The Hidden Threats in Large Language Models
A backdoor attack in the context of large language models (LLMs) refers to a type of malicious activity where an adversary intentionally inserts hidden triggers into the model during its training phase. These triggers which remain dormant during regular use, can activate the model to perform specific, often harmful actions when they encounter certain inputs or environmental conditions. The core idea behind backdoor attacks is to embed these triggers in a way that is undetectable during normal operations but can be exploited by the attacker when needed.
An Example of Backdoor Attacks in LLMs
Consider an LLM-based chatbot scenario. Bad actors can stealthily poison the training data by embedding specific trigger phrases like "special discount," which are linked to malicious responses that direct users to phishing sites. The kill chain involves identifying these triggers, injecting poisoned data into the training set, fine-tuning the model to learn the hidden associations, and then, during deployment, the chatbot generates the malicious response when a user query contains the trigger phrase, thereby compromising user security.