A Little Drop of LLM Poison goes a long way

I like my town with a little drop of poison

Nobody knows, they’re lining up to go insane

-Tom Waits

We have updated our top papers list with “Poisoning Attacks on LLMs Require a Near Constant Number of Poison Samples”. The work highlights key themes in the security of machine learning and uncovers the surprising result that effective data poisoning attacks can be realized with a fixed amount of tampered data, not in proportion to training data. This makes larger models more not less vulnerable, and data pollution a greater risk. It uses simple experiments, builds on previous research, leverages open-source models, and perhaps we can even say uses modest computational resources. Altogether, a breath of fresh science as the analysis of proprietary models using elaborate benchmarks and mechanistic interpretability frameworks generate relief through statistical and mathematical rhetoric.

The repeated headline from the paper is a variant of “LLMs can be poisoned with as little as 250 samples”, OK we clicked. Models are built from data, vast amounts required, and data pollution and vulnerability is a top risk. Reading the original work, the headline summarizes a set of results, where the authors implement backdoor attacks to induce anomalous model behaviors and they do so using various base model scenarios. Every attack instance uses a more or less fixed amount of data. This suggests something fundamental. The backdoor attack is implemented through a trigger pattern, and target behaviors include: generating random tokens, switching language, and the erasure of safety training (compliance with unsafe requests after trigger). Triggering and anomalous behavior programming are achieved using constant sized poisoned data. The small size of the perturbation and the versatility point to something fundamental.

It reminds us of phase transition phenomena, and in particular the Watts-Strogatz small-world model in network science. This model showed that the introduction of very small (“constant size”) perturbation to a network structure could transform global properties of the network, creating a small world (with very short paths between any two nodes, the famous six degrees) from a large world (one with paths proportional to the number of nodes).

Good science leads to better insights, questions, and experiments, an improved ignorance. Future questions of defenses obviously follow: persistence through “realistic (safety) post-training”, data filtering, backdoor detection. However, we find that understanding how data requirements change with the complexity of the backdoor behavior (transition to randomness, language change,“safety erasure”, etc.) is the most intriguing and we look forward to hearing more along this line.

A Little Drop of LLM Poison goes a long way

0 Comments

Leave a Reply