Harold | Berryville Institute of Machine Learning

And in the land where I grew up
Into the bosom of technology
I kept my feelings to myself
Until the perfect moment comes
-David Byrne

From its very title—Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training—you get the first glimpse of the anthropomorphic mish-mosh interpretation of LLM function that infects this study. Further on, any doubts about this deeply-misguided line of reasoning (and its detrimental effects on actual work in machine learning security) are resolved for the worst.

Caught in, or perhaps parroting, the FUD-fueled narrative of “the existential threat of AI,” the title evokes images of rogue AI leveraging deception to pursue its plans of world domination. Oh my! The authors do, however, state multiple times that the “work does not assess the likelihood of the discussed threat models.” In other words, it’s all misguided fantasy. The rogue AI, that they deem “deceptive instrumental alignment” is completely hypothetical; a second, more-real threat referred to in the literature as “model backdoors” or “Trojan models” is not new. The misleading and deceptive reasoning exhibited in this paper is, alas, all too human.

The rationale for this work is built on a problematic analogy with human deception, where “similar selection pressures” that lead humans to deceive will result in the same kind of bad behavior in “future AI systems.” No consideration is given to the limits of this analogy in terms of the agency and behavior of an actual living organism (intentional, socially accountable, living precariously, mortal), versus the pretend “agency” that may be simulated by a generative model through clever prompting with no effective embodiment. Squawk!

A series of before and after experiments on models with deliberately-hardwired backdoors are the most interesting part of the study, with perhaps-useful observations but deeply-flawed interpretations. The conceptual failure of anthropomorphism is central to the flawed interpretations. For example, a Chain-of-Thought prompting trick is interpreted as offering “reasoning tools” (quotes ours, not in the paper) and construed to reveal the model’s deceptive intentions. Sad to say, training the model to generate text about deceptive intention does nothing to create actual “deceptive intention,” all that has been done is to provide more (stilted) context that the model uses during generation. When the model is distilled to incorporate this new “deceptive training,” the text is sublimated from the text generation into model changes. Yes, this makes the associations in question harder to see in some lighting (like painted on camouflage), but it does not intent make.

Observations about the “robustness” of Chain-of-Thought prompting tricks is interpreted as “teach[ing] the models to better recognize [sic] their backdoor triggers, effectively hiding their unsafe behavior”. However, in our view, the reported observations would be better described in terms of fuzzy versus crisp triggering behavior in the backdoor model. When adversarial training mitigates the fuzzy triggering of the backdoor generative state, it is not hiding possible unsafe behavior! How we construe the goals and impact of the backdoor behavior will change how we should consider the outcome of the safety training. Fuzzy triggers increase the “recall” of the attack state, and perhaps increase the probability of detection of the poisoned model, but also increase unintended harm. Safety training in this case was observed to make triggers more precise, with all of the functional consequences that entails. If adversarial training had uncovered the actual trigger, it could have mitigated that as well.

We suggest an adaption to the subtitle Anthropic chose, that is, we prefer: Safety Training Ineffective Against Backdoor Model Attacks. It may not make great clickbait, but at least it brings attention to the more-substantial observation in the study: that current “behavioral safety training” mechanisms create a false impression of safety. In fact, we find them a Potemkin Village in the land of AI safety.

Author: Harold

Absolute Nonsense from Anthropic: Sleeper Agents