Harnessing Alien Intelligence

This is about how we are now harnessing the “harness” metaphor to describe how to work effectively with generative AI. Generative AI Applications can first appear to be familiarly “intelligent” but will typically with use reveal themselves not to be so, with practical implications. The “harness” concept can used to clarify questions about AI capabilities, to create dramatic jumps in performance in a range of challenges, and is also related to averting LLMs risks we have identified in our previous work (paper).

Harnessing alien intelligences is a hallmark achievement of humanity. Harness a horse, and it brings together a rider and horse to enable heavy laden travel over difficult terrain, or feats of speed, agility, and competition unavailable to each. For example, this last weekend was the latest edition of the Kentucky Derby. We can also harness multiple horses to a cart, bulls to a plow, dogs to a sled. Each scenario is distinct, with different goals and details that impact how the harness is shaped and operated. Common to the scenarios however, there must be a sufficient fit of the harness to the animal and task and a degree of domestication of the animal, which entails a sufficient capacity for communication. The detailed capabilities of the partners and desired outcomes shape harnessing. It’s hard not to get lost considering this idea and the clever details of implementation of this phenomenon. And this even though the cross-species examples can fail to evoke the prime example of social existence, which harnesses the alien (to each other) intelligences of other humans through shared concepts, norms, and laws to achieve among other things the device I am writing this on and the network that this message travels through.

Metaphors and abstraction aside, we are now learning (or paying attention to the fact) that effective use of generative models, strongly depends on the scaffolding, the harness that we build around these. In fact, important questions we ask about these models, such as how “intelligent” they may be are misconstrued outside the concept of a harness. This is one among many ideas discussed in an analysis from Farrell et al “Large AI models are cultural and social technologies” (paper), that contextualize recent models in the history of information and collaboration technologies. Harnessing has been here all along, but sometimes it’s hard to see the what’s too close, “this is water”. The chat interface is a harness, so are the sampling or decoding algorithms, the recent agentic scaffolds are harnesses and finally also by name. Model weights in and of themselves are some kind of data extract, created from a pile sent through the digital distillery of training protocols and algorithms, but alone they do no work, make no decisions, are not intelligent. Seen as a kind of library some may be more extensive than others, but inert nevertheless.

It took for the harnesses to get big and complex for us to begin to see the concept. But also the results have come to be known under various names. Most famously we now have a menagerie of coding “agents”, the talking horses that will replace human software developers. Understanding efforts in the neighborhood of harnessing and agents is an area we are closely looking at. This is a brief look at three recent techniques that have stood out.

The first is “Recursive Language Models” (RLM) (paper). The technique is motivated as a response to the challenges of very long context computation where attention becomes very resource intensive and yet we still experience “context rot”. Context rot means that task performance decays or collapses the context gets larger. The risk of assuming that simply longer context is better when using LLMs was among the risks in our original assessment. The RLM, described as new “inference paradigm” (not a harness) puts the context into a REPL environment and asks the LM to generate code and new prompts to  “programmatically examine, decompose, and recursively call itself over snippets”. Intermediate results are stored as variables in the REPL workspace environment and the final state goes into a named variable, signaling completion. This is the harness, and it engages the model in a particular set of behaviors (through code and prompt generation) in an environment defined by an arbitrarily large provided prompt. The result is effective handling of prompts multiple orders of magnitude larger with dramatic outperformance in a set of relevant tasks. Incorporating this harness tiny models can perform as “vanilla” frontier models! We often talk about WHAT and HOW machines, and identified a top “misuse” risk as trying to approach any computational task using auto-regressive text completion. The RLM approach uses code generation and execution in a REPL to move away from this with great impact. Along the way RLM also mitigates risks related to representational transparency and black-box behavior by executing inspectable code and storing intermediate results in its environment! It also turns out that these traces can be put to further use. 

The second example is “AutoHarness: improving LLM agents by automatically synthesizing a code harness” (paper), twice in the title. This effort is motivated by the failure of language models, even frontier ones, to repeatedly generate valid moves in game environments. Again the word environment, its use here more direct, the formal world of a game. A familiar observation here is the presence of Potemkin understanding (paper), a model may describe or apparently understand the rules of a game but cannot reliably generate legal moves. The authors describe a harness as “the glue or plumbing between the bowel and the task that needs to be solved” and talk about harness use in two ways. One way, by asking the model to generate validation code for proposed game moves, and then interleaving the generated validator with auto-regressive answering, they call this “harness as verifier”. The second approach is to ask the model to generate code that produces valid moves directly, which they call “harness as policy”. Both approaches attempt to constrain the behavior of the model to conform to the game rules, and in this way “harness” the behavior. But there is also a third form, the overarching “AutoHarness” strategy itself. Performance of the harnessed strategies again greatly outperforms, even vanilla frontier models. In the case of harness as policy, where we ask the WHAT machine to create a little HOW machine, outperformance shows up both in task performance and dramatically in compute cost. 

Full recognition of the centrality of the harness shows up in “Meta-Harness: End-to-End Optimization of Model Harnesses” (paper). The goal here is to use a strong coding model called a proposer to create new task-specific harnesses by examining the code, execution, and performance of other harnesses. This is one where the traces described before are put to good use. Interestingly the task-specific may be executed by a weaker model. This method is capable of dramatically improve upon existing harnesses with limited compute, in ways that transfer across models that execute the harness, and across a range of tasks. 

In each of these cases small models can vastly outperform “vanilla” frontier models, when they are simply harnessed models through a sampling algorithm. Meta-Harness can improve upon harnesses similar to RLM, hybrids of HOW and WHAT machines and beat hand coded approaches. Omar Khattab, one of the authors here is also an author in RLM.

There is something fundamental happening here, with implications for performance, computational costs, interpretability, and risk management. We are also seeing these patterns in other work, so we will continue to revisit. 

A Little Drop of LLM Poison goes a long way

I like my town with a little drop of poison

Nobody knows, they’re lining up to go insane

-Tom Waits

We have updated our top papers list with “Poisoning Attacks on LLMs Require a Near Constant Number of Poison Samples”. The work highlights key themes in the security of machine learning and uncovers the surprising result that effective data poisoning attacks can be realized with a fixed amount of tampered data, not in proportion to training data. This makes larger models more not less vulnerable, and data pollution a greater risk. It uses simple experiments, builds on previous research, leverages open-source models, and perhaps we can even say uses modest computational resources. Altogether, a breath of fresh science as the analysis of proprietary models using elaborate benchmarks and mechanistic interpretability frameworks generate relief through statistical and mathematical rhetoric.

The repeated headline from the paper is a variant of “LLMs can be poisoned with as little as 250 samples”, OK we clicked. Models are built from data, vast amounts required, and data pollution and vulnerability is a top risk. Reading the original work, the headline summarizes a set of results, where the authors implement backdoor attacks to induce anomalous model behaviors and they do so using various base model scenarios. Every attack instance uses a more or less fixed amount of data. This suggests something fundamental. The backdoor attack is implemented through a trigger pattern, and target behaviors include: generating random tokens, switching language, and the erasure of safety training (compliance with unsafe requests after trigger). Triggering and anomalous behavior programming are achieved using constant sized poisoned data. The small size of the perturbation and the versatility point to something fundamental. 

It reminds us of phase transition phenomena, and in particular the Watts-Strogatz small-world model in network science. This model showed that the introduction of very small (“constant size”) perturbation to a network structure could transform global properties of the network, creating a small world (with very short paths between any two nodes, the famous six degrees) from a large world (one with paths proportional to the number of nodes).

Good science leads to better insights, questions, and experiments, an improved ignorance. Future questions of defenses obviously follow: persistence through “realistic (safety) post-training”, data filtering, backdoor detection. However, we find that understanding how data requirements change with the complexity of the backdoor behavior (transition to randomness, language change,“safety erasure”, etc.) is the most intriguing and we look forward to hearing more along this line. 

Absolute Nonsense from Anthropic: Sleeper Agents

And in the land where I grew up
Into the bosom of technology
I kept my feelings to myself
Until the perfect moment comes

-David Byrne

From its very title—Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training—you get the first glimpse of the anthropomorphic mish-mosh interpretation of LLM function that infects this study. Further on, any doubts about this deeply-misguided line of reasoning (and its detrimental effects on actual work in machine learning security) are resolved for the worst.

Caught in, or perhaps parroting, the FUD-fueled narrative of “the existential threat of AI,” the title evokes images of rogue AI leveraging deception to pursue its plans of world domination. Oh my! The authors do, however, state multiple times that the “work does not assess the likelihood of the discussed threat models.” In other words, it’s all misguided fantasy. The rogue AI, that they deem “deceptive instrumental alignment” is completely hypothetical; a second, more-real threat referred to in the literature as “model backdoors” or “Trojan models” is not new. The misleading and deceptive reasoning exhibited in this paper is, alas, all too human.

The rationale for this work is built on a problematic analogy with human deception, where “similar selection pressures” that lead humans to deceive will result in the same kind of bad behavior in “future AI systems.” No consideration is given to the limits of this analogy in terms of the agency and behavior of an actual living organism (intentional, socially accountable, living precariously, mortal), versus the pretend “agency” that may be simulated by a generative model through clever prompting with no effective embodiment. Squawk!

A series of before and after experiments on models with deliberately-hardwired backdoors are the most interesting part of the study, with perhaps-useful observations but deeply-flawed interpretations. The conceptual failure of anthropomorphism is central to the flawed interpretations. For example, a Chain-of-Thought prompting trick is interpreted as offering “reasoning tools” (quotes ours, not in the paper) and construed to reveal the model’s deceptive intentions. Sad to say, training the model to generate text about deceptive intention does nothing to create actual “deceptive intention,” all that has been done is to provide more (stilted) context that the model uses during generation. When the model is distilled to incorporate this new “deceptive training,” the text is sublimated from the text generation into model changes. Yes, this makes the associations in question harder to see in some lighting (like painted on camouflage), but it does not intent make.

Observations about the “robustness” of Chain-of-Thought prompting tricks is interpreted as “teach[ing] the models to better recognize [sic] their backdoor triggers, effectively hiding their unsafe behavior”. However, in our view, the reported observations would be better described in terms of fuzzy versus crisp triggering behavior in the backdoor model. When adversarial training mitigates the fuzzy triggering of the backdoor generative state, it is not hiding possible unsafe behavior! How we construe the goals and impact of the backdoor behavior will change how we should consider the outcome of the safety training. Fuzzy triggers increase the “recall” of the attack state, and perhaps increase the probability of detection of the poisoned model, but also increase unintended harm. Safety training in this case was observed to make triggers more precise, with all of the functional consequences that entails. If adversarial training had uncovered the actual trigger, it could have mitigated that as well.

We suggest an adaption to the subtitle Anthropic chose, that is, we prefer: Safety Training Ineffective Against Backdoor Model Attacks. It may not make great clickbait, but at least it brings attention to the more-substantial observation in the study: that current “behavioral safety training” mechanisms create a false impression of safety. In fact, we find them a Potemkin Village in the land of AI safety.