BIML’s thoughts about Paprenot’s eye-opening worm paper captured by Fortune. Sharon Goldman in one of the best AI security (MLsec) reporters out there.

BIML’s thoughts about Paprenot’s eye-opening worm paper captured by Fortune. Sharon Goldman in one of the best AI security (MLsec) reporters out there.

This long conversation on the Data Culture podcast features a great overview of the work we do at BIML, including coverage of why we are a non-profit. Have a listen..

Who dat?
BIML of course…in very good company.

Lots of old friends.

Alex Presiding.

And some takeaway messages encoded in bits.



Do you remember the Morris worm? Because we do. We watched it take the Internet by storm in 1988 when the net was small and mostly .edu sites connected with UUCP (there were only around 60,000 computers on the net those days). It was a big day in Net history and a watchman’s cry for the rising importance of computer security. Turns out that connected computers are subject to automated network-based attacks. Overnight, computer viruses escaped the sneaker net and grew wings.
Fast forward 38 years. Today there are 6 billion or so people on the Internet, often using multiple devices. And worms have evolved through SQL Slammer, Conficker, Stuxnet, and WannaCry—which all targeted exactly one bug—to Agentic AI controlled worms that grind on a target looking for ANY BUG. The viruses that grew wings in 1988 have developed relentless little brains.
This is Papernot at his best, reminding us why Machine Learning Security is crucially important. We’ll have a closer look this week and possibly revisit our annotated bibliography’s TOP 5.
Here is the abstract from the academic paper. We are tempted to call this new worm concept “Morris.”
A computer worm is malware that spreads on a network by replicating itself from one machine to another. Traditional worms, like WannaCry, exploited predetermined vulnerabilities, and their spread can be halted by patching those vulnerabilities. Here we show that artificial intelligence (AI) agents enable a fundamentally new threat: a worm that generates tailored attack strategies to each target it encounters. The worm parasitically uses compromised machines to run open-weight large language models (LLMs) to sustain its reasoning, or extend its reach for further attacks. Deployed on a network of machines spanning Linux, Windows, and IoT (Internet of Things) devices, the worm propagated by exploiting common, real-world corporate network vulnerabilities. Since the worm is powered by stolen compute, the attacker’s marginal cost per new infection is zero. This creates a destabilizing economic asymmetry between attackers and defenders. Moreover, because the worm requires no commercial AI platform, centralized safety controls, such as service refusals or rate limiting, are structurally irrelevant. Our results demonstrate that self-sustaining AI-driven cyber-threats are no longer theoretical. We must prepare for autonomous generative adversaries: malware systems that propagate without human operators and are defined not by fixed exploit code, but by the capacity to reason about targets, adapt to observations, and synthesize attack logic in real time.
Thirty-eight years after 1988, we now have AI enabled malicious code leveraging the Trinity of Trouble with automated goal-driven intelligence for next to no cost. Expect things to change.
This story was broken in the New York Times by Cade Metz who provides an excellent story.
McGraw provides the proverbial two cents. This New York Times article covering Mythos, Anthropic’s and OpenAI’s (differing) approaches to model release and Cybersecurity risk…
https://www.nytimes.com/2026/05/12/technology/anthropic-claude-mythos.html

Cynthia Brumfield wrote an excellent, in-depth article for CSO on recursive pollution that is well worth a read. At BIML, we still believe recursive pollution is the #1 risk for LLM technology. Recursive pollution subtle and nowhere near as sexy as prompt injection, but it leads to insidious wrongness.

We are now harnessing the “harness” metaphor to describe how to work effectively with generative AI. Generative AI applications often first appear familiarly “intelligent” but with use typically reveal themselves not to be so, with practical implications. Combining stochastic models with deterministic tools and code loops can change that, this state of affairs is typically called a harness. The “harness” concept (and implementations) can be used: to clarify questions about AI capabilities, to create dramatic jumps in performance in a range of challenging tasks, and is related to (perhaps caused by) averting various LLMs risks we have identified in previous work (paper).
Harnessing alien intelligences is a hallmark achievement of humanity. Harnessing a horse, brings together a rider and horse to enable heavy laden travel over difficult terrain, or feats of speed, agility, and competition unavailable to each. For example, this last weekend was the latest edition of the Kentucky Derby. We can also harness multiple horses to a cart, bulls to a plow, dogs to a sled. Each scenario is distinct. Different goals and context details impact how the harness is shaped and operated. Common to the scenarios however is that there must be a sufficient fit of the harness to the animal and task, and a degree of domestication of the animal, which entails sufficient capacity for communication or mutual understanding. Again, the detailed capabilities of the partners and desired outcomes shape harnessing. It’s hard not to get lost considering this idea and the clever details of implementation of this phenomenon. And this even though the cross-species examples can fail to evoke the primary example of social existence, which harnesses the alien (to each other) intelligences of other humans through shared concepts, norms, and laws to achieve among other things the device I am writing this on and the network that this message travels through.
Metaphors and abstraction aside, we are now learning (or paying attention to the fact) that effective use of generative models, strongly depends on the scaffolding, the harness that we build around these. In fact, important questions that we ask about these models, such as how “intelligent” they may be are misconstrued outside the concept of a harness. This is one among many ideas discussed in an analysis from Farrell et al “Large AI models are cultural and social technologies” (paper). The paper places recent generative models in the context of the history of information and collaboration technologies.
Harnessing has been here all along, but sometimes it’s hard to see the what’s too close, “this is water”. The chat interface is a harness, so are the sampling or decoding algorithms, the recent agentic scaffolds are harnesses and finally also by name. Model weights in and of themselves are some kind of data extract, created from a pile of data sent through the digital distillery of training protocols and algorithms, but alone they do no work, make no decisions, are not intelligent. Seen as a kind of library some may be more extensive than others, but inert nevertheless.
It took for the harnesses to get big and complex for us to begin to see the concept. But also the results have come to be known under various names. Most famously we now have a menagerie of coding “agents”, the talking horses that will allegedly replace human software developers. Understanding efforts in the neighborhood of harnessing and agents is an area we are closely looking at. This is a brief look at three recent techniques that have stood out.
The first is “Recursive Language Models” (RLM) (paper). The technique is motivated as a response to the challenges of very long context computation where attention becomes very resource intensive and yet we still experience “context rot”. Context rot means that task performance decays or collapses the context gets larger. The risk of assuming that simply longer context is better when using LLMs was among the risks in our original assessment. The RLM, described as new “inference paradigm” (not a harness) puts the context into a REPL environment and asks the LM to generate code and new prompts to “programmatically examine, decompose, and recursively call itself over snippets”. Intermediate results are stored as variables in the REPL workspace environment and the final state goes into a named variable, signaling completion. This is the harness, and it engages the model in a particular set of behaviors (through code and prompt generation) in an environment defined by an arbitrarily large provided prompt. The result is effective handling of prompts multiple orders of magnitude larger with dramatic outperformance in a set of relevant tasks. Incorporating this harness tiny models can perform as “vanilla” frontier models! We often talk about WHAT and HOW machines, and identified a top “misuse” risk as trying to approach any computational task using auto-regressive text completion. The RLM approach uses code generation and execution in a REPL to move away from this with great impact. Along the way RLM also mitigates risks related to representational transparency and black-box behavior by executing inspectable code and storing intermediate results in its environment! It also turns out that these traces can be put to further use.
The second example is “AutoHarness: improving LLM agents by automatically synthesizing a code harness” (paper), twice in the title. This effort is motivated by the failure of language models, even frontier ones, to repeatedly generate valid moves in game environments. Again the word environment, its use here more direct, the formal world of a game. A familiar observation here is the presence of Potemkin understanding (paper), a model may describe or apparently understand the rules of a game but cannot reliably generate legal moves. The authors describe a harness as “the glue or plumbing between the bowel and the task that needs to be solved” and talk about harness use in two ways. One way, by asking the model to generate validation code for proposed game moves, and then interleaving the generated validator with auto-regressive answering, they call this “harness as verifier”. The second approach is to ask the model to generate code that produces valid moves directly, which they call “harness as policy”. Both approaches attempt to constrain the behavior of the model to conform to the game rules, and in this way “harness” the behavior. But there is also a third form, the overarching “AutoHarness” strategy itself. Performance of the harnessed strategies again greatly outperforms, even vanilla frontier models. In the case of harness as policy, where we ask the WHAT machine to create a little HOW machine, outperformance shows up both in task performance and dramatically in compute cost.
Full recognition of the centrality of the harness shows up in “Meta-Harness: End-to-End Optimization of Model Harnesses” (paper). The goal here is to use a strong coding model called a proposer to create new task-specific harnesses by examining the code, execution, and performance of other harnesses. This is one where the traces described before are put to good use. Interestingly the task-specific may be executed by a weaker model. This method is capable of dramatically improve upon existing harnesses with limited compute, in ways that transfer across models that execute the harness, and across a range of tasks.
In each of these cases small models can vastly outperform “vanilla” frontier models, when they are simply harnessed models through a sampling algorithm. Meta-Harness can improve upon harnesses similar to RLM, hybrids of HOW and WHAT machines and beat hand coded approaches. Omar Khattab, one of the authors here is also an author in RLM.
There is something fundamental happening here, with implications for performance, computational costs, interpretability, and risk management. We are also seeing these patterns in other work, so we will continue to revisit.
BIML is proud to host Patrick McDaniel, an OG of machine learning security (prominently featured in the BIML TOP 5) and a Dean of Research at Wisconsin, for a visit to the BIML Barn. Patrick arrived in Berryville late on Thursday and was greeted with a Liberal or two on the porch. We stayed up way too late talking about AI and security.

In the morning after breakfast, we spent much of the Friday research discussion going over our soon to be released paper No Security Meter for AI. Patrick has been thinking about measuring ML behavior for a long time, and was an early proponent of a whitebox approach. He had lots of very useful feedback for us.
Does science really get done around the kitchen table? Why yes. Yes it does. (And technical talks really get delivered in the BIML Barn.)

We ventured into greater metropolitan Berryville for lunch and coffee.

And then Patrick delivered a new talk as a BIML in the Barn feature to be released on May 13th. Patrick’s talk really surprised us and in very important philosophical ways.




After the talk we shared a cocktail on the patio. Maybelline is an honorary BIML dog.

Patrick enjoys a well-deserved Lemon Mint Fizz.


And then it was off to dinner with BIML spouses at Huntōn in Leesburg.
Fantastic visit. These kinds of human interaction are absolutely critical as we construct a reasonable approach to machine learning security.

Gary McGraw, cofounder of the Berryville Institute of Machine Learning, pointed to a core gap: Today’s benchmarks tend to measure how well AI systems can perform security tasks—not how secure the systems themselves are. Companies need to keep that distinction in mind when evaluating their tools and defenses.
McGraw warned as far back as 2019 that securing machine learning systems would be “one of the defining cybersecurity struggles of the next decade.” That moment has now arrived.
“These meetings are a way to remind ourselves of the fundamentals,” he said, “as we try to define what machine learning security actually is.”
What was to be a more standard copy of the BIML risk talk, instead was transformed into a debut of BIML’s forthcoming paper No Security Meter for AI. (expected mid-May) for an audience of NIST computer scientists.

It’s always fun to debut a talk for an audience that is engaged and knowledgeable.



While we were inside the very industrial Chemistry building for a talk that was 80% zoom, it rained outside.

