The More Things Change, the More They Stay The Same: Defending Against Vulnerabilities you Create

Regarding the AP wire story out this morning (which features a quote by BIML):

Like any tool that humans have created, LLMs can be repurposed to do bad things.  The biggest danger that LLMs pose in security is that they can leverage the ELIZA effect to convince gullible people into believing they are thinking and understanding things. This makes them particularly interesting in attacks that involve what security people call “spoofing.”  Spoofing is important enough as an attack category that Microsoft included it in it’s STRIDE system as the very first attack to worry about.  There is no doubt that LLMs make spoofing much more powerful as an attack. This includes creating and using “deep fakes” FWIW.  Phishing attacks? Spoofing. Confidence flim-flams? Spoofing. Ransomware negotiations? Spoofing will help. Credit card fraud? Spoofing used all the time.

Twenty years ago the security community found it pretty brazen that Microsoft was thinking about selling defensive security tools at all since many of the attacks and exploits in the wild were successfully targeting their broken software. “Why don’t they just fix the broken software instead of monetizing their own bugs?” we asked.  We might ask the same thing today. Why not create more secure black box LLM foundation models instead of selling defensive tools for a problem they are helping to create?

Absolute Nonsense from Anthropic: Sleeper Agents

And in the land where I grew up
Into the bosom of technology
I kept my feelings to myself
Until the perfect moment comes

-David Byrne

From its very title—Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training—you get the first glimpse of the anthropomorphic mish-mosh interpretation of LLM function that infects this study. Further on, any doubts about this deeply-misguided line of reasoning (and its detrimental effects on actual work in machine learning security) are resolved for the worst.

Caught in, or perhaps parroting, the FUD-fueled narrative of “the existential threat of AI,” the title evokes images of rogue AI leveraging deception to pursue its plans of world domination. Oh my! The authors do, however, state multiple times that the “work does not assess the likelihood of the discussed threat models.” In other words, it’s all misguided fantasy. The rogue AI, that they deem “deceptive instrumental alignment” is completely hypothetical; a second, more-real threat referred to in the literature as “model backdoors” or “Trojan models” is not new. The misleading and deceptive reasoning exhibited in this paper is, alas, all too human.

The rationale for this work is built on a problematic analogy with human deception, where “similar selection pressures” that lead humans to deceive will result in the same kind of bad behavior in “future AI systems.” No consideration is given to the limits of this analogy in terms of the agency and behavior of an actual living organism (intentional, socially accountable, living precariously, mortal), versus the pretend “agency” that may be simulated by a generative model through clever prompting with no effective embodiment. Squawk!

A series of before and after experiments on models with deliberately-hardwired backdoors are the most interesting part of the study, with perhaps-useful observations but deeply-flawed interpretations. The conceptual failure of anthropomorphism is central to the flawed interpretations. For example, a Chain-of-Thought prompting trick is interpreted as offering “reasoning tools” (quotes ours, not in the paper) and construed to reveal the model’s deceptive intentions. Sad to say, training the model to generate text about deceptive intention does nothing to create actual “deceptive intention,” all that has been done is to provide more (stilted) context that the model uses during generation. When the model is distilled to incorporate this new “deceptive training,” the text is sublimated from the text generation into model changes. Yes, this makes the associations in question harder to see in some lighting (like painted on camouflage), but it does not intent make.

Observations about the “robustness” of Chain-of-Thought prompting tricks is interpreted as “teach[ing] the models to better recognize [sic] their backdoor triggers, effectively hiding their unsafe behavior”. However, in our view, the reported observations would be better described in terms of fuzzy versus crisp triggering behavior in the backdoor model. When adversarial training mitigates the fuzzy triggering of the backdoor generative state, it is not hiding possible unsafe behavior! How we construe the goals and impact of the backdoor behavior will change how we should consider the outcome of the safety training. Fuzzy triggers increase the “recall” of the attack state, and perhaps increase the probability of detection of the poisoned model, but also increase unintended harm. Safety training in this case was observed to make triggers more precise, with all of the functional consequences that entails. If adversarial training had uncovered the actual trigger, it could have mitigated that as well.

We suggest an adaption to the subtitle Anthropic chose, that is, we prefer: Safety Training Ineffective Against Backdoor Model Attacks. It may not make great clickbait, but at least it brings attention to the more-substantial observation in the study: that current “behavioral safety training” mechanisms create a false impression of safety. In fact, we find them a Potemkin Village in the land of AI safety.

Decypher Podcast Features BIML LLM Work

The February 6th episode of Dennis Fisher’s Decypher podcast does an excellent job unpacking BIML’s latest work on LLMs. Have a listen:

Podcast Episode

The Silver Bullet podcast archive (all 153 episodes) can be found here.

Dennis Fisher Covers BIML and Data Feudalism

Here is an excellent piece from Dennis Fisher (currently writing for decipher) covers our new LLM Architectural Risk Analysis. Dennis always produces accurate and tightly-written work.

This article includes an important section on data feudalism, a term that BIML coined in an earlier Decipher article:

“Massive private data sets are now the norm and the companies that own them and use them to train their own LLMs are not much in the mood for sharing anymore. This creates a new type of inequality in which those who own the data sets control how and why they’re used, and by whom.

‘The people who built the original LLM used the whole ocean of data, but then they started dividing [the ocean] up, [leading] to data feudalism. Which means you can’t build your own model because you don’t have access to [enough] data,’ McGraw said.”

Two interesting reads on LLM security

The Register has a great interview with Ilia Shumailov on the number one risk of LLMs. He calls it “model collapse” but we like the term “recursive pollution” better because we find it more descriptive. Have a look at the article.

Our work at BIML has been deeply influenced by Shumailov’s work. In fact, he currently has two articles in our Annotated Bibliography TOP 5.

Here is what we have to say about recursive pollution in our work — An Architectural Risk Analysis of Large Language Models:

  1. [LLMtop10:1:recursive pollution] LLMs can sometimes be spectacularly wrong, and confidently so. If and when LLM output is pumped back into the training data ocean (by reference to being put on the Internet, for example), a future LLM may end up being trained on these very same polluted data. This is one kind of “feedback loop” problem we identified and discussed in 2020. See, in particular, [BIML78 raw:8:looping], [BIML78 input:4:looped input], and [BIML78 output:7:looped output]. Shumilov et al, subsequently wrote an excellent paper on this phenomenon. Also see Alemohammad. Recursive pollution is a serious threat to LLM integrity. ML systems should not eat their own output just as mammals should not consume brains of their own species. See [raw:1:recursive pollution] and [output:8:looped output].

Another excellent piece, this time in the politics, policy, business and international relations press is written by Peter Levin. See The real issue with artificial intelligence: The misalignment problem in The Hill. We like the idea of a “mix master of ideas” but we think it is more of a “mix master of auto-associative predictive text.” LLMs do not have “ideas.”

Lemos on the BIML LLM Risk Analysis

What’s the difference (philosophically) between Adversarial AI and Machine Learning Security? Once again, Rob Lemos cuts to the quick with his analysis of MLsec happenings. It helps that Rob has actual experience in ML/AI (unlike, say, most reporters on the planet). That helps Rob get things right.


We were proud to have our first coverage come from Rob in darkreading.

My favorite quote: “Those things that are in the black box are the risk decisions that are being made by Google and Open AI and Microsoft and Meta on your behalf without you even knowing what the risks are,” McGraw says. “We think that it would be very helpful to open up the black box and answer some questions.”

Read BIML’s An Architectural Risk Analysis of Large Language Models (January 24, 2024)

Google Cloud Security Podcast Features BIML

Have a listen to Google’s cloud security podcast EP150 Taming the AI Beast: Threat Modeling for Modern AI Systems with Gary McGraw, the episode is tight, fast, and filled with good information.

Google Cloud Security Podcast: Taming the AI Beast with Gary McGraw
  • Gary, you’ve been doing software security for many decades, so tell us: are we really behind on securing ML and AI systems? 
  • If not SBOM for data or “DBOM”, then what? Can data supply chain tools or just better data governance practices help?
  • How would you threat model a system with ML in it or a new ML system you are building? 
  • What are the key differences and similarities between securing AI and securing a traditional, complex enterprise system?
  • What are the key differences between securing the AI you built and AI you buy or subscribe to?
  • Which security tools and frameworks will solve all of these problems for us? 

All Your LLM Are Belong to Us

We didn’t want to rain on the Davos parade, so we waited until this week to release our latest piece of work. Our paper “An Architectural Risk Analysis of Large Language Models: Applied Machine Learning Security,” spotlights what we view as major concerns with foundation model LLMs as well as their adaptations and applications.

We are fans of ML and “AI” (which the whole world tilted towards in 2023, fawning over the latest models with both awe and apprehension). We’re calling out the inherent risks. Not hand wavy stuff—we’ve spent the past year reading science publications, dissecting the research ideas, understanding the math, testing models, parsing through the noise, and ultimately analyzing LLMs through the lens of security design and architecture. We took the tool we invented for ML security risk analysis in 2020 (see our earlier paper, “Architectural Risk Analysis of Machine Learning Systems: Toward More Secure Machine Learning”) and applied it to LLMs specifically.

We found 81 risks overall, distilled a Top Ten (Risks) list, and shined a spotlight on 23 critical risks inherent in the black box LLM foundation models.

And now 2024 is off and running. It will be the year of “AI Governance” in name and (optimistic) intent. In practice, however, it’s on pace to be a shitshow for democracy as regulators run like hell just to get to the starting line.

The Slovak parliamentary election deepfake debacle, is the tip of the iceberg. OpenAI tried to get ahead of concerns that their technology may be used to influence the US Presidential Election in nefarious ways by posting its plans to deter misinformation. The irony is that OpenAI trained its models on a corpus so large that it holds vast globs of crazy rhetoric, conspiracy theories, fake news, and other pollution which its stochastic models will draw upon and (predictably) spit out…that will, in turn, add to the ever amassing pile of garbage-strewn data in the world, which future LLM foundation models will ingest, … See the problem here? That’s recursive pollution.

It’s the Data, stupid. (We sure wish it were that simple, anyway.)

See our official Press Release here.

Another Round of “Adversarial Machine Learning” from NIST

The National Institute of Standards and Technology (aka NIST) recently released a paper enumerating many attacks relevant to AI system developers. With the seemingly-unending rise in costs incurred by cybercrime, it’s sensible to think through the means and motives behind these attacks. NIST provides good explanations of the history and context for a variety of AI attacks in a partially-organized laundry list. That’s a good thing. However, in our view, NIST’s taxonomy lacks a useful structure for thinking about and categorizing systemic AI risks. We released a simple (and hopefully more effective) taxonomy of ML attacks in 2019 that divides attacks into two types—extraction and manipulation—and further divides these types into the three most common attack surfaces found in all ML systems—the model, the (training) data, and the (runtime) inputs. That move yields a six category taxonomy.

But wait, there’s more… Attacks represent only a small portion of security risks present in AI systems. NIST’s attack taxonomy doesn’t have any room for serious (non-attack-related) concerns such as recursive pollution or improper use of AI technology for tasks it wasn’t designed for. Far too much of NIST’s evaluation of generative AI is dedicated to prompt injection attacks, where an attacker manipulates the prompt provided to the LLM at runtime, producing undesirable results. LLM developers certainly need to consider the potential for malicious prompts (or malicious input as computer security people have always called it), but this downplays a much more important risk—stochastic behavior from LLM foundation models can be wrong and bad all by itself without any clever prompting!

At BIML, we are chiefly concerned with building security in to ML systems—a fancy way of saying security engineering. By contrast, NIST’s approach encourages “red-teaming”, using teams of ethical hackers (or just people off the street and pizza delivery guys) to try to penetration test LLM systems based on chugging down a checklist of known problems. Adopting this “outside–>in” paradigm of build (a broken thing)-break-fix will inevitably overlook huge security chasms inside the system—holes that are ripe for attackers to exploit. Instead of trying to test your way toward security one little prompt at a time (which turns out to be insanely expensive), why not build systems properly in the first place through a comprehensive overview of systemic risks?!

In any case, we would like to see appropriate regulatory action to ensure that proper security engineering takes place (including, say, documenting exactly where those training data came from and what they contain). We don’t think enlisting an army of pizza guys providing prompts is the answer,In the meantime, AI systems are already being made available to the public, and they are already wreaking havoc. Consider, for example, the recent misuse of AI to suppress voter turnout in the New Hampshire presidential primary! This kind of thing should shock the conscience of any who believe AI security can be tested in as an afterthought. So we have a call to action for you. It is imperative that AI architects and thought leaders adopt a risk-driven approach to engineering secure systems before releasing them to the public.

Bottom line on the NIST attack list? Mostly harmless.

Our Secret BIML Strategy

Dang. Darkreading went and published our world domination plan for machine learning securiy

To properly secure machine learning, the enterprise needs to be able to do three things: find where machine learning is being used, threat model the risk based on what was found, and put in controls to manage those risks.

‘We need to find machine learning [and] do a threat model based on what you found,’ McGraw says. ‘You found some stuff, and now your threat model needs to be adjusted. Once you do your threat model and you’ve identified some risks and threats, you need to put in some controls right across all those problems.’

There is no one tool or platform that can handle all three things, but McGraw happens to be on the advisory boards for three companies corresponding to each of the areas. Legit Security finds everything, IriusRisk helps with threat modeling, and Calypso AI puts controls in place.

‘I can see all the parts moving,’ McGraw says. ‘All the pieces are coming together.'”

Ah the Trinity of MLsec explained! Read the article here.