Why Whitebox Machine Learning Matters

Imagine that you are trying to practice good security engineering at the system level when one of your essential components is an unpredicatable black box that sometimes does the wrong thing. How do you ensure or even measure the trustworthiness of that system? That seems to be the current situation we are in with LLMs and Agentic AI.

One of the levers we are exploring is observability INSIDE the black box. SO, In the case of an LLM, that would be trying to figure out what is going on inside the Transformer. Are there circuits in the trained model that correlate with and define certain behaviors? Are there concepts in there? Can we make use of various activation patterns (and weights) or otherwise guide them from inside the network? Are there indicators of bad behavior? Can we see the “guidelines” imposed by alignment training? Are they robust? Etc.

This is what we call (for the moment anyway) “Whitebox Interpositioning” at BIML. It’s like watching your brain (and interposing inside it) while you are acting as part of a system. Maybe we can build an “Intention-ometer” or maybe not. But we are certainly moving toward “WHYness” in a WHAT machine.

This all reminds us of what happened in software security when we moved from black box monitoring and sandboxing to whitebox code analysis (static and dynamic both). Thing is, we never really got a handle on architecture, especially when it came to security…

Plenty of work to do on the raw science front…and something we want to create a coalition to approach. Toward that end, BIML recently hosted a whitebox summit with Realm Labs and Starseer. We were joined by Paul Kocher. Expect something to come of this.

Why Whitebox Machine Learning Matters

0 Comments

Leave a Reply