BIML Coins a Term: Data Feudalism

Decipher covers the White House AI Executive Order, with the last word to BIML. Read the article from October 31, 2023 here.

Much of what the executive order is trying to accomplish are things that the software and security communities have been working on for decades, with limited success.

“We already tried this in security and it didn’t work. It feels like we already learned this lesson. It’s too late. The only way to understand these systems is to understand the data from which they’re built. We’re behind the eight ball on this,” said Gary McGraw, CEO of the Berryville Institute of Machine Learning, who has been studying software security for more than 25 years and is now focused on AI and machine learning security.

“The big data sets are already being walled off and new systems can’t be trained on them. Google, Meta, Apple, those companies have them and they’re not sharing. The worst future is that we have data feudalism.”

Another challenge in the effort to build safer and less biased models is the quality of the data on which those systems are being trained. Inaccurate, biased, or incomplete data going in will lead to poor results coming out.

“We’re building this recursive data pollution problem and we don’t know how to address it. Anything trained on a huge pile of data is going to reflect the data that it ate,” McGraw said. “These models are going out and grabbing all of these bad inputs that in a lot of cases were outputs from the models themselves.”

“It’s good that people are thinking about this problem. I just wish the answer from the government wasn’t red teaming. You can’t test your way out of this problem.”

BIML is Born

Welcome to the BIML blog where we will (informally) write about MLsec, otherwise known as Machine Learning security. BIML is short for the Berryville Institute of Machine Learning. For what it’s worth, we think it is pretty amusing to have a “Berryville Institute” just like Santa Fe has the “Santa Fe Institute.” You go, Berryville!

BIML was born when I retired from my job of 24 years in January 2019. Many years ago as a graduate student at Indiana University, I did lots of work in machine learning and AI as part of my Ph.D. program in Cognitive Science. As a student of Doug Hofstadter’s I was deeply interested in emergent computation, sub-symbolic representation, error making, analogy, and low-level perceptual systems. I was fortunate to be surrounded by lots of fellow students interested in building things and finding out how the systems we were learning about by reading papers actually worked. Long story short, we built lots of real systems and published a bunch of papers about what we learned in the scientific literature.

The official BIML logo, designed by Jackie McGraw

Our mission at BIML is to explore the security implications built into ML systems. We’re starting with neural networks, which are all the rage at the moment, but we intend to think and write about genetic algorithms, sparse distributed memory, and other ML systems. Just to make this perfectly clear, we’re not really thinking much about using ML for security, rather we are focused on the security of ML systems themselves.

Fast forward 24 years. As one of the fathers of software security and security engineering at the design level, I have been professionally interested in how systems break, what kinds of risks are inherent in system design, and how to design systems that are more secure. At BIML we are applying these techniques and approaches directly to ML systems.

Through a series of small world phenomenon, the BIML group coalesced, sparked first when I met Harold Figueroa at an Ntrepid Technical Advisory Board meeting in the Fall of 2018 (I am the Chair of Ntrepid’s TAB, and Harold leads Ntrepid’s machine learning research group). Harold and I had a great initial discussion over dinner about representation, ML progress, and other issues. We decided that continuing those discussions and digging into some research was in order. Victor Shepardson, who did lots of ML work at Dartmouth as a Masters student, was present for our first meeting in January. We quickly added Richie Bonett, a Berryville local like me (!!) and a Ph.D. student at William and Mary, to the group. And BIML was officially born.

We started with a systematic and in depth review of the MLsec literature. You can see the results of our work in the annotated bibliography that we will continue to curate as we read and discuss papers.

Our second task was to develop an Attack Taxonomy that makes sense at a meta-level of the burgeoning ML attack literature. These days, lots of energy is being expended to attack certain ML systems. Some of the attacks are quite famous (stop sign recognition, and seeing cats as tanks both come to mind), and the popular press has made much of both ML progress and amusing attacks against ML systems. You can review the (ongoing) Attack Taxonomy work elsewhere on our website.

We’re now in the midst of an Architectural Risk Analysis (ARA) of a generic ML system. Our approach follows the ARA process introduced in my book Software Security and applied professionally for many years at Cigital. We plan to publish our work here as we make progress.

We’re really having fun with this group, and we hope you will get as much of a kick out of our results as we’re getting. We welcome contact and collaboration. Please let us know what you think.