Security people are quick to point out that security is like a chain. And just as a chain is only as strong as the weakest link, an ML system is only as secure as its weakest component. Want to anticipate where bad guys will attack your ML system? Well, think through which part would be easiest to attack.
ML systems are different from many other artifacts that we engineer because the data in ML are just as important (or sometimes even more important) than the learning mechanism itself. That means we need to pay even more attention to the data used to train, test, and operate an ML system than we might in a standard system.
In some sense, this turns the idea of an attack surface on its head. To understand what we mean, consider that the training data in an ML system may often come from a public location—that is, one that may be subject to poor data protection controls. If that’s the case, perhaps the easiest way to attack an ML system of this flavor would be through polluting or otherwise manipulating the data before it even arrives. An attacker wins if they get to the ML-critical data before the ML system even starts to learn. Who cares about the public API of the trained up and operating ML system if the data used to build it were already maliciously constructed?
Thinking about ML data as money makes a good exercise. Where does the “money” (that is, data) in the system come from? How is it stored? Can counterfeit money help in an attack? Does all of the money get compressed into high value storage in one place (say the weights and thresholds learned in the ML systems’ distributed representation)? How does money come out of an ML system? Can money be transferred to an attacker? How would that work?
Lets stretch this analogy ever farther. When it comes to actual money, a sort of perverse logic pervades the physical security world. There’s generally more money in a bank than a convenience store, but which one is more likely to be held up? The convenience store, because banks tend to have much stronger security precautions. Convenience stores are a much easier target. Of course the payoff for successfully robbing a convenience store is much lower than knocking off a bank, but it is probably a lot easier to get away from the convenience store crime scene. To stretch our analogy a bit, you want to look for and better defend the convenience stores in your ML system.
ML has another weird factor that is worth considering—that is that much of the source code is open and re-used all over the place. Should you trust that algorithm that you snagged from GitHub? How does it work? Does it protect those oh so valuable data sets you built up? What if the algorithm itself is sneakily compromised? These are some potential weak links that may not be considered in a traditional security stance.
Identifying the weakest component of a system falls directly out of a good risk analysis. Given good risk analysis information, addressing the most serious risk first, instead of a risk that may be easiest to mitigate, is always prudent. Security resources should be doled out according to risk. Deal with one or two major problems, and move on to the remaining ones in order of severity.
Of course, this strategy can be applied forever, because 100% security is never attainable. There is a clear need for some stopping point. It is okay to stop addressing risks when all components appear to be within the threshold of acceptable risk. The notion of acceptability depends on the business proposition.
All of our analogies aside, good security practice dictates an approach that identifies and strengthens weak links until an acceptable level of risk is achieved.