A Taxonomy of ML Attacks

Victor Shepardson, Gary McGraw, Harold Figueroa, Richie Bonett — May, 2019

Here we consider attacks on ML algorithms, as opposed to peripheral attacks1 or attacks on ML infrastructure (i.e., software frameworks or hardware accelerators).

We divide attacks into extraction attacks (which compromise confidentiality) and manipulation attacks (which compromise integrity). Orthogonally, we consider three surfaces of an ML system: the model, the (training) data, and the inputs (at runtime)2. Each component can be the vector for a manipulation attack or the object of an extraction attack, resulting in our six categories:

  • input manipulation
  • data manipulation
  • model manipulation
  • input extraction
  • data extraction
  • model extraction

Input manipulation, also called “adversarial examples” [Goodfellow14], is a manipulation attack on an ML model at runtime. In this case, an attacker concocts an input to an operating ML system which reliably produces a different output than its creators intend. Examples include a stop sign being classified as a speed limit sign [Eykholt17]; a spam email being classified as not spam [Biggio13]; or a vocal utterance being transcribed as an unrelated text [Carlini18]. For a survey of input manipulation techniques on deep learning and classical ML systems see [Yuan19] and [Biggio13], respectively.

Data manipulation, also called a “poisoning” [Kloft07] or “causative” attack [Barreno06] is a manipulation attack on an operating model via the training process. An attacker modifies a data corpus used to train ML systems in order to impair or influence those systems’ behavior. For example, an attacker may publish bogus data to influence financial time-series forecasting models [Alfeld16]or interfere with medical diagnoses [Mozaffari-Kermani15].

Model manipulation is implied by our taxonomy, though we find few examples in the literature. However one can imagine an attacker publishing a white-box3 model with certain latent behavior, to be unwittingly adopted by third parties and later exploited by the attacker. It is common in the deep learning community to release models under a permissive open source license; given the prevalence of code reuse and transfer learning we believe this topic deserves greater scrutiny.

Input extraction, also called “model inversion” applies in cases where model outputs are public but inputs are secret; an attacker attempts to recover inputs from outputs. For example, inferring features of medical records from the dosage recommended by an ML model [Fredrikson14], or producing a recognizable image of a face given only the identity (classification in a face-recognition model) and confidence score [Fredrikson15].

Data extraction is closely related to input extraction and sometimes also termed “model inversion”. In this case, an attacker extracts details of the data corpus an ML model was trained on [Ateniese13] [Shokri17]. Research in deep learning often focuses on the model to the exclusion of the data, yet data is known to be crucially important to a trained system’s behavior. And though research is often conducted on public datasets, real-world ML systems involve proprietary data with privacy implications.

Model extraction is an extraction attack on the model itself. An attacker targets a less-than-fully-white-box ML system, attempting to “open the box” and copy its behavior or parameters. Model extraction may function as theft of a proprietary model [Tramèr16] or may enable white-box attacks on a black-box model [Papernot17] [Wang18].

1. For example, [Shokri15] observe that sensitive data becomes vulnerable when distributed to many third parties for ML purposes, and propose privacy-preserving training to obviate data sharing.

2. ML algorithms exist on a spectrum from offline to online. Offline algorithms place training prior to deployment; online algorithms continue to train after deployment. In the online setting, inputs and training data may not be distinct. However, we can say that input manipulation compromises behavior toward the malicious input, while data manipulation compromises behavior toward future inputs — the methods of attack and security implications are distinct.

3. A public model is sometimes called white-box, while a secret model is called black-box.