Promote Privacy [Principle 7]
Privacy is tricky even when ML is not involved. ML makes things ever trickier by in some sense re-representing sensitive and/or confidential data inside of the machine. This makes the original data “invisible” (at least to some users), but remember that the data are still in some sense “in there somewhere.” So, for example, if you train a classifier on sensitive medical data and you don’t consider what will happen when an attacker tries to get those data back out through a set of sophisticated queries, you may not be doing your job.
When it comes to sensitive data, one promising approach in privacy-preserving ML is differential privacy. The idea behind differential privacy is to set up privacy restrictions that, for example, guarantee that an individual patient’s private medical data never has too much influence on a dataset or on a trained ML system. The idea is to in some sense “hide in plain sight” with a goal of ensuring that anything that can be learned about an individual from the released information, can also be learned without that individual’s data being included. An algorithm is differentially private if an observer examining the output is not able to determine whether a specific individual’s information was used in the computation. Differential privacy is achieved through the use of random noise that is generated according to a chosen distribution and is used to perturb a true answer. Somewhat counterintuitively, because of its use of noise, differential privacy can also be used to combat overfitting in some ML situations. Differential privacy is a reasonably promising line of research that can in some cases provide for privacy protection.
Privacy also applies to the behavior of a trained-up ML system in operation. We’ve discussed the tradeoffs associated with providing (or not providing) confidence scores. Sometimes that’s a great idea, and sometimes it’s not. Figuring out the impact on system security that providing confidence scores will have is another decision that should be explicitly considered and documented.
In short, you will do well to spend some cycles thinking about privacy in your ML system while you’re thinking about security.