On the (Statistical) Detection of Adversarial Examples
Kathrin Grosse
and
Praveen Manoharan
and
Nicolas Papernot
and
Michael Backes
and
Patrick McDaniel
arXiv e-Print archive - 2017 via Local arXiv
Keywords:
cs.CR, cs.LG, stat.ML
First published: 2017/02/21 (7 years ago) Abstract: Machine Learning (ML) models are applied in a variety of tasks such as
network intrusion detection or Malware classification. Yet, these models are
vulnerable to a class of malicious inputs known as adversarial examples. These
are slightly perturbed inputs that are classified incorrectly by the ML model.
The mitigation of these adversarial inputs remains an open problem. As a step
towards understanding adversarial examples, we show that they are not drawn
from the same distribution than the original data, and can thus be detected
using statistical tests. Using thus knowledge, we introduce a complimentary
approach to identify specific inputs that are adversarial. Specifically, we
augment our ML model with an additional output, in which the model is trained
to classify all adversarial inputs. We evaluate our approach on multiple
adversarial example crafting methods (including the fast gradient sign and
saliency map methods) with several datasets. The statistical test flags sample
sets containing adversarial inputs confidently at sample sizes between 10 and
100 data points. Furthermore, our augmented model either detects adversarial
examples as outliers with high accuracy (> 80%) or increases the adversary's
cost - the perturbation added - by more than 150%. In this way, we show that
statistical properties of adversarial examples are essential to their
detection.
Grosse et al. use statistical tests to detect adversarial examples; additionally, machine learning algorithms are adapted to detect adversarial examples on-the-fly of performing classification. The idea of using statistics tests to detect adversarial examples is simple: assuming that there is a true data distribution, a machine learning algorithm can only approximate this distribution – i.e. each algorithm “learns” an approximate distribution. The ideal adversary uses this discrepancy to draw a sample from the data distribution where data distribution and learned distribution differ – resulting in mis-classification. In practice, they show that kernel-based two-sample statistics hypothesis testing can be used to identify a set of adversarial examples (but not individual one). In order to also detect individual ones, each classifier is augmented to also detect whether the input is an adversarial example. This approach is similar to adversarial training, where adversarial examples are included in the training set with the correct label. However, I believe that it is possible to again craft new examples to the augmented classifier – as is also possible with adversarial training.