Detecting Adversarial Samples from Artifacts
Reuben Feinman
and
Ryan R. Curtin
and
Saurabh Shintre
and
Andrew B. Gardner
arXiv e-Print archive - 2017 via Local arXiv
Keywords:
stat.ML, cs.LG
First published: 2017/03/01 (7 years ago) Abstract: Deep neural networks (DNNs) are powerful nonlinear architectures that are
known to be robust to random perturbations of the input. However, these models
are vulnerable to adversarial perturbations--small input changes crafted
explicitly to fool the model. In this paper, we ask whether a DNN can
distinguish adversarial samples from their normal and noisy counterparts. We
investigate model confidence on adversarial samples by looking at Bayesian
uncertainty estimates, available in dropout neural networks, and by performing
density estimation in the subspace of deep features learned by the model. The
result is a method for implicit adversarial detection that is oblivious to the
attack algorithm. We evaluate this method on a variety of standard datasets
including MNIST and CIFAR-10 and show that it generalizes well across different
architectures and attacks. Our findings report that 85-93% ROC-AUC can be
achieved on a number of standard classification tasks with a negative class
that consists of both normal and noisy samples.
Feinman et al. use dropout to compute an uncertainty measure that helps to identify adversarial examples. Their so-called Bayesian Neural Network Uncertainty is computed as follows:
$\frac{1}{T} \sum_{i=1}^T \hat{y}_i^T \hat{y}_i - \left(\sum_{i=1}^T \hat{y}_i\right)\left(\sum_{i=1}^T \hat{y}_i\right)$
where $\{\hat{y}_1,\ldots,\hat{y}_T\}$ is a set of stochastic predictions (i.e. predictions with different noise patterns in the dropout layers). Here, is can easily be seen that this measure corresponds to a variance computatin where the first term is correlation and the second term is the product of expectations. In Figure 1, the authors illustrate the distributions of this uncertainty measure for regular training samples, adversarial samples and noisy samples for two attacks (BIM and JSMA, see paper for details).
https://i.imgur.com/kTWTHb5.png
Figure 1: Uncertainty distributions for two attacks (BIM and JSMA, see paper for details) and normal samples, adversarial samples and noisy samples.
Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/).