Explaining and Harnessing Adversarial Examples
Ian J. Goodfellow
and
Jonathon Shlens
and
Christian Szegedy
arXiv e-Print archive - 2014 via Local arXiv
Keywords:
stat.ML, cs.LG
First published: 2014/12/20 (9 years ago) Abstract: Several machine learning models, including neural networks, consistently
misclassify adversarial examples---inputs formed by applying small but
intentionally worst-case perturbations to examples from the dataset, such that
the perturbed input results in the model outputting an incorrect answer with
high confidence. Early attempts at explaining this phenomenon focused on
nonlinearity and overfitting. We argue instead that the primary cause of neural
networks' vulnerability to adversarial perturbation is their linear nature.
This explanation is supported by new quantitative results while giving the
first explanation of the most intriguing fact about them: their generalization
across architectures and training sets. Moreover, this view yields a simple and
fast method of generating adversarial examples. Using this approach to provide
examples for adversarial training, we reduce the test set error of a maxout
network on the MNIST dataset.
#### Problem addressed:
A fast way of finding adversarial examples, and a hypothesis for the adversarial examples
#### Summary:
This paper tries to explain why adversarial examples exists, the adversarial example is defined in another paper \cite{arxiv.org/abs/1312.6199}. The adversarial example is kind of counter intuitive because they normally are visually indistinguishable from the original example, but leads to very different predictions for the classifier. For example, let sample $x$ be associated with the true class $t$. A classifier (in particular a well trained dnn) can correctly predict $x$ with high confidence, but with a small perturbation $r$, the same network will predict $x+r$ to a different incorrect class also with high confidence.
This paper explains that the exsistence of such adversarial examples is more because of low model capacity in high dimensional spaces rather than overfitting, and got some empirical support on that. It also shows a new method that can reliably generate adversarial examples really fast using `fast sign' method. Basically, one can generate an adversarial example by taking a small step toward the sign direction of the objective. They also showed that training along with adversarial examples helps the classifier to generalize.
#### Novelty:
A fast method to generate adversarial examples reliably, and a linear hypothesis for those examples.
#### Datasets:
MNIST
#### Resources:
Talk of the paper https://www.youtube.com/watch?v=Pq4A2mPCB0Y
#### Presenter:
Yingbo Zhou
Goodfellow et al. introduce the fast gradient sign method (FGSM) to craft adversarial examples and further provide a possible interpretation of adversarial examples considering linear models. FGSM is a grdient-based, one step method for generating adversarial examples. In particular, letting $J$ be the objective optimized during training and $\epsilon$ be the maximum $\infty$-norm of the adversarial perturbation, FGSM computes
$x' = x + \eta = x + \epsilon \text{sign}(\nabla_x J(x, y))$
where $y$ is the label for sample $x$. The $\text{sign}$ method is applied element-wise here. The applicability of this method is shown in several examples and it is commonly used in related work.
In the remainder of the paper, Goodfellow et al. discuss a linear interpretation of why adversarial examples exist. Specifically, considering the dot product
$w^T x' = w^T x + w^T \eta$
it becomes apparent that the perturbation $\eta$ – although insignificant on a per-pixel level (i.e. smaller than $\epsilon$) – causes the activation of a single neuron to be influence significantly. What is more, this effect is more pronounced the higher the dimensionality of $x$. Additionally, many network architectures today use $\text{ReLU}$ activations, which are essentially linear.
Goodfellow et al. conduct several more experiments; I want to highlight the conclusions of some of them:
- Training on adversarial samples can be seen as regularization. Based on experiments, it is more effective than $L_1$ regularization or adding random noise.
- The direction of the perturbation matters most. Adversarial samples might be transferable as similar models learn similar functions where these directions are, thus, similarly effective.
- Ensembles are not necessarily resistant to perturbations.
Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/).