There Is No Free Lunch In Adversarial Robustness (But There Are Unexpected Benefits)
Tsipras, Dimitris
and
Santurkar, Shibani
and
Engstrom, Logan
and
Turner, Alexander
and
Madry, Aleksander
arXiv e-Print archive - 2018 via Local Bibsonomy
Keywords:
dblp
Tsipras et al. investigate the trade-off between classification accuracy and adversarial robustness. In particular, on a very simple toy dataset, they proof that such a trade-off exists; this means that very accurate models will also have low robustness. Overall, on this dataset, they find that there exists a sweet-spot where the accuracy is 70% and the adversarial accuracy (i.e., accuracy on adversarial examples) is 70%. Using adversarial training to obtain robust networks, they additionally show that the robustness is increased by not using “fragile” features – features that are only weakly correlated with the actual classification tasks. Only focusing on few, but “robust” features also has the advantage of more interpretable gradients and sparser weights (or convolutional kernels). Due to the induced robustness, adversarial examples are perceptually significantly more different from the original examples, as illustrated in Figure 1 on MNIST.
https://i.imgur.com/OP2TOOu.png
Figure 1: Illustration of adversarial examples for a standard model, a model trained using $L_\infty$ adversarial training and $L_2$ adversarial training. Especially for the $L_2$ case it is visible that adversarial examples need to change important class characteristics to fool the network.
Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).