Adversarial Examples Are Not Bugs, They Are Features
Ilyas, Andrew
and
Santurkar, Shibani
and
Tsipras, Dimitris
and
Engstrom, Logan
and
Tran, Brandon
and
Madry, Aleksander
- 2019 via Local Bibsonomy
Keywords:
adversarial
Ilyas et al. present a follow-up work to their paper on the trade-off between accuracy and robustness. Specifically, given a feature $f(x)$ computed from input $x$, the feature is considered predictive if
$\mathbb{E}_{(x,y) \sim \mathcal{D}}[y f(x)] \geq \rho$;
similarly, a predictive feature is robust if
$\mathbb{E}_{(x,y) \sim \mathcal{D}}\left[\inf_{\delta \in \Delta(x)} yf(x + \delta)\right] \geq \gamma$.
This means, a feature is considered robust if the worst-case correlation with the label exceeds some threshold $\gamma$; here the worst-case is considered within a pre-defined set of allowed perturbations $\Delta(x)$ relative to the input $x$. Obviously, there also exist predictive features, which are however not robust according to the above definition. In the paper, Ilyas et al. present two simple algorithms for obtaining adapted datasets which contain only robust or only non-robust features. The main idea of these algorithms is that an adversarially trained model only utilizes robust features, while a standard model utilizes both robust and non-robust features. Based on these datasets, they show that non-robust, predictive features are sufficient to obtain high accuracy; similarly training a normal model on a robust dataset also leads to reasonable accuracy but also increases robustness. Experiments were done on Cifar10. These observations are supported by a theoretical toy dataset consisting of two overlapping Gaussians; I refer to the paper for details.
Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).