[link]
Ilyas et al. present a follow-up work to their paper on the trade-off between accuracy and robustness. Specifically, given a feature $f(x)$ computed from input $x$, the feature is considered predictive if $\mathbb{E}_{(x,y) \sim \mathcal{D}}[y f(x)] \geq \rho$; similarly, a predictive feature is robust if $\mathbb{E}_{(x,y) \sim \mathcal{D}}\left[\inf_{\delta \in \Delta(x)} yf(x + \delta)\right] \geq \gamma$. This means, a feature is considered robust if the worst-case correlation with the label exceeds some threshold $\gamma$; here the worst-case is considered within a pre-defined set of allowed perturbations $\Delta(x)$ relative to the input $x$. Obviously, there also exist predictive features, which are however not robust according to the above definition. In the paper, Ilyas et al. present two simple algorithms for obtaining adapted datasets which contain only robust or only non-robust features. The main idea of these algorithms is that an adversarially trained model only utilizes robust features, while a standard model utilizes both robust and non-robust features. Based on these datasets, they show that non-robust, predictive features are sufficient to obtain high accuracy; similarly training a normal model on a robust dataset also leads to reasonable accuracy but also increases robustness. Experiments were done on Cifar10. These observations are supported by a theoretical toy dataset consisting of two overlapping Gaussians; I refer to the paper for details. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).
Your comment:
|
[link]
It didn’t hit me how much this paper was a pun until I finished it, and in retrospect, I say, bravo. This paper focuses on adversarial examples, and argues that, at least in some cases, adversarial perturbations aren’t purely overfitting failures on behalf of the model, but actual features that generalize to the test set. This conclusion comes from a set of two experiments: - In one, the authors create a dataset that only contains what they call “robust features”. They do this by taking a classifier trained to be robust using adversarial training (training on adversarial examples), and doing gradient descent to modify the input pixels until the final-layer robust model activations of the modified inputs match the final layer activations when the unmodified inputs are passed in. Operating under the premise that features identified by a robust model are themselves robust, because by definition they don’t change in the presence of an adversarial perturbation, creating a training set that matches these features means that you’ve created some kind of platonic, robust version of the training set, with only robust features present. They then take this dataset, and train a new model on it, and show that it has strong test set performance, in both normal settings, and adversarial ones. This is not enormously surprising, since the original robust classifier performed well, but still interesting. - The most interesting and perhaps surprising experiment is where the authors create a dataset by taking normal images, and layering on top an adversarial perturbation. They then label these perturbed images with the label corresponding to the perturbation class, and train a model off of that. They then find that this model, which is trained on images which correspond to their labeled class only in their perturbation features, and not in the underlying visual features a human would recognize, achieves good test set performance under normal conditions. However, it performs poorly on adversarial perturbations of the test set. https://i.imgur.com/eJQXb0i.png Overall, the authors claim that the perturbations that are “tricking” models are features that can genuinely provide some amount of test set generalization, due to real but unintuitive regularities in the data, but that these features are non-robust, in that small amounts of noise can cause them to switch sign. |