First published: 2016/11/04 (8 years ago) Abstract: Adversarial examples are malicious inputs designed to fool machine learning
models. They often transfer from one model to another, allowing attackers to
mount black box attacks without knowledge of the target model's parameters.
Adversarial training is the process of explicitly training a model on
adversarial examples, in order to make it more robust to attack or to reduce
its test error on clean inputs. So far, adversarial training has primarily been
applied to small problems. In this research, we apply adversarial training to
ImageNet. Our contributions include: (1) recommendations for how to succesfully
scale adversarial training to large models and datasets, (2) the observation
that adversarial training confers robustness to single-step attack methods, (3)
the finding that multi-step attack methods are somewhat less transferable than
single-step attack methods, so single-step attacks are the best for mounting
black-box attacks, and (4) resolution of a "label leaking" effect that causes
adversarially trained models to perform better on adversarial examples than on
clean examples, because the adversarial example construction process uses the
true label and the model can learn to exploit regularities in the construction
process.
Kurakin et al. present some larger scale experiments using adversarial training on ImageNet to increase robustness. In particular, they claim to be the first using adversarial training on ImageNet. Furthermore, they provide experiments underlining the following conclusions:
- Adversarial training can also be seen as regularizer. This, however, is not surprising as training on noisy training samples is also known to act as regularization.
- Label leaking describes the observation that an adversarially trained model is able to defend against (i.e. correctly classify) an adversarial example which has been computed by knowing to true label while not defending against adversarial examples that were crafted without knowing the true label. This means that crafting adversarial examples without guidance by the true label might be beneficial (in terms of a stronger attack).
- Model complexity seems to have an impact on robustness after adversarial training. However, from the experiments, it is hard to deduce how this connection might look exactly.
Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/).