Delving into Transferable Adversarial Examples and Black-box Attacks
Yanpei Liu
and
Xinyun Chen
and
Chang Liu
and
Dawn Song
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.LG
First published: 2016/11/08 (8 years ago) Abstract: An intriguing property of deep neural networks is the existence of
adversarial examples, which can transfer among different architectures. These
transferable adversarial examples may severely hinder deep neural network-based
applications. Previous works mostly study the transferability using small scale
datasets. In this work, we are the first to conduct an extensive study of the
transferability over large models and a large scale dataset, and we are also
the first to study the transferability of targeted adversarial examples with
their target labels. We study both non-targeted and targeted adversarial
examples, and show that while transferable non-targeted adversarial examples
are easy to find, targeted adversarial examples generated using existing
approaches almost never transfer with their target labels. Therefore, we
propose novel ensemble-based approaches to generating transferable adversarial
examples. Using such approaches, we observe a large proportion of targeted
adversarial examples that are able to transfer with their target labels for the
first time. We also present some geometric studies to help understanding the
transferable adversarial examples. Finally, we show that the adversarial
examples generated using ensemble-based approaches can successfully attack
Clarifai.com, which is a black-box image classification system.
Liu et al. provide a comprehensive study on the transferability of adversarial examples considering different attacks and models on ImageNet. In their experiments, they consider both targeted and non-targeted attack and also provide a real-world example by attacking clarifai.com. Here, I want to list some interesting conclusions drawn from their experiments:
- Non-targeted attacks easily transfer between models; targeted-attacks, in contrast, do generally not transfer – meaning that the target does not transfer across models.
- The level of transferability does also seem to heavily really on hyperparameters of the trained models. In the experiments, the author observed this on different ResNet models which share the general architecture building blocks, but are of different depth.
- Considering different models, it turns out that the gradient directions (i.e. the adversarial directions used in many gradient-based attacks) are mostly orthogonal – this means that different models have different vulnerabilities. However, the observed transferability suggests that this only holds for the “steepest” adversarial direction; the gradient direction of one model is, thus, still useful to craft adversarial examples for another model.
- The authors also provide an interesting visualization of the local decision landscape around individual examples. As illustrated in Figure 1, the region where the chosen image is classified correctly is often limited to a small central area. Of course, I believe that these examples are hand-picked to some extent, but they show the worst-case scenario relevant for defense mechanisms.
https://i.imgur.com/STz0iwo.png
Figure 1: Decision boundary showing different classes in different colors. The axes correspond to one pixel differences; the used images are computed using $x' = x +\delta_1u + \delta_2v$ where $u$ is the gradient direction and $v$ a random direction.
Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/).