First published: 2016/12/19 (4 years ago) Abstract: Deep neural networks are powerful and popular learning models that achieve
state-of-the-art pattern recognition performance on many computer vision,
speech, and language processing tasks. However, these networks have also been
shown susceptible to carefully crafted adversarial perturbations which force
misclassification of the inputs. Adversarial examples enable adversaries to
subvert the expected system behavior leading to undesired consequences and
could pose a security risk when these systems are deployed in the real world.
In this work, we focus on deep convolutional neural networks and demonstrate
that adversaries can easily craft adversarial examples even without any
internal knowledge of the target network. Our attacks treat the network as an
oracle (black-box) and only assume that the output of the network can be
observed on the probed inputs. Our first attack is based on a simple idea of
adding perturbation to a randomly selected single pixel or a small set of them.
We then improve the effectiveness of this attack by carefully constructing a
small set of pixels to perturb by using the idea of greedy local-search. Our
proposed attacks also naturally extend to a stronger notion of
misclassification. Our extensive experimental results illustrate that even
these elementary attacks can reveal a deep neural network's vulnerabilities.
The simplicity and effectiveness of our proposed schemes mean that they could
serve as a litmus test for designing robust networks.
Narodytska and Kasiviswanathan propose a local search-based black.box adversarial attack against deep networks. In particular, they address the problem of k-misclassification defined as follows:
Definition (k-msiclassification). A neural network k-misclassifies an image if the true label is not among the k likeliest labels.
To this end, they propose a local search algorithm which, in each round, randomly perturbs individual pixels in a local search area around the last perturbation. If a perturbed image satisfies the k-misclassificaiton condition, it is returned as adversarial perturbation. While the approach is very simple, it is applicable to black-box models where gradients and or internal representations are not accessible but only the final score/probability is available. Still the approach seems to be quite inefficient, taking up to one or more seconds to generate an adversarial example. Unfortunately, the authors do not discuss qualitative results and do not give examples of multiple adversarial examples (except for the four in Figure 1).
Figure 1: Examples of adversarial attacks. Top: original image, bottom: perturbed image.
Table 4, 5, with only $.5\%$ of the pixels, you can get to $90\%$ missclassification, and it is a blackbox attack.
#### LocSearchAdv Algorithm
For $R$ rounds, at each round find $t$ top pixels that if you were to perturb them without bounds they could affect the classification the most. Then perturb each of the $t$ pixels such that they stay within the bounds (the magnitude of perturbation is a fixed value $r$). The top $t$ pixels are chosen from a subset of $P$ which is around $10\%$ of pixels; at the end of each round $P$ is updated to be the neighborhood of size $d\times d$ around the last $t$ top pixels.