First published: 2016/10/26 (4 years ago) Abstract: Given a state-of-the-art deep neural network classifier, we show the
existence of a universal (image-agnostic) and very small perturbation vector
that causes natural images to be misclassified with high probability. We
propose a systematic algorithm for computing universal perturbations, and show
that state-of-the-art deep neural networks are highly vulnerable to such
perturbations, albeit being quasi-imperceptible to the human eye. We further
empirically analyze these universal perturbations and show, in particular, that
they generalize very well across neural networks. The surprising existence of
universal perturbations reveals important geometric correlations among the
high-dimensional decision boundary of classifiers. It further outlines
potential security breaches with the existence of single directions in the
input space that adversaries can possibly exploit to break a classifier on most
natural images.
Moosavi-Dezfooli et al. propose universal adversarial perturbations – perturbations that are image-agnostic. Specifically, they extend the framework for crafting adversarial examples, i.e. by iteratively solving
$\arg\min_r \|r \|_2$ s.t. $f(x + r) \neq f(x)$.
Here, $r$ denotes the adversarial perturbation, $x$ a training sample and $f$ the neural network. Instead of solving this problem for a specific $x$, the authors propose to solve the problem over the full training set, i.e. in each iteration, a different sample $x$ is chosen, one step in the direction of the gradient is taken and the perturbation is updated accordingly. In experiments, they show that these universal perturbations are indeed able to fool networks an several images; in addition, these perturbations are – sometimes – transferable to other networks.
Also view this summary on [davidstutz.de](https://davidstutz.de/category/reading/).