First published: 2016/08/16 (8 years ago) Abstract: Neural networks provide state-of-the-art results for most machine learning
tasks. Unfortunately, neural networks are vulnerable to adversarial examples:
given an input $x$ and any target classification $t$, it is possible to find a
new input $x'$ that is similar to $x$ but classified as $t$. This makes it
difficult to apply neural networks in security-critical areas. Defensive
distillation is a recently proposed approach that can take an arbitrary neural
network, and increase its robustness, reducing the success rate of current
attacks' ability to find adversarial examples from $95\%$ to $0.5\%$.
In this paper, we demonstrate that defensive distillation does not
significantly increase the robustness of neural networks by introducing three
new attack algorithms that are successful on both distilled and undistilled
neural networks with $100\%$ probability. Our attacks are tailored to three
distance metrics used previously in the literature, and when compared to
previous adversarial example generation algorithms, our attacks are often much
more effective (and never worse). Furthermore, we propose using high-confidence
adversarial examples in a simple transferability test we show can also be used
to break defensive distillation. We hope our attacks will be used as a
benchmark in future defense attempts to create neural networks that resist
adversarial examples.
Carlini and Wagner propose three novel methods/attacks for adversarial examples and show that defensive distillation is not effective. In particular, they devise attacks for all three commonly used norms $L_1$, $L_2$ and $L_\infty$ – which are used to measure the deviation of the adversarial perturbation from the original testing sample. In the course of the paper, starting with the targeted objective
$\min_\delta d(x, x + \delta)$ s.t. $f(x + \delta) = t$ and $x+\delta \in [0,1]^n$,
they consider up to 7 different surrogate objectives to express the constraint $f(x + \delta) = t$. Here, $f$ is the neural network to attack and $\delta$ denotes the perturbation. This leads to the formulation
$\min_\delta \|\delta\|_p + cL(x + \delta)$ s.t. $x + \delta \in [0,1]^n$
where $L$ is the surrogate loss. After extensive evaluation, the loss $L$ is taken to be
$L(x') = \max(\max\{Z(x')_i : i\neq t\} - Z(x')_t, -\kappa)$
where $x' = x + \delta$ and $Z(x')_i$ refers to the logit for class $i$; $\kappa$ is a constant ($=0$ in their experiments) that can be used to control the confidence of the adversarial example. In practice, the box constraint $[0,1]^n$ is encoded through a change of variable by expressing $\delta$ in terms of the hyperbolic tangent, see the paper for details. Carlini and Wagner then discuss the detailed attacks for all three norms, i.e. $L_1$, $L_2$ and $L_\infty$ where the first and latter are discussed in more detail as they impose non-differentiability.