Black-box Adversarial Attacks with Limited Queries and Information
Andrew Ilyas
and
Logan Engstrom
and
Anish Athalye
and
Jessy Lin
arXiv e-Print archive - 2018 via Local arXiv
Keywords:
cs.CV, cs.CR, stat.ML
First published: 2018/04/23 (6 years ago) Abstract: Current neural network-based classifiers are susceptible to adversarial
examples even in the black-box setting, where the attacker only has query
access to the model. In practice, the threat model for real-world systems is
often more restrictive than the typical black-box model where the adversary can
observe the full output of the network on arbitrarily many chosen inputs. We
define three realistic threat models that more accurately characterize many
real-world classifiers: the query-limited setting, the partial-information
setting, and the label-only setting. We develop new attacks that fool
classifiers under these more restrictive threat models, where previous methods
would be impractical or ineffective. We demonstrate that our methods are
effective against an ImageNet classifier under our proposed threat models. We
also demonstrate a targeted black-box attack against a commercial classifier,
overcoming the challenges of limited query access, partial information, and
other practical issues to break the Google Cloud Vision API.
Ilyas et al. propose three query-efficient black-box adversarial example attacks using distribution-based gradient estimation. In particular, their simplest attacks involves estimating the gradient locally using a search distribution:
$ \nabla_x \mathbb{E}_{\pi(\theta|x)} [F(\theta)] = \mathbb{E}_{\pi(\theta|x)} [F(\theta) \nabla_x \log(\pi(\theta|x))]$
where $F(\cdot)$ is a loss function – e.g., using the cross-entropy loss which is maximized to obtain an adversarial example. The above equation, using a Gaussian noise search distribution leads to a simple approximator for the gradient:
$\nabla \mathbb{E}[F(\theta)] = \frac{1}{\sigma n} \sum_{i = 1}^n \delta_i F(\theta + \sigma \delta_i)$
where $\sigma$ is the search variance and $\delta_i$ are sampled from a unit Gaussian. This scheme can then be applied as part of the projected gradient descent white-box attacks to obtain adversarial examples.
The above attack assumes that the black-box network provides probability outputs in order to compute the loss $F$. In the remainder of the paper, the authors also generalize this approach to the label-only case, where the network only provides the top $k$ labels for each input. In experiments, the attacks is shown to be effective while rarely requiring more than $50$k queries on ImageNet.
Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).