Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality
Xingjun Ma
and
Bo Li
and
Yisen Wang
and
Sarah M. Erfani
and
Sudanthi Wijewickrema
and
Grant Schoenebeck
and
Dawn Song
and
Michael E. Houle
and
James Bailey
arXiv e-Print archive - 2018 via Local arXiv
Keywords:
cs.LG, cs.CR, cs.CV
First published: 2018/01/08 (6 years ago) Abstract: Deep Neural Networks (DNNs) have recently been shown to be vulnerable against
adversarial examples, which are carefully crafted instances that can mislead
DNNs to make errors during prediction. To better understand such attacks, a
characterization is needed of the properties of regions (the so-called
'adversarial subspaces') in which adversarial examples lie. We tackle this
challenge by characterizing the dimensional properties of adversarial regions,
via the use of Local Intrinsic Dimensionality (LID). LID assesses the
space-filling capability of the region surrounding a reference example, based
on the distance distribution of the example to its neighbors. We first provide
explanations about how adversarial perturbation can affect the LID
characteristic of adversarial regions, and then show empirically that LID
characteristics can facilitate the distinction of adversarial examples
generated using state-of-the-art attacks. As a proof-of-concept, we show that a
potential application of LID is to distinguish adversarial examples, and the
preliminary results show that it can outperform several state-of-the-art
detection measures by large margins for five attack strategies considered in
this paper across three benchmark datasets. Our analysis of the LID
characteristic for adversarial regions not only motivates new directions of
effective adversarial defense, but also opens up more challenges for developing
new attacks to better understand the vulnerabilities of DNNs.
Ma et al. detect adversarial examples based on their estimated intrinsic dimensionality. I want to note that this work is also similar to [1] – in both publications, local intrinsic dimensionality is used to analyze adversarial examples. Specifically, the intrinsic dimensionality of a sample is estimated based on the radii $r_i(x)$ of the $k$-nearest neighbors around a sample $x$:
$- \left(\frac{1}{k} \sum_{i = 1}^k \log \frac{r_i(x)}{r_k(x)}\right)^{-1}$.
For details regarding the original, theoretical formulation of local intrinsic dimensionality I refer to the paper. In experiments, the authors show that adversarial examples exhibit a significant higher intrinsic dimensionality than training samples or randomly perturbed examples. This observation allows detection of adversarial examples. A proper interpretation of this finding is, however, missing. It would be interesting to investigate what this finding implies about the properties of adversarial examples.