Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients
Andrew Slavin Ross
and
Finale Doshi-Velez
arXiv e-Print archive - 2017 via Local arXiv
Keywords:
cs.LG, cs.CR, cs.CV
First published: 2017/11/26 (6 years ago) Abstract: Deep neural networks have proven remarkably effective at solving many
classification problems, but have been criticized recently for two major
weaknesses: the reasons behind their predictions are uninterpretable, and the
predictions themselves can often be fooled by small adversarial perturbations.
These problems pose major obstacles for the adoption of neural networks in
domains that require security or transparency. In this work, we evaluate the
effectiveness of defenses that differentiably penalize the degree to which
small changes in inputs can alter model predictions. Across multiple attacks,
architectures, defenses, and datasets, we find that neural networks trained
with this input gradient regularization exhibit robustness to transferred
adversarial examples generated to fool all of the other models. We also find
that adversarial examples generated to fool gradient-regularized models fool
all other models equally well, and actually lead to more "legitimate,"
interpretable misclassifications as rated by people (which we confirm in a
human subject experiment). Finally, we demonstrate that regularizing input
gradients makes them more naturally interpretable as rationales for model
predictions. We conclude by discussing this relationship between
interpretability and robustness in deep neural networks.
Ross and Doshi-Velez propose input gradient regularization to improve robustness and interpretability of neural networks. As the discussion of interpretability is quite limited in the paper, the main contribution is an extensive evaluation of input gradient regularization against adversarial examples – in comparison to defenses such as distillation or adversarial training. Specifically, input regularization as proposed in [1] is used:
$\arg\min_\theta H(y,\hat{y}) + \lambda \|\nabla_x H(y,\hat{y})\|_2^2$
where $\theta$ are the network’s parameters, $x$ its input and $\hat{y}$ the predicted output. Here, $H$ might be a cross-entropy loss. It also becomes apparent why this regularization was originally called double-backpropagation because the second derivative is necessary during training.
In experiments, the authors show that the proposed regularization is superior to many other defenses including distillation and adversarial training. Unfortunately, the comparison does not include other “regularization” techniques to improve robustness – such as Lipschitz regularization. This makes the comparison less interpretable, especially as the combination of input gradient regularization and adversarial training performs best (suggesting that adversarial training is a meaningful defense, as well). Still, I recommend a closer look on the experiments. For example, the authors also study the input gradients of defended models, leading to some interesting conclusions.
[1] H. Drucket, Y. LeCun. Improving generalization performance using double backpropagation. IEEE Transactions on Neural Networks, 1992.
Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).