First published: 2017/02/08 (4 years ago) Abstract: Machine learning classifiers are known to be vulnerable to inputs maliciously
constructed by adversaries to force misclassification. Such adversarial
examples have been extensively studied in the context of computer vision
applications. In this work, we show adversarial attacks are also effective when
targeting neural network policies in reinforcement learning. Specifically, we
show existing adversarial example crafting techniques can be used to
significantly degrade test-time performance of trained policies. Our threat
model considers adversaries capable of introducing small perturbations to the
raw input of the policy. We characterize the degree of vulnerability across
tasks and training algorithms, for a subclass of adversarial-example attacks in
white-box and black-box settings. Regardless of the learned task or training
algorithm, we observe a significant drop in performance, even with small
adversarial perturbations that do not interfere with human perception. Videos
are available at http://rll.berkeley.edu/adversarial.
Huang et al. study adversarial attacks on reinforcement learning policies. One of the main problems, in contrast to supervised learning, is that there might not be a reward in any time step, meaning there is no clear objective to use. However, this is essential when crafting adversarial examples as they are mostly based on maximizing the training loss. To avoid this problem, Huang et al. assume a well-trained policy; the policy is expected to output a distribution over actions. Then, adversarial examples can be computed by maximizing the cross-entropy loss using the most-likely action as ground truth.