First published: 2016/07/08 (8 years ago) Abstract: Most existing machine learning classifiers are highly vulnerable to
adversarial examples. An adversarial example is a sample of input data which
has been modified very slightly in a way that is intended to cause a machine
learning classifier to misclassify it. In many cases, these modifications can
be so subtle that a human observer does not even notice the modification at
all, yet the classifier still makes a mistake. Adversarial examples pose
security concerns because they could be used to perform an attack on machine
learning systems, even if the adversary has no access to the underlying model.
Up to now, all previous work have assumed a threat model in which the adversary
can feed data directly into the machine learning classifier. This is not always
the case for systems operating in the physical world, for example those which
are using signals from cameras and other sensors as an input. This paper shows
that even in such physical world scenarios, machine learning systems are
vulnerable to adversarial examples. We demonstrate this by feeding adversarial
images obtained from cell-phone camera to an ImageNet Inception classifier and
measuring the classification accuracy of the system. We find that a large
fraction of adversarial examples are classified incorrectly even when perceived
through the camera.
Adversarial examples are datapoints that are designed to fool a classifier. For example, we can take an image that is classified correctly using a neural network, then backprop through the model to find which changes we need to make in order for it to be classified as something else. And these changes can be quite small, such that a human would hardly notice a difference.
https://i.imgur.com/pkK570X.png
Examples of adversarial images.
In this paper, they show that much of this property holds even when the images are fed into the classifier from the real world – after being photographed with a cell phone camera. While the accuracy goes from 85.3% to 36.3% when adversarial modifications are applied on the source images, the performance still drops from 79.8% to 36.4% when the images are photographed. They also propose two modifications to the process of generating adversarial images – making it into a more gradual iterative process, and optimising for a specific adversarial class.