Zhao et al. propose a generative adversarial network (GAN) based approach to generate meaningful and natural adversarial examples for images and text. With natural adversarial examples, the authors refer to meaningful changes in the image content instead of adding seemingly random/adversarial noise – as illustrated in Figure 1. These natural adversarial examples can be crafted by first learning a generative model of the data, e.g., using a GAN together with an inverter (similar to an encoder), see Figure 2. Then, given an image $x$ and its latent code $z$, adversarial examples $\tilde{z} = z + \delta$ can be found within the latent code. The hope is that these adversarial examples will correspond to meaningful, naturally looking adversarial examples in the image space.
https://i.imgur.com/XBhHJuY.png
Figure 1: Illustration of natural adversarial examples in comparison ot regular, FGSM adversarial examples.
https://i.imgur.com/HT2StGI.png
Figure 2: Generative model (GAN) together with the required inverter.
In practice, e.g., on MNIST, any black-box classifier can be attacked by randomly sampling possible perturbations $\delta$ in the random space (with increasing norm) until an adversarial perturbation is found. Here, the inverted from Figure 2 is trained on top of the critic of the GAN (although specific details are missing in the paper).
Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).