[link]
Adversarial examples and defenses to prevent them are often presented as a case of inherent model fragility, where the model is making a clear and identifiable mistake, by misclassifying a label humans would classify correctly. But, another frame on the adversarial examples research is that they're a way of imposing a certain kind of prior requirement on our models: that they be sensitive to certain scales of perturbation to their inputs. One reason to want to do this is because you believe the model might reasonably need to interact with such perturbed inputs in future. But, another is that smoothness of model outputs, in the sense of an output that doesn't change sharply in the immediate vicinity of an example, can be a useful inductive bias that improves generalization. In images, this is often not the case, as training on adversarial examples empirically worsens performance on normal examples. In text, however, it seems like you can get more benefit out of training on adversarial examples, and this paper proposes a specific way of doing that. An interesting up-front distinction is the one between generating adversarial examples in embeddings vs raw text. Raw text is generally harder: it's unclear how to permute sentences in ways that leave them grammatically and meaningfully unchanged, and thus mean that the same label is the "correct" one as before, without human input. So the paper instead works in embedding space: adding a delta vectors of adversarial noise to the learned word embeddings used in a text model. One salient downside of generating adversarial examples to train on is that doing so is generally costly: it requires calculating the gradients with respect to the input to calculate the direction of the delta vector, which requires another backwards pass through the network, in addition to the ones needed to calculate the parameter gradients to update those. It happens to be the case that once you've calculated gradients w.r.t inputs, doing so for parameters is basically done for you for free, so one possible solution to this problem is to do a step of parameter gradient calculation/model training every time you take a step of perturbation generation. However, if you're generating your adversarial examples via multi-step Projected Gradient Descent, doing a step of model training at each of the K steps in multi-step PGD means that by the time you finish all K steps and are ready to train on the example, your perturbation vector is out of sync with with your model parameters, and so isn't optimally adversarial. To fix this, the authors propose actually training on the adversarial example generated by each step in the multi-step generation process, not just the example produced at the end. So, instead of training your model on perturbations of a given size, you train them on every perturbation up to and including that size. This also solves the problem of your perturbation being out of sync with your parameters, since you "apply" your perturbation in training at the same step where you calculate it. The authors sole purpose in this was to make models that generalize better, and they show reasonably convincing evidence that this method works slightly better than competing alternatives on language modeling tasks. More saliently, in my view, they come up with a straightforward and clever solution to a problem, which could potentially be used in other domains.
Your comment:
|