ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks
Nicolas Papernot and Patrick McDaniel and Xi Wu and Somesh Jha and Ananthram Swami
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CR, cs.LG, cs.NE, stat.ML
more

[link] Summary by David Stutz 7 years ago

Papernot et al. build upon the idea of network distillation [1] and propose a simple mechanism to defend networks against adversarial attacks. The main idea of distillation – originally introduced to “distill” the knowledge of very deep networks into smaller ones – is to train a second, possibly smaller network, with the probability distributions of the original, possibly larger network as supervision. Papernot et al. as well as the authors of [1] argue that the probability distributions, i.e. the activations of the final softmax layer (also referred to as “soft” labels), contain rich information about the task in contrast to the true “hard” labels. This allows the network to achieve similar performance while using less parameters or a different architecture.

However, Papernot et al. do not distill a network's knowledge into a smaller one; instead they use distillation to make networks robust against adversarial attacks. They argue that most algorithms to generate adversarial examples make use of the “adversarial gradient”; i.e. the gradient of the network's cost w.r.t. its input. The adversarial gradient then guides perturbation of the input image in the direction of wrong classes (the authors consider a simple classification task for simplicity). Therefore, Papernot et al. Argure, the gradient around training samples needs to be reduced – in other words, the model needs to be smoothed.

https://i.imgur.com/jXIhIGz.png

The proposed approach is very simple, they just distill the knowledge of the network into another network with same architectures and hyper parameters. By using the probability distributions as “soft” labels instead of the hard labels for training, the network is essentially smoothed. The full procedure is illustrated in Figure 1.

Despite the simplicity of the approach, I want to highlight some additional key observations:
- Distillation is also supposed to help generalization by avoiding overly confident networks.
- The success rate of adversarial attacks can be reduced significantly as shown in quantitative experiments.
- The amplitude of adversarial gradients can be reduced, which means that the network has been smoothed and is less sensitive to variations in the input samples.

Also see this summary on [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
arxiv-vanity.com
scholar.google.com

Training Region-based Object Detectors with Online Hard Example Mining
Abhinav Shrivastava and Abhinav Gupta and Ross Girshick
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CV, cs.LG
more

[link] Summary by RyanDsouza 7 years ago

The problem statement this paper tries to address is that the training set is distinguished by a large imbalance between the number of foreground examples and background examples-To make the point concrete cases like sliding window object detectors like deformable parts model, the imbalance may be as extreme as 100,000 background examples to one annotated foreground example.

Before i proceed to give you the details of Hard Example mining, i just want to note that HEM in its essence is mostly while training you sort your losses and train your model on the most difficult examples which mostly means the ones with the most loss.(An extension to this can be found in the paper Focal Loss). This is a simple but powerful technique.

So taking this as out background,The authors propose a simple but effective method to train an Fast-RCNN.
Their approach is as follows,
1. For an input image at SGD iteration t, they first compute a convolution feature map using the conv-Network
2. The ROI Network uses this feature map and all the input ROI's to do a forward pass
3. Hard examples are sorted by loss and taking the B/N examples for which the current network performs worse.(Here B is batch size and N is Number of examples)

4. While doing this, The researchers notice that Co-located ROI's with high overlap are likely to have co-related losses. Also If you notice Overlapping ROI's will project onto the mostly the same region in the Conv-feature map because the feature map is a denser/smaller representation of the feature map.So this might lead to loss double counting.To deal with this They use standard Non-Maximum Supression.

5. Now how NMS works here is, It iteratively selects the ROI with the highest loss and removes all lower loss ROI's that have high overlap with the selected region.Here they use a IOU threshold of 0.7

arxiv.org
arxiv-vanity.com
scholar.google.com

Prototypical Networks for Few-shot Learning
Jake Snell and Kevin Swersky and Richard S. Zemel
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG, stat.ML
more

[link] Summary by CodyWild 7 years ago

This paper describes an architecture designed for generating class predictions based on a set of features in situations where you may only have a few examples per class, or, even where you see entirely new classes at test time. Some prior work has approached this problem in ridiculously complex fashion, up to and including training a network to predict the gradient outputs of a meta-network that it thinks would best optimize loss, given a new class. The method of Prototypical Networks prides itself on being much simpler, and more intuitive, so I hope I’ll be able to convey that in this explanation.  In order to think about this problem properly, it makes sense to take a few steps back, and think about some fundamental assumptions that underly machine learning. 
https://i.imgur.com/Q45w0QT.png

One very basic one is that you need some notion of similarity between observations in your training set, and potential new observations in your test set, in order to properly generalize.  To put it very simplistically, if a test example is very similar to examples of class A that we saw in training, we might predict it to be of class A at testing. But what does it *mean* for two observations to be similar to one another? If you’re using a method like K Nearest Neighbors, you calculate a point’s class identity based on the closest training-set observations to it in Euclidean space, and you assume that nearness in that space corresponds to likelihood of two data points having come the same class. This is useful for the use case of having new classes show up after training, since, well, there isn’t really a training period: the strategy for KNN is just carrying your whole training set around, and, whenever a new test point comes along, calculating it’s closest neighbors among those training-set points. If you see a new class in the wild, all you need to do is add the examples of that class to your group of training set points, and then after a few examples, if your assumptions hold, you’ll be able to predict that class by (hopefully) finding those two or three points as neighbors. But what if some dimensions of your feature space matter much more than others for differentiating between classes? In a simplistic example, you could have twenty features, but, unbeknownst to you, only one is actually useful for separating out your classes, and the other 19 are random. If you use the naive KNN assumption, you wouldn’t expect to perform well here, because you will have distances in these 19 meaningless directions spreading out your points, due to randomness, more than the meaningful dimension spread them out due to belonging to different classes. And what if you want to be able to learn non-linear relationships between your features, which the composability of multi-layer neural networks lends itself well to? In cases like those, the features you were handed may be a woefully suboptimal metric space in which to calculate a kind of similarity that corresponds to differences in class identity, so you’ll just have to strike out for the territories and create a metric space for yourself. 

That is, at a very high level, what this paper seeks to do: learn a transformation between input features and some vector space, such that distances in that vector space correspond as well as possible to probabilities of belonging to a given output class. You may notice me using “vector space” and “embedding” similarity; they are the same idea: the result of that learned transformation, which represents your input observations as dense vectors in some p-dimensional space, where p is a chosen hyperparameter. 

What are the concrete learning steps this architecture goes through?
 
1. During each training episode, sample a subset of classes, and then divide those classes into training examples, and query examples 
2. Using a set of weights that are being learned by the network, map the input features of each training example into a vector space. 
3. Once all training examples are mapped into the space, calculate a “mean vector” for class A by averaging all of the embeddings of training examples that belong to class A. This is the “prototype” for class A, and once we have it, we can forget the values of the embedded examples that were averaged to create it. This is a nice update on the KNN approach, since the number of parameters we need to carry around to evaluate is only (num-dimensions) * (num-classes), rather than (num-dimensions) * (num-training-examples). 
4. Then, for each query example, map it into the embedding space, and use a distance metric in that space to create a softmax over possible classes. (You can just think of a softmax as a network’s predicted probability, it’s a set of floats that add up to 1). 
5. Then, you can calculate the (cross-entropy) error between the true output and that softmax prediction vector in the same way as you would for any classification network 
6. Add up the prediction loss for all the query examples, and then backpropogate through the network to update your weights 

The overall effect of this process is to incentivize your network to learn, not necessarily a good prediction function, but a good metric space. The idea is that, if the metric space is good enough, and the classes are conceptually similar to each other (i.e. car vs chair, as opposed to car vs the-meaning-of-life), a space that does well at causing similar observed classes to be close to one another will do the same for classes not seen during training. 

I admit to not being sufficiently familiar with the datasets used for testing to have a sense for how well this method compares to more fully supervised classification schemes; if anyone does, definitely let me know! But the paper claims to get state of the art results compared to other approaches in this domain of few-shot learning (matching networks, and the aforementioned meta-learning). 

One interesting note is that the authors found that squared Euclidean distance, when applied within the embedded space, worked meaningfully better than cosine distance (which is a more standard way of measuring distances between vectors, since it measures only angle, rather than magnitude). They suspect that this is because Euclidean distance, but not cosine distance belongs to a category of divergence/distance metrics (called Bregman Divergences) that have a special set of properties such that the point closest on aggregate to all points in a cluster is the average of all those points. If you want to dive way deep into the minutia on this point, I found this blog post quite good: http://mark.reid.name/blog/meet-the-bregman-divergences.html

1 Comments

arxiv.org
arxiv-vanity.com
scholar.google.com

Word Translation Without Parallel Data
Alexis Conneau and Guillaume Lample and Marc'Aurelio Ranzato and Ludovic Denoyer and Hervé Jégou
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CL
more

[link] Summary by CodyWild 7 years ago

The core goal of this paper is to perform in an unsupervised (read: without parallel texts) way what other machine translation researchers had previously only effectively performed in a supervised way: the creation of a word-to-word translational mapping between natural languages. To frame the problem concretely: the researchers start with word embeddings learned in each language independently, and their desired output is a set of nearest neighbors for a source word that contains the true target (i.e. translated) word as often a possible. 

An interesting bit of background for this paper is that Mikilov, who was the initial progenitor of the word embedding approach, went on to posit, based on experiments he’d conducted, that the embeddings produced by different languages share characteristics in vector space, such that one could expect a linear translation (i.e. taking a set of points and rotating, shifting, and/or scaling them) to be able to map from one language to another. This assumption is relied on heavily in this paper. A notional note: when I refer to “a mapped source embedding” or “mapped source”, that just means that a matrix transformation, captured in a weight matrix W, is being used to do some form of rotation, scaling, or shifting, to “map” between the source embedding space and the shared space. 

The three strategies this paper employs are: 
1. Using adversarial training to try to force the distributions of the embeddings in source and target languages to be similar to one another 
2. Taking examples where method (1) has high confidence, and borrowing a method from supervised word-to-word translation, called the Procrustes method, to further optimize the mapping into the shared vector space 
3. Calculating the nearest neighbors of a source word using an approach they develop called “Cross-Domain Similarity Local Scaling”. At a high level, this conducts nearest neighbors, but “normalizes” for density, so that, on an intuitive level, it’s basically scaling distances up in dense regions of the space, and scaling them down in sparse regions 

Focusing on (1) first, the notion here goes back to that assumption I mentioned earlier: that internal relationships within embedding space are similar across languages, such that if you able to align the overall distributions of target embedding with a mapped source embedding, then you might - if you take Mikilov’s assumption seriously -  reasonably expect this to push words in the mapped-source space close to their corresponding words in target space. And this does work, to some degree, but the researchers found that this approach on it’s own didn’t get them to where they wanted to be in terms of accuracy. 

To further refine the mapping created by the adversarial training, the authors use something called the “Procrustes Method”. They go into it in more detail in the paper, but at a high level, it turns out that if you’re trying to solve the problem of minimizing the sum of squared distances between a mapped-source embedding and a target embedding, assuming that that mapping is linear, and that you want the weight matrix to be orthogonal, that problem reduces to doing the singular value decomposition of the matrix of source embeddings multiplied by the (transposed) matrix of target embeddings, for a set of ground truth shared words. Now, you may reasonably note: this is an unsupervised method, we don’t have access to ground truth embeddings across languages. And you would be correct. So, here, what the authors do is take words that are *mutual* nearest neighbors (according to the CSLS metric of nearest neighbors I’ll describe in (3)  ) after conducting their adversarially-learned rotation, and take that mutual-nearest-neighbor-dom as a marker of high confidence in that word pair. They took these mutually-nearest-neighbor pairs, and used those as “ground truth” to conduct this singular value decomposition, which was applied on top of the adversarially-learned rotation to get to their final mapping. 

(3) is described well in equation form in the paper itself, and is just a way of constructing a similarity metric between a mapped-source embedding and a target embedding that does some clever normalization. Specifically, it takes two times the (cosine) distance between Ws (mapped source) and t (target), and subtracts out the average (cosine) distance of Ws to its k nearest target words, as well as the (average) cosine distance of t to its k nearest source words. In this way, it normalizes the distance between Ws and t based on how dense each of their neighborhoods is. 

Using all of these approaches together, the authors really do get quite impressive performance. For EN-ES, ES-EN, EN-FR, FR-EN, EN-DE, DE-EN, and EO (Esperanto)-EN, the performance of the adversarial method is within 0.5 accuracy score of the supervised method, with the adversarial method being higher in 5 of those 7 cases (note: I read this as "functionally equivalent"). Interestingly, though, for EN-RU, RU-EN, EN-CHN, and CHN-EN, the adversarial method was dramatically less effective, with accuracy deltas ranging from 5 to 10 points between the adversarial and the supervised method, with the supervised method prevailing in all cases. This suggests that the assumption of a simple linear mapping between the vector spaces of different languages may be a more valid one when the languages are more closely related, and thus closer in their structure. I'd be really interested in any experiments that try to actually confirm this by testing on a wider array of languages, or testing on subgroups of languages that are closer or farther (i.e. you would expect ES-FR to do even better than EN-FR, and you would expect ES-DE to do worse than EN-DE).

arxiv.org
arxiv-vanity.com
scholar.google.com

Born Again Neural Networks
Tommaso Furlanello and Zachary C. Lipton and Michael Tschannen and Laurent Itti and Anima Anandkumar
arXiv e-Print archive - 2018 via Local arXiv
Keywords: stat.ML, cs.AI, cs.LG
more

[link] Summary by CodyWild 7 years ago

A finding first publicized by Geoff Hinton is the fact that, when you train a simple, lower capacity module on the probability outputs of another model, you can often get a model that has comparable performance, despite that lowered capacity. Another, even more interesting finding is that, if you take a trained model, and train a model with identical structure on its probability outputs, you can often get a model with better performance than the original teacher, with quicker convergence.

This paper addresses, and tries to specifically test, a few theories about why this effect might be observed. One idea is that the "student" model can learn more quickly because getting to see the full probability distribution over a well-trained models outputs gives it a more valuable signal, specifically because the trained model is able to better rank the classes that aren't the true class. For example, if you're training on Imagenet, on an image of a huskies, you're only told "this is a husky (1), and not one of 100 other classes, which are all 0". Whereas a trained model might say "'this is most likely a husky, but the probability of wolf is way higher than that of teapot". This inherently gives you more useful signal to train on, because you’re given a full distribution of classes that an image is most like. This theory goes by the name of the “Dark Knowledge” theory (a truly delightful name), because it pulls all of this knowledge that is hidden in a 0/1 label into the light. An alternative explanation for the strong performance of distillation techniques is that the student model is just benefitting from the implicit importance weighting of having a stronger gradient on examples where the teacher model is more confident. You could think of this as leading the student towards examples that are the most clear or unambiguous examples of a class, rather than more fuzzy and uncertain ones.

Along with a few other tests (which I won’t address here, for sake of time and focus), the authors design a few experiments to test these possible mechanisms of action. The first test involved doing an explicit importance weighting of examples according to how confident the teacher model is, but including no information about the incorrect classes. The second was similar, but instead involved perturbing the probabilities of the classes that weren’t the max probability. In this situation, the student model gets some information in terms of the overall magnitudes of the not-max class, but can’t leverage it as usefully because it’s been randomized.

In both situations, they found that there still was some value - in other words, that the student outperformed the teacher - but it outperformed by less than the case where the teacher could see the full probability distribution. This supports the case that both the inclusion of probabilities for the less probable classes, as well as the “confidence weighting” effect of weighting the student to learn more from examples on which the “teacher” model was more confident.