![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
Layer-wise Relevance Propagation (LRP) is a novel technique has been used by authors in multiple use-cases (apart from this publication) to demonstrate the robustness and advantage of a *decomposition* method over other heatmap generation methods. Such heatmap generation methods are very crucial for increasing interpretability of Deep Learning models as such. Apart from LRP relevance, authors also discuss quantitative ways to measure the accuracy of the heatmap generated. ### LRP & Alternatives What is LRP ? LRP is a principled approach to decompose a classification decision into pixel-wise relevances indicating the contributions of a pixel to the overall classification score. The approach is derived from a layer-wise conservation principle , which forces the propagated quantity (e.g. evidence for a predicted class) to be preserved between neurons of two adjacent layers. Denoting by R(l) [i] the relevance associated to the ith neuron of layer and by R (l+1) [j] the relevance associated to the jth neuron in the next layer, the conservation principle requires that  where R(l) [i] is given as  where z[i,j] is the activation of jth neuron because of input from ith neuron As per authors this is not necssarily the only relevance funtion which is conserved. The intuition behind using such a function is that lower-layer neurons that mostly contribute to the activation of the higher-layer neuron receive a larger share of the relevance Rj of the neuron j. A downside of this propagation rule (at least if *epsilon* = 0) is that the denominator may tend to zero if lower-level contributions to neuron j cancel each other out. The numerical instability can be overcome by setting *epsilon* > 0. However in that case, the conservation idea is relaxated in order to gain better numerical properties. To conserve relevance, it can be formulated as sum of positive and negative activations  such that *alpha* - *beta* = 1 #### Alternatives to LRP for heatmap **Senstiivity measurement** In such methods of generating heamaps, gradient of the output with respect to input is used for generating heatmap. This quantity measures how much small changes in the pixel value locally affect the network output. ##### Disadvantages Given most models use ReLU as activation function, the gradient flows only through activation with positive output - thereby making makes the backward mapping discontinuous, and consequently strongly local. Also same applies for maxpool activations - wherein gradients only flow through neurons with maximum intensity in local neighbourhood. Also, given most of these methods use absolute impact on prediction cause by changes in pixel intensities, the granularity of whether the pixel intensity was in favour or against evidence is lost. **Deconvolutional Networks** ##### Disadvantages Here the backward discontinuity problem of sensitivity based methods are absent, hence global features can be captured. However, since the method only takes in activation from final layer (which learns the presence or absence of features mostly) , using this for generating heatmaps is likely to yield avergae maps, lacking image specific localisation effects LRP is able to counter the effects nicely because of the way it uses relevance #### Performance of heatmaps Few concerns that the authors raise are - A heatmap is not a segmentation mask on the contrary missing evidence or the context may be very important for classification - Salient features represent average explanations of what distinguishes one image category from another. For individual images these explanations may be meaningless or even wrong. For instance, salient features for the class ‘bicycle’ may be the wheels and the handlebar. However, in some images a bicycle may be partly occluded so that these parts of a bike are not visible. In these images salient features fail to explain the classifier’s decision (which still may be correct). Authors propose a novel method (MoRF - *Most Relevant First* ) of objectively quantifying quality of a heatmap. A good detailed idea of the measure can best be obtained from the paper. To give an idea, the most reliable method should ideally rank the most relevant regions in the same order even if small perturbations in pixel intensities are observed (in non-relevant areas. The quantity of interest in this case is the area over the MoRF perturbation curve (AOPC). #### Observation Most of the sensitivity based methods answer to the question - *what change would make the image more or less belong to the category car* which isn't really the classifier's question. LRP plans to answer the real classifier question *what speaks for the presence of a car in the image* An image below would be a good example of how LRPs can denoise heatmaps generated on the basis of sensitivity.  ![]() |
[link]
# Conditional Generative Adversarial Nets ## Introduction * Conditional version of [Generative Adversarial Nets (GAN)](https://gist.github.com/shagunsodhani/1f9dc0444142be8bd8a7404a226880eb) where both generator and discriminator are conditioned on some data **y** (class label or data from some other modality). * [Link to the paper](https://arxiv.org/abs/1411.1784) ## Architecture * Feed **y** into both the generator and discriminator as additional input layers such that **y** and input are combined in a joint hidden representation. ## Experiment ### Unimodal Setting * Conditioning MNIST images on class labels. * *z* (random noise) and **y** mapped to hidden layers with ReLu with layer sizes of 200 and 1000 respectively and are combined to obtain ReLu layer of dimensionality 1200. * Discriminator maps *x* (input) and **y** to maxout layers and the joint maxout layer is fed to sigmoid layer. * Results do not outperform the state-of-the-art results but do provide a proof-of-the-concept. ### Multimodal Setting * Map images (from Flickr) to labels (or user tags) to obtain the one-to-many mapping. * Extract image and text features using convolutional and language model. * Generative Model * Map noise and convolutional features to a single 200 dimensional representation. * Discriminator Model * Combine the representation of word vectors (corresponding to tags) and images. ## Future Work * While the results are not so good, they do show the potential of Conditional GANs, especially in the multimodal setting. ![]() |
[link]
Originally posted [here](https://github.com/abhshkdz/papers/blob/master/reviews/actions-~-transformations.md). This paper introduces a novel representation for actions in videos as transformations that change the state of the environment from what it was before the action (precondition) to what it will be after it (effect). - Model - The model utilizes a Siamese architecture with each head having convolutional and fully-connected layers (similar to VGG16). Each head extracts features for a subset of video frames (precondition or effect) that are aggregated by average pooling and followed by a fully-connected layer. - The precondition frames are indexed from 1 to z\_p and the effect frames from z\_e to t. Both z\_p and z\_e are latent variables, constrained to be from [1/3t, 1/2t] and [1/2t, 2/3t] respectively and estimated via brute force search during training. - The action is represented as a linear transformation between the final fully-connected layers of the two heads. For n action categories, the transformation layer has n transformation matrices. - The model is trained with a contrastive loss function to 1) maximize cosine similarity between the effect embedding and the transformed precondition embedding, and 2) maximize distance for incorrect transformations if greater than a chosen margin. - ACT Dataset - 50 keywords, 43 classes, ~500 YouTube videos per keyword. - The authors collect the ACT dataset primarily for the task of cross-category generalization (as it doesn't allow models to overfit to contextual information). For example, how would a model learned on "opening a window" generalize to recognize "opening the trunk of the car"? How about generalizing from a model trained on "climbing a cliff" to recognize "climbing a tree"? - The ACT dataset has class and super-class annotations from human workers. Each super-class has different sub-categories which are the same action under different subjects, objects and scenes. - Experiments - Action recognition on UCF101, HMDB51, ACT. - Cross-category generalization on ACT. - Visualizations - Nearest neighbor: modeling the actions as transformations gives semantically meaningful retrievals that don't just depend on motion and color. - Gradient visualizations (Simonyan et al. 2014): model focuses on changes in scene (human + object) than context. - Embedding retrievals based on transformed precondition embeddings. ** Thoughts ** - Modeling action as a transformation from precondition to effect is a very neat idea. - The exact formulation and supporting experiments and ablation studies are thorough. - During inference, the model first extracts features for all frames and then does a brute force search over (y,z\_p,z\_e) to estimate the action category and segmentation into precondition and effect. For longer sequences, this seems expensive. Although hard decisions aren't differentiable, a soft attention mechanism on z might be feasible and reduce computation to a single forward pass. ![]() |
[link]
A Critical Paper Review by Alex Lamb: https://www.youtube.com/watch?v=_seX4kZSr_8 ![]() |
[link]
Originally posted on my Github [paper-notes](https://github.com/karpathy/paper-notes/blob/master/matching_networks.md) repo. # Matching Networks for One Shot Learning By DeepMind crew: **Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra** This is a paper on **one-shot** learning, where we'd like to learn a class based on very few (or indeed, 1) training examples. E.g. it suffices to show a child a single giraffe, not a few hundred thousands before it can recognize more giraffes. This paper falls into a category of *"duh of course"* kind of paper, something very interesting, powerful, but somehow obvious only in retrospect. I like it. Suppose you're given a single example of some class and would like to label it in test images. - **Observation 1**: a standard approach might be to train an Exemplar SVM for this one (or few) examples vs. all the other training examples - i.e. a linear classifier. But this requires optimization. - **Observation 2:** known non-parameteric alternatives (e.g. k-Nearest Neighbor) don't suffer from this problem. E.g. I could immediately use a Nearest Neighbor to classify the new class without having to do any optimization whatsoever. However, NN is gross because it depends on an (arbitrarily-chosen) metric, e.g. L2 distance. Ew. - **Core idea**: lets train a fully end-to-end nearest neighbor classifer! ## The training protocol As the authors amusingly point out in the conclusion (and this is the *duh of course* part), *"one-shot learning is much easier if you train the network to do one-shot learning"*. Therefore, we want the test-time protocol (given N novel classes with only k examples each (e.g. k = 1 or 5), predict new instances to one of N classes) to exactly match the training time protocol. To create each "episode" of training from a dataset of examples then: 1. Sample a task T from the training data, e.g. select 5 labels, and up to 5 examples per label (i.e. 5-25 examples). 2. To form one episode sample a label set L (e.g. {cats, dogs}) and then use L to sample the support set S and a batch B of examples to evaluate loss on. The idea on high level is clear but the writing here is a bit unclear on details, of exactly how the sampling is done. ## The model I find the paper's model description slightly wordy and unclear, but basically we're building a **differentiable nearest neighbor++**. The output \hat{y} for a test example \hat{x} is computed very similar to what you might see in Nearest Neighbors: where **a** acts as a kernel, computing the extent to which \hat{x} is similar to a training example x_i, and then the labels from the training examples (y_i) are weight-blended together accordingly. The paper doesn't mention this but I assume for classification y_i would presumbly be one-hot vectors. Now, we're going to embed both the training examples x_i and the test example \hat{x}, and we'll interpret their inner products (or here a cosine similarity) as the "match", and pass that through a softmax to get normalized mixing weights so they add up to 1. No surprises here, this is quite natural:  Here **c()** is cosine distance, which I presume is implemented by normalizing the two input vectors to have unit L2 norm and taking a dot product. I assume the authors tried skipping the normalization too and it did worse? Anyway, now all that's left to define is the function **f** (i.e. how do we embed the test example into a vector) and the function **g** (i.e. how do we embed each training example into a vector?). **Embedding the training examples.** This (the function **g**) is a bidirectional LSTM over the examples:  i.e. encoding of i'th example x_i is a function of its "raw" embedding g'(x_i) and the embedding of its friends, communicated through the bidirectional network's hidden states. i.e. each training example is a function of not just itself but all of its friends in the set. This is part of the ++ above, because in a normal nearest neighbor you wouldn't change the representation of an example as a function of the other data points in the training set. It's odd that the **order** is not mentioned, I assume it's random? This is a bit gross because order matters to a bidirectional LSTM; you'd get different embeddings if you permute the examples. **Embedding the test example.** This (the function **f**) is a an LSTM that processes for a fixed amount (K time steps) and at each point also *attends* over the examples in the training set. The encoding is the last hidden state of the LSTM. Again, this way we're allowing the network to change its encoding of the test example as a function of the training examples. Nifty:  That looks scary at first but it's really just a vanilla LSTM with attention where the input at each time step is constant (f'(\hat{x}), an encoding of the test example all by itself) and the hidden state is a function of previous hidden state but also a concatenated readout vector **r**, which we obtain by attending over the encoded training examples (encoded with **g** from above). Oh and I assume there is a typo in equation (5), it should say r_k = … without the -1 on LHS. ## Experiments **Task**: N-way k-shot learning task. i.e. we're given k (e.g. 1 or 5) labelled examples for N classes that we have not previously trained on and asked to classify new instances into he N classes. **Baselines:** an "obvious" strategy of using a pretrained ConvNet and doing nearest neighbor based on the codes. An option of finetuning the network on the new examples as well (requires training and careful and strong regularization!). **MANN** of Santoro et al. [21]: Also a DeepMind paper, a fun NTM-like Meta-Learning approach that is fed a sequence of examples and asked to predict their labels. **Siamese network** of Koch et al. [11]: A siamese network that takes two examples and predicts whether they are from the same class or not with logistic regression. A test example is labeled with a nearest neighbor: with the class it matches best according to the siamese net (requires iteration over all training examples one by one). Also, this approach is less end-to-end than the one here because it requires the ad-hoc nearest neighbor matching, while here the *exact* end task is optimized for. It's beautiful. ### Omniglot experiments ###  Omniglot of [Lake et al. [14]](http://www.cs.toronto.edu/~rsalakhu/papers/LakeEtAl2015Science.pdf) is a MNIST-like scribbles dataset with 1623 characters with 20 examples each. Image encoder is a CNN with 4 modules of [3x3 CONV 64 filters, batchnorm, ReLU, 2x2 max pool]. The original image is claimed to be so resized from original 28x28 to 1x1x64, which doesn't make sense because factor of 2 downsampling 4 times is reduction of 16, and 28/16 is a non-integer >1. I'm assuming they use VALID convs? Results:  Matching nets do best. Fully Conditional Embeddings (FCE) by which I mean they the "Full Context Embeddings" of Section 2.1.2 instead are not used here, mentioned to not work much better. Finetuning helps a bit on baselines but not with Matching nets (weird). The comparisons in this table are somewhat confusing: - I can't find the MANN numbers of 82.8% and 94.9% in their paper [21]; not clear where they come from. E.g. for 5 classes and 5-shot they seem to report 88.4% not 94.9% as seen here. I must be missing something. - I also can't find the numbers reported here in the Siamese Net [11] paper. As far as I can tell in their Table 2 they report one-shot accuracy, 20-way classification to be 92.0, while here it is listed as 88.1%? - The results of Lake et al. [14] who proposed Omniglot are also missing from the table. If I'm understanding this correctly they report 95.2% on 1-shot 20-way, while matching nets here show 93.8%, and humans are estimated at 95.5%. That is, the results here appear weaker than those of Lake et al., but one should keep in mind that the method here is significantly more generic and does not make any assumptions about the existence of strokes, etc., and it's a simple, single fully-differentiable blob of neural stuff. (skipping ImageNet/LM experiments as there are few surprises) ## Conclusions Good paper, effectively develops a differentiable nearest neighbor trained end-to-end. It's something new, I like it! A few concerns: - A bidirectional LSTMs (not order-invariant compute) is applied over sets of training examples to encode them. The authors don't talk about the order actually used, which presumably is random, or mention this potentially unsatisfying feature. This can be solved by using a recurrent attentional mechanism instead, as the authors are certainly aware of and as has been discussed at length in [ORDER MATTERS: SEQUENCE TO SEQUENCE FOR SETS](https://arxiv.org/abs/1511.06391), where Oriol is also the first author. I wish there was a comment on this point in the paper somewhere. - The approach also gets quite a bit slower as the number of training examples grow, but once this number is large one would presumable switch over to a parameteric approach. - It's also potentially concerning that during training the method uses a specific number of examples, e.g. 5-25, so this is the number of that must also be used at test time. What happens if we want the size of our training set to grow online? It appears that we need to retrain the network because the encoder LSTM for the training data is not "used to" seeing inputs of more examples? That is unless you fall back to iteratively subsampling the training data, doing multiple inference passes and averaging, or something like that. If we don't use FCE it can still be that the attention mechanism LSTM can still not be "used to" attending over many more examples, but it's not clear how much this matters. An interesting experiment would be to not use FCE and try to use 100 or 1000 training examples, while only training on up to 25 (with and fithout FCE). Discussion surrounding this point would be interesting. - Not clear what happened with the Omniglot experiments, with incorrect numbers for [11], [21], and the exclusion of Lake et al. [14] comparison. - A baseline that is missing would in my opinion also include training of an [Exemplar SVM](https://www.cs.cmu.edu/~tmalisie/projects/iccv11/), which is a much more powerful approach than encode-with-a-cnn-and-nearest-neighbor. ![]()
4 Comments
|