![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
Layer-wise Relevance Propagation (LRP) is a novel technique has been used by authors in multiple use-cases (apart from this publication) to demonstrate the robustness and advantage of a *decomposition* method over other heatmap generation methods. Such heatmap generation methods are very crucial for increasing interpretability of Deep Learning models as such. Apart from LRP relevance, authors also discuss quantitative ways to measure the accuracy of the heatmap generated. ### LRP & Alternatives What is LRP ? LRP is a principled approach to decompose a classification decision into pixel-wise relevances indicating the contributions of a pixel to the overall classification score. The approach is derived from a layer-wise conservation principle , which forces the propagated quantity (e.g. evidence for a predicted class) to be preserved between neurons of two adjacent layers. Denoting by R(l) [i] the relevance associated to the ith neuron of layer and by R (l+1) [j] the relevance associated to the jth neuron in the next layer, the conservation principle requires that  where R(l) [i] is given as  where z[i,j] is the activation of jth neuron because of input from ith neuron As per authors this is not necssarily the only relevance funtion which is conserved. The intuition behind using such a function is that lower-layer neurons that mostly contribute to the activation of the higher-layer neuron receive a larger share of the relevance Rj of the neuron j. A downside of this propagation rule (at least if *epsilon* = 0) is that the denominator may tend to zero if lower-level contributions to neuron j cancel each other out. The numerical instability can be overcome by setting *epsilon* > 0. However in that case, the conservation idea is relaxated in order to gain better numerical properties. To conserve relevance, it can be formulated as sum of positive and negative activations  such that *alpha* - *beta* = 1 #### Alternatives to LRP for heatmap **Senstiivity measurement** In such methods of generating heamaps, gradient of the output with respect to input is used for generating heatmap. This quantity measures how much small changes in the pixel value locally affect the network output. ##### Disadvantages Given most models use ReLU as activation function, the gradient flows only through activation with positive output - thereby making makes the backward mapping discontinuous, and consequently strongly local. Also same applies for maxpool activations - wherein gradients only flow through neurons with maximum intensity in local neighbourhood. Also, given most of these methods use absolute impact on prediction cause by changes in pixel intensities, the granularity of whether the pixel intensity was in favour or against evidence is lost. **Deconvolutional Networks** ##### Disadvantages Here the backward discontinuity problem of sensitivity based methods are absent, hence global features can be captured. However, since the method only takes in activation from final layer (which learns the presence or absence of features mostly) , using this for generating heatmaps is likely to yield avergae maps, lacking image specific localisation effects LRP is able to counter the effects nicely because of the way it uses relevance #### Performance of heatmaps Few concerns that the authors raise are - A heatmap is not a segmentation mask on the contrary missing evidence or the context may be very important for classification - Salient features represent average explanations of what distinguishes one image category from another. For individual images these explanations may be meaningless or even wrong. For instance, salient features for the class ‘bicycle’ may be the wheels and the handlebar. However, in some images a bicycle may be partly occluded so that these parts of a bike are not visible. In these images salient features fail to explain the classifier’s decision (which still may be correct). Authors propose a novel method (MoRF - *Most Relevant First* ) of objectively quantifying quality of a heatmap. A good detailed idea of the measure can best be obtained from the paper. To give an idea, the most reliable method should ideally rank the most relevant regions in the same order even if small perturbations in pixel intensities are observed (in non-relevant areas. The quantity of interest in this case is the area over the MoRF perturbation curve (AOPC). #### Observation Most of the sensitivity based methods answer to the question - *what change would make the image more or less belong to the category car* which isn't really the classifier's question. LRP plans to answer the real classifier question *what speaks for the presence of a car in the image* An image below would be a good example of how LRPs can denoise heatmaps generated on the basis of sensitivity.  ![]() |
[link]
Mask RCNN takes off from where Faster RCNN left, with some augmentations aimed at bettering instance segmentation (which was out of scope for FRCNN). Instance segmentation was achieved remarkably well in *DeepMask* , *SharpMask* and later *Feature Pyramid Networks* (FPN). Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool , the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. Mask RCNN fixes that by introducing RoIAlign in place of RoIPool. #### Methodology Mask RCNN retains most of the architecture of Faster RCNN. It adds the a third branch for segmentation. The third branch takes the output from RoIAlign layer and predicts binary class masks for each class. ##### Major Changes and intutions **Mask prediction** Mask prediction segmentation predicts a binary mask for each RoI using fully convolution - and the stark difference being usage of *sigmoid* activation for predicting final mask instead of *softmax*, implies masks don't compete with each other. This *decouples* segmentation from classification. The class prediction branch is used for class prediction and for calculating loss, the mask of predicted loss is used calculating Lmask. Also, they show that a single class agnostic mask prediction works almost as effective as separate mask for each class, thereby supporting their method of decoupling classification from segmentation **RoIAlign** RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Instead of quantization of the RoI boundaries or bin bilinear interpolation is used to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average). **Backbone architecture** Faster RCNN uses a VGG like structure for extracting features from image, weights of which were shared among RPN and region detection layers. Herein, authors experiment with 2 backbone architectures - ResNet based VGG like in FRCNN and ResNet based [FPN](http://www.shortscience.org/paper?bibtexKey=journals/corr/LinDGHHB16) based. FPN uses convolution feature maps from previous layers and recombining them to produce pyramid of feature maps to be used for prediction instead of single-scale feature layer (final output of conv layer before connecting to fc layers was used in Faster RCNN) **Training Objective** The training objective looks like this  Lmask is the addition from Faster RCNN. The method to calculate was mentioned above #### Observation Mask RCNN performs significantly better than COCO instance segmentation winners *without any bells and whiskers*. Detailed results are available in the paper ![]() |
[link]
A Critical Paper Review by Alex Lamb: https://www.youtube.com/watch?v=_seX4kZSr_8 ![]() |
[link]
An attention mechanism and a separate encoder/decoder are two properties of almost every single neural translation model. The question asked in this paper is- how far can we go without attention and without a separate encoder and decoder? And the answer is- pretty far! The model presented preforms just as well as the attention model of Bahdanau on the four language directions that are studied in the paper. The translation model presented in the paper is basically a simple recurrent language model. A recurrent language model receives at every timestep the current input word and has to predict the next word in the dataset. To translate with such a model, simply give it the current word from the source sentence and have it try to predict the next word from the target sentence. Obviously, in many cases such a simple model wouldn't work. For example, if your sentence was "The white dog" and you wanted to translate to Spanish ("El perro blanco"), at the 2nd timestep, the input would be "white" and the expected output would be "perro" (dog). But how could the model predict "perro" when it hasn't seen "dog" yet? To solve this issue, we preprocess the data before training and insert "empty" padding tokens into the target sentence. When the model outputs such a token, it means that the model would like to read more of the input sentence before emitting the next output word. So in the example from above, we would change the target sentence to "El PAD perro blanco". Now, at timestep 2 the model emits the PAD symbol. At timestep 3, when the input is "dog", the model can emit the token "perro". These padding symbols are deleted in post-processing, before the output is returned to the user. You can see a visualization of the decoding process below: https://i.imgur.com/znI6xoN.png To enable us to use beam search, our model actually receives the previous outputted target token in addition to receiving the current source token at every timestep. PyTorch code for the model is available at https://github.com/ofirpress/YouMayNotNeedAttention ![]() |
[link]
Facebook has [released a series of papers](https://research.facebook.com/blog/learning-to-segment/) for object segmentation and detection. This paper is the first in that series. This is how modern object detection works (think [RCNN](https://arxiv.org/abs/1311.2524), [Fast RCNN](http://arxiv.org/abs/1504.08083)): 1. A rich set of object proposals (i.e., a set of image regions which are likely to contain an object) is generated using a fast (but possibly imprecise) algorithm. 2. A CNN classifier is applied on each of the proposals. The current paper improves the step 1, i.e., region/object proposals. Most object proposals approaches fall into three categories: * Objectness scoring * Seed Segmentation * Superpixel Merging Current method is different from these three. It share similarities with [Faster R-CNN](https://arxiv.org/abs/1506.01497) in that proposals are generated using a CNN. The method predicts a segmentation mask given an input *patch* and assigns a score corresponding to how likely the patch is to contain an object. ## Model and Training Both mask and score predictions are achieved with a single convolutional network but with multiple outputs. All the convolutional layers except the last few are from VGG-A pretrained model. Each training sample is a triplet of RGB input patch, the binary mask corresponding to the input patch, a label which specifies whether the patch contains an object. A patch is given label 1 only if it satisfies the following constraints: * the patch contains an object roughly centered in the input patch * the object is fully contained in the patch and in a given scale range Note that the network must output a mask for a single object at the center even when multiple objects are present. Figure 1 shows the architecture and sampling for training.  Model is then jointly trained for segmentation and objectness. Negative samples are not used for segmentation. ## Inference During full image inference, model is applied densely at multiple locations and scales. This can be done efficiently since all computations are convolutional like in a fully convolutional network (FCN).  This approach surpasses the previous state of the art by a large margin in both box and segmentation proposal generation. ![]() |