Welcome to ShortScience.org! |
[link]
This paper presents a combination of the inception architecture with residual networks. This is done by adding a shortcut connection to each inception module. This can alternatively be seen as a resnet where the 2 conv layers are replaced by a (slightly modified) inception module. The paper (claims to) provide results against the hypothesis that adding residual connections improves training, rather increasing the model size is what makes the difference. |
[link]
Mask RCNN takes off from where Faster RCNN left, with some augmentations aimed at bettering instance segmentation (which was out of scope for FRCNN). Instance segmentation was achieved remarkably well in *DeepMask* , *SharpMask* and later *Feature Pyramid Networks* (FPN). Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool , the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. Mask RCNN fixes that by introducing RoIAlign in place of RoIPool. #### Methodology Mask RCNN retains most of the architecture of Faster RCNN. It adds the a third branch for segmentation. The third branch takes the output from RoIAlign layer and predicts binary class masks for each class. ##### Major Changes and intutions **Mask prediction** Mask prediction segmentation predicts a binary mask for each RoI using fully convolution - and the stark difference being usage of *sigmoid* activation for predicting final mask instead of *softmax*, implies masks don't compete with each other. This *decouples* segmentation from classification. The class prediction branch is used for class prediction and for calculating loss, the mask of predicted loss is used calculating Lmask. Also, they show that a single class agnostic mask prediction works almost as effective as separate mask for each class, thereby supporting their method of decoupling classification from segmentation **RoIAlign** RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Instead of quantization of the RoI boundaries or bin bilinear interpolation is used to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average). **Backbone architecture** Faster RCNN uses a VGG like structure for extracting features from image, weights of which were shared among RPN and region detection layers. Herein, authors experiment with 2 backbone architectures - ResNet based VGG like in FRCNN and ResNet based [FPN](http://www.shortscience.org/paper?bibtexKey=journals/corr/LinDGHHB16) based. FPN uses convolution feature maps from previous layers and recombining them to produce pyramid of feature maps to be used for prediction instead of single-scale feature layer (final output of conv layer before connecting to fc layers was used in Faster RCNN) **Training Objective** The training objective looks like this ![](https://i.imgur.com/snUq73Q.png) Lmask is the addition from Faster RCNN. The method to calculate was mentioned above #### Observation Mask RCNN performs significantly better than COCO instance segmentation winners *without any bells and whiskers*. Detailed results are available in the paper |
[link]
`Update 2015/11/23: Since I first wrote this note, I became involved in the next iterations of this work, which became v2 of the arXiv manuscript. The notes below were made based on v1.` This paper considers the problem of Maximum Inner Product Search (MIPS). In MIPS, given a query $q$ and a set of inputs $x_i$, we want to find the input (or the top n inputs) with highest inner product, i.e. $argmax_i q' x_i$. Recently, it was shown that a simple transformation to the query and input vectors made it possible to approximately solve MIPS using hashing methods for Maximum Cosine Similarity Search (MCSS), a problem for which solutions are readily available (see section 2.4 for a brief but very clear description of the transformation). In this paper, the authors combine this approach with clustering, in order to improve the quality of retrieved inputs. Specifically, they consider the spherical k-means algorithm, which is a variant of k-means in which data points are clustered based on cosine similarity instead of the euclidean similarity (in short, data points are first scaled to be of unit norm, then in the training inner loop points are assigned to the cluster centroid with highest dot product and cluster centroids are updated as usual, except that they are always rescaled to unit norm). Moreover, they consider a bottom-up application of the algorithm to yield a hierarchical clustering tree. They propose to use such a hierarchical clustering tree to find the top-n candidates for MIPS. The key insight here is that, since spherical k-means relies on cosine similarity for finding the best cluster, and since we have a transformation that allows the maximisation of inner product to be approximated by the maximisation of cosine similarity, then a tree to find MIPS candidates could be constructed by running spherical k-means on the inputs transformed by the same transformation used for hashing-based MIPS. In order to make the search more robust to border issues when a query is close to the frontier between clusters, at each level of the tree they consider more than one candidate cluster during top-down search, so as to merge the candidates in several leaves of the tree at the very end of a full top down query. Their experiments using search with word embeddings show that the quality of the top 1, 10 and 100 MIPS candidates using their spherical k-means approach is better than using two hashing-based search methods. |
[link]
Layer-wise Relevance Propagation (LRP) is a novel technique has been used by authors in multiple use-cases (apart from this publication) to demonstrate the robustness and advantage of a *decomposition* method over other heatmap generation methods. Such heatmap generation methods are very crucial for increasing interpretability of Deep Learning models as such. Apart from LRP relevance, authors also discuss quantitative ways to measure the accuracy of the heatmap generated. ### LRP & Alternatives What is LRP ? LRP is a principled approach to decompose a classification decision into pixel-wise relevances indicating the contributions of a pixel to the overall classification score. The approach is derived from a layer-wise conservation principle , which forces the propagated quantity (e.g. evidence for a predicted class) to be preserved between neurons of two adjacent layers. Denoting by R(l) [i] the relevance associated to the ith neuron of layer and by R (l+1) [j] the relevance associated to the jth neuron in the next layer, the conservation principle requires that ![](https://i.imgur.com/GQxrnCT.png) where R(l) [i] is given as ![](https://i.imgur.com/FD7AAfF.png) where z[i,j] is the activation of jth neuron because of input from ith neuron As per authors this is not necssarily the only relevance funtion which is conserved. The intuition behind using such a function is that lower-layer neurons that mostly contribute to the activation of the higher-layer neuron receive a larger share of the relevance Rj of the neuron j. A downside of this propagation rule (at least if *epsilon* = 0) is that the denominator may tend to zero if lower-level contributions to neuron j cancel each other out. The numerical instability can be overcome by setting *epsilon* > 0. However in that case, the conservation idea is relaxated in order to gain better numerical properties. To conserve relevance, it can be formulated as sum of positive and negative activations ![](https://i.imgur.com/lo7f8AI.png) such that *alpha* - *beta* = 1 #### Alternatives to LRP for heatmap **Senstiivity measurement** In such methods of generating heamaps, gradient of the output with respect to input is used for generating heatmap. This quantity measures how much small changes in the pixel value locally affect the network output. ##### Disadvantages Given most models use ReLU as activation function, the gradient flows only through activation with positive output - thereby making makes the backward mapping discontinuous, and consequently strongly local. Also same applies for maxpool activations - wherein gradients only flow through neurons with maximum intensity in local neighbourhood. Also, given most of these methods use absolute impact on prediction cause by changes in pixel intensities, the granularity of whether the pixel intensity was in favour or against evidence is lost. **Deconvolutional Networks** ##### Disadvantages Here the backward discontinuity problem of sensitivity based methods are absent, hence global features can be captured. However, since the method only takes in activation from final layer (which learns the presence or absence of features mostly) , using this for generating heatmaps is likely to yield avergae maps, lacking image specific localisation effects LRP is able to counter the effects nicely because of the way it uses relevance #### Performance of heatmaps Few concerns that the authors raise are - A heatmap is not a segmentation mask on the contrary missing evidence or the context may be very important for classification - Salient features represent average explanations of what distinguishes one image category from another. For individual images these explanations may be meaningless or even wrong. For instance, salient features for the class ‘bicycle’ may be the wheels and the handlebar. However, in some images a bicycle may be partly occluded so that these parts of a bike are not visible. In these images salient features fail to explain the classifier’s decision (which still may be correct). Authors propose a novel method (MoRF - *Most Relevant First* ) of objectively quantifying quality of a heatmap. A good detailed idea of the measure can best be obtained from the paper. To give an idea, the most reliable method should ideally rank the most relevant regions in the same order even if small perturbations in pixel intensities are observed (in non-relevant areas. The quantity of interest in this case is the area over the MoRF perturbation curve (AOPC). #### Observation Most of the sensitivity based methods answer to the question - *what change would make the image more or less belong to the category car* which isn't really the classifier's question. LRP plans to answer the real classifier question *what speaks for the presence of a car in the image* An image below would be a good example of how LRPs can denoise heatmaps generated on the basis of sensitivity. ![](https://i.imgur.com/Sq0b5yg.png) |
[link]
## **Keywords** Progressive GAN , High resolution generator --- ## **Summary** 1. **Introduction** 1. **Goal of the paper** 1. Generation of very high quality images using progressively increasing size of the generator and discriminator. 1. Improved training and stability of GANs. 1. New metric for evaluating GAN results. 1. A high quality version of CELEBA-HQ dataset. 1. **Previous Research** 1. Generative methods help to produce new samples from higher-dimensional data distributions such as images . 1. The common approaches for generative methods are : 1. Autoregressive models : Produce sharp images and are slow to evaluate. eg PixelCNN 1. Variational Autoencoders : Easy to train but produces blurry images. 1. Generative Adversarial Neural Network : Produces sharp images at small resolutions but are highly unstable. 1. **Method** 1. **Basic GAN architecture** 1. Gan consists of two major parts : 1. _Generator_ : Creates a sample image from latent code which look very close to the training images. 1. _Discriminator_: Discriminator is trained to assess how close the sample image looks to the training image. 1. To measure the overlap between the training and the generated distributions many methods are used like Jensen-Shannon divergence , least-squares divergence and Wasserstein Distance. 1. Larger resolution generations cause problems because it becomes difficult for both the training and the generated networks amplifying the gradient problem. Larger resolutions also require large memory and can cause problems. 1. A mechanism is also proposed to stop the generator from participating in escalation that causes mode collapse problem. 1. **Progressive growing of GANs** 1. The primary method for the GAN training is to start off from a low resolution image and add extra layers in each step of the training process. 1. Lower resolution images are more stable as they have very less class information and as the resolution of the image increases further smaller details and features are added to the image. 1. This leads to a smooth increase in the quality of image instead of the network learning lot of details in one single step. 1. **Mini-batch separation** 1. GANs tend to capture only a very small set of features from the image. 1. "Minibatch discrimination" is used to generate feature vector for each individual image along with one for the the mini batch of images also. ![alt_text](https://i.imgur.com/dHFl5OV.png "image_tooltip") 1. **Conclusion** 1. Higher resolution images are able to be generated which are robust and efficient. 1. Improved quality of the generated images is given. 1. Reduced training time for a comparable result and output quality and resolution. --- ## **Notes** * Gradient Problem : At higher resolutions it becomes easier to tell the differences between the training and the testing images [1]. This is referred to as the gradient problem. * Mode Collapse : The generator is incapable of creating a large variety of samples and get stuck. ## **Open research questions** 1. Improved methods for a true photorealism generation of images. 1. Improved semantic sensibility and improved understanding of the dataset. ## **References** 1. [https://blog.acolyer.org/2018/05/10/progressive-growing-of-gans-for-improved-quality-stability-and-variation/](https://blog.acolyer.org/2018/05/10/progressive-growing-of-gans-for-improved-quality-stability-and-variation/) 1. [https://medium.com/@jonathan_hui/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b](https://medium.com/@jonathan_hui/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b) |