Welcome to ShortScience.org! 
[link]
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. Advantages: * Learning the identity becomes learning 0 which is simpler * Loss in information flow in the forward pass is not a problem anymore * No vanishing / exploding gradient * Identities don't have parameters to be learned ## Evaluation The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{4}$), momentum of 0.9. They use minibatches of size 128. * ImageNet ILSVRC 2015: 3.57% (ensemble) * CIFAR10: 6.43% * MS COCO: 59.0% mAp@0.5 (ensemble) * PASCAL VOC 2007: 85.6% mAp@0.5 * PASCAL VOC 2012: 83.8% mAp@0.5 ## See also * [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993) 
[link]
This paper derives an algorithm for passing gradients through a sample from a mixture of Gaussians. While the reparameterization trick allows to get the gradients with respect to the Gaussian means and covariances, the same trick cannot be invoked for the mixing proportions parameters (essentially because they are the parameters of a multinomial discrete distribution over the Gaussian components, and the reparameterization trick doesn't extend to discrete distributions). One can think of the derivation as proceeding in 3 steps: 1. Deriving an estimator for gradients a sample from a 1dimensional density $f(x)$ that is such that $f(x)$ is differentiable and its cumulative distribution function (CDF) $F(x)$ is tractable: $\frac{\partial \hat{x}}{\partial \theta} =  \frac{1}{f(\hat{x})}\int_{t=\infty}^{\hat{x}} \frac{\partial f(t)}{\partial \theta} dt$ where $\hat{x}$ is a sample from density $f(x)$ and $\theta$ is any parameter of $f(x)$ (the above is a simplified version of Equation 6). This is probably the most important result of the paper, and is based on a really clever use of the general form of the Leibniz integral rule. 2. Noticing that one can sample from a $D$dimensional Gaussian mixture by decomposing it with the product rule $f({\bf x}) = \prod_{d=1}^D f(x_d{\bf x}_{<d})$ and using ancestral sampling, where each $f(x_d{\bf x}_{<d})$ are themselves 1dimensional mixtures (i.e. with differentiable densities and tractable CDFs) 3. Using the 1dimensional gradient estimator (of Equation 6) and the chain rule to backpropagate through the ancestral sampling procedure. This requires computing the integral in the expression for $\frac{\partial \hat{x}}{\partial \theta}$ above, where $f(x)$ is one of the 1D conditional Gaussian mixtures and $\theta$ is a mixing proportion parameter $\pi_j$. As it turns out, this integral has an analytical form (see Equation 22). **My two cents** This is a really surprising and neat result. The author mentions it could be applicable to variational autoencoders (to support posteriors that are mixtures of Gaussians), and I'm really looking forward to read about whether that can be successfully done in practice. The paper provides the derivation only for mixtures of Gaussians with diagonal covariance matrices. It is mentioned that extending to nondiagonal covariances is doable. That said, ancestral sampling with nondiagonal covariances would become more computationally expensive, since the conditionals under each Gaussian involves a matrix inverse. Beyond the case of Gaussian mixtures, Equation 6 is super interesting in itself as its application could go beyond that case. This is probably why the paper also derived a samplingbased estimator for Equation 6, in Equation 9. However, that estimator might be inefficient, since it involves sampling from Equation 10 with rejection, and it might take a lot of time to get an accepted sample if $\hat{x}$ is very small. Also, a good estimate of Equation 6 might require *multiple* samples from Equation 10. Finally, while I couldn't find any obvious problem with the mathematical derivation, I'd be curious to see whether using the same approach to derive a gradient on one of the Gaussian mean or standard deviation parameters gave a gradient that is consistent with what the reparameterization trick provides.
3 Comments

[link]
Mask RCNN takes off from where Faster RCNN left, with some augmentations aimed at bettering instance segmentation (which was out of scope for FRCNN). Instance segmentation was achieved remarkably well in *DeepMask* , *SharpMask* and later *Feature Pyramid Networks* (FPN). Faster RCNN was not designed for pixeltopixel alignment between network inputs and outputs. This is most evident in how RoIPool , the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. Mask RCNN fixes that by introducing RoIAlign in place of RoIPool. #### Methodology Mask RCNN retains most of the architecture of Faster RCNN. It adds the a third branch for segmentation. The third branch takes the output from RoIAlign layer and predicts binary class masks for each class. ##### Major Changes and intutions **Mask prediction** Mask prediction segmentation predicts a binary mask for each RoI using fully convolution  and the stark difference being usage of *sigmoid* activation for predicting final mask instead of *softmax*, implies masks don't compete with each other. This *decouples* segmentation from classification. The class prediction branch is used for class prediction and for calculating loss, the mask of predicted loss is used calculating Lmask. Also, they show that a single class agnostic mask prediction works almost as effective as separate mask for each class, thereby supporting their method of decoupling classification from segmentation **RoIAlign** RoIPool first quantizes a floatingnumber RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Instead of quantization of the RoI boundaries or bin bilinear interpolation is used to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average). **Backbone architecture** Faster RCNN uses a VGG like structure for extracting features from image, weights of which were shared among RPN and region detection layers. Herein, authors experiment with 2 backbone architectures  ResNet based VGG like in FRCNN and ResNet based [FPN](http://www.shortscience.org/paper?bibtexKey=journals/corr/LinDGHHB16) based. FPN uses convolution feature maps from previous layers and recombining them to produce pyramid of feature maps to be used for prediction instead of singlescale feature layer (final output of conv layer before connecting to fc layers was used in Faster RCNN) **Training Objective** The training objective looks like this ![](https://i.imgur.com/snUq73Q.png) Lmask is the addition from Faster RCNN. The method to calculate was mentioned above #### Observation Mask RCNN performs significantly better than COCO instance segmentation winners *without any bells and whiskers*. Detailed results are available in the paper 
[link]
Here the authors present a model which projects queries and documents into a low dimensional space, where you can fetch relevant documents by computing distance, *here cosine is used*, between the query vector and the document vectors. ### Model Description #### Word Hashing Layer They have used bag of trigrams for representing words(office > #office# > {#of, off, ffi, fic, ice, ce#}). This is able to generalize unseen words and maps morphological variation of same words to points which are close in ngram space. #### Context Window Vector Then for representing a sentence they are taking a `Window Size` around a word and appending them to form a context window vector. If we take `Window Size` = 3: (He is going to Office > { [vec of 'he', vec of 'is', vec of 'going'], [vec of 'is', vec of 'going', vec of 'to'], [vec of 'going', vec of 'to', vec of 'Office'] } #### Convolutional Layer and MaxPool layer Run a convolutional layer over each of the context window vector (for an intuition these are local features). Max pool over the resulting features to get global features. The output dimension is taken here to be 300. #### Semantic Layer Use a fully connected layer and project the 300D vector to a 128D vector. They have used two different networks, one for queries and other for documents. Now for each query and document (we are given labeled documents, one of them is positive and rest are negative) they compute the cosine similarity of the 128D output vector. And then they learn the weights of convolutional filters and the fully connected layer by maximizing conditional likelihood of positive documents. My thinking is that they have used two different networks as their is significant difference between Query length and Document Length. 
[link]
Ilyas et al. present a followup work to their paper on the tradeoff between accuracy and robustness. Specifically, given a feature $f(x)$ computed from input $x$, the feature is considered predictive if $\mathbb{E}_{(x,y) \sim \mathcal{D}}[y f(x)] \geq \rho$; similarly, a predictive feature is robust if $\mathbb{E}_{(x,y) \sim \mathcal{D}}\left[\inf_{\delta \in \Delta(x)} yf(x + \delta)\right] \geq \gamma$. This means, a feature is considered robust if the worstcase correlation with the label exceeds some threshold $\gamma$; here the worstcase is considered within a predefined set of allowed perturbations $\Delta(x)$ relative to the input $x$. Obviously, there also exist predictive features, which are however not robust according to the above definition. In the paper, Ilyas et al. present two simple algorithms for obtaining adapted datasets which contain only robust or only nonrobust features. The main idea of these algorithms is that an adversarially trained model only utilizes robust features, while a standard model utilizes both robust and nonrobust features. Based on these datasets, they show that nonrobust, predictive features are sufficient to obtain high accuracy; similarly training a normal model on a robust dataset also leads to reasonable accuracy but also increases robustness. Experiments were done on Cifar10. These observations are supported by a theoretical toy dataset consisting of two overlapping Gaussians; I refer to the paper for details. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 