Welcome to ShortScience.org! |
[link]
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. Advantages: * Learning the identity becomes learning 0 which is simpler * Loss in information flow in the forward pass is not a problem anymore * No vanishing / exploding gradient * Identities don't have parameters to be learned ## Evaluation The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128. * ImageNet ILSVRC 2015: 3.57% (ensemble) * CIFAR-10: 6.43% * MS COCO: 59.0% mAp@0.5 (ensemble) * PASCAL VOC 2007: 85.6% mAp@0.5 * PASCAL VOC 2012: 83.8% mAp@0.5 ## See also * [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993) |
[link]
This paper presents Swapout, a simple dropout method applied to Residual Networks (ResNets). In a ResNet, a layer $Y$ is computed from the previous layer $X$ as $Y = X + F(X)$ where $F(X)$ is essentially the composition of a few convolutional layers. Swapout simply applies dropout separately on both terms of a layer's equation: $Y = \Theta_1 \odot X + \Theta_2 \odot F(X)$ where $\Theta_1$ and $\Theta_2$ are independent dropout masks for each term. The paper shows that this form of dropout is at least as good or superior as other forms of dropout, including the recently proposed [stochastic depth dropout][1]. Much like in the stochastic depth paper, better performance is achieved by linearly increasing the dropout rate (from 0 to 0.5) from the first hidden layer to the last. In addition to this observation, I also note the following empirical observations: 1. At test time, averaging the output layers of multiple dropout mask samples (referenced to as stochastic inference) is better than replacing the masks by their expectation (deterministic inference), the latter being the usual standard. 2. Comparable performance is achieved by making the ResNet wider (e.g. 4 times) and with fewer layers (e.g. 32) than the orignal ResNet work with thin but very deep (more than 1000 layers) ResNets. This would confirm a similar observation from [this paper][2]. Overall, these are useful observations to be aware of for anyone wanting to use ResNets in practice. [1]: http://arxiv.org/abs/1603.09382v1 [2]: https://arxiv.org/abs/1605.07146 |
[link]
This paper presents a way to differentiate through discrete random variables by replacing them with continuous random variables. Say you have a discrete [categorical variable][cat] and you're sampling it with the [Gumbel trick][gumbel] like this ($G_k$ is a Gumbel distributed variable and $\boldsymbol{\alpha}/\sum_k \alpha_k$ are our categorical probabilities): $$ z = \text{one_hot} \left( \underset{k}{\text{arg max}} [ G_k + \log \alpha_k ] \right) $$ This paper replaces the one hot and argmax with a softmax, and they add a $\lambda$ variable to control the "temperature". As $\lambda$ tends to zero it will equal the above equation. $$ z = \text{softmax} \left( \frac{ G_k + \log \alpha_k }{\lambda} \right) $$ I made [some notes][nb] on how this process works, if you'd like more intuition. Comparison to [Gumbel-softmax][gs] -------------------------------------------- These papers are proposed precisely the same distribution with notation changes ([noted there][gs]). Both papers also reference each other and the differences. Although, they mention differences in the variatonal objectives to the Gumbel-softmax. This paper also compares to [VIMCO][], which is probably a harder benchmark to compare against (multi-sample versus single sample). The results in both papers compare to SOTA score function based estimators and both report high scoring results (often the best). There are some details about implementations to consider though, such as scheduling and exactly how to define the variational objective. [cat]: https://en.wikipedia.org/wiki/Categorical_distribution [gumbel]: https://hips.seas.harvard.edu/blog/2013/04/06/the-gumbel-max-trick-for-discrete-distributions/ [gs]: http://www.shortscience.org/paper?bibtexKey=journals/corr/JangGP16 [nb]: https://gist.github.com/gngdb/ef1999ce3a8e0c5cc2ed35f488e19748 [vimco]: https://arxiv.org/abs/1602.06725 |
[link]
### Main Idea: It basically tunes the hyper-parameters of the neural network architecture using reinforcement learning. The reward signal is taken as evaluation on the validation set. The method is policy gradient as the cost function is non-differentiable. ### Method: #### i. Actions: 1. There is controller RNN which predicts some hyper-parameters of the layer conditioned on the previous predictions. This prediction is just a one-hot vector based on the previous hyper-parameter chosen. At the start for the first prediction - this vector is just all zeros. 2. Once the network is generated completely, it is trained for a fixed number of epochs, the reward signal is calculated based on the evaluation on the validation set. #### ii. Training: 1. Nothing fancy in the reinforcement learning approach simple policy gradients. 2. Baseline is added to reduce the variance. 3. It takes 2-3 weeks to train it over 800 GPUs!! #### iii. Results: 1. Use it to generate CNNs and LSTM cells. Close to state-of-art results with generated architectures. ### Possible new directions: 1. Use better techniques RL techniques like TRPO, PPO etc. 2. Right now, they generate fixed length architecture. Their reason is for variable length architectures, it is difficult to determine how much time each architecture is trained. Smaller networks are easier to train. Thus, somehow determine training time as function of the learning capacity of the network. Code: They haven't released the code yet. I tried to simulate it in torch.(https://github.com/abhigenie92/nn_search) |
[link]
Mask RCNN takes off from where Faster RCNN left, with some augmentations aimed at bettering instance segmentation (which was out of scope for FRCNN). Instance segmentation was achieved remarkably well in *DeepMask* , *SharpMask* and later *Feature Pyramid Networks* (FPN). Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool , the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. Mask RCNN fixes that by introducing RoIAlign in place of RoIPool. #### Methodology Mask RCNN retains most of the architecture of Faster RCNN. It adds the a third branch for segmentation. The third branch takes the output from RoIAlign layer and predicts binary class masks for each class. ##### Major Changes and intutions **Mask prediction** Mask prediction segmentation predicts a binary mask for each RoI using fully convolution - and the stark difference being usage of *sigmoid* activation for predicting final mask instead of *softmax*, implies masks don't compete with each other. This *decouples* segmentation from classification. The class prediction branch is used for class prediction and for calculating loss, the mask of predicted loss is used calculating Lmask. Also, they show that a single class agnostic mask prediction works almost as effective as separate mask for each class, thereby supporting their method of decoupling classification from segmentation **RoIAlign** RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Instead of quantization of the RoI boundaries or bin bilinear interpolation is used to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average). **Backbone architecture** Faster RCNN uses a VGG like structure for extracting features from image, weights of which were shared among RPN and region detection layers. Herein, authors experiment with 2 backbone architectures - ResNet based VGG like in FRCNN and ResNet based [FPN](http://www.shortscience.org/paper?bibtexKey=journals/corr/LinDGHHB16) based. FPN uses convolution feature maps from previous layers and recombining them to produce pyramid of feature maps to be used for prediction instead of single-scale feature layer (final output of conv layer before connecting to fc layers was used in Faster RCNN) **Training Objective** The training objective looks like this ![](https://i.imgur.com/snUq73Q.png) Lmask is the addition from Faster RCNN. The method to calculate was mentioned above #### Observation Mask RCNN performs significantly better than COCO instance segmentation winners *without any bells and whiskers*. Detailed results are available in the paper |