ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 9 years ago

Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. 

Advantages:

* Learning the identity becomes learning 0 which is simpler
* Loss in information flow in the forward pass is not a problem anymore
    * No vanishing / exploding gradient
* Identities don't have parameters to be learned

## Evaluation

The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128.

* ImageNet ILSVRC 2015: 3.57% (ensemble)
* CIFAR-10: 6.43%
* MS COCO: 59.0% mAp@0.5 (ensemble)
* PASCAL VOC 2007: 85.6% mAp@0.5
* PASCAL VOC 2012: 83.8% mAp@0.5

## See also

* [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993)

jmlr.org
scholar.google.com

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Ioffe, Sergey and Szegedy, Christian
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by José Manuel Rodríguez Sotelo 9 years ago

The main contribution of this paper is introducing a new transformation that the authors call Batch Normalization (BN). The need for BN comes from the fact that during the training of deep neural networks (DNNs) the distribution of each layer’s input change. This phenomenon is called internal covariate shift (ICS).

#### What is BN?
Normalize each (scalar) feature independently with respect to the mean and variance of the mini batch. Scale and shift the normalized values with two new parameters (per activation) that will be learned. The BN consists of making normalization part of the model architecture.

#### What do we gain?
According to the author, the use of BN provides a great speed up in the training of DNNs. In particular, the gains are greater when it is combined with higher learning rates. In addition, BN works as a regularizer for the model which allows to use less dropout or less L2 normalization. Furthermore, since the distribution of the inputs is normalized, it also allows to use sigmoids as activation functions without the saturation problem.

#### What follows?
This seems to be specially promising for training recurrent neural networks (RNNs). The vanishing and exploding gradient problems \cite{journals/tnn/BengioSF94} have their origin in the iteration of transformation that scale up or down the activations in certain directions (eigenvectors). It seems that this regularization would be specially useful in this context since this would allow the gradient to flow more easily. When we unroll the RNNs, we usually have ultra deep networks.

#### Like
* Simple idea that seems to improve training.
* Makes training faster.
* Simple to implement. Probably.
* You can be less careful with initialization.

#### Dislike
* Does not work with stochastic gradient descent (minibatch size = 1).
* This could reduce the parallelism of the algorithm since now all the examples in a mini batch are tied.
* Results on ensemble of networks for ImageNet makes it harder to evaluate the relevance of BN by itself. (Although they do mention the performance of a single model).

arxiv.org
arxiv-vanity.com
scholar.google.com

Wasserstein GAN
Martin Arjovsky and Soumith Chintala and Léon Bottou
arXiv e-Print archive - 2017 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by MarvMind 8 years ago

This very new paper, is currently receiving quite a bit of attention by the [community](https://www.reddit.com/r/MachineLearning/comments/5qxoaz/r_170107875_wasserstein_gan/).

The paper describes a new training approach, which solves the two major practical problems with current GAN training:

1) The training process comes with a meaningful loss. This can be used as a (soft) performance metric and will help debugging, tune parameters and so on.

2) The training process does not suffer from all the instability problems. In particular the paper reduces mode collapse significantly.

On top of that, the paper comes with quite a bit mathematical theory, explaining why there approach works and other approachs have failed. This paper is a must read for anyone interested in GANs.

arxiv.org
scholar.google.com

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Simonyan, Karen and Vedaldi, Andrea and Zisserman, Andrew
arXiv e-Print archive - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

* The paper presents gradient computation based techniques to visualise image classification models.
* [Link to the paper](https://arxiv.org/abs/1312.6034)

#### Experimental Setup

* Single deep convNet trained on ILSVRC-2013 dataset (1.2M training images and 1000 classes).
* Weight layer configuration is: conv64-conv256-conv256-conv256-conv256-full4096-full4096-full1000.

#### Class Model Visualisation

* Given a learnt ConvNet and a class (of interest), start with the zero image and perform optimisation by back propagating with respect to the input image (keeping the ConvNet weights constant).
* Add the mean image (for training set) to the resulting image.
* The paper used unnormalised class scores so that optimisation focuses on increasing the score of target class and not decreasing the score of other classes.

#### Image-Specific Class Saliency Visualisation

* Given an image, class of interest, and trained ConvNet, rank the pixels of the input image based on their influence on class scores.
* Derivative of the class score with respect to image gives an estimate of the importance of different pixels for the class.
* The magnitude of derivative also indicated how much each pixel needs to be changed to improve the class score.

##### Class Saliency Extraction

* Find the derivative of the class score with respect with respect to the input image.
* This would result in one single saliency map per colour channel.
* To obtain a single saliency map, take the maximum magnitude of derivative across all colour channels.

##### Weakly Supervised Object Localisation

* The saliency map for an image provides a rough encoding of the location of the object of the class of interest. 
* Given an image and its saliency map, an object segmentation map can be computed using GraphCut colour segmentation.
* Color continuity cues are needed as saliency maps might capture only the most dominant part of the object in the image.
* This weakly supervised approach achieves 46.4% top-5 error on the test set of ILSVRC-2013.

#### Relation to Deconvolutional Networks

* DeconvNet-based reconstruction of the $n^{th}$ layer input is similar to computing the gradient of the visualised neuron activity $f$ with respect to the input layer.
* One difference is in the way RELU neurons are treated: 
    * In DeconvNet, the sign indicator (for the derivative of RELU) is computed on output reconstruction while in this paper, the sign indicator is computed on the layer input.

dx.doi.org
sci-hub
scholar.google.com

Fast R-CNN
Girshick, Ross B.
International Conference on Computer Vision - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 9 years ago

This method is based on improving the speed of R-CNN \cite{conf/cvpr/GirshickDDM14}

1. Where R-CNN would have two different objective functions, Fast R-CNN combines localization and classification losses into a "multi-task loss" in order to speed up training.
2. It also uses a pooling method based on \cite{journals/pami/HeZR015} called the RoI pooling layer that scales the input so the images don't have to be scaled before being set an an input image to the CNN. "RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of sub-windows of approximate size $h/H \times w/W$ and then max-pooling the values in each sub-window into the corresponding output grid cell."
3. Backprop is performed for the RoI pooling layer by taking the argmax of the incoming gradients that overlap the incoming values.

This method is further improved by the paper "Faster R-CNN" \cite{conf/nips/RenHGS15}