ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe and Christian Szegedy
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Martin Thoma 9 years ago

One problem of training deep networks is that the features of lower-layer networks change while the upper-layer networks have already been adjusted to the previous lower-layer features. The phenomenon of changing inputs while optimizing is called *internal covariate shift*.

Batch normalization is done at training time for each mini batch.

## Ideas

* Training converges faster, if input is whitened (zero means, unit variances, decorrelated).
* Normalization parameters have to be computed within the gradient calculation step to prevent the model from blowing up

## What Batch Normalization is

For a layer with $d$-dimensional input $x = (x^{(1)}, \dots, x^{(d)})$, we will normalize
each dimension 
$$\hat{x}^{(k)} = \frac{x^{(k)} - \mathbb{E}[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$$
where the expectation and the variance are computed over the training data set. This does *not* decorrelate the features, though.

Additionally, for each activation $x^{(k)}$ two paramters $\gamma^{(k)}, \beta^{(k)}$ are introduced which scale and shift the feature:

$$y^{(k)} = \gamma^{(k)} \cdot \hat{x}^{(k)} + \beta^{(k)}$$

Those two parameters (per feature) are learnable!

## Effect of Batch normalization

* Higher learning rates can be used
* Initialization is less important
* Acts as a regularizer, eliminating the need for dropout in some cases
* Faster training

## Datasets

* reaching 4.9% top-5 validation error (and 4.8% test error) on ImageNet classification

## Used by

* [Going Deeper with Convolutions](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14)
* [Deep Residual Learning for Image Recognition](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeZRS15#martinthoma)

## See also

* [other summaries](http://www.shortscience.org/paper?bibtexKey=conf/icml/IoffeS15)

dl.acm.org
sci-hub
scholar.google.com

Dropout: a simple way to prevent neural networks from overfitting
Srivastava, Nitish and Hinton, Geoffrey E. and Krizhevsky, Alex and Sutskever, Ilya and Salakhutdinov, Ruslan
Journal of Machine Learning Research - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 9 years ago

This paper is a much better introduction to Dropout than [Improving neural networks by preventing
co-adaptation of feature detectors](http://www.shortscience.org/paper?bibtexKey=journals/corr/1207.0580), written by the same authors two years later.

## General idea of Dropout

Dropout is a layer type. It has a parameter $\alpha \in (0, 1)$. The output dimensionality of a dropout layer is equal to its input dimensionality. With a probability of $\alpha$ any neurons output is set to 0. At testing time, the output of all neurons is multiplied with $\alpha$ to compensate for the fact that no output is set to 0.


## Interpretations

Dropout can be interpreted as training an ensemble of many networks, which share weights.

It can also be seen as a regularizer.

arxiv.org
arxiv-vanity.com
scholar.google.com

Improving neural networks by preventing co-adaptation of feature detectors
Geoffrey E. Hinton and Nitish Srivastava and Alex Krizhevsky and Ilya Sutskever and Ruslan R. Salakhutdinov
arXiv e-Print archive - 2012 via Local arXiv
Keywords: cs.NE, cs.CV, cs.LG
more

[link] Summary by Martin Thoma 9 years ago

This paper introduced Dropout, a new layer type. It has a parameter $\alpha \in (0, 1)$. The output dimensionality of a dropout layer is equal to its input dimensionality. With a probability of $\alpha$ any neurons output is set to 0. At testing time, the output of all neurons is multiplied with $\alpha$ to compensate for the fact that no output is set to 0.

A much better paper, by the same authors but 2 years later, is [Dropout: a simple way to prevent neural networks from overfitting](http://www.shortscience.org/paper?bibtexKey=journals/jmlr/SrivastavaHKSS14).

Dropout can be interpreted as training an ensemble of many networks, which share weights.

It was notably used by [ImageNet Classification with Deep Convolutional Neural Networks](http://www.shortscience.org/paper?bibtexKey=krizhevsky2012imagenet).

arxiv.org
arxiv-vanity.com
scholar.google.com

An empirical analysis of dropout in piecewise linear networks
David Warde-Farley and Ian J. Goodfellow and Aaron Courville and Yoshua Bengio
arXiv e-Print archive - 2013 via Local arXiv
Keywords: stat.ML, cs.LG, cs.NE
more

1	[link] Summary by Martin Thoma 9 years ago This paper analyses fully connected networks with dropout and ReLU activation functions. more less

scholar.google.com

Imagenet classification with deep convolutional neural networks
Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E
Neural Information Processing Systems Conference - 2012 via Local Bibsonomy
Keywords: image, imagenet, thema:deepwalk, classification

[link] Summary by Martin Thoma 9 years ago

This paper is about Convolutional Neural Networks for Computer Vision. It was the first break-through in the ImageNet classification challenge (LSVRC-2010, 1000 classes).

ReLU was a key aspect which was not so often used before. The paper also used Dropout in the last two layers.

## Training details

* Momentum of 0.9
* Learning rate of $\varepsilon$ (initialized at 0.01)
* Weight decay of $0.0005 \cdot \varepsilon$.
* Batch size of 128
* The training took 5 to 6 days on two NVIDIA GTX 580 3GB GPUs.

## See also

* [Stanford presentation](http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf)