[link]
Summary by Martin Thoma 9 years ago
One problem of training deep networks is that the features of lower-layer networks change while the upper-layer networks have already been adjusted to the previous lower-layer features. The phenomenon of changing inputs while optimizing is called *internal covariate shift*.
Batch normalization is done at training time for each mini batch.
## Ideas
* Training converges faster, if input is whitened (zero means, unit variances, decorrelated).
* Normalization parameters have to be computed within the gradient calculation step to prevent the model from blowing up
## What Batch Normalization is
For a layer with $d$-dimensional input $x = (x^{(1)}, \dots, x^{(d)})$, we will normalize
each dimension
$$\hat{x}^{(k)} = \frac{x^{(k)} - \mathbb{E}[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$$
where the expectation and the variance are computed over the training data set. This does *not* decorrelate the features, though.
Additionally, for each activation $x^{(k)}$ two paramters $\gamma^{(k)}, \beta^{(k)}$ are introduced which scale and shift the feature:
$$y^{(k)} = \gamma^{(k)} \cdot \hat{x}^{(k)} + \beta^{(k)}$$
Those two parameters (per feature) are learnable!
## Effect of Batch normalization
* Higher learning rates can be used
* Initialization is less important
* Acts as a regularizer, eliminating the need for dropout in some cases
* Faster training
## Datasets
* reaching 4.9% top-5 validation error (and 4.8% test error) on ImageNet classification
## Used by
* [Going Deeper with Convolutions](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14)
* [Deep Residual Learning for Image Recognition](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeZRS15#martinthoma)
## See also
* [other summaries](http://www.shortscience.org/paper?bibtexKey=conf/icml/IoffeS15)

more
less