[link]
One problem of training deep networks is that the features of lower-layer networks change while the upper-layer networks have already been adjusted to the previous lower-layer features. The phenomenon of changing inputs while optimizing is called *internal covariate shift*. Batch normalization is done at training time for each mini batch. ## Ideas * Training converges faster, if input is whitened (zero means, unit variances, decorrelated). * Normalization parameters have to be computed within the gradient calculation step to prevent the model from blowing up ## What Batch Normalization is For a layer with $d$-dimensional input $x = (x^{(1)}, \dots, x^{(d)})$, we will normalize each dimension $$\hat{x}^{(k)} = \frac{x^{(k)} - \mathbb{E}[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$$ where the expectation and the variance are computed over the training data set. This does *not* decorrelate the features, though. Additionally, for each activation $x^{(k)}$ two paramters $\gamma^{(k)}, \beta^{(k)}$ are introduced which scale and shift the feature: $$y^{(k)} = \gamma^{(k)} \cdot \hat{x}^{(k)} + \beta^{(k)}$$ Those two parameters (per feature) are learnable! ## Effect of Batch normalization * Higher learning rates can be used * Initialization is less important * Acts as a regularizer, eliminating the need for dropout in some cases * Faster training ## Datasets * reaching 4.9% top-5 validation error (and 4.8% test error) on ImageNet classification ## Used by * [Going Deeper with Convolutions](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14) * [Deep Residual Learning for Image Recognition](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeZRS15#martinthoma) ## See also * [other summaries](http://www.shortscience.org/paper?bibtexKey=conf/icml/IoffeS15)
Your comment:
|