[link]
One problem of training deep networks is that the features of lowerlayer networks change while the upperlayer networks have already been adjusted to the previous lowerlayer features. The phenomenon of changing inputs while optimizing is called *internal covariate shift*. Batch normalization is done at training time for each mini batch. ## Ideas * Training converges faster, if input is whitened (zero means, unit variances, decorrelated). * Normalization parameters have to be computed within the gradient calculation step to prevent the model from blowing up ## What Batch Normalization is For a layer with $d$dimensional input $x = (x^{(1)}, \dots, x^{(d)})$, we will normalize each dimension $$\hat{x}^{(k)} = \frac{x^{(k)}  \mathbb{E}[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$$ where the expectation and the variance are computed over the training data set. This does *not* decorrelate the features, though. Additionally, for each activation $x^{(k)}$ two paramters $\gamma^{(k)}, \beta^{(k)}$ are introduced which scale and shift the feature: $$y^{(k)} = \gamma^{(k)} \cdot \hat{x}^{(k)} + \beta^{(k)}$$ Those two parameters (per feature) are learnable! ## Effect of Batch normalization * Higher learning rates can be used * Initialization is less important * Acts as a regularizer, eliminating the need for dropout in some cases * Faster training ## Datasets * reaching 4.9% top5 validation error (and 4.8% test error) on ImageNet classification ## Used by * [Going Deeper with Convolutions](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14) * [Deep Residual Learning for Image Recognition](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeZRS15#martinthoma) ## See also * [other summaries](http://www.shortscience.org/paper?bibtexKey=conf/icml/IoffeS15)
Your comment:
