Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe
and
Christian Szegedy
arXiv e-Print archive - 2015 via Local arXiv
Keywords:
cs.LG
First published: 2015/02/11 (9 years ago) Abstract: Training Deep Neural Networks is complicated by the fact that the
distribution of each layer's inputs changes during training, as the parameters
of the previous layers change. This slows down the training by requiring lower
learning rates and careful parameter initialization, and makes it notoriously
hard to train models with saturating nonlinearities. We refer to this
phenomenon as internal covariate shift, and address the problem by normalizing
layer inputs. Our method draws its strength from making normalization a part of
the model architecture and performing the normalization for each training
mini-batch. Batch Normalization allows us to use much higher learning rates and
be less careful about initialization. It also acts as a regularizer, in some
cases eliminating the need for Dropout. Applied to a state-of-the-art image
classification model, Batch Normalization achieves the same accuracy with 14
times fewer training steps, and beats the original model by a significant
margin. Using an ensemble of batch-normalized networks, we improve upon the
best published result on ImageNet classification: reaching 4.9% top-5
validation error (and 4.8% test error), exceeding the accuracy of human raters.
Network training is very sensitive to learning rate and initialization factors. Each layer output distribution is different than its input distribution (called covariate shift) which implies that layers have to permanently adapt to new input distribution. In this paper the author introduce batch normalization, a new layer to reduce covariate shift.
_Dataset:_ [MNIST](http://yann.lecun.com/exdb/mnist/), [ImageNet](www.image-net.org/).
#### Inner workings:
Batch normalization fixes the means and variances of layer inputs for a training batch by computing the following normalization on each batch.
[![screen shot 2017-04-13 at 10 21 39 am](https://cloud.githubusercontent.com/assets/17261080/24996464/4027fbba-2033-11e7-966a-2db3c0f1389d.png)](https://cloud.githubusercontent.com/assets/17261080/24996464/4027fbba-2033-11e7-966a-2db3c0f1389d.png)
The parameters Gamma and Beta are then learned with a gradient descent.
During inference the statistics are computed using unbiased estimators of the whole dataset (and not just the batch).
#### Results:
Batch normalization provides several advantages:
1. Use of a higher learning rate without risk of divergence by stabilizing the gradient scale.
2. Regularizes the model.
3. Reduces the need for dropout.
4. Avoid the network to get stuck when using saturating nonlinearities.
#### What to do?
1. Add batch norm layer before activation layers.
2. Increase the learning rate.
3. Remove dropout.
4. Reduce L2 weight regularization.
5. Accelerate learning rate decay.
6. Reduce picture distorsion for data augmentation.