First published: 2015/02/11 (9 years ago) Abstract: Training Deep Neural Networks is complicated by the fact that the
distribution of each layer's inputs changes during training, as the parameters
of the previous layers change. This slows down the training by requiring lower
learning rates and careful parameter initialization, and makes it notoriously
hard to train models with saturating nonlinearities. We refer to this
phenomenon as internal covariate shift, and address the problem by normalizing
layer inputs. Our method draws its strength from making normalization a part of
the model architecture and performing the normalization for each training
mini-batch. Batch Normalization allows us to use much higher learning rates and
be less careful about initialization. It also acts as a regularizer, in some
cases eliminating the need for Dropout. Applied to a state-of-the-art image
classification model, Batch Normalization achieves the same accuracy with 14
times fewer training steps, and beats the original model by a significant
margin. Using an ensemble of batch-normalized networks, we improve upon the
best published result on ImageNet classification: reaching 4.9% top-5
validation error (and 4.8% test error), exceeding the accuracy of human raters.
#### Problem addressed:
Strategy for training deep neural networks
#### Summary:
The input distribution (to every layer) undergoes constant changes while training a deep network. The authors call this internal covariate shift in the input distribution. The authors claim this leads to slow learning of optimal model parameters. In order to overcome this, they introduce the idea of normalizing the input of every layer a part of the optimization strategy. Specifically, they reparameterize the input to each layer so that it is whitened and thus has non-changing distribution at every iteration.
They apply 2 approximation in their strategy:
1. this normalization is done for every mini-batch of training data,
2. the input dimensions are assumed to be uncorrelated.
Finally, the output of last layer is mean subtracted and variance normalized (these can be back-propagated while training). Additionally, the authors also introduce 2 learnable scalar parameters $(r,b)$ per dimension such that the final input to a layer is $y=rg(BN(x))+b$ where g is the activation function.
The advantage of BN apart from the intuition mentioned above is that it allows higher learning rate and network behavior remains unaffected by the scale of the parameters W and bias. The authors also empirically show that BN acts as a regularizer since optimization without dropout yields at par performance.
#### Novelty:
Previous work only focused on whitening in 1st layer input. This work extends this idea to all layers and suggests a practical approach for applying this idea to real world data.
#### Datasets:
Imagenet
#### Resources:
presentation video available on cedar server
#### Presenter:
Devansh Arpit