Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift on ShortScience.org

jmlr.org
scholar.google.com

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Ioffe, Sergey and Szegedy, Christian
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 8

[link] Summary by José Manuel Rodríguez Sotelo 8 years ago

The main contribution of this paper is introducing a new transformation that the authors call Batch Normalization (BN). The need for BN comes from the fact that during the training of deep neural networks (DNNs) the distribution of each layer’s input change. This phenomenon is called internal covariate shift (ICS).

#### What is BN?
Normalize each (scalar) feature independently with respect to the mean and variance of the mini batch. Scale and shift the normalized values with two new parameters (per activation) that will be learned. The BN consists of making normalization part of the model architecture.

#### What do we gain?
According to the author, the use of BN provides a great speed up in the training of DNNs. In particular, the gains are greater when it is combined with higher learning rates. In addition, BN works as a regularizer for the model which allows to use less dropout or less L2 normalization. Furthermore, since the distribution of the inputs is normalized, it also allows to use sigmoids as activation functions without the saturation problem.

#### What follows?
This seems to be specially promising for training recurrent neural networks (RNNs). The vanishing and exploding gradient problems \cite{journals/tnn/BengioSF94} have their origin in the iteration of transformation that scale up or down the activations in certain directions (eigenvectors). It seems that this regularization would be specially useful in this context since this would allow the gradient to flow more easily. When we unroll the RNNs, we usually have ultra deep networks.

#### Like
* Simple idea that seems to improve training.
* Makes training faster.
* Simple to implement. Probably.
* You can be less careful with initialization.

#### Dislike
* Does not work with stochastic gradient descent (minibatch size = 1).
* This could reduce the parallelism of the algorithm since now all the examples in a mini batch are tied.
* Results on ensemble of networks for ImageNet makes it harder to evaluate the relevance of BN by itself. (Although they do mention the performance of a single model).

Your comment:

[link] Summary by Shagun Sodhani 7 years ago

The [Batch Normalization paper](http://arxiv.org/pdf/1502.03167.pdf) describes a method to address the various issues related to training of Deep Neural Networks. It makes normalization a part of the architecture itself and reports significant improvements in terms of the number of iterations required to train the network.

## Issues With Training Deep Neural Networks

### Internal Covariate shift

Covariate shift refers to the change in the input distribution to a learning system. In the case of deep networks, the input to each layer is affected by parameters in all the input layers. So even small changes to the network get amplified down the network. This leads to change in the input distribution to internal layers of the deep network and is known as internal covariate shift.

It is well established that networks converge faster if the inputs have been whitened (ie zero mean, unit variances) and are uncorrelated and internal covariate shift leads to just the opposite.

### Vanishing Gradient

Saturating nonlinearities (like tanh or sigmoid) can not be used for deep networks as they tend to get stuck in the saturation region as the network grows deeper. Some ways around this are to use:
* Nonlinearities like ReLU which do not saturate
* Smaller learning rates
* Careful initializations

## Normalization

Let us say that the layer we want to normalize has *d* dimensions **x** $= (x_1, ... x_d)$. Then, we can normalize the $k^th$ dimension as follows:

![Scaled and shifted normalized value](https://db.tt/YORi6lov)

We also need to scale and shift the normalized values otherwise just normalizing a layer would limit the layer in terms of what it can represent. For example, if we normalize the inputs to a sigmoid function, then the output would be bound to the linear region only.

So the normalized input $x^k$ is transformed to:

![Scaled and shifted normalized value](https://db.tt/6vImAQoc)

where $γ$ and $β$ are parameters to be learned.

Moreover, just like we use mini-batch in Stochastic Gradient Descent (SGD), we can use mini-batch with normalization to estimate the mean and variance for each activation.

The transformation from $x$ to $y$ as described above is called **Batch Normalizing Tranform**. This BN transform is differentiable and ensures that as the model is training, the layers can learn on the input distributions that exhibit less internal covariate shift and can hence accelerate the training.

At training time, a subset of activations in specified and BN transform is applied to all of them.

During test time, the normalization is done using the population statistics instead of mini-batch statistics to ensure that the output deterministically depends on the input.

## Batch Normalized Convolutional Networks

Let us say that $x = g(Wu+b)$ is the operation performed by the layer where $W$ and $b$ are the parameters to be learned, $g$ is a nonlinearity and $u$ is the input from the previous layer.

The BN transform is added just before the nonlinearity, by normalizing $x = Wu+b$. An alternative would have been to normalize $u$ itself but constraining just the first and the second moment would not eliminate the covariate shift from $u$.

When normalizing $Wu+b$, we can ignore the $b$ term as it would be canceled during the normalization step (*b*'s role is subsumed by β) and we have

$z = g( BN(Wu) )$

For convolutional layers, normalization should follow the convolution property as well - ie different elements of the same feature map, at different locations, are normalized in the same way. So all the activations in a mini-batch are jointly normalized over all the locations and parameters (*γ* and *β*) are learnt per feature map instead of per activation.

## Advantages Of Batch Normalization

1. Reduces internal covariant shift.
2. Reduces the dependence of gradients on the scale of the parameters or their initial values.
3. Regularizes the model and reduces the need for dropout, photometric distortions, local response normalization and other regularization techniques.
4. Allows use of saturating nonlinearities and higher learning rates.

Batch Normalization was applied to models trained for MNIST and Inception Network for ImageNet. All the above-mentioned advantages were validated in the experiments. Interestingly, Batch Normalization with sigmoid achieved an accuracy of 69.8% (overall best, using any nonlinearity, was 74.8%) while Inception model (sigmoid nonlinearity), without Batch Normalisation, worked only as good as a random guess.

## Future Work

While BN Transform does enhance the overall deep network training task, its precise effect on gradient propagation is still not well understood. A future extension of Batch Normalisation would be in the domain of Recurrent Neural Networks where internal covariate shift and vanishing gradients are more severe. It remains to be explored if it can also help with domain adaption by easily generalizing to new data distributions.

Do you have a source for how the normalization works for CNNs? Do you know of any follow-up work which did what you mentioned in "Future work"? (And there is a typo: "archwitecture")

To see effect of batch normalization on CNN, you may refer this benchmark [https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md] Thanks for pointing out the typo :)

Your comment:

[link] Summary by Alexander Jung 6 years ago

### What is BN:
  * Batch Normalization (BN) is a normalization method/layer for neural networks.
  * Usually inputs to neural networks are normalized to either the range of [0, 1] or [-1, 1] or to mean=0 and variance=1. The latter is called *Whitening*.
  * BN essentially performs Whitening to the intermediate layers of the networks.

### How its calculated:
  * The basic formula is $x^* = (x - E[x]) / \sqrt{\text{var}(x)}$, where $x^*$ is the new value of a single component, $E[x]$ is its mean within a batch and `var(x)` is its variance within a batch.
  * BN extends that formula further to $x^{**} = gamma * x^* +$ beta, where $x^{**}$ is the final normalized value. `gamma` and `beta` are learned per layer. They make sure that BN can learn the identity function, which is needed in a few cases.
  * For convolutions, every layer/filter/kernel is normalized on its own (linear layer: each neuron/node/component). That means that every generated value ("pixel") is treated as an example. If we have a batch size of N and the image generated by the convolution has width=P and height=Q, we would calculate the mean (E) over `N*P*Q` examples (same for the variance).

### Theoretical effects:
  * BN reduces *Covariate Shift*. That is the change in distribution of activation of a component. By using BN, each neuron's activation becomes (more or less) a gaussian distribution, i.e. its usually not active, sometimes a bit active, rare very active.
  * Covariate Shift is undesirable, because the later layers have to keep adapting to the change of the type of distribution (instead of just to new distribution parameters, e.g. new mean and variance values for gaussian distributions).
  * BN reduces effects of exploding and vanishing gradients, because every becomes roughly normal distributed. Without BN, low activations of one layer can lead to lower activations in the next layer, and then even lower ones in the next layer and so on.

### Practical effects:
  * BN reduces training times. (Because of less Covariate Shift, less exploding/vanishing gradients.)
  * BN reduces demand for regularization, e.g. dropout or L2 norm. (Because the means and variances are calculated over batches and therefore every normalized value depends on the current batch. I.e. the network can no longer just memorize values and their correct answers.)
  * BN allows higher learning rates. (Because of less danger of exploding/vanishing gradients.)
  * BN enables training with saturating nonlinearities in deep networks, e.g. sigmoid. (Because the normalization prevents them from getting stuck in saturating ranges, e.g. very high/low values for sigmoid.)


![MNIST and neuron activations](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Batch_Normalization__performance_and_activations.png?raw=true "MNIST and neuron activations")

*BN applied to MNIST (a), and activations of a randomly selected neuron over time (b, c), where the middle line is the median activation, the top line is the 15th percentile and the bottom line is the 85th percentile.*

-------------------------

### Rough chapter-wise notes

* (2) Towards Reducing Covariate Shift
  * Batch Normalization (*BN*) is a special normalization method for neural networks.
  * In neural networks, the inputs to each layer depend on the outputs of all previous layers.
  * The distributions of these outputs can change during the training. Such a change is called a *covariate shift*.
  * If the distributions stayed the same, it would simplify the training. Then only the parameters would have to be readjusted continuously (e.g. mean and variance for normal distributions).
  * If using sigmoid activations, it can happen that one unit saturates (very high/low values). That is undesired as it leads to vanishing gradients for all units below in the network.
  * BN fixes the means and variances of layer inputs to specific values (zero mean, unit variance).
  * That accomplishes:
    * No more covariate shift.
    * Fixes problems with vanishing gradients due to saturation.
  * Effects:
    * Networks learn faster. (As they don't have to adjust to covariate shift any more.)
    * Optimizes gradient flow in the network. (As the gradient becomes less dependent on the scale of the parameters and their initial values.)
    * Higher learning rates are possible. (Optimized gradient flow reduces risk of divergence.)
    * Saturating nonlinearities can be safely used. (Optimized gradient flow prevents the network from getting stuck in saturated modes.)
    * BN reduces the need for dropout. (As it has a regularizing effect.)
  * How BN works:
    * BN normalizes layer inputs to zero mean and unit variance. That is called *whitening*.
    * Naive method: Train on a batch. Update model parameters. Then normalize. Doesn't work: Leads to exploding biases while distribution parameters (mean, variance) don't change.
    * A proper method has to include the current example *and* all previous examples in the normalization step.
    * This leads to calculating in covariance matrix and its inverse square root. That's expensive. The authors found a faster way.

* (3) Normalization via Mini-Batch Statistics
  * Each feature (component) is normalized individually. (Due to cost, differentiability.)
  * Normalization according to: `componentNormalizedValue = (componentOldValue - E[component]) / sqrt(Var(component))`
  * Normalizing each component can reduce the expressitivity of nonlinearities. Hence the formula is changed so that it can also learn the identiy function.
  * Full formula: `newValue = gamma * componentNormalizedValue + beta` (gamma and beta learned per component)
  * E and Var are estimated for each mini batch.
  * BN is fully differentiable. Formulas for gradients/backpropagation are at the end of chapter 3 (page 4, left).

* (3.1) Training and Inference with Batch-Normalized Networks
  * During test time, E and Var of each component can be estimated using all examples or alternatively with moving averages estimated during training.
  * During test time, the BN formulas can be simplified to a single linear transformation.

* (3.2) Batch-Normalized Convolutional Networks
  * Authors recommend to place BN layers after linear/fully-connected layers and before the ninlinearities.
  * They argue that the linear layers have a better distribution that is more likely to be similar to a gaussian.
  * Placing BN after the nonlinearity would also not eliminate covariate shift (for some reason).
  * Learning a separate bias isn't necessary as BN's formula already contains a bias-like term (beta).
  * For convolutions they apply BN equally to all features on a feature map. That creates effective batch sizes of m\*pq, where m is the number of examples in the batch and p q are the feature map dimensions (height, width). BN for linear layers has a batch size of m.
  * gamma and beta are then learned per feature map, not per single pixel. (Linear layers: Per neuron.)

* (3.3) Batch Normalization enables higher learning rates
  * BN normalizes activations.
  * Result: Changes to early layers don't amplify towards the end.
  * BN makes it less likely to get stuck in the saturating parts of nonlinearities.
  * BN makes training more resilient to parameter scales.
  * Usually, large learning rates cannot be used as they tend to scale up parameters. Then any change to a parameter amplifies through the network and can lead to gradient explosions.
  * With BN gradients actually go down as parameters increase. Therefore, higher learning rates can be used.
  * (something about singular values and the Jacobian)

* (3.4) Batch Normalization regularizes the model
  * Usually: Examples are seen on their own by the network.
  * With BN: Examples are seen in conjunction with other examples (mean, variance).
  * Result: Network can't easily memorize the examples any more.
  * Effect: BN has a regularizing effect. Dropout can be removed or decreased in strength.

* (4) Experiments
* (4.1) Activations over time
** They tested BN on MNIST with a 100x100x10 network. (One network with BN before each nonlinearity, another network without BN for comparison.)
** Batch Size was 60.
** The network with BN learned faster. Activations of neurons (their means and variances over several examples) seemed to be more consistent during training.
** Generalization of the BN network seemed to be better.

* (4.2) ImageNet classification
** They applied BN to the Inception network.
** Batch Size was 32.
** During training they used (compared to original Inception training) a higher learning rate with more decay, no dropout, less L2, no local response normalization and less distortion/augmentation.
** They shuffle the data during training (i.e. each batch contains different examples).
** Depending on the learning rate, they either achieve the same accuracy (as in the non-BN network) in 14 times fewer steps (5x learning rate) or a higher accuracy in 5 times fewer steps (30x learning rate).
** BN enables training of Inception networks with sigmoid units (still a bit lower accuracy than ReLU).
** An ensemble of 6 Inception networks with BN achieved better accuracy than the previously best network for ImageNet.

* (5) Conclusion
** BN is similar to a normalization layer suggested by Gülcehre and Bengio. However, they applied it to the outputs of nonlinearities.
** They also didn't have the beta and gamma parameters (i.e. their normalization could not learn the identity function).

Your comment:

[link] Summary by Denny Britz 7 years ago

TLDR; The authors introduce Batch Normalization, a technique to normalize unit activations to zero mean and unit variance within the network. The authors show that in Feedforward and Convolutional Networks, Batch Normalization leads to faster training and better accuracies. BN also acts as a regularizer, reducing the need for Dropout, etc. Using an ensemble of batch normalized networks the authors achieve state of the art on ILSVRC.


#### Key Points

- Network training is complicated because the input distributions to higher level change as the parameter in lower layers are changing: Internal Covariate Shift. Solution: Normalize within the network.
- BN: Normalize input to nonlinearity to have zero mean and unit variance. Then add two additional parameters (scaling and bias) per unit to preserve expressability of the network. Statistics are calculated per minibatch.
- Network parameters increase, but not by much: 2 parameter per unit that has batch normalization applied to it.
- Works well for fully connected and convolutional layers. Authors didn't try RNNs.
- Change to make when adding BN: Increase learning rate, remove/decrease dropout and l2 regularization, accelerate learning rate decay, shuffle training examples more thoroughly.

Could you please explain why adding the parameters $\beta$ and $\gamma$ does not change the variance?

What do you mean by "shuffle training examples more thoroughly"?

Your comment:

[link] Summary by Martin Thoma 7 years ago

One problem of training deep networks is that the features of lower-layer networks change while the upper-layer networks have already been adjusted to the previous lower-layer features. The phenomenon of changing inputs while optimizing is called *internal covariate shift*.

Batch normalization is done at training time for each mini batch.

## Ideas

* Training converges faster, if input is whitened (zero means, unit variances, decorrelated).
* Normalization parameters have to be computed within the gradient calculation step to prevent the model from blowing up

## What Batch Normalization is

For a layer with $d$-dimensional input $x = (x^{(1)}, \dots, x^{(d)})$, we will normalize
each dimension 
$$\hat{x}^{(k)} = \frac{x^{(k)} - \mathbb{E}[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$$
where the expectation and the variance are computed over the training data set. This does *not* decorrelate the features, though.

Additionally, for each activation $x^{(k)}$ two paramters $\gamma^{(k)}, \beta^{(k)}$ are introduced which scale and shift the feature:

$$y^{(k)} = \gamma^{(k)} \cdot \hat{x}^{(k)} + \beta^{(k)}$$

Those two parameters (per feature) are learnable!

## Effect of Batch normalization

* Higher learning rates can be used
* Initialization is less important
* Acts as a regularizer, eliminating the need for dropout in some cases
* Faster training

## Datasets

* reaching 4.9% top-5 validation error (and 4.8% test error) on ImageNet classification

## Used by

* [Going Deeper with Convolutions](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14)
* [Deep Residual Learning for Image Recognition](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeZRS15#martinthoma)

## See also

* [other summaries](http://www.shortscience.org/paper?bibtexKey=conf/icml/IoffeS15)

Your comment:

[link] Summary by Cubs Reading Group 6 years ago

#### Problem addressed:
Strategy for training deep neural networks

#### Summary:
The input distribution (to every layer) undergoes constant changes while training a deep network. The authors call this internal covariate shift in the input distribution. The authors claim this leads to slow learning of optimal model parameters. In order to overcome this, they introduce the idea of normalizing the input of every layer a part of the optimization strategy. Specifically, they reparameterize the input to each layer so that it is whitened and thus has non-changing distribution at every iteration.

They apply 2 approximation in their strategy:

1. this normalization is done for every mini-batch of training data,

2. the input dimensions are assumed to be uncorrelated.

Finally, the output of last layer is mean subtracted and variance normalized (these can be back-propagated while training). Additionally, the authors also introduce 2 learnable scalar parameters $(r,b)$ per dimension such that the final input to a layer is $y=rg(BN(x))+b$ where g is the activation function.

The advantage of BN apart from the intuition mentioned above is that it allows higher learning rate and network behavior remains unaffected by the scale of the parameters W and bias. The authors also empirically show that BN acts as a regularizer since optimization without dropout yields at par performance.

#### Novelty:
Previous work only focused on whitening in 1st layer input. This work extends this idea to all layers and suggests a practical approach for applying this idea to real world data.

#### Datasets:
Imagenet

#### Resources:
presentation video available on cedar server

#### Presenter:
Devansh Arpit

Your comment:

[link] Summary by Léo Paillier 6 years ago

Network training is very sensitive to learning rate and initialization factors. Each layer output distribution is different than its input distribution (called covariate shift) which implies that layers have to permanently adapt to new input distribution. In this paper the author introduce batch normalization, a new layer to reduce covariate shift.

_Dataset:_ [MNIST](http://yann.lecun.com/exdb/mnist/), [ImageNet](www.image-net.org/).

#### Inner workings:

Batch normalization fixes the means and variances of layer inputs for a training batch by computing the following normalization on each batch.
[![screen shot 2017-04-13 at 10 21 39 am](https://cloud.githubusercontent.com/assets/17261080/24996464/4027fbba-2033-11e7-966a-2db3c0f1389d.png)](https://cloud.githubusercontent.com/assets/17261080/24996464/4027fbba-2033-11e7-966a-2db3c0f1389d.png)
The parameters Gamma and Beta are then learned with a gradient descent.
During inference the statistics are computed using unbiased estimators of the whole dataset (and not just the batch).

#### Results:

Batch normalization provides several advantages:

1. Use of a higher learning rate without risk of divergence by stabilizing the gradient scale.
2. Regularizes the model.
3. Reduces the need for dropout.
4. Avoid the network to get stuck when using saturating nonlinearities.

#### What to do?

1. Add batch norm layer before activation layers.
2. Increase the learning rate.
3. Remove dropout.
4. Reduce L2 weight regularization.
5. Accelerate learning rate decay.
6. Reduce picture distorsion for data augmentation.

Your comment:

[link] Summary by Joseph Paul Cohen 8 years ago

A *Batch Normalization* applied immediately after fully connected layers and adjusts the values of the feedforward output so that they are centered to a zero mean and have unit variance.

It has been used by famous Convolutional Neural Networks such as GoogLeNet \cite{journals/corr/SzegedyLJSRAEVR14} and ResNet \cite{journals/corr/HeZRS15}

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private