The Shattered Gradients Problem: If resnets are the answer, then what is the question?
David Balduzzi
and
Marcus Frean
and
Lennox Leary
and
JP Lewis
and
Kurt Wan-Duo Ma
and
Brian McWilliams
arXiv e-Print archive - 2017 via Local arXiv
Keywords:
cs.NE, cs.LG, stat.ML
First published: 2017/02/28 (7 years ago) Abstract: A long-standing obstacle to progress in deep learning is the problem of
vanishing and exploding gradients. The problem has largely been overcome
through the introduction of carefully constructed initializations and batch
normalization. Nevertheless, architectures incorporating skip-connections such
as resnets perform much better than standard feedforward architectures despite
well-chosen initialization and batch normalization. In this paper, we identify
the shattered gradients problem. Specifically, we show that the correlation
between gradients in standard feedforward networks decays exponentially with
depth resulting in gradients that resemble white noise. In contrast, the
gradients in architectures with skip-connections are far more resistant to
shattering decaying sublinearly. Detailed empirical evidence is presented in
support of the analysis, on both fully-connected networks and convnets.
Finally, we present a new "looks linear" (LL) initialization that prevents
shattering. Preliminary experiments show the new initialization allows to train
very deep networks without the addition of skip-connections.
Imagine you make a neural network mapping a scalar to a scalar. After you initialise this network in the traditional way, randomly with some given variance, you could take the gradient of the input with respect to the output for all reasonable values (between about - and 3 because networks typically assume standardised inputs). As the value increases, different rectified linear units in the network will randomly switch on, drawing a random walk in the gradients; another name for which is brown noise.
![](http://i.imgur.com/KMzfzMZ.png)
However, do the same thing for deep networks, and any traditional initialisation you choose, and you'll see the random walk start to look like white noise. One intuition given in the paper is that as different rectifiers in the network switch off and on the input is taking a number of different paths though the network. The number of possible paths grows exponentially with the depth of the network, so as the input varies, the gradients become increasingly chaotic. **The explanations and derivations given in the paper are much better reasoned and thorough, please read those if you are interested**.
Why should we care about this? Because the authors take the recent nonlinearity [CReLU][] (output is concatenation of `relu(x)` and `relu(-x)`) and develop an initialisation that will avoid problems with gradient shattering. The initialisation is just to take your standard initialised weight matrix $\mathbf{W}$ and set the right half to be the negative of the left half ($\mathbf{W}_{\text{left}}$). As long as the input to the layer is also concatenated, the left half will be multiplied by `relu(x)` and the right by `relu(-x)`. Then:
$$
\mathbf{W}.\text{CReLU}(\mathbf{x}) = \begin{cases} \mathbf{W}_{\text{left}}\mathbf{x} & \text{ if } x > 0 \\ \mathbf{W}_{\text{left}}\mathbf{x} & \text{ if } x \leq 0\end{cases}
$$
Doing this allows them to train deep networks without skip connections, and they show results on CIFAR-10 with depths of up to 200 exceeding (slightly) a similar resnet.
[crelu]: https://arxiv.org/abs/1603.05201
Imagine you make a neural network mapping a scalar to a scalar. After you initialise this network in the traditional way, randomly with some given variance, you could take the gradient of the input with respect to the output for all reasonable values (between about -3 and 3 because networks typically assume standardised inputs). As the value increases, different rectified linear units in the network will randomly switch on, drawing a random walk in the gradients; another name for which is brown noise.
![](http://i.imgur.com/KMzfzMZ.png)
However, do the same thing for deep networks, and any traditional initialisation you choose, and you'll see the random walk start to look like white noise. One intuition given in the paper is that as different rectifiers in the network switch off and on the input is taking a number of different paths though the network. The number of possible paths grows exponentially with the depth of the network, so as the input varies, the gradients become increasingly chaotic. **The explanations and derivations given in the paper are much better reasoned and thorough, please read those if you are interested**.
Why should we care about this? Because the authors take the recent nonlinearity [CReLU][] (output is concatenation of `relu(x)` and `relu(-x)`) and develop an initialisation that will avoid problems with gradient shattering. The initialisation is just to take your standard initialised weight matrix $\mathbf{W}$ and set the right half to be the negative of the left half ($\mathbf{W}_{\text{left}}$). As long as the input to the layer is also concatenated, the left half will be multiplied by `relu(x)` and the right by `relu(-x)`. Then:
$$
\mathbf{W}.\text{CReLU}(\mathbf{x}) = \begin{cases} \mathbf{W}_{\text{left}}\mathbf{x} & \text{ if } x > 0 \\ \mathbf{W}_{\text{left}}\mathbf{x} & \text{ if } x \leq 0\end{cases}
$$
Doing this allows them to train deep networks without skip connections, and they show results on CIFAR-10 with depths of up to 200 exceeding (slightly) a similar resnet.
[crelu]: https://arxiv.org/abs/1603.05201