The Shattered Gradients Problem: If resnets are the answer, then what is the question?
David Balduzzi
and
Marcus Frean
and
Lennox Leary
and
JP Lewis
and
Kurt Wan-Duo Ma
and
Brian McWilliams
arXiv e-Print archive - 2017 via Local arXiv
Keywords:
cs.NE, cs.LG, stat.ML
First published: 2017/02/28 (7 years ago) Abstract: A long-standing obstacle to progress in deep learning is the problem of
vanishing and exploding gradients. The problem has largely been overcome
through the introduction of carefully constructed initializations and batch
normalization. Nevertheless, architectures incorporating skip-connections such
as resnets perform much better than standard feedforward architectures despite
well-chosen initialization and batch normalization. In this paper, we identify
the shattered gradients problem. Specifically, we show that the correlation
between gradients in standard feedforward networks decays exponentially with
depth resulting in gradients that resemble white noise. In contrast, the
gradients in architectures with skip-connections are far more resistant to
shattering decaying sublinearly. Detailed empirical evidence is presented in
support of the analysis, on both fully-connected networks and convnets.
Finally, we present a new "looks linear" (LL) initialization that prevents
shattering. Preliminary experiments show the new initialization allows to train
very deep networks without the addition of skip-connections.