How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift)
Shibani Santurkar
and
Dimitris Tsipras
and
Andrew Ilyas
and
Aleksander Madry
arXiv e-Print archive - 2018 via Local arXiv
Keywords:
stat.ML, cs.LG, cs.NE
First published: 2018/05/29 (6 years ago) Abstract: Batch Normalization (BatchNorm) is a widely adopted technique that enables
faster and more stable training of deep neural networks (DNNs). Despite its
pervasiveness, the exact reasons for BatchNorm's effectiveness are still poorly
understood. The popular belief is that this effectiveness stems from
controlling the change of the layers' input distributions during training to
reduce the so-called "internal covariate shift". In this work, we demonstrate
that such distributional stability of layer inputs has little to do with the
success of BatchNorm. Instead, we uncover a more fundamental impact of
BatchNorm on the training process: it makes the optimization landscape
significantly smoother. This smoothness induces a more predictive and stable
behavior of the gradients, allowing for faster training. These findings bring
us closer to a true understanding of our DNN training toolkit.