#### Problem addressed:
Instead of approaching the problem why pre-training works, this paper addresses why traditional way of training deep NNs dont work.
The main focus of this paper is to empirically study why deep nets dont work with backprop without any pre-training. To analyse this, the authors mainly study the trends of activations and gradient strength across layers vs training iteration using simple backprop. Their study shows that the higher layer units saturate to 0 in the case of Sigmoid which prevents any backpropagated gradients to lower layers. It takes a lot of iterations to get out of saturation after which the lower layers start to learn.
For this reason the authors suggest using activations symmetric around 0 to avoid saturation, like Tanh and Softsign. For Tanh, they find that units of every layer initialized on either part of 0 start saturating (to respective sides) one after the other starting from lower layer to higher layer. For Softsign on the other hand, units from all layers move towards saturation together. Further the histogram of final activations suggest that Tanh units have a peak at both 0 and -1,+1 saturation, while Softsign units generally lie in the linear region. Note that the linear region in Tanh/Softsign has activation gradients-- hence propagates information.
The most interesting part of this study is the way the authors analyse the flow of information from the input layer to the top layer and vice versa. While the forward prop transmits the information about input to higher layers, backward prop transmits the error gradient. They measure the flow of information in terms of the variance of activation (forward) and gradients (backwards) for different layers. Since we would want the information flow to be equal at all layers, the variance should also be the same. So they propose to initialize the weight vectors such that this variance is preserved across layers. They call this ""Normalized Initialization"". Their empirical results show that both activations and gradients (hence information) at all layers have better propagation with their initialization.
Analysis of activation values and back-prop gradient across layers for analyzing training difficulties. Also, a new weight initialization method.
The variance study for activation/gradient is done for linear networks but applied to Tanh and Softsign. How is this justified?
Shapeset 3x2, MNIST, CIFAR-10