The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training
Erhan, Dumitru
and
Manzagol, Pierre-Antoine
and
Bengio, Yoshua
and
Bengio, Samy
and
Vincent, Pascal
Journal of Machine Learning Research - 2009 via Local Bibsonomy
Keywords:
dblp
#### Introduction
* The paper explores the challenges involved in training deep networks, the effect of unsupervised pre-training on training process and visualizes the error function landscape for deep architectures.
* [Link to the paper](http://research.google.com/pubs/pub34923.html)
#### Experiments
* Datasets used - Shapeset and MNIST.
* Train deep architectures for a variable number of layers with and without pre-training.
* Weights initialized using random sample from $[\frac{-1}{\sqrt(k)}, \frac{1}{\sqrt(k)}]$ where $k$ is fan-in value.
#### Observations
* Increasing depth (without pre-training) causes error rate to go up faster than the case of pre-training.
* Pre-training also makes the network more robust to random initializations.
* At same training cost level, the pre-trained models systematically yields a lower cost than the randomly initialized ones.
* Pre-training seems to be most advantageous for smaller training sets.
* Pre-training appears to have a regularizing effect - it decreases the variance (for parameter configurations) by restricting the set of possible final configurations for parameter values and introduces a bias.
* Pre-training helps for larger layers (with a larger number of units per layer) and for deeper networks. But in the case of small networks, it can lower the performance.
* As small networks tend to have a small capacity, this supports the hypothesis that pre-training exhibits a kind of regularizing effect.
* Pre-training seems to provide a better marginal conditioning of the weights. Though this is not the only benefit pre-training provides as it captures more intricate dependencies.
* Pre-training the lower layers is more important (and impactful) than pre-training the layers closer to the output.
* Error landscape seems to be flatter for deep architectures and for the case of pre-training.
* Learning trajectories for pre-trained and not pre-trained models start and stay in different regions of function space. Moreover, trajectories of any of the given type initially move together, but at some point, they diverge away.