The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training on ShortScience.org

www.jmlr.org
scholar.google.com

The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training
Erhan, Dumitru and Manzagol, Pierre-Antoine and Bengio, Yoshua and Bengio, Samy and Vincent, Pascal
Journal of Machine Learning Research - 2009 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 1

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

* The paper explores the challenges involved in training deep networks, the effect of unsupervised pre-training on training process and visualizes the error function landscape for deep architectures.
* [Link to the paper](http://research.google.com/pubs/pub34923.html)

#### Experiments

* Datasets used - Shapeset and MNIST.
* Train deep architectures for a variable number of layers with and without pre-training.
* Weights initialized using random sample from $[\frac{-1}{\sqrt(k)}, \frac{1}{\sqrt(k)}]$ where $k$ is fan-in value.

#### Observations

* Increasing depth (without pre-training) causes error rate to go up faster than the case of pre-training.
* Pre-training also makes the network more robust to random initializations.
* At same training cost level, the pre-trained models systematically yields a lower cost than the randomly initialized ones.
* Pre-training seems to be most advantageous for smaller training sets.
* Pre-training appears to have a regularizing effect - it decreases the variance (for parameter configurations) by restricting the set of possible final configurations for parameter values and introduces a bias.
* Pre-training helps for larger layers (with a larger number of units per layer) and for deeper networks. But in the case of small networks, it can lower the performance.
* As small networks tend to have a small capacity, this supports the hypothesis that pre-training exhibits a kind of regularizing effect.
* Pre-training seems to provide a better marginal conditioning of the weights. Though this is not the only benefit pre-training provides as it captures more intricate dependencies.
* Pre-training the lower layers is more important (and impactful) than pre-training the layers closer to the output.
* Error landscape seems to be flatter for deep architectures and for the case of pre-training.
* Learning trajectories for pre-trained and not pre-trained models start and stay in different regions of function space. Moreover, trajectories of any of the given type initially move together, but at some point, they diverge away.

Your comment: