Improving neural networks by preventing co-adaptation of feature detectors
Geoffrey E. Hinton
and
Nitish Srivastava
and
Alex Krizhevsky
and
Ilya Sutskever
and
Ruslan R. Salakhutdinov
arXiv e-Print archive - 2012 via Local arXiv
Keywords:
cs.NE, cs.CV, cs.LG
First published: 2012/07/03 (12 years ago) Abstract: When a large feedforward neural network is trained on a small training set,
it typically performs poorly on held-out test data. This "overfitting" is
greatly reduced by randomly omitting half of the feature detectors on each
training case. This prevents complex co-adaptations in which a feature detector
is only helpful in the context of several other specific feature detectors.
Instead, each neuron learns to detect a feature that is generally helpful for
producing the correct answer given the combinatorially large variety of
internal contexts in which it must operate. Random "dropout" gives big
improvements on many benchmark tasks and sets new records for speech and object
recognition.
This paper introduced Dropout, a new layer type. It has a parameter $\alpha \in (0, 1)$. The output dimensionality of a dropout layer is equal to its input dimensionality. With a probability of $\alpha$ any neurons output is set to 0. At testing time, the output of all neurons is multiplied with $\alpha$ to compensate for the fact that no output is set to 0.
A much better paper, by the same authors but 2 years later, is [Dropout: a simple way to prevent neural networks from overfitting](http://www.shortscience.org/paper?bibtexKey=journals/jmlr/SrivastavaHKSS14).
Dropout can be interpreted as training an ensemble of many networks, which share weights.
It was notably used by [ImageNet Classification with Deep Convolutional Neural Networks](http://www.shortscience.org/paper?bibtexKey=krizhevsky2012imagenet).