Recurrent Neural Network Regularization
Zaremba, Wojciech
and
Sutskever, Ilya
and
Vinyals, Oriol
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords:
dblp
TLDR; The authors show that applying dropout to only the **non-recurrent** connections (between layers of the same timestep) in an LSTM works well, improving the scores on various sequence tasks.
#### Data Sets and model performance
- PTB Language Modeling Perplexity: 78.4
- Google Icelandic Speech Dataset WER Accuracy: 70.5
- WMT'14 English to French Machine Translation BLEU: 29.03
- MS COCO Image Caption Generation BLEU: 24.3
#### Introduction
* The paper explains how to apply dropout to LSTMs and how it could reduce overfitting in tasks like language modelling, speech recognition, image caption generation and machine translation.
* [Link to the paper](https://arxiv.org/abs/1409.2329)
#### [Dropout](https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf)
* Regularisation method that drops out (or temporarily removes) units in a neural network.
the network, along with all its incoming and outgoing connections
* Conventional dropout does not work well with RNNs as the recurrence amplifies the noise and hurts learning.
#### Regularization
* The paper proposes to apply dropout to only the non-recurrent connections.
* The dropout operator would corrupt information carried by some units (and not all) forcing them to perform intermediate computations more robustly.
* The information is corrupted L+1 times where L is the number of layers and is independent of timestamps traversed by the information.
#### Observation
* In the context of language modelling, image caption generation, speech recognition and machine translation, dropout enables training larger networks and reduces the testing error in terms of perplexity and frame accuracy.