TLDR; The authors show that applying dropout to only the **non-recurrent** connections (between layers of the same timestep) in an LSTM works well, improving the scores on various sequence tasks.
#### Data Sets and model performance
- PTB Language Modeling Perplexity: 78.4
- Google Icelandic Speech Dataset WER Accuracy: 70.5
- WMT'14 English to French Machine Translation BLEU: 29.03
- MS COCO Image Caption Generation BLEU: 24.3
* The paper explains how to apply dropout to LSTMs and how it could reduce overfitting in tasks like language modelling, speech recognition, image caption generation and machine translation.
* [Link to the paper](https://arxiv.org/abs/1409.2329)
* Regularisation method that drops out (or temporarily removes) units in a neural network.
the network, along with all its incoming and outgoing connections
* Conventional dropout does not work well with RNNs as the recurrence amplifies the noise and hurts learning.
* The paper proposes to apply dropout to only the non-recurrent connections.
* The dropout operator would corrupt information carried by some units (and not all) forcing them to perform intermediate computations more robustly.
* The information is corrupted L+1 times where L is the number of layers and is independent of timestamps traversed by the information.
* In the context of language modelling, image caption generation, speech recognition and machine translation, dropout enables training larger networks and reduces the testing error in terms of perplexity and frame accuracy.