This paper is about a new model for language which uses a convolutional approach instead of LSTMs.
## General Language modeling
Statistical language models estimate the probability distribution of a sequence of words. They are important for ASR (automatic speech recognition) and translation. The usual approach is to embedd words into $\mathbb{R}^n$ and then apply RNNs to the vector sequences.
## Evaluation
* [WikiText-103](http://metamind.io/research/the-wikitext-long-term-dependency-language-modeling-dataset/): [Perplexity](https://en.wikipedia.org/wiki/Perplexity) of 44.9 (lower is better)
* new best single-GPU result on the Google Billion Word benchmark: Perplexity of 43.9
## Idea
* uses Gated Linear Units (GLU)
* uses pre-activation residual blocks
* adaptive softmax
* no tanh in the gating mechanism
* use gradient clipping
## See also
* [Reddit](https://www.reddit.com/r/MachineLearning/comments/5kbsjb/r_161208083_language_modeling_with_gated/)
* [Improving Neural Language Models with a Continuous Cache](https://arxiv.org/abs/1612.04426): Test perplexity of **40.8 on WikiText-103**