Language Modeling with Gated Convolutional Networks
Yann N. Dauphin
and
Angela Fan
and
Michael Auli
and
David Grangier
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.CL
First published: 2016/12/23 (7 years ago) Abstract: The pre-dominant approach to language modeling to date is based on recurrent
neural networks. In this paper we present a convolutional approach to language
modeling. We introduce a novel gating mechanism that eases gradient propagation
and which performs better than the LSTM-style gating of (Oord et al, 2016)
despite being simpler. We achieve a new state of the art on WikiText-103 as
well as a new best single-GPU result on the Google Billion Word benchmark. In
settings where latency is important, our model achieves an order of magnitude
speed-up compared to a recurrent baseline since computation can be parallelized
over time. To our knowledge, this is the first time a non-recurrent approach
outperforms strong recurrent models on these tasks.
This paper is about a new model for language which uses a convolutional approach instead of LSTMs.
## General Language modeling
Statistical language models estimate the probability distribution of a sequence of words. They are important for ASR (automatic speech recognition) and translation. The usual approach is to embedd words into $\mathbb{R}^n$ and then apply RNNs to the vector sequences.
## Evaluation
* [WikiText-103](http://metamind.io/research/the-wikitext-long-term-dependency-language-modeling-dataset/): [Perplexity](https://en.wikipedia.org/wiki/Perplexity) of 44.9 (lower is better)
* new best single-GPU result on the Google Billion Word benchmark: Perplexity of 43.9
## Idea
* uses Gated Linear Units (GLU)
* uses pre-activation residual blocks
* adaptive softmax
* no tanh in the gating mechanism
* use gradient clipping
## See also
* [Reddit](https://www.reddit.com/r/MachineLearning/comments/5kbsjb/r_161208083_language_modeling_with_gated/)
* [Improving Neural Language Models with a Continuous Cache](https://arxiv.org/abs/1612.04426): Test perplexity of **40.8 on WikiText-103**