Pointer Sentinel Mixture Models
Stephen Merity
and
Caiming Xiong
and
James Bradbury
and
Richard Socher
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.CL, cs.AI
First published: 2016/09/26 (8 years ago) Abstract: Recent neural network sequence models with softmax classifiers have achieved
their best language modeling performance only with very large hidden states and
large vocabularies. Even then they struggle to predict rare or unseen words
even if the context makes the prediction unambiguous. We introduce the pointer
sentinel mixture architecture for neural sequence models which has the ability
to either reproduce a word from the recent context or produce a word from a
standard softmax classifier. Our pointer sentinel-LSTM model achieves state of
the art language modeling performance on the Penn Treebank (70.9 perplexity)
while using far fewer parameters than a standard softmax LSTM. In order to
evaluate how well language models can exploit longer contexts and deal with
more realistic vocabularies and larger corpora we also introduce the freely
available WikiText corpus.
TLDR; The authors combine a standard LSTM softmax with [Pointer Networks](https://arxiv.org/abs/1506.03134) in a mixture model called Pointer-Sentinel LSTM (PS-LSTM). The pointer networks helps with rare words and long-term dependencies but is unable to refer to words that are not in the input. The oppoosite is the case for the standard softmax. By combining the two approaches we get the best of both worlds. The probability of an output words is defined as a mixture of the pointer and softmax model and the mixture coefficient is calculated as part of the pointer attention. The authors evaluate their architecture on the PTB Language Modeling dataset where they achieve state of the art. They also present a novel WikiText dataset that is larger and more realistic then PTB.
### Key Points:
- Standard RNNs with softmax struggle with rare and unseen words, even when adding attention.
- Use a window of the most recent`L` words to match against.
- Probability of output with gating: `p(y|x) = g * p_vocab(y|x) + (1 - g) * p_ptr(y|x)`.
- The gate `g` is calcualted as an extra element in the attention module. Probabilities for the pointer network are then normalized accordingly.
- Integrating the gating funciton computation into the pointer network is crucial: It needs to have access to the pointer network state, not just the RNN state (which can't hold long-term info)
- WikiText-2 dataset: 2M train tokens, 217k validation tokens, 245k test tokens. 33k vocab, 2.6% OOV. 2x larger than PTB.
- WikiText-1-3 dataset: 103M train tokens, 217k validation tokens, 245k test tokens. 267k vocab, 2.4% OOV. 100x larger than PTB.
- Pointer Sentiment Model leads to stronger improvements for rare words - that makes intuitive sense.