A Deep Reinforced Model for Abstractive Summarization
Romain Paulus
and
Caiming Xiong
and
Richard Socher
arXiv e-Print archive - 2017 via Local arXiv
Keywords:
cs.CL
First published: 2017/05/11 (7 years ago) Abstract: Attentional, RNN-based encoder-decoder models for abstractive summarization
have achieved good performance on short input and output sequences. However,
for longer documents and summaries, these models often include repetitive and
incoherent phrases. We introduce a neural network model with intra-attention
and a new training method. This method combines standard supervised word
prediction and reinforcement learning (RL). Models trained only with the former
often exhibit "exposure bias" -- they assume ground truth is provided at each
step during training. However, when standard word prediction is combined with
the global sequence prediction training of RL the resulting summaries become
more readable. We evaluate this model on the CNN/Daily Mail and New York Times
datasets. Our model obtains a 41.16 ROUGE-1 score on the CNN/Daily Mail
dataset, a 5.7 absolute points improvement over previous state-of-the-art
models. It also performs well as the first abstractive model on the New York
Times corpus. Human evaluation also shows that our model produces higher
quality summaries.
Generates abstractive summaries from news articles. Also see [blog](https://metamind.io/research/your-tldr-by-an-ai-a-deep-reinforced-model-for-abstractive-summarization)
* Input:
* vocab size 150K
* start with $W_\text{emb}$ Glove 100
* Seq2Seq:
* bidirectional LSTM, `size=200` in each direction. Final hidden states are concatenated and feed as initial hidden state of the decoder an LSTM of `size=400`. surprising it's only one layer.
* Attention:
* Add standard attention mechanism between each new hidden state of the decoder and all the hidden states of the encoder
* A new kind of attention mechanism is done between the new hidden state of the decoder and all previous hidden states of the decoder
* the new hidden state is concatenated with the two attention outputs and feed to dense+softmax to model next word in summary (output vocab size 50K). The weight matrix $W_h$ is reduced to $W_h = \tanh \left( W_\text{emb} W_\text{proj} \right) $ resulting in faster converges, see [1](arXiv:1611.01462) and [2](https://arxiv.org/abs/1608.05859)
* Pointer mechanism:
* The concatenated values are also feed to logistic classifier to decide if the softmax output should be used or one of the words in the article should be copied to the output. The article word to be copied is selected using same weights computed in the attention mechanism
* Loss
* $L_\text{ml}$: NLL of the example summary $y^*$. If only $L_\text{ml}$ is used then 25% of the times use generated instead of given sample as input to next step.
* $L_\text{rl}$: sample an entire summary from the model $y^s$ (temperature=1) and the loss is the NLL of the sample multiplied by a reward. The reward is $r(y^s)-r(\hat{y})$ where $r$ is ROUGE-L and $\hat{y}$ is a generated greedy sequences
* $L=\gamma L_\text{rl} + (1-\gamma)L_\text{ml}$ where $\gamma=0.9984$
* Training
* `batch=50`, Adam, `LR=1e-4` for RL/ML+RL training
* The training labels are summary examples and an indication if copy was used in the pointer mechanism and which word was copied. This is indicated when the summary word is OOV or if it appears in the article and its NER is one of PERSON, LOCATION, ORGANIZATION or MISC
* Generation
* 5 beams
* force trigrams not to appear twice in the same beam