A Deep Reinforced Model for Abstractive Summarization on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

A Deep Reinforced Model for Abstractive Summarization
Romain Paulus and Caiming Xiong and Richard Socher
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CL
more

Summaries/Notes 1

[link] Summary by Udibr 7 years ago

Generates abstractive summaries from news articles. Also see [blog](https://metamind.io/research/your-tldr-by-an-ai-a-deep-reinforced-model-for-abstractive-summarization)
* Input:
 * vocab size 150K
 * start with $W_\text{emb}$ Glove 100
* Seq2Seq:
  * bidirectional LSTM, `size=200` in each direction. Final hidden states are concatenated and feed as initial hidden state of the decoder an LSTM of `size=400`. surprising it's only one layer.
* Attention:
  * Add standard attention mechanism between each new hidden state of the decoder and all the hidden states of the encoder
  * A new kind of attention mechanism is done between the new hidden state of the decoder and all previous hidden states of the decoder
  * the new hidden state is concatenated with the two attention outputs and feed to dense+softmax to model next word in summary (output vocab size 50K). The weight matrix $W_h$ is reduced to $W_h = \tanh \left( W_\text{emb} W_\text{proj} \right) $ resulting in faster converges, see [1](arXiv:1611.01462) and [2](https://arxiv.org/abs/1608.05859)
* Pointer mechanism:
  * The concatenated values are also feed to logistic classifier to decide if the softmax output should be used or one of the words in the article should be copied to the output. The article word to be copied is selected using same weights computed in the attention mechanism
* Loss
  * $L_\text{ml}$: NLL of the example summary $y^*$. If only $L_\text{ml}$ is used then 25% of the times use generated instead of given sample as input to next step. 
  * $L_\text{rl}$: sample an entire summary from the model $y^s$ (temperature=1) and the loss is the NLL of the sample multiplied by a reward. The reward is $r(y^s)-r(\hat{y})$ where $r$ is ROUGE-L and $\hat{y}$ is a generated greedy sequences
 * $L=\gamma L_\text{rl} + (1-\gamma)L_\text{ml}$ where $\gamma=0.9984$
* Training
  * `batch=50`, Adam,  `LR=1e-4` for RL/ML+RL training
  * The training labels are summary examples and an indication if copy was used in the pointer mechanism and which word was copied. This is indicated when the summary word is OOV or if it appears in the article and its NER is one of PERSON, LOCATION, ORGANIZATION or MISC
* Generation
  * 5 beams
  * force trigrams not to appear twice in the same beam

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private