Efficient Summarization with Read-Again and Copy Mechanism
Wenyuan Zeng
and
Wenjie Luo
and
Sanja Fidler
and
Raquel Urtasun
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.CL
First published: 2016/11/10 (7 years ago) Abstract: Encoder-decoder models have been widely used to solve sequence to sequence
prediction tasks. However current approaches suffer from two shortcomings.
First, the encoders compute a representation of each word taking into account
only the history of the words it has read so far, yielding suboptimal
representations. Second, current decoders utilize large vocabularies in order
to minimize the problem of unknown words, resulting in slow decoding times. In
this paper we address both shortcomings. Towards this goal, we first introduce
a simple mechanism that first reads the input sequence before committing to a
representation of each word. Furthermore, we propose a simple copy mechanism
that is able to exploit very small vocabularies and handle out-of-vocabulary
words. We demonstrate the effectiveness of our approach on the Gigaword dataset
and DUC competition outperforming the state-of-the-art.
### Read-Again
Two options:
* GRU: run a pass of regular GRU on the input text $x_1,\ldots,x_n$. Use its hidden states $h_1,\ldots,h_n$ to compute weights vector for every step $i$ :
$\alpha_i = \tanh \left( W_e h_i + U_e h_n + V_e x_i\right)$ and then runs a second GRU pass on the same input text. In the second pass the weights $\alpha_i$, from the first pass, are multiplied with the internal $z_i$ GRU gatting (controlling if hidden state is directly copied) of the second pass.
* LSTM: concatenate the hidden states from the first pass with the input text
$\left[ x_i, h_i, h_n \right]$ and run a second pass on this new input.
In case of multiple sentences the above passes are done per sentence. In addition the $h^s_n$ of each sentence $s$ is concatenated with the $h^{s'}_n$ of the other sentences or with $\tanh \left( \sum_s V_s h_s + v\right)$
### Decoder with copy mechanism
LSTM with hidden state $s_t$. Input is previously generated word $y_{t-1}$ and context computed with attention mechanism: $c_t = \sum_i^n \beta_{it} h_i$. Here $h_i$ are the hidden states of the 2nd pass of the encoder. The weights are $\beta_{it} = \text{softmax} \left( v_a^T \tanh \left( W_a s_{t-1} + U_a h_i\right) \right)$
The decoder vocabulary $Y$ used is small. If $y_{t-1}$ does not appear in $Y$ but does appear in the input at $x_i$ then its embedding is replaced with $p_t = \tanh \left( W_c h_i + b_c\right)$ and <UNK> otherwise.
$p_t$ is also used to copy the input to the output (details not given)
### Experiments
abstractive summarization [DUC2003 and DUC2004 competitions](http://www-nlpir.nist.gov/projects/duc/data.html).