Neural Machine Translation by Jointly Learning to Align and Translate
Bahdanau, Dzmitry
and
Cho, Kyunghyun
and
Bengio, Yoshua
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords:
dblp
One core aspect of this attention approach is that it provides the ability to debug the learned representation by visualizing the softmax output (later called $\alpha_{ij}$) over the input words for each output word as shown below.
https://i.imgur.com/Kb7bk3e.png
In this approach each unit in the RNN they attend over the previous states, unitwise so the length can vary, and then apply a softmax and use the resulting probabilities to multiply and sum each state. This forms the memory used by each state to make a prediction. This bypasses the need for the network to encode everything in the state passed between units.
Each hidden unit is computed as:
$$s_i = f(s_{i−1}, y_{i−1}, c_i).$$
Where $s_{i−1}$ is the previous state and $y_{i−1}$ is the previous target word. Their contribution is $c_i$. This is the context vector which contains the memory of the input phrase.
$$c_i = \sum_{j=1} \alpha_{ij} h_j$$
Here $\alpha_{ij}$ is the output of a softmax for the $j$th element of the input sequence. $h_j$ is the hidden state at the point the RNN was processing the input sequence.