Neural Machine Translation by Jointly Learning to Align and Translate
Bahdanau, Dzmitry
Cho, Kyunghyun
Bengio, Yoshua
arXiv e-Print archive - 2014 via Local Bibsonomy
One core aspect of this attention approach is that it provides the ability to debug the learned representation by visualizing the softmax output (later called $\alpha_{ij}$) over the input words for each output word as shown below.
In this approach each unit in the RNN they attend over the previous states, unitwise so the length can vary, and then apply a softmax and use the resulting probabilities to multiply and sum each state. This forms the memory used by each state to make a prediction. This bypasses the need for the network to encode everything in the state passed between units.
Each hidden unit is computed as:
$$s_i = f(s_{i−1}, y_{i−1}, c_i).$$
Where $s_{i−1}$ is the previous state and $y_{i−1}$ is the previous target word. Their contribution is $c_i$. This is the context vector which contains the memory of the input phrase.
$$c_i = \sum_{j=1} \alpha_{ij} h_j$$
Here $\alpha_{ij}$ is the output of a softmax for the $j$th element of the input sequence. $h_j$ is the hidden state at the point the RNN was processing the input sequence.