First published: 2018/10/31 (6 years ago) Abstract: In NMT, how far can we get without attention and without separate encoding
and decoding? To answer that question, we introduce a recurrent neural
translation model that does not use attention and does not have a separate
encoder and decoder. Our eager translation model is low-latency, writing target
tokens as soon as it reads the first source token, and uses constant memory
during decoding. It performs on par with the standard attention-based model of
Bahdanau et al. (2014), and better on long sentences.
An attention mechanism and a separate encoder/decoder are two properties of almost every single neural translation model. The question asked in this paper is- how far can we go without attention and without a separate encoder and decoder? And the answer is- pretty far! The model presented preforms just as well as the attention model of Bahdanau on the four language directions that are studied in the paper.
The translation model presented in the paper is basically a simple recurrent language model. A recurrent language model receives at every timestep the current input word and has to predict the next word in the dataset. To translate with such a model, simply give it the current word from the source sentence and have it try to predict the next word from the target sentence.
Obviously, in many cases such a simple model wouldn't work. For example, if your sentence was "The white dog" and you wanted to translate to Spanish ("El perro blanco"), at the 2nd timestep, the input would be "white" and the expected output would be "perro" (dog). But how could the model predict "perro" when it hasn't seen "dog" yet?
To solve this issue, we preprocess the data before training and insert "empty" padding tokens into the target sentence. When the model outputs such a token, it means that the model would like to read more of the input sentence before emitting the next output word.
So in the example from above, we would change the target sentence to "El PAD perro blanco". Now, at timestep 2 the model emits the PAD symbol. At timestep 3, when the input is "dog", the model can emit the token "perro". These padding symbols are deleted in post-processing, before the output is returned to the user. You can see a visualization of the decoding process below:
https://i.imgur.com/znI6xoN.png
To enable us to use beam search, our model actually receives the previous outputted target token in addition to receiving the current source token at every timestep.
PyTorch code for the model is available at https://github.com/ofirpress/YouMayNotNeedAttention