First published: 2016/07/29 (6 years ago) Abstract: The vanilla attention-based neural machine translation has achieved promising
performance because of its capability in leveraging varying-length source
annotations. However, this model still suffers from failures in long sentence
translation, for its incapability in capturing long-term dependencies. In this
paper, we propose a novel recurrent neural machine translation (RNMT), which
not only preserves the ability to model varying-length source annotations but
also better captures long-term dependencies. Instead of the conventional
attention mechanism, RNMT employs a recurrent neural network to extract the
context vector, where the target-side previous hidden state serves as its
initial state, and the source annotations serve as its inputs. We refer to this
new component as contexter. As the encoder, contexter and decoder in our model
are all derivable recurrent neural networks, our model can still be trained
end-to-end on large-scale corpus via stochastic algorithms. Experiments on
Chinese-English translation tasks demonstrate the superiority of our model to
attention-based neural machine translation, especially on long sentences.
Besides, further analysis of the contexter revels that our model can implicitly
reflect the alignment to source sentence.
TLDR; The authors replace the standard attention mechanism (Bahdanau et al) with a RNN/GRU, hoping to model historical dependencies for translation and mitigating the "coverage problem". The authors evaluate their model on Chinese-English translation where they beat Moses (SMT) and GroundHog baselines. The authors also visualize the attention RNN and show that the activations make intuitive sense.
#### Key Points
- Training time: 2 weeks on Titan X, 300 batches per hour, 2.9M language pairs
- The authors argue that their attention mechanism works better b/c it can capture dependencies among the source states. I'm not convinced by this argument. These states already capture dependencies because they are generated by a bidirectional RNN.
- Training seems *very* slow for only 2.9M pairs. I wonder if this model is prohibitively expensive for any production system.
- I wonder if we can use RL to "cover" phrases in the source sentences out of order. At each step we pick a span to cover before generating the next token in the target sequence.
- The authors don't evaluate Moses for long sentences, why?