This paper presents a simple approach to predicting
sequences from sequential input. They use a multi-layer
LSTM-based encoder-decoder architecture and show
promising results on the task of neural machine translation.
Their approach beats a phrase-based statistical machine
translation system by a BLEU score of > 1.0 and is close to
state-of-the-art if used to re-rank 1000-best predictions
from the SMT system. Main contributions:
- The first LSTM encodes an input sequence to a single
vector, which is then decoded by a second LSTM. End of sequence
is indicated by a special character.
- 4-layer deep LSTMs.
- 160k source vocabulary, 80k target vocabulary. Trained on
12M sentences. Words in output sequence are generated by a softmax
over fixed vocabulary.
- Beam search is used at test time to predict translations
(Beam size 2 does best).
## Strengths
- Qualitative results (PCA projections) show that learned representations are
fairly insensitive to active/passive voice, as sentences similar in meaning
are clustered together.
- Another interesting observation was that reversing the source
sequence gives a significant boost to translation of long sentences
and results in performance gain, most likely due to the introduction of
short-term dependencies that are more easily captured by the gradients.
## Weaknesses / Notes
- The reversing source input idea needs better justification,
otherwise comes across as an 'ugly hack'.
- To re-score the n-best list of predictions of the baseline,
they average confidences of LSTM and baseline model. They should
have reported re-ranking accuracies by using just the LSTM-model
confidences.