#### Introduction
* The paper proposes a general and end-to-end approach for sequence learning that uses two deep LSTMs, one to map input sequence to vector space and another to map vector to the output sequence.
* For sequence learning, Deep Neural Networks (DNNs) requires the dimensionality of input and output sequences be known and fixed. This limitation is overcome by using the two LSTMs.
* [Link to the paper](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf)
#### Model
* Recurrent Neural Networks (RNNs) generalizes feed forward neural networks to sequences.
* Given a sequence of inputs $(x_{1}, x_{2}...x_{t})$, RNN computes a sequence of outputs $(y_1, y_2...y_t')$ by iterating over the following equation:
$$h_t = sigm(W^{hx}x_t + W^{hh} h_{t-1})$$
$$y^{t} = W^{yh}h_{t}$$
* To map variable length sequences, the input is mapped to a fixed size vector using an RNN and this fixed size vector is mapped to output sequence using another RNN.
* Given the long-term dependencies between the two sequences, LSTMs are preferred over RNNs.
* LSTMs estimate the conditional probability *p(output sequence | input sequence)* by first mapping the input sequence to a fixed dimensional representation and then computing the probability of output with a standard LST-LM formulation.
##### Differences between the model and standard LSTMs
* The model uses two LSTMs (one for input sequence and another for output sequence), thereby increasing the number of model parameters at negligible computing cost.
* Model uses Deep LSTMs (4 layers).
* The words in the input sequences are reversed to introduce short-term dependencies and to reduce the "minimal time lag". By reversing the word order, the first few words in the source sentence (input sentence) are much closer to first few words in the target sentence (output sentence) thereby making it easier for LSTM to "establish" communication between input and output sentences.
#### Experiments
* WMT'14 English to French dataset containing 12 million sentences consisting of 348 million French words and 304 million English words.
* Model tested on translation task and on the task of re-scoring the n-best results of baseline approach.
* Deep LSTMs trained in sentence pairs by maximizing the log probability of a correct translation $T$, given the source sentence $S$
* The training objective is to maximize this log probability, averaged over all the pairs in the training set.
* Most likely translation is found by performing a simple, left-to-right beam search for translation.
* A hard constraint is enforced on the norm of the gradient to avoid the exploding gradient problem.
* Min batches are selected to have sentences of similar lengths to reduce training time.
* Model performs better when reversed sentences are used for training.
* While the model does not beat the state-of-the-art, it is the first pure neural translation system to outperform a phrase-based SMT baseline.
* The model performs well on long sentences as well with only a minor degradation for the largest sentences.
* The paper prepares ground for the application of sequence-to-sequence based learning models in other domains by demonstrating how a simple and relatively unoptimised neural model could outperform a mature SMT system on translation tasks.
TLDR; The authors show that a multilayer LSTM RNN (4 layers, 1000 cells per layer, 1000d embeddings, 160k source vocab, 80k target vocab) can achieve competitive results on Machine Translation tasks. The authors find that reversing the input sequence leads to significant improvements, most likely due to the introduction of short-term dependencies that are more easily captured by the gradients. Somewhat surprisingly, the LSTM did not have difficulties on long sentences. The model is evaluated on MT tasks and achieves competitive results (34.8 BLEU) by itself, and close to state of the art if coupled with existing baseline systems (36.5 BLEU).
#### Key Points
- Invert input sequence leads to significant improvement
- Deep LSTM performs much better than shallow LSTM.
- User different parameters for encoder/decoder. This allows parallel training for multiple languages decoders.
- 4 Layers, 1000 cells per layer. 1000-dimensional words embeddings. 160k source vocabulary. 80k target vocabulary.Trained on 12M sentences (652M words). SGD with fixed learning rate of 0.7, decreasing by 1/2 every epoch after 5 initial epochs. Gradient clipping. Parallelization on GPU leads to 6.3k words/sec.
- Batching sentences of approximately the same length leads to 2x speedup.
- PCA projection shows meaningful clusters of sentences robust to passive/active voice, suggesting that the fixed vector representation captures meaning.
- "No complete explanation" for why the LSTM does so much better with the introduced short-range dependencies.
- Beam size 1 already performs well, beam size 2 is best in deep model.
#### Notes/Questions
- Seems like the performance here is mostly due to the computational resources available and optimized implementation. These models are pretty big by most standards, and other approaches (e.g. attention) may lead to better results if they had more computational resources.
- Reversing the input still feels like a hack to me, there should be a more principled solution to deal with long-range dependencies.
This paper presents a simple approach to predicting
sequences from sequential input. They use a multi-layer
LSTM-based encoder-decoder architecture and show
promising results on the task of neural machine translation.
Their approach beats a phrase-based statistical machine
translation system by a BLEU score of > 1.0 and is close to
state-of-the-art if used to re-rank 1000-best predictions
from the SMT system. Main contributions:
- The first LSTM encodes an input sequence to a single
vector, which is then decoded by a second LSTM. End of sequence
is indicated by a special character.
- 4-layer deep LSTMs.
- 160k source vocabulary, 80k target vocabulary. Trained on
12M sentences. Words in output sequence are generated by a softmax
over fixed vocabulary.
- Beam search is used at test time to predict translations
(Beam size 2 does best).
## Strengths
- Qualitative results (PCA projections) show that learned representations are
fairly insensitive to active/passive voice, as sentences similar in meaning
are clustered together.
- Another interesting observation was that reversing the source
sequence gives a significant boost to translation of long sentences
and results in performance gain, most likely due to the introduction of
short-term dependencies that are more easily captured by the gradients.
## Weaknesses / Notes
- The reversing source input idea needs better justification,
otherwise comes across as an 'ugly hack'.
- To re-score the n-best list of predictions of the baseline,
they average confidences of LSTM and baseline model. They should
have reported re-ranking accuracies by using just the LSTM-model
confidences.