#### Introduction
* The paper proposes a general and end-to-end approach for sequence learning that uses two deep LSTMs, one to map input sequence to vector space and another to map vector to the output sequence.
* For sequence learning, Deep Neural Networks (DNNs) requires the dimensionality of input and output sequences be known and fixed. This limitation is overcome by using the two LSTMs.
* [Link to the paper](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf)
#### Model
* Recurrent Neural Networks (RNNs) generalizes feed forward neural networks to sequences.
* Given a sequence of inputs $(x_{1}, x_{2}...x_{t})$, RNN computes a sequence of outputs $(y_1, y_2...y_t')$ by iterating over the following equation:
$$h_t = sigm(W^{hx}x_t + W^{hh} h_{t-1})$$
$$y^{t} = W^{yh}h_{t}$$
* To map variable length sequences, the input is mapped to a fixed size vector using an RNN and this fixed size vector is mapped to output sequence using another RNN.
* Given the long-term dependencies between the two sequences, LSTMs are preferred over RNNs.
* LSTMs estimate the conditional probability *p(output sequence | input sequence)* by first mapping the input sequence to a fixed dimensional representation and then computing the probability of output with a standard LST-LM formulation.
##### Differences between the model and standard LSTMs
* The model uses two LSTMs (one for input sequence and another for output sequence), thereby increasing the number of model parameters at negligible computing cost.
* Model uses Deep LSTMs (4 layers).
* The words in the input sequences are reversed to introduce short-term dependencies and to reduce the "minimal time lag". By reversing the word order, the first few words in the source sentence (input sentence) are much closer to first few words in the target sentence (output sentence) thereby making it easier for LSTM to "establish" communication between input and output sentences.
#### Experiments
* WMT'14 English to French dataset containing 12 million sentences consisting of 348 million French words and 304 million English words.
* Model tested on translation task and on the task of re-scoring the n-best results of baseline approach.
* Deep LSTMs trained in sentence pairs by maximizing the log probability of a correct translation $T$, given the source sentence $S$
* The training objective is to maximize this log probability, averaged over all the pairs in the training set.
* Most likely translation is found by performing a simple, left-to-right beam search for translation.
* A hard constraint is enforced on the norm of the gradient to avoid the exploding gradient problem.
* Min batches are selected to have sentences of similar lengths to reduce training time.
* Model performs better when reversed sentences are used for training.
* While the model does not beat the state-of-the-art, it is the first pure neural translation system to outperform a phrase-based SMT baseline.
* The model performs well on long sentences as well with only a minor degradation for the largest sentences.
* The paper prepares ground for the application of sequence-to-sequence based learning models in other domains by demonstrating how a simple and relatively unoptimised neural model could outperform a mature SMT system on translation tasks.