Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
Cho, Kyunghyun
and
van Merrienboer, Bart
and
Gülçehre, Çaglar
and
Bahdanau, Dzmitry
and
Bougares, Fethi
and
Schwenk, Holger
and
Bengio, Yoshua
Empirical Methods on Natural Language Processing (EMNLP) - 2014 via Local Bibsonomy
Keywords:
dblp
TLDR; The authors propose a novel encoder-decoder neural network architecture. The encoder RNN encodes a sequence into a fixed length vector representation and the decoder generates a new variable-length sequence based on this representation. The authors also introduce a new cell type (now called GRU) to be used with this network architecture. The model is evaluated on a statistical machine translation task where it is fed as an additional feature to a log-linear model. It leads to improved BLEU scores. The authors also find that the model learns syntactically and semantically meaningful representations of both words and phrases.
#### Key Points:
- New encoder-decoder architecture, seq2seq. Decoder conditioned on thought vector.
- Architecture can be used for both scoring or generation
- New hidden unit type, now called GRU. Simplified LSTM.
- Could replace whole pipeline with this architecture, but this paper doesn't
- 15k vocabulary (93% of dataset cover). 100d embeddings, 500 maxout units in final affine layer, batch size of 64, adagrad, 384M words, 3 days training time.
- Architecture is trained without frequency information so we expect it to capture linguistic information rather than statistical information.
- Visualizations of both words embeddings and thought vectors.
#### Questions/Notes
- Why not just use LSTM units?