Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu
and
Mike Schuster
and
Zhifeng Chen
and
Quoc V. Le
and
Mohammad Norouzi
and
Wolfgang Macherey
and
Maxim Krikun
and
Yuan Cao
and
Qin Gao
and
Klaus Macherey
and
Jeff Klingner
and
Apurva Shah
and
Melvin Johnson
and
Xiaobing Liu
and
Łukasz Kaiser
and
Stephan Gouws
and
Yoshikiyo Kato
and
Taku Kudo
and
Hideto Kazawa
and
Keith Stevens
et al.
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.CL, cs.AI, cs.LG
First published: 2016/09/26 (8 years ago) Abstract: Neural Machine Translation (NMT) is an end-to-end learning approach for
automated translation, with the potential to overcome many of the weaknesses of
conventional phrase-based translation systems. Unfortunately, NMT systems are
known to be computationally expensive both in training and in translation
inference. Also, most NMT systems have difficulty with rare words. These issues
have hindered NMT's use in practical deployments and services, where both
accuracy and speed are essential. In this work, we present GNMT, Google's
Neural Machine Translation system, which attempts to address many of these
issues. Our model consists of a deep LSTM network with 8 encoder and 8 decoder
layers using attention and residual connections. To improve parallelism and
therefore decrease training time, our attention mechanism connects the bottom
layer of the decoder to the top layer of the encoder. To accelerate the final
translation speed, we employ low-precision arithmetic during inference
computations. To improve handling of rare words, we divide words into a limited
set of common sub-word units ("wordpieces") for both input and output. This
method provides a good balance between the flexibility of "character"-delimited
models and the efficiency of "word"-delimited models, naturally handles
translation of rare words, and ultimately improves the overall accuracy of the
system. Our beam search technique employs a length-normalization procedure and
uses a coverage penalty, which encourages generation of an output sentence that
is most likely to cover all the words in the source sentence. On the WMT'14
English-to-French and English-to-German benchmarks, GNMT achieves competitive
results to state-of-the-art. Using a human side-by-side evaluation on a set of
isolated simple sentences, it reduces translation errors by an average of 60%
compared to Google's phrase-based production system.
This is a very techincal paper and I only covered items that interested me
* Model
* Encoder
* 8 layers LSTM
* bi-directional only first encoder layer
* top 4 layers add input to output (residual network)
* Decoder
* same as encoder except all layers are just forward direction
* encoder state is not passed as a start point to Decoder state
* Attention
* energy computed using NN with one hidden layer as appose to dot product or the usual practice of no hidden layer and $\tanh$ activation at the output layer
* computed from output of 1st decoder layer
* pre-feed to all layers
* Training has two steps: ML and RL
* ML (cross-entropy) training:
* common wisdom, initialize all trainable parameters uniformly between [-0.04, 0.04]
* clipping=5, batch=128
* Adam (lr=2e-4) 60K steps followed by SGD (lr=.5 which is probably a typo!) 1.2M steps + 4x(lr/=2 200K steps)
* 12 async machines, each machine with 8 GPUs (K80) on which the model is spread X 6days
* [dropout](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZarembaSV14) 0.2-0.3 (higher for smaller datasets)
* RL - [Reinforcement Learning](http://www.shortscience.org/paper?bibtexKey=journals/corr/RanzatoCAZ15)
* sequence score, $\text{GLEU} = r = \min(\text{precision}, \text{recall})$ computed on n-grams of size 1-4
* mixed loss $\alpha \text{ML} + \text{RL}, \alpha =0.25$
* mean $r$ computed from $m=15$ samples
* SGD, 400K steps, 3 days, no drouput
* Prediction (i.e. Decoder)
* beam search (3 beams)
* A normalized score is computed to every beam that ended (died)
* did not normalize beam score by $\text{beam_length}^\alpha , \alpha \in [0.6-0.7]$
* normalized with similar formula in which 5 is add to length and a coverage factor is added, which is the sum-log of attention weight of every input word (i.e. after summing over all output words)
* Do a second pruning using normalized scores