Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu and Mike Schuster and Zhifeng Chen and Quoc V. Le and Mohammad Norouzi and Wolfgang Macherey and Maxim Krikun and Yuan Cao and Qin Gao and Klaus Macherey and Jeff Klingner and Apurva Shah and Melvin Johnson and Xiaobing Liu and Łukasz Kaiser and Stephan Gouws and Yoshikiyo Kato and Taku Kudo and Hideto Kazawa and Keith Stevens et al.
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.AI, cs.LG
more

Summaries/Notes 1

[link] Summary by Udibr 8 years ago

This is a very techincal paper and I only covered items that interested me
* Model
  * Encoder
    * 8 layers LSTM 
    * bi-directional only first encoder layer
    * top 4 layers add input to output (residual network)
  * Decoder
    * same as encoder except all layers are just forward direction
  * encoder state is not passed as a start point to Decoder state
  * Attention
    * energy computed using NN with one hidden layer as appose to dot product or the usual practice of no hidden layer and $\tanh$ activation at the output layer
    * computed from output of 1st decoder layer
    * pre-feed to all layers
* Training has two steps: ML and RL
  * ML (cross-entropy) training:
    * common wisdom, initialize all trainable parameters uniformly between [-0.04, 0.04]
    * clipping=5, batch=128
    * Adam (lr=2e-4) 60K steps followed by SGD (lr=.5 which is probably a typo!) 1.2M steps + 4x(lr/=2 200K steps)
    * 12 async machines, each machine with 8 GPUs (K80) on which the model is spread X 6days
    * [dropout](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZarembaSV14) 0.2-0.3 (higher for smaller datasets)
  * RL - [Reinforcement Learning](http://www.shortscience.org/paper?bibtexKey=journals/corr/RanzatoCAZ15) 
    * sequence score, $\text{GLEU} = r = \min(\text{precision}, \text{recall})$ computed on n-grams of size 1-4
    * mixed loss $\alpha \text{ML} + \text{RL}, \alpha =0.25$
    * mean $r$ computed from $m=15$ samples
    * SGD, 400K steps, 3 days, no drouput
* Prediction (i.e. Decoder)
  * beam search (3 beams)
  * A normalized score is computed to every beam that ended (died)
    * did not normalize beam score by $\text{beam_length}^\alpha , \alpha \in [0.6-0.7]$
    * normalized with similar formula in which 5 is add to length and a coverage factor is added, which is the sum-log of attention weight of every input word (i.e. after summing over all output words)
    * Do a second pruning using normalized scores

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private