[link]
Summary by Udibr 7 years ago
This is a very techincal paper and I only covered items that interested me
* Model
* Encoder
* 8 layers LSTM
* bi-directional only first encoder layer
* top 4 layers add input to output (residual network)
* Decoder
* same as encoder except all layers are just forward direction
* encoder state is not passed as a start point to Decoder state
* Attention
* energy computed using NN with one hidden layer as appose to dot product or the usual practice of no hidden layer and $\tanh$ activation at the output layer
* computed from output of 1st decoder layer
* pre-feed to all layers
* Training has two steps: ML and RL
* ML (cross-entropy) training:
* common wisdom, initialize all trainable parameters uniformly between [-0.04, 0.04]
* clipping=5, batch=128
* Adam (lr=2e-4) 60K steps followed by SGD (lr=.5 which is probably a typo!) 1.2M steps + 4x(lr/=2 200K steps)
* 12 async machines, each machine with 8 GPUs (K80) on which the model is spread X 6days
* [dropout](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZarembaSV14) 0.2-0.3 (higher for smaller datasets)
* RL - [Reinforcement Learning](http://www.shortscience.org/paper?bibtexKey=journals/corr/RanzatoCAZ15)
* sequence score, $\text{GLEU} = r = \min(\text{precision}, \text{recall})$ computed on n-grams of size 1-4
* mixed loss $\alpha \text{ML} + \text{RL}, \alpha =0.25$
* mean $r$ computed from $m=15$ samples
* SGD, 400K steps, 3 days, no drouput
* Prediction (i.e. Decoder)
* beam search (3 beams)
* A normalized score is computed to every beam that ended (died)
* did not normalize beam score by $\text{beam_length}^\alpha , \alpha \in [0.6-0.7]$
* normalized with similar formula in which 5 is add to length and a coverage factor is added, which is the sum-log of attention weight of every input word (i.e. after summing over all output words)
* Do a second pruning using normalized scores
more
less