* Output can contain several sentences, that are considered as a single long sequence.
* Seq2Seq+attention:
* Oddly they use the formula used by Bahdanau attention weights to combine the weighted attention $c_t$ with the decoder output $h_t^T = W_0 \tanh \left( U_h h_t^T + W_h c_t \right) $ while the attention weights are computed with softmax over dot product between encoder and decoder outputs $h_t^T \cdot h_i^S$
* Glove 300
* 2 layer LSTM 256
* RL model
* Reward=Simplicity+Relevance+Fluency = $\lambda^s r^S + \lambda^R r^R + \lambda^F r^F$
* $r^S = \beta \text{SARI}(X,\hat{Y},Y) + (1-\beta) \text{SARI}(X,Y,\hat{Y})$
* $r^R$ cosine of output of RNN auto encoder run on input and a separate auto encoder run on output
* $r^F$ perplexity of LM trained on output
* Learning exactly as in [MIXER](https://arxiv.org/abs/1511.06732)
* Lexical Simplification model: they train a second model $P_{LS}$ which uses pre-trained attention weights and then use the weighted output of an encoder LSTM as the input to a softmax