[link]
* Output can contain several sentences, that are considered as a single long sequence. * Seq2Seq+attention: * Oddly they use the formula used by Bahdanau attention weights to combine the weighted attention $c_t$ with the decoder output $h_t^T = W_0 \tanh \left( U_h h_t^T + W_h c_t \right) $ while the attention weights are computed with softmax over dot product between encoder and decoder outputs $h_t^T \cdot h_i^S$ * Glove 300 * 2 layer LSTM 256 * RL model * Reward=Simplicity+Relevance+Fluency = $\lambda^s r^S + \lambda^R r^R + \lambda^F r^F$ * $r^S = \beta \text{SARI}(X,\hat{Y},Y) + (1\beta) \text{SARI}(X,Y,\hat{Y})$ * $r^R$ cosine of output of RNN auto encoder run on input and a separate auto encoder run on output * $r^F$ perplexity of LM trained on output * Learning exactly as in [MIXER](https://arxiv.org/abs/1511.06732) * Lexical Simplification model: they train a second model $P_{LS}$ which uses pretrained attention weights and then use the weighted output of an encoder LSTM as the input to a softmax 
[link]
Generates abstractive summaries from news articles. Also see [blog](https://metamind.io/research/yourtldrbyanaiadeepreinforcedmodelforabstractivesummarization) * Input: * vocab size 150K * start with $W_\text{emb}$ Glove 100 * Seq2Seq: * bidirectional LSTM, `size=200` in each direction. Final hidden states are concatenated and feed as initial hidden state of the decoder an LSTM of `size=400`. surprising it's only one layer. * Attention: * Add standard attention mechanism between each new hidden state of the decoder and all the hidden states of the encoder * A new kind of attention mechanism is done between the new hidden state of the decoder and all previous hidden states of the decoder * the new hidden state is concatenated with the two attention outputs and feed to dense+softmax to model next word in summary (output vocab size 50K). The weight matrix $W_h$ is reduced to $W_h = \tanh \left( W_\text{emb} W_\text{proj} \right) $ resulting in faster converges, see [1](arXiv:1611.01462) and [2](https://arxiv.org/abs/1608.05859) * Pointer mechanism: * The concatenated values are also feed to logistic classifier to decide if the softmax output should be used or one of the words in the article should be copied to the output. The article word to be copied is selected using same weights computed in the attention mechanism * Loss * $L_\text{ml}$: NLL of the example summary $y^*$. If only $L_\text{ml}$ is used then 25% of the times use generated instead of given sample as input to next step. * $L_\text{rl}$: sample an entire summary from the model $y^s$ (temperature=1) and the loss is the NLL of the sample multiplied by a reward. The reward is $r(y^s)r(\hat{y})$ where $r$ is ROUGEL and $\hat{y}$ is a generated greedy sequences * $L=\gamma L_\text{rl} + (1\gamma)L_\text{ml}$ where $\gamma=0.9984$ * Training * `batch=50`, Adam, `LR=1e4` for RL/ML+RL training * The training labels are summary examples and an indication if copy was used in the pointer mechanism and which word was copied. This is indicated when the summary word is OOV or if it appears in the article and its NER is one of PERSON, LOCATION, ORGANIZATION or MISC * Generation * 5 beams * force trigrams not to appear twice in the same beam 
[link]
### ReadAgain Two options: * GRU: run a pass of regular GRU on the input text $x_1,\ldots,x_n$. Use its hidden states $h_1,\ldots,h_n$ to compute weights vector for every step $i$ : $\alpha_i = \tanh \left( W_e h_i + U_e h_n + V_e x_i\right)$ and then runs a second GRU pass on the same input text. In the second pass the weights $\alpha_i$, from the first pass, are multiplied with the internal $z_i$ GRU gatting (controlling if hidden state is directly copied) of the second pass. * LSTM: concatenate the hidden states from the first pass with the input text $\left[ x_i, h_i, h_n \right]$ and run a second pass on this new input. In case of multiple sentences the above passes are done per sentence. In addition the $h^s_n$ of each sentence $s$ is concatenated with the $h^{s'}_n$ of the other sentences or with $\tanh \left( \sum_s V_s h_s + v\right)$ ### Decoder with copy mechanism LSTM with hidden state $s_t$. Input is previously generated word $y_{t1}$ and context computed with attention mechanism: $c_t = \sum_i^n \beta_{it} h_i$. Here $h_i$ are the hidden states of the 2nd pass of the encoder. The weights are $\beta_{it} = \text{softmax} \left( v_a^T \tanh \left( W_a s_{t1} + U_a h_i\right) \right)$ The decoder vocabulary $Y$ used is small. If $y_{t1}$ does not appear in $Y$ but does appear in the input at $x_i$ then its embedding is replaced with $p_t = \tanh \left( W_c h_i + b_c\right)$ and <UNK> otherwise. $p_t$ is also used to copy the input to the output (details not given) ### Experiments abstractive summarization [DUC2003 and DUC2004 competitions](http://wwwnlpir.nist.gov/projects/duc/data.html). 
[link]
[Code](https://github.com/ashwinkalyan/dbs), [Live Demo](http://dbs.cloudcv.org/) ([code for demo site]( https://github.com/CloudCV/diversebeamsearch)) Diverse Beam Search (DBS) is an alternative to Beam Search (BS). Decodes diverse lists by dividing the beam budget $B$ (e.g. 6) into groups $G$ (e.g. 3) and enforcing diversity between groups of beams. For every time step $t$ iterate over all groups. In 1st group find $B'=B/G$ (e.g. 2) partial beams $Y^1_{[t]} = \{y^1_{b,t} : b \in [B']\}$ using BS with NLL. In 2nd group find partial beams $y^2_{b,t}$ using BS with partial beam score taken to be the sum of NLL and the distance between the partial beam and the partial beams in 1st group. The distance is multiplied by a factor $\lambda_t$. For group $g$ the distance is measured between the partial beam $y^g_{b,t}$ and all the partial beams in all groups that were already optimized for current time step. $\Delta(Y^1_{[t]},\ldots,Y^{g1}_{[t]})[y^g_{b,t}]$ Evaluation Metrics: * Oracle Accuracy: maximum value of the metric (BLEU) over a list of final beams * Diversity Statistics: number of distinct ngrams in all final beams * Human preference Parameters: * $G=B$ allows for the maximum exploration and found to improve oracle accuracy. * $\lambda \in [0.20.8]$ Distance between partial beam and all other groups is broken to a sum of the distances with each group: $$\Delta(Y^1_{[t]},\ldots,Y^{g1}_{[t]}) = \sum^{g1}_{h=1}\Delta(Y^h_{[t]})$$ individual $\Delta(Y^h_{[t]})[y^g_{b,t}]$ is taken to be one of: * Hamming (gives best oracle performance): proportional to the number of times latest token in $y^g_{b,t}$ was selected as latest token in beams in $Y^h_{[t]}$. * Cumulative: cancels out Hamming: $\exp\{(\sum_{\tau \in t} \sum_{b \in B'} \mathbb{1}_{[y^h_{b,\tau} \neq y^g_{b,\tau}]})/\Gamma\}$ * ngram: number of times each ngram in a candidate occurred in previous groups * Neuralembedding: in all previous methods replace hamming similarity with cosine of word2vec of token (or sum of word2vec of ngram tokens) My 2 cents: * Once a beam reaches EOS you need to stop comparing it with other groups * Using DBS cause results to be longer. Perhaps too much. You can reduce length by adding a penalty to length 
[link]
This is a very techincal paper and I only covered items that interested me * Model * Encoder * 8 layers LSTM * bidirectional only first encoder layer * top 4 layers add input to output (residual network) * Decoder * same as encoder except all layers are just forward direction * encoder state is not passed as a start point to Decoder state * Attention * energy computed using NN with one hidden layer as appose to dot product or the usual practice of no hidden layer and $\tanh$ activation at the output layer * computed from output of 1st decoder layer * prefeed to all layers * Training has two steps: ML and RL * ML (crossentropy) training: * common wisdom, initialize all trainable parameters uniformly between [0.04, 0.04] * clipping=5, batch=128 * Adam (lr=2e4) 60K steps followed by SGD (lr=.5 which is probably a typo!) 1.2M steps + 4x(lr/=2 200K steps) * 12 async machines, each machine with 8 GPUs (K80) on which the model is spread X 6days * [dropout](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZarembaSV14) 0.20.3 (higher for smaller datasets) * RL  [Reinforcement Learning](http://www.shortscience.org/paper?bibtexKey=journals/corr/RanzatoCAZ15) * sequence score, $\text{GLEU} = r = \min(\text{precision}, \text{recall})$ computed on ngrams of size 14 * mixed loss $\alpha \text{ML} + \text{RL}, \alpha =0.25$ * mean $r$ computed from $m=15$ samples * SGD, 400K steps, 3 days, no drouput * Prediction (i.e. Decoder) * beam search (3 beams) * A normalized score is computed to every beam that ended (died) * did not normalize beam score by $\text{beam_length}^\alpha , \alpha \in [0.60.7]$ * normalized with similar formula in which 5 is add to length and a coverage factor is added, which is the sumlog of attention weight of every input word (i.e. after summing over all output words) * Do a second pruning using normalized scores 
[link]
This paper is covered by author in this [talk](https://github.com/udibr/notes/blob/master/Talk%20by%20Sasha%20Rush%20%20Interpreting%2C%20Training%2C%20and%20Distilling%20Seq2Seq%E2%80%A6.pdf) 
[link]
[Parsey McParseface](http://github.com/tensorflow/models/tree/master/syntaxnet) is a parser of English sentences capable of finding parts of speech and dependency parsing. By Michael Collins and google NY. This paper is more than just about google's data collection and computing powers. The parser uses a feed forward NN, which is much faster than the RNN usually used for parsing. Also the paper is using a global method to solve the label bias problem. This method can be used for many tasks and indeed in the paper it is used also to shorten sentences by throwing unnecessary words. The label bias problem arises when predicting each label in a sequence using a softmax over all possible label values in each step. This is a local approach but what we are really interested in is a global approach in which the sequence of all labels that appeared in a training example are normalized by all possible sequences. This is intractable so instead a beam search is performed to generate alternative sequences to the training sequence. The search is stopped when the training sequence drops from the beam or ends. The different beams with the training sequence are then used to compute the global loss. Similar method is used in [seq2seq by Sasha Rush](http://arxiv.org/pdf/1606.02960.pdf) and [talk](https://github.com/udibr/notes/blob/master/Talk%20by%20Sasha%20Rush%20%20Interpreting%2C%20Training%2C%20and%20Distilling%20Seq2Seq%E2%80%A6.pdf) 
[link]
[web site](http://groups.inf.ed.ac.uk/cup/codeattention/), [code (Theano)](https://github.com/mastgroup/convolutionalattention), [working version of code](https://github.com/udibr/convolutionalattention), [ICML](http://icml.cc/2016/?page_id=1839#971), [external notes](https://github.com/jxieeducation/DIYDataScience/blob/master/papernotes/2016/02/convattentionnetworksourcecodesummarization.md) Given an arbitrary snippet of Java code (~72 tokens) generate the methods name (~3 tokens): generation starts with a $m_0 = \text{startsymbol}$ and state $h_0$, to generate next output token $m_t$ do: * convert code tokens $c_i$ and embed to $E_{c_i}$ * convert all $E_{c_i}$ to $\alpha$ and $\kappa$ all same length as code using a network of `Conv1D` and padding (`Conv1D` because the code is highly structured, unambiguous.) The convertion is done using following network: ![](http://i.imgur.com/cHbiSIi.png?1) * $\alpha$ and $\kappa$ are probabilities over length of code (using softmax). * In addition compute $\lambda$ by running another `Conv1D` over $L_\text{feat}$ with $\sigma$ activation and take the maximal value. * use $\alpha$ to weight average $E_{c_i}$ and pass the average through FC layer to end with a softmax over output vocabulary $V$. Probability for output word $m_t$ is $n_{m_t}$. * As an alternative use $\kappa$ to give probability to use as output each of the tokens $c_i$ which can be inside $V$ or outside it. This is also called "translationinvariant features" ([ref](https://papers.nips.cc/paper/5866pointernetworks.pdf)) * $\lambda$ is used as a metaattention deciding which to use: $P(m_t \mid h_{t1},c) = \lambda \sum_i \kappa_i I_{c_i = m_t} + (1\lambda) \mu n_{m_t}$ where $\mu$ is $1$ unless you are in training and $m_t$ is UNK and the correct value for $m_t$ appears in $c$ in which case it is $e^{10}$ * Advance $h_{t1}$ to $h_t$ with GRU and using as input the embedding of output token $m_{t1}$ (while training this is taken from the training labels or with small probability the argmax of the generated output.) * Generating using hybrid breadthfirst search and beam search: keep a heap of all suggestions and always try to extend the best suggestion so far. Remove suggestions that are worse than all the completed suggestions (dead) so far. 
[link]
multi layer RNN in which first layer is LSTM, following layers $l$ have $t$,$c$ gates that control whether the state of the layer is carried from previous state or transferred previous layer: $s_l^{[t]} = h_l^{[t]} \cdot t_l^{[t]} + s_{l1}^{[t]} \cdot c _l^{[t]}$ 
[link]
[code](https://github.com/openai/improvedgan), [demo](http://infinitechamber35121.herokuapp.com/cifarminibatch/1/?), [related](http://www.inference.vc/understandingminibatchdiscriminationingans/) ### Feature matching problem: overtraining on the current discriminator solution: ￼$E_{x \sim p_{\text{data}}}f(x)  E_{z \sim p_{z}(z)}f(G(z))_{2}^{2}$ were f(x) activations intermediate layer in discriminator ### Minibatch discrimination problem: generator to collapse to a single point solution: for each sample i, concatenate to $f(x_i)$ features $b$ measuring its distance to other samples j (i and j are both real or generated samples in same batch): $\sum_j \exp(M_{i, b}  M_{j, b}_{L_1})$ ￼ this generates visually appealing samples very quickly ### Historical averaging problem: SGD fails by going into extended orbits solution: parameters revert to the mean $ \theta  \frac{1}{t} \sum_{i=1}^t \theta[i] ^2$ ￼ ### Onesided label smoothing problem: discriminator vulnerability to adversarial examples solution: discriminator target for positive samples is 0.9 instead of 1 ### Virtual batch normalization problem: using BN cause output of examples in batch to be dependent solution: use reference batch chosen once at start of training and each sample is normalized using itself and the reference. It's expensive so used only on generation ### Assessment of image quality problem: MTurk not reliable solution: use inception model p(yx) to compute ￼$\exp(\mathbb{E}_x \text{KL}(p(y  x)  p(y)))$ on 50K generated images x ### Semisupervised learning use the discriminator to also classify on K labels when known and use all real samples (labels and unlabeled) in the discrimination task ￼$D(x) = \frac{Z(x)}{Z(x) + 1}, \text{ where } Z(x) = \sum_{k=1}^{K} \exp[l_k(x)]$. In this case use feature matching but not minibatch discrimination. It also improves the quality of generated images.
3 Comments
