Denny Britz's profile - ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Wide & Deep Learning for Recommender Systems
Heng-Tze Cheng and Levent Koc and Jeremiah Harmsen and Tal Shaked and Tushar Chandra and Hrishi Aradhye and Glen Anderson and Greg Corrado and Wei Chai and Mustafa Ispir and Rohan Anil and Zakaria Haque and Lichan Hong and Vihan Jain and Xiaobing Liu and Hemal Shah
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, cs.IR, stat.ML
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors jointly train a Logistic Regression Model with sparse features that is good at "memorization" and a deep feedforward net with embedded sparse features that is good at "generalization". The model is live in the Google Play store and has achieved a 3.9% gain in app acquisiton as measured by A/B testing.

#### Key Points

- Wide Model (Logistic Regression) gets cross product of binary features, e.g. "AND(user_installed_app=netflix, impression_app=pandora") as inputs. Good at memorization.
- Deep Model alone has a hard time to learning embedding for cross-product features because no data for most combinations but still makes predictions.
- Trained jointly on 500B examples.

arxiv.org
arxiv-vanity.com
scholar.google.com

SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Denny Britz 8 years ago

TLDR; A new dataset of ~100k questions and answers based on ~500 articles from Wikipedia. Both questions and answers were collected using crowdsourcing. Answers are of various types: 20% dates and numbers, 32% proper nouns, 31% noun phrase answers and 16% other phrases. Humans achieve an F1 score of 86%, and the proposed Logistic Regression model gets 51%. It does well on simple answers but struggles with more complex types of reasoning. Tge data set is publicly available at https://stanford-qa.com/.

#### Key Points

- System must select answers from all possible spans in a passage. $O(N^2)$ possibilities for N tokens in passage.
- Answers are ambiguous. Humans achieve 77% on exact match and 86% on F1 (overlap based). Humans would probably achieve close to 100% if the answer phrases were unambiguous.
- Lexicalized and dependency tree path features are most important for the LR model
- Model performs best on dates and numbers, single tokens, and categories with few possible candidates

arxiv.org
arxiv-vanity.com
scholar.google.com

Recurrent Neural Machine Translation
Biao Zhang and Deyi Xiong and Jinsong Su
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors replace the standard attention mechanism (Bahdanau et al) with a RNN/GRU, hoping to model historical dependencies for translation and mitigating the "coverage problem". The authors evaluate their model on Chinese-English translation where they beat Moses (SMT) and GroundHog baselines. The authors also visualize the attention RNN and show that the activations make intuitive sense.

#### Key Points

- Training time: 2 weeks on Titan X, 300 batches per hour, 2.9M language pairs

#### Notes

- The authors argue that their attention mechanism works better b/c it can capture dependencies among the source states. I'm not convinced by this argument. These states already capture dependencies because they are generated by a bidirectional RNN.
- Training seems *very* slow for only 2.9M pairs. I wonder if this model is prohibitively expensive for any production system.
- I wonder if we can use RL to "cover" phrases in the source sentences out of order. At each step we pick a span to cover before generating the next token in the target sequence.
- The authors don't evaluate Moses for long sentences, why?

arxiv.org
arxiv-vanity.com
scholar.google.com

Progressive Neural Networks
Andrei A. Rusu and Neil C. Rabinowitz and Guillaume Desjardins and Hubert Soyer and James Kirkpatrick and Koray Kavukcuoglu and Razvan Pascanu and Raia Hadsell
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors propose Progressive Neural Networks (ProgNN), a new way to do transfer learning without forgetting prior knowledge (as is done in finetuning). ProgNNs train a neural neural on task 1, freeze the parameters, and then train a new network on task 2 while introducing lateral connections and adapter functions from network 1 to network 2. This process can be repeated with further columns (networks). The authors evaluate ProgNNs on 3 RL tasks and find that they outperform finetuning-based approaches.

#### Key Points

- Finetuning is a destructive process that forgets previous knowledge. We don't want that.
- Layer h_k in network 3 gets additional lateral connections from layers h_(k-1) in network 2 and network 1. Parameters of those connections are learned, but network 2 and network 1 are frozen during training of network 3.
- Downside: # of Parameters grows quadratically with the number of tasks. Paper discussed some approaches to address the problem, but not sure how well these work in practice.
- Metric: AUC (Average score per episode during training) as opposed to final score. Transfer score = Relative performance compared with single net baseline.
- Authors use Average Perturbation Sensitivity (APS) and Average Fisher Sensitivity (AFS) to analyze which features/layers from previous networks are actually used in the newly trained network.
- Experiment 1: Variations of Pong game. Baseline that finetunes only final layer fails to learn. ProgNN beats other baselines and APS shows re-use of knowledge.
- Experiment 2: Different Atari games. ProgNets result in positive Transfer 8/12 times, negative transfer 2/12 times. Negative transfer may be a result of optimization problems. Finetuning final layers fails again. ProgNN beats other approaches.
- Experiment 3: Labyrinth, 3D Maze. Pretty much same result as other experiments.

#### Notes

- It seems like the assumption is that layer k always wants to transfer knowledge from layer (k-1). But why is that true? Network are trained on different tasks, so the layer representations, or even numbers of layers, may be completely different. And Once you introduce lateral connections from all layers to all other layers the approach no longer scales.
- Old tasks cannot learn from new tasks. Unlike humans.
- Gating or residuals for lateral connection could make sense to allow to network to "easily" re-use previously learned knowledge.
- Why use AUC metric? I also would've liked to see the final score. Maybe there's a good reason for this, but the paper doesn't explain.
- Scary that finetuning the final layer only fails in most experiments. That's a very commonly used approach in non-RL domains.
- Someone should try this on non-RL tasks.
- What happens to training time and optimization difficult as you add more columns? Seems prohibitively expensive.

arxiv.org
arxiv-vanity.com
scholar.google.com

Pointer Sentinel Mixture Models
Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.AI
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors combine a standard LSTM softmax with [Pointer Networks](https://arxiv.org/abs/1506.03134) in a mixture model called Pointer-Sentinel LSTM (PS-LSTM). The pointer networks helps with rare words and long-term dependencies but is unable to refer to words that are not in the input. The oppoosite is the case for the standard softmax. By combining the two approaches we get the best of both worlds. The probability of an output words is defined as a mixture of the pointer and softmax model and the mixture coefficient is calculated as part of the pointer attention. The authors evaluate their architecture on the PTB Language Modeling dataset where they achieve state of the art. They also present a novel WikiText dataset that is larger and more realistic then PTB.

### Key Points:

- Standard RNNs with softmax struggle with rare and unseen words, even when adding attention.
- Use a window of the most recent`L` words to match against.
- Probability of output with gating: `p(y|x) = g * p_vocab(y|x) + (1 - g) * p_ptr(y|x)`.
- The gate `g` is calcualted as an extra element in the attention module. Probabilities for the pointer network are then normalized accordingly.
- Integrating the gating funciton computation into the pointer network is crucial: It needs to have access to the pointer network state, not just the RNN state (which can't hold long-term info)
- WikiText-2 dataset: 2M train tokens, 217k validation tokens, 245k test tokens. 33k vocab, 2.6% OOV. 2x larger than PTB.
- WikiText-1-3 dataset: 103M train tokens, 217k validation tokens, 245k test tokens. 267k vocab, 2.4% OOV. 100x larger than PTB.
- Pointer Sentiment Model leads to stronger improvements for rare words - that makes intuitive sense.

arxiv.org
arxiv-vanity.com
scholar.google.com

Learning Online Alignments with Continuous Rewards Policy Gradient
Yuping Luo and Chung-Cheng Chiu and Navdeep Jaitly and Ilya Sutskever
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, cs.CL
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors use policy gradients on an RNN to train a "hard" attention mechanism that decides whether to output something at the current timestep or not. Their algorithm is online, which means it does not need to see the complete sequence before making a prediction, as is the case with soft attention. The authors evaluate their model on small- and medium-scale speech recognition tasks, where they achieve performance comparable to standard sequential models.

#### Notes:

- Entropy regularization and baselines were critical to make the model learn
- Neat trick: Increase dropout as training progresses
- Grid LSTMs outperformed standard LSTMs

arxiv.org
arxiv-vanity.com
scholar.google.com

Neural Machine Translation with Reconstruction
Zhaopeng Tu and Yang Liu and Lifeng Shang and Xiaohua Liu and Hang Li
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors add a reconstruction objective to the standard seq2seq model by adding a "Reconstructor" RNN that is trained to re-generate the source sequence based on the hidden states of the decoder. A reconstruction cost is then added to the cost function and the architecture is trained end-to-end. The authors find that the technique improves upon the baseline both when 1. used during training only and 2. when used as a rankign objective during beam search decoding.

#### Key Points

- Problem to solve:
- Standard seq2seq models tend to under- and over-translate because they don't ensure that all of the source information is covered by the target side.
- The MLE objective only captures information from source -> target, which favors short translations. Thus, Increasing the beam size actually lowers translation quality
- Basic Idea
- Reconstruct source sentences form the latent representations of the decoder
- Use attention over decoder hidden states
- Add MLE reconstruction probability to the training objective
- Beam Decoding is now two-phase scheme
1. Generate candidates using the encoder-decoder
2. For each candidate, compute a reconstruction score and use it to re-rank together with the likelihood
- Training Procedure
- Params Chinese-English: `vocab=30k, maxlen=80, embedding_dim=620, hidden_dim=1000, batch=80`.
- 1.25M pairs trained for 15 epochs using Adadelta, the train with reconstructor for 10 epochs.
- Results:
- Model increases BLEU from 30.65 -> 31.17 (beam size 10) when used for training only and decoding stays unchaged
- BLEU increase from 31.17 -> 31.73 (beam size 10) when also used for decoding
- Model successfully deals with large decoding spaces, i.e. BLEU now increases together with beam size

#### Notes

- [See this issue for author's comments](https://github.com/dennybritz/deeplearning-papernotes/issues/3)
- I feel like "adequacy" is a somewhat strange description of what the authors try to optimize. Wouldn't "coverage" be more appropriate?
- In Table 1, why does BLEU score still decrease when length normalization is applied? The authors don't go into detail on this.
- The training curves are a bit confusing/missing. I would've liked to see a standard training curve that shows the MLE objective loss and the finetuning with reconstruction objective side-by-side.
- The training procedure somewhat confusing. The say "We further train the model for 10 epochs" with reconstruction objective, byt then "we use a trained model at iteration 110k". I'm assuming they do early-stopping at 110k * 80 = 8.8M steps. Again, would've liked to see the loss curves for this, not just BLEU curves.
- I would've liked to see model performance on more "standard" NMT datasets like EN-FR and EN-DE, etc.
- Is there perhaps a smarter way to do reconstruction iteratively by looking at what's missing from the reconstructed output? Trainig with reconstructor with MLE has some of the same drawbacks as training standard enc-dec with MLE and teacher forcing.

arxiv.org
arxiv-vanity.com
scholar.google.com

Neural Machine Translation with Recurrent Attention Modeling
Zichao Yang and Zhiting Hu and Yuntian Deng and Chris Dyer and Alex Smola
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.NE, cs.CL
more

[link] Summary by Denny Britz 8 years ago

TLDR; The standard attention model does not take into account the "history" of attention activations, even though this should be a good predictor of what to attend to next. The authors augment a seq2seq network with a dynamic memory that, for each input, keep track of an attention matrix over time. The model is evaluated on English-German and Englih-Chinese NMT tasks and beats competing models.

#### Notes

- How expensive is this, and how much more difficult are these networks to train?
- Sequentiallly attending to neighboring words makes sense for some language pairs, but for others it doesn't. This method seems rather restricted because it only takes into account a window of k time steps.

arxiv.org
arxiv-vanity.com
scholar.google.com

Learning to Translate in Real-time with Neural Machine Translation
Jiatao Gu and Graham Neubig and Kyunghyun Cho and Victor O. K. Li
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.LG
more

[link] Summary by Denny Britz 8 years ago

The authors propose a framework where a Reinforcement Learning agents makes decisions of reading the next input words or producing the next output word to trade off translation quality and time delay (caused by read operations). The reward function is based on both quality (BLEU score) and delay (various metrics and hyperparameters). The authors use Policy Gradient to optimize the model, which is initialized from a pre-trained translation model. They apply to approach to WMT'15 EN-DE and EN-RU translation and show that the model increases translation quality in all settings and is able to trade off effectively between quality and delay.

arxiv.org
arxiv-vanity.com
scholar.google.com

Layer Normalization
Jimmy Lei Ba and Jamie Ryan Kiros and Geoffrey E. Hinton
arXiv e-Print archive - 2016 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors propose a new normalization scheme called "Layer Normalization" that works especially well for recurrent networks. Layer Normalization is similar to Batch Normalization, but only depends on a single training case. As such, it's well suited for variable length sequences or small batches. In Layer Normalization each hidden unit shares the same normalization term. The authors show through experiments that Layer Normalization converges faster, and sometimes to better solutions, than batch- or unnormalized RNNs. Batch normalization still performs better for CNNs.

arxiv.org
arxiv-vanity.com
scholar.google.com

Hierarchical Multiscale Recurrent Neural Networks
Junyoung Chung and Sungjin Ahn and Yoshua Bengio
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors propose a new "Hierarchical Multiscale RNN" (HM-RNN) architecture. This models explicitly learns both temporal and hierarchical (character -> word -> phrase -> ...) representations without needing to be told what the structure or timescale of the hierarchy is. This is done by adding binary boundary detectors at each layer. These detectors activate based on whether the segment in a certain layer is finished or not. Based on the activation of these boundary detectors information is then propagated to neighboring layers. Because this model involves discrete decision making based on binary outputs it is trained using a straight-through estimator. The authors evaluate the model on Language Modeling and Handwriting Sequence Generation tasks, where it outperforms competing models. Qualitatively the authors show that the network learns meaningful boundaries (e.g. spaces) without being needing to be told about them.

### Key Points

- Learning both hierarchical and temporal representations at the same time is a challenge for RNNs
- Observation: High-level abstractions (e.g. paragraphs) change slowly, but low-level abstractions (e.g. words) change quickly. These should be updated at different timescales.
- Benefits of HN-RNN: (1) Computational Efficiency (2) Efficient long-term dependency propagation (vanishing gradients) (3) Efficient resource allocation, e.g. higher layers can have more neurons
- Binary boundary detector at each layer is turned on if the segment of the corresponding layer abstraction (char, word, sentence, etc) is finished.
- Three operations based on boundary detector state: UPDATE, COPY, FLUSH
- UPDATE Op: Standard LSTM update. This happens when the current segment is not finished, but the segment one layer below is finished.
- COPY Op: Copies previous memory cell. Happens when neither the current segment nor the segment one layer below is finished. Basically, this waits for the lower-level representation to be "done".
- FLUSH Op: Flushes state to layer above and resets the state to start a new segment. Happens when the segment of this layer is finished.
- Boundary detector is binarized using a step function. This is non-differentiable and training is done with a straight-through estimator that estimates the gradient using a similar hard sigmoid function.
- Slope annealing trick: Gradually increase the slop of the hard sigmoid function for the boundary estimation to make it closer to a discrete step function over time. Needed to be SOTA.
- Language Modeling on PTB: Beats state of the art, but not by much.
- Language Modeling on other data: Beats or matches state of the art.
- Handwriting Sequence Generation: Beats Standard LSTM

### My Notes

- I think the ideas in this paper are very important, but I am somewhat disappointed by the results. The model is significantly more complex with more knobs to tune than competing models (e.g. a simple batch-normalized LSTM). However, it just barely beats those simpler models by adding new "tricks" like slope annealing. For example, the slope annealing schedule with a `0.04` constant looks very suspicious.
- I don't know much about Handwriting Sequence Generation, but I don't see any comparisons to state of the art models. Why only compare to a standard LSTM?
- The main argument is that the network can dynamically learn hierarchical representations and timescales. However, the number of layers implicitly restricts how many hierarchical representations the network can and cannot learn. So, there still is a hyperparameter involved here that needs to be set by hand.
- One claim is that the model learns boundary information (spaces) without being told about them. That's true, but I'm not convinced that's as novel as the authors make it out to be. I'm pretty sure that a standard LSTM (perhaps with extra skip connections) will learn the same and that it's possible to tease these out of the LSTM parameter matrices.
- Could be interesting to apply this to CJK languages where boundaries and hierarchical representations are more apparent.
- The authors claim that "computational efficiency" is one of the main benefits of this model because higher level representations need to be updated less frequency. However, there are no experiments to verify this claim. Obviously this is true in theory, but I can imagine that in practice this model is actually slower to train. Also, what about convergence time?

arxiv.org
arxiv-vanity.com
scholar.google.com

Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
Melvin Johnson and Mike Schuster and Quoc V. Le and Maxim Krikun and Yonghui Wu and Zhifeng Chen and Nikhil Thorat and Fernanda Viégas and Martin Wattenberg and Greg Corrado and Macduff Hughes and Jeffrey Dean
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.AI
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors train a multilingual Neural Machine Translation (NMT) system based on the Google NMT architecture by prepend a special `2[lang]` (e.g. `2fr`) token to the input sequence to specify the target language. They empirically evaluate model performance on many-to-one, one-to-many and many-to-many translation tasks and demonstrate evidence for shared representations (interlingua).

arxiv.org
arxiv-vanity.com
scholar.google.com

Generating Sentences from a Continuous Space
Samuel R. Bowman and Luke Vilnis and Oriol Vinyals and Andrew M. Dai and Rafal Jozefowicz and Samy Bengio
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG, cs.CL
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors present an RNN-based variational autoencoder that can learn a latent sentence representation while learning to decode. A linear layer that predicts the parameter of a Gaussian distribution is inserted between encoder and decoder. The loss is a combination of the reconstruction objective and the KL divergence with the prior (Gaussian) - similar to the "standard" VAE does. The authors evaluate the model on Language Modeling and Impution (Inserting Missing Words) tasks and also present a qualitative analysis of the latent space.

#### Key Points

- Training is tricky. Vanilla training results in the decoder ignoring the encoder and the KL error term becoming zero.
- Training Trick 1: KL Cost Annealing. During training, increase weight on the KL term of the cost to anneal from vanilla to VAE.
- Training Trick 2: Word dropout using a word keep rate hyperparameter. This forces the decoder to rely more on the global representation.
- Results on Language Modeling: Standard model (without cost annealing and word dropout) trails Vanilla RNNLM model, but not by much. KL cost term goes to zero in this setting. In an inputless decoder setting (word keep prob = 0) the VAE outperforms the RNNLM (obviously)
- Results on Imputing Missing Words: Benchmarked using an adversarial error classifier. VAE significantly outperforms RNNLM. However, the comparison is somewhat unfair since the RNNML has nothing to condition on and relies on unigram distribution for the first token.
- Qualitative: Can use higher word dropout to get more diverse sentences
- Qualitative: Can walk the latent space and get grammatical and meaningful sentences.

arxiv.org
arxiv-vanity.com
scholar.google.com

Using Fast Weights to Attend to the Recent Past
Jimmy Ba and Geoffrey Hinton and Volodymyr Mnih and Joel Z. Leibo and Catalin Ionescu
arXiv e-Print archive - 2016 via Local arXiv
Keywords: stat.ML, cs.LG, cs.NE
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors propose "fast weights", a type of attention mechanism to the recent past that performs multiple steps of computation between each hidden state computation step in an RNN. The authors evaluate their architecture on various tasks that require short-term memory, arguing that the fast weights mechanism frees up the RNN from memorizing sthings in the hidden state which is freed up for other types of computation.

### Key Points

- Currently, RNNs have slow-changing long-term memory (Permanent Weights) and fast changing short-term memory (hidden state). We want something in the middle: Fast weights with higher storage capacity.
- For each transition in the RNN, multiple transitions can be made by the fast weights. They are a kind of attention mechanism to the recent past that is not parameterized separately but depends on the past states.
- Fast weights are decayed over time and based on the outer product of previous hidden states: `A(t+1) = lambdaA(t) + eta*h(t)h(t)^T`.
- The next hidden state of the RNN is computed by a regular transition based on input adn previous state combined by an "inner loop" of S steps of the fast weights.
- "At each iteration of the inner loop the fast weight matrix A is eqivalent to attending to past hidden vectors in proportion to their scalar product with the current hidden state, weighted by a decay factor" - And this is efficient to compute.
- Added Layer Normalization to fast weights to prevent exploding/vanishign gradients.
- Associative Retrieval Toy Task: Memorize recent key-value pairs. Fast weights siginifcantly outperform RNN, LSTM and Associative LSTM.
- Visual Attention on MNIST: Beats RNN/LSTM and is comparable to CovnNet for large number of features.
- Agents with Memory: Fast Weight net learns faster in a partially obseverable environment where the networks must remember the previous states.

### Thoughts

-Overall I think this is very exciting work. It kind of reminds me of Adaptive Computation Time where you dynamically decide how many steps to "ponder" before making another outputs. However, it is also quite different in that this work explicitly "attends" over past states and isn't really about computation time.
- In the experiments the authors say they set S=1 (i.e. just one inner loop step). Why is that? I thought one of the more important points of fast weights would be to have additional computation betwene each slow step. This also raises the question of how to pick this hyperparameter.
- A lot of references to Machine Translation models with attention but not NLP experiments.

arxiv.org
arxiv-vanity.com
scholar.google.com

Natural Language Comprehension with the EpiReader
Adam Trischler and Zheng Ye and Xingdi Yuan and Kaheer Suleman
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors prorpose the "EpiReader" model for Question Answering / Machine Comprehension. The model consists of two modules: An Extractor that selects answer candidates (single words) using a Pointer network, and a Reasoner that rank these candidates by estimating textual entailment. The model is trained end-to-end and works on cloze-style questions. The authors evaluate the model on CBT and CNN datasets where they beat Attention Sum Reader and MemNN architectures.


#### Notes

- In most architectures, the correct answer is among the top5 candidates 95% of the time.
- Soft Attention is a problem in many architectures. Need a way to do hard attention.

arxiv.org
arxiv-vanity.com
scholar.google.com

End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning
Jason D. Williams and Geoffrey Zweig
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.AI, cs.LG
more

[link] Summary by Denny Britz 8 years ago

TLDR; The author present and end-2-end dialog system that consists of an LSTM, action templates, an entity extraction system, and custom code for declaring business rules. They test the systme on a toy task where the goal is to call a person from an address book. They train the system on 21 dialogs using Supervised Learning, and then optimize it using Reinforcement Learning, achieving 70% task completion rates.

#### Key Points

- Task: User asks to call person. Action: Find in address book and place call
- 21 example dialogs
- Several hundred lines of Python code to block certain actions
- External entity recognition API
- Hand-crafted features as input to the LSTM. Hand-crafted action template.
- RNN maps from sequence to action template, First pre-train LSTM to reproduce dialogs using Supervised Learning, then train using RL / policy gradient
- The system doesn't generate text, it picks a template


#### Notes

- I wonder how well the system would generalize to a task that has a larger action space and more varied conversations. The 21 provided dialogs cover a lot of the taks space already. Much harder to do that in larger spaces.
- I wouldn't call this approach end-to-end ;)

arxiv.org
arxiv-vanity.com
scholar.google.com

Dual Learning for Machine Translation
Yingce Xia and Di He and Tao Qin and Liwei Wang and Nenghai Yu and Tie-Yan Liu and Wei-Ying Ma
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors finetune an FR -> EN NMT model using a RL-based dual game. 1. Pick a French sentence from a monolingual corpus and translate it to EN. 2. Use an EN language model to get a reward for the translation 3. Translate the translation back into FR using an EN -> FR system. 4. Get a reward based on the consistency between original and reconstructed sentence. Training this architecture using Policy Gradient authors can make efficient use of monolingual data and show that a system trained on only 10% of parallel data and finetuned with monolingual data achieves comparable BLUE scores as a system trained on the full set of parallel data.

### Key Points

- Making efficient use of monolingual data to improve NMT systems is a challenge
- Two Agent communication game: Agent A only knows language A and agent B only knows language B. A send message through a noisy translation channel, B receives message, checks its correctness, and sends it back through another noisy translation channel. A checks if it is consistent with the original message. Translation channels are then improves based on the feedback.
- Pieces required: LanguageModel(A), LanguageModel(B), TranslationModel(A->B), TranslationModel(B->A). Monolingual Data.
- Total reward is linear combination of: `r1 = LM(translated_message)`, `r2 = log(P(original_message | translated_message)`
- Samples are based on beam search using the average value as the gradient approximation
- EN -> FR pretrained on 100% of parallel data: 29.92 to 32.06 BLEU
- EN -> FR pretrained on 10% of parallel data: 25.73 to 28.73 BLEU
- FR -> EN pretrained on 100% of parallel data: 27.49 to 29.78 BLEU
- FR -> EN pretrained on 10% of parallel data: 22.27 to 27.50 BLEU

### Some Notes

- I think the idea is very interesting and we'll see a lot related work coming out of this. It would be even more amazing if the architecture was trained from scratch using monolingual data only. Due the the high variance of RL methods this is probably quite hard to do though.
- I think the key issue is that the rewards are quite noisy, as is the case with MT in general. Neither the language model nor the BLEU scores gives good feedback for the "correctness" of a translation.
- I wonder why there is such a huge jump in BLEU scores for FR->EN on 10% of data, but not for EN->FR on the same amount of data.

Denny Britz

sciscore: 2.281