[link]
TLDR; The authors jointly train a Logistic Regression Model with sparse features that is good at "memorization" and a deep feedforward net with embedded sparse features that is good at "generalization". The model is live in the Google Play store and has achieved a 3.9% gain in app acquisiton as measured by A/B testing. #### Key Points - Wide Model (Logistic Regression) gets cross product of binary features, e.g. "AND(user_installed_app=netflix, impression_app=pandora") as inputs. Good at memorization. - Deep Model alone has a hard time to learning embedding for cross-product features because no data for most combinations but still makes predictions. - Trained jointly on 500B examples. |
[link]
TLDR; A new dataset of ~100k questions and answers based on ~500 articles from Wikipedia. Both questions and answers were collected using crowdsourcing. Answers are of various types: 20% dates and numbers, 32% proper nouns, 31% noun phrase answers and 16% other phrases. Humans achieve an F1 score of 86%, and the proposed Logistic Regression model gets 51%. It does well on simple answers but struggles with more complex types of reasoning. Tge data set is publicly available at https://stanford-qa.com/. #### Key Points - System must select answers from all possible spans in a passage. $O(N^2)$ possibilities for N tokens in passage. - Answers are ambiguous. Humans achieve 77% on exact match and 86% on F1 (overlap based). Humans would probably achieve close to 100% if the answer phrases were unambiguous. - Lexicalized and dependency tree path features are most important for the LR model - Model performs best on dates and numbers, single tokens, and categories with few possible candidates |
[link]
TLDR; The authors replace the standard attention mechanism (Bahdanau et al) with a RNN/GRU, hoping to model historical dependencies for translation and mitigating the "coverage problem". The authors evaluate their model on Chinese-English translation where they beat Moses (SMT) and GroundHog baselines. The authors also visualize the attention RNN and show that the activations make intuitive sense. #### Key Points - Training time: 2 weeks on Titan X, 300 batches per hour, 2.9M language pairs #### Notes - The authors argue that their attention mechanism works better b/c it can capture dependencies among the source states. I'm not convinced by this argument. These states already capture dependencies because they are generated by a bidirectional RNN. - Training seems *very* slow for only 2.9M pairs. I wonder if this model is prohibitively expensive for any production system. - I wonder if we can use RL to "cover" phrases in the source sentences out of order. At each step we pick a span to cover before generating the next token in the target sequence. - The authors don't evaluate Moses for long sentences, why? |
[link]
TLDR; The authors propose Progressive Neural Networks (ProgNN), a new way to do transfer learning without forgetting prior knowledge (as is done in finetuning). ProgNNs train a neural neural on task 1, freeze the parameters, and then train a new network on task 2 while introducing lateral connections and adapter functions from network 1 to network 2. This process can be repeated with further columns (networks). The authors evaluate ProgNNs on 3 RL tasks and find that they outperform finetuning-based approaches. #### Key Points - Finetuning is a destructive process that forgets previous knowledge. We don't want that. - Layer h_k in network 3 gets additional lateral connections from layers h_(k-1) in network 2 and network 1. Parameters of those connections are learned, but network 2 and network 1 are frozen during training of network 3. - Downside: # of Parameters grows quadratically with the number of tasks. Paper discussed some approaches to address the problem, but not sure how well these work in practice. - Metric: AUC (Average score per episode during training) as opposed to final score. Transfer score = Relative performance compared with single net baseline. - Authors use Average Perturbation Sensitivity (APS) and Average Fisher Sensitivity (AFS) to analyze which features/layers from previous networks are actually used in the newly trained network. - Experiment 1: Variations of Pong game. Baseline that finetunes only final layer fails to learn. ProgNN beats other baselines and APS shows re-use of knowledge. - Experiment 2: Different Atari games. ProgNets result in positive Transfer 8/12 times, negative transfer 2/12 times. Negative transfer may be a result of optimization problems. Finetuning final layers fails again. ProgNN beats other approaches. - Experiment 3: Labyrinth, 3D Maze. Pretty much same result as other experiments. #### Notes - It seems like the assumption is that layer k always wants to transfer knowledge from layer (k-1). But why is that true? Network are trained on different tasks, so the layer representations, or even numbers of layers, may be completely different. And Once you introduce lateral connections from all layers to all other layers the approach no longer scales. - Old tasks cannot learn from new tasks. Unlike humans. - Gating or residuals for lateral connection could make sense to allow to network to "easily" re-use previously learned knowledge. - Why use AUC metric? I also would've liked to see the final score. Maybe there's a good reason for this, but the paper doesn't explain. - Scary that finetuning the final layer only fails in most experiments. That's a very commonly used approach in non-RL domains. - Someone should try this on non-RL tasks. - What happens to training time and optimization difficult as you add more columns? Seems prohibitively expensive. |
[link]
TLDR; The authors combine a standard LSTM softmax with [Pointer Networks](https://arxiv.org/abs/1506.03134) in a mixture model called Pointer-Sentinel LSTM (PS-LSTM). The pointer networks helps with rare words and long-term dependencies but is unable to refer to words that are not in the input. The oppoosite is the case for the standard softmax. By combining the two approaches we get the best of both worlds. The probability of an output words is defined as a mixture of the pointer and softmax model and the mixture coefficient is calculated as part of the pointer attention. The authors evaluate their architecture on the PTB Language Modeling dataset where they achieve state of the art. They also present a novel WikiText dataset that is larger and more realistic then PTB. ### Key Points: - Standard RNNs with softmax struggle with rare and unseen words, even when adding attention. - Use a window of the most recent`L` words to match against. - Probability of output with gating: `p(y|x) = g * p_vocab(y|x) + (1 - g) * p_ptr(y|x)`. - The gate `g` is calcualted as an extra element in the attention module. Probabilities for the pointer network are then normalized accordingly. - Integrating the gating funciton computation into the pointer network is crucial: It needs to have access to the pointer network state, not just the RNN state (which can't hold long-term info) - WikiText-2 dataset: 2M train tokens, 217k validation tokens, 245k test tokens. 33k vocab, 2.6% OOV. 2x larger than PTB. - WikiText-1-3 dataset: 103M train tokens, 217k validation tokens, 245k test tokens. 267k vocab, 2.4% OOV. 100x larger than PTB. - Pointer Sentiment Model leads to stronger improvements for rare words - that makes intuitive sense. |
[link]
TLDR; The authors use policy gradients on an RNN to train a "hard" attention mechanism that decides whether to output something at the current timestep or not. Their algorithm is online, which means it does not need to see the complete sequence before making a prediction, as is the case with soft attention. The authors evaluate their model on small- and medium-scale speech recognition tasks, where they achieve performance comparable to standard sequential models. #### Notes: - Entropy regularization and baselines were critical to make the model learn - Neat trick: Increase dropout as training progresses - Grid LSTMs outperformed standard LSTMs |
[link]
TLDR; The authors add a reconstruction objective to the standard seq2seq model by adding a "Reconstructor" RNN that is trained to re-generate the source sequence based on the hidden states of the decoder. A reconstruction cost is then added to the cost function and the architecture is trained end-to-end. The authors find that the technique improves upon the baseline both when 1. used during training only and 2. when used as a rankign objective during beam search decoding. #### Key Points - Problem to solve: - Standard seq2seq models tend to under- and over-translate because they don't ensure that all of the source information is covered by the target side. - The MLE objective only captures information from source -> target, which favors short translations. Thus, Increasing the beam size actually lowers translation quality - Basic Idea - Reconstruct source sentences form the latent representations of the decoder - Use attention over decoder hidden states - Add MLE reconstruction probability to the training objective - Beam Decoding is now two-phase scheme 1. Generate candidates using the encoder-decoder 2. For each candidate, compute a reconstruction score and use it to re-rank together with the likelihood - Training Procedure - Params Chinese-English: `vocab=30k, maxlen=80, embedding_dim=620, hidden_dim=1000, batch=80`. - 1.25M pairs trained for 15 epochs using Adadelta, the train with reconstructor for 10 epochs. - Results: - Model increases BLEU from 30.65 -> 31.17 (beam size 10) when used for training only and decoding stays unchaged - BLEU increase from 31.17 -> 31.73 (beam size 10) when also used for decoding - Model successfully deals with large decoding spaces, i.e. BLEU now increases together with beam size #### Notes - [See this issue for author's comments](https://github.com/dennybritz/deeplearning-papernotes/issues/3) - I feel like "adequacy" is a somewhat strange description of what the authors try to optimize. Wouldn't "coverage" be more appropriate? - In Table 1, why does BLEU score still decrease when length normalization is applied? The authors don't go into detail on this. - The training curves are a bit confusing/missing. I would've liked to see a standard training curve that shows the MLE objective loss and the finetuning with reconstruction objective side-by-side. - The training procedure somewhat confusing. The say "We further train the model for 10 epochs" with reconstruction objective, byt then "we use a trained model at iteration 110k". I'm assuming they do early-stopping at 110k * 80 = 8.8M steps. Again, would've liked to see the loss curves for this, not just BLEU curves. - I would've liked to see model performance on more "standard" NMT datasets like EN-FR and EN-DE, etc. - Is there perhaps a smarter way to do reconstruction iteratively by looking at what's missing from the reconstructed output? Trainig with reconstructor with MLE has some of the same drawbacks as training standard enc-dec with MLE and teacher forcing. |
[link]
TLDR; The standard attention model does not take into account the "history" of attention activations, even though this should be a good predictor of what to attend to next. The authors augment a seq2seq network with a dynamic memory that, for each input, keep track of an attention matrix over time. The model is evaluated on English-German and Englih-Chinese NMT tasks and beats competing models. #### Notes - How expensive is this, and how much more difficult are these networks to train? - Sequentiallly attending to neighboring words makes sense for some language pairs, but for others it doesn't. This method seems rather restricted because it only takes into account a window of k time steps. |
[link]
The authors propose a framework where a Reinforcement Learning agents makes decisions of reading the next input words or producing the next output word to trade off translation quality and time delay (caused by read operations). The reward function is based on both quality (BLEU score) and delay (various metrics and hyperparameters). The authors use Policy Gradient to optimize the model, which is initialized from a pre-trained translation model. They apply to approach to WMT'15 EN-DE and EN-RU translation and show that the model increases translation quality in all settings and is able to trade off effectively between quality and delay. |
[link]
TLDR; The authors propose a new normalization scheme called "Layer Normalization" that works especially well for recurrent networks. Layer Normalization is similar to Batch Normalization, but only depends on a single training case. As such, it's well suited for variable length sequences or small batches. In Layer Normalization each hidden unit shares the same normalization term. The authors show through experiments that Layer Normalization converges faster, and sometimes to better solutions, than batch- or unnormalized RNNs. Batch normalization still performs better for CNNs. |
[link]
TLDR; The authors propose a new "Hierarchical Multiscale RNN" (HM-RNN) architecture. This models explicitly learns both temporal and hierarchical (character -> word -> phrase -> ...) representations without needing to be told what the structure or timescale of the hierarchy is. This is done by adding binary boundary detectors at each layer. These detectors activate based on whether the segment in a certain layer is finished or not. Based on the activation of these boundary detectors information is then propagated to neighboring layers. Because this model involves discrete decision making based on binary outputs it is trained using a straight-through estimator. The authors evaluate the model on Language Modeling and Handwriting Sequence Generation tasks, where it outperforms competing models. Qualitatively the authors show that the network learns meaningful boundaries (e.g. spaces) without being needing to be told about them. ### Key Points - Learning both hierarchical and temporal representations at the same time is a challenge for RNNs - Observation: High-level abstractions (e.g. paragraphs) change slowly, but low-level abstractions (e.g. words) change quickly. These should be updated at different timescales. - Benefits of HN-RNN: (1) Computational Efficiency (2) Efficient long-term dependency propagation (vanishing gradients) (3) Efficient resource allocation, e.g. higher layers can have more neurons - Binary boundary detector at each layer is turned on if the segment of the corresponding layer abstraction (char, word, sentence, etc) is finished. - Three operations based on boundary detector state: UPDATE, COPY, FLUSH - UPDATE Op: Standard LSTM update. This happens when the current segment is not finished, but the segment one layer below is finished. - COPY Op: Copies previous memory cell. Happens when neither the current segment nor the segment one layer below is finished. Basically, this waits for the lower-level representation to be "done". - FLUSH Op: Flushes state to layer above and resets the state to start a new segment. Happens when the segment of this layer is finished. - Boundary detector is binarized using a step function. This is non-differentiable and training is done with a straight-through estimator that estimates the gradient using a similar hard sigmoid function. - Slope annealing trick: Gradually increase the slop of the hard sigmoid function for the boundary estimation to make it closer to a discrete step function over time. Needed to be SOTA. - Language Modeling on PTB: Beats state of the art, but not by much. - Language Modeling on other data: Beats or matches state of the art. - Handwriting Sequence Generation: Beats Standard LSTM ### My Notes - I think the ideas in this paper are very important, but I am somewhat disappointed by the results. The model is significantly more complex with more knobs to tune than competing models (e.g. a simple batch-normalized LSTM). However, it just barely beats those simpler models by adding new "tricks" like slope annealing. For example, the slope annealing schedule with a `0.04` constant looks very suspicious. - I don't know much about Handwriting Sequence Generation, but I don't see any comparisons to state of the art models. Why only compare to a standard LSTM? - The main argument is that the network can dynamically learn hierarchical representations and timescales. However, the number of layers implicitly restricts how many hierarchical representations the network can and cannot learn. So, there still is a hyperparameter involved here that needs to be set by hand. - One claim is that the model learns boundary information (spaces) without being told about them. That's true, but I'm not convinced that's as novel as the authors make it out to be. I'm pretty sure that a standard LSTM (perhaps with extra skip connections) will learn the same and that it's possible to tease these out of the LSTM parameter matrices. - Could be interesting to apply this to CJK languages where boundaries and hierarchical representations are more apparent. - The authors claim that "computational efficiency" is one of the main benefits of this model because higher level representations need to be updated less frequency. However, there are no experiments to verify this claim. Obviously this is true in theory, but I can imagine that in practice this model is actually slower to train. Also, what about convergence time? |
[link]
TLDR; The authors train a multilingual Neural Machine Translation (NMT) system based on the Google NMT architecture by prepend a special `2[lang]` (e.g. `2fr`) token to the input sequence to specify the target language. They empirically evaluate model performance on many-to-one, one-to-many and many-to-many translation tasks and demonstrate evidence for shared representations (interlingua). |
[link]
TLDR; The authors present an RNN-based variational autoencoder that can learn a latent sentence representation while learning to decode. A linear layer that predicts the parameter of a Gaussian distribution is inserted between encoder and decoder. The loss is a combination of the reconstruction objective and the KL divergence with the prior (Gaussian) - similar to the "standard" VAE does. The authors evaluate the model on Language Modeling and Impution (Inserting Missing Words) tasks and also present a qualitative analysis of the latent space. #### Key Points - Training is tricky. Vanilla training results in the decoder ignoring the encoder and the KL error term becoming zero. - Training Trick 1: KL Cost Annealing. During training, increase weight on the KL term of the cost to anneal from vanilla to VAE. - Training Trick 2: Word dropout using a word keep rate hyperparameter. This forces the decoder to rely more on the global representation. - Results on Language Modeling: Standard model (without cost annealing and word dropout) trails Vanilla RNNLM model, but not by much. KL cost term goes to zero in this setting. In an inputless decoder setting (word keep prob = 0) the VAE outperforms the RNNLM (obviously) - Results on Imputing Missing Words: Benchmarked using an adversarial error classifier. VAE significantly outperforms RNNLM. However, the comparison is somewhat unfair since the RNNML has nothing to condition on and relies on unigram distribution for the first token. - Qualitative: Can use higher word dropout to get more diverse sentences - Qualitative: Can walk the latent space and get grammatical and meaningful sentences. |
[link]
TLDR; The authors propose "fast weights", a type of attention mechanism to the recent past that performs multiple steps of computation between each hidden state computation step in an RNN. The authors evaluate their architecture on various tasks that require short-term memory, arguing that the fast weights mechanism frees up the RNN from memorizing sthings in the hidden state which is freed up for other types of computation. ### Key Points - Currently, RNNs have slow-changing long-term memory (Permanent Weights) and fast changing short-term memory (hidden state). We want something in the middle: Fast weights with higher storage capacity. - For each transition in the RNN, multiple transitions can be made by the fast weights. They are a kind of attention mechanism to the recent past that is not parameterized separately but depends on the past states. - Fast weights are decayed over time and based on the outer product of previous hidden states: `A(t+1) = lambdaA(t) + eta*h(t)h(t)^T`. - The next hidden state of the RNN is computed by a regular transition based on input adn previous state combined by an "inner loop" of S steps of the fast weights. - "At each iteration of the inner loop the fast weight matrix A is eqivalent to attending to past hidden vectors in proportion to their scalar product with the current hidden state, weighted by a decay factor" - And this is efficient to compute. - Added Layer Normalization to fast weights to prevent exploding/vanishign gradients. - Associative Retrieval Toy Task: Memorize recent key-value pairs. Fast weights siginifcantly outperform RNN, LSTM and Associative LSTM. - Visual Attention on MNIST: Beats RNN/LSTM and is comparable to CovnNet for large number of features. - Agents with Memory: Fast Weight net learns faster in a partially obseverable environment where the networks must remember the previous states. ### Thoughts -Overall I think this is very exciting work. It kind of reminds me of Adaptive Computation Time where you dynamically decide how many steps to "ponder" before making another outputs. However, it is also quite different in that this work explicitly "attends" over past states and isn't really about computation time. - In the experiments the authors say they set S=1 (i.e. just one inner loop step). Why is that? I thought one of the more important points of fast weights would be to have additional computation betwene each slow step. This also raises the question of how to pick this hyperparameter. - A lot of references to Machine Translation models with attention but not NLP experiments. |
[link]
TLDR; The authors prorpose the "EpiReader" model for Question Answering / Machine Comprehension. The model consists of two modules: An Extractor that selects answer candidates (single words) using a Pointer network, and a Reasoner that rank these candidates by estimating textual entailment. The model is trained end-to-end and works on cloze-style questions. The authors evaluate the model on CBT and CNN datasets where they beat Attention Sum Reader and MemNN architectures. #### Notes - In most architectures, the correct answer is among the top5 candidates 95% of the time. - Soft Attention is a problem in many architectures. Need a way to do hard attention. |
[link]
TLDR; The author present and end-2-end dialog system that consists of an LSTM, action templates, an entity extraction system, and custom code for declaring business rules. They test the systme on a toy task where the goal is to call a person from an address book. They train the system on 21 dialogs using Supervised Learning, and then optimize it using Reinforcement Learning, achieving 70% task completion rates. #### Key Points - Task: User asks to call person. Action: Find in address book and place call - 21 example dialogs - Several hundred lines of Python code to block certain actions - External entity recognition API - Hand-crafted features as input to the LSTM. Hand-crafted action template. - RNN maps from sequence to action template, First pre-train LSTM to reproduce dialogs using Supervised Learning, then train using RL / policy gradient - The system doesn't generate text, it picks a template #### Notes - I wonder how well the system would generalize to a task that has a larger action space and more varied conversations. The 21 provided dialogs cover a lot of the taks space already. Much harder to do that in larger spaces. - I wouldn't call this approach end-to-end ;) |
[link]
TLDR; The authors finetune an FR -> EN NMT model using a RL-based dual game. 1. Pick a French sentence from a monolingual corpus and translate it to EN. 2. Use an EN language model to get a reward for the translation 3. Translate the translation back into FR using an EN -> FR system. 4. Get a reward based on the consistency between original and reconstructed sentence. Training this architecture using Policy Gradient authors can make efficient use of monolingual data and show that a system trained on only 10% of parallel data and finetuned with monolingual data achieves comparable BLUE scores as a system trained on the full set of parallel data. ### Key Points - Making efficient use of monolingual data to improve NMT systems is a challenge - Two Agent communication game: Agent A only knows language A and agent B only knows language B. A send message through a noisy translation channel, B receives message, checks its correctness, and sends it back through another noisy translation channel. A checks if it is consistent with the original message. Translation channels are then improves based on the feedback. - Pieces required: LanguageModel(A), LanguageModel(B), TranslationModel(A->B), TranslationModel(B->A). Monolingual Data. - Total reward is linear combination of: `r1 = LM(translated_message)`, `r2 = log(P(original_message | translated_message)` - Samples are based on beam search using the average value as the gradient approximation - EN -> FR pretrained on 100% of parallel data: 29.92 to 32.06 BLEU - EN -> FR pretrained on 10% of parallel data: 25.73 to 28.73 BLEU - FR -> EN pretrained on 100% of parallel data: 27.49 to 29.78 BLEU - FR -> EN pretrained on 10% of parallel data: 22.27 to 27.50 BLEU ### Some Notes - I think the idea is very interesting and we'll see a lot related work coming out of this. It would be even more amazing if the architecture was trained from scratch using monolingual data only. Due the the high variance of RL methods this is probably quite hard to do though. - I think the key issue is that the rewards are quite noisy, as is the case with MT in general. Neither the language model nor the BLEU scores gives good feedback for the "correctness" of a translation. - I wonder why there is such a huge jump in BLEU scores for FR->EN on 10% of data, but not for EN->FR on the same amount of data. |
[link]
TLDR; The authors train a DQN on text-based games. The main difference is that their Q-Value functions embeds the state (textual context) and action (text-based choice) separately and then takes the dot product between them. The authors call this a Deep Reinforcement Learning Relevance network. Basically, just a different Q function implementation. Empirically, the authors show that their network can learn to solve "Saving John" and "Machine of Death" text games. |
[link]
TLDR; The authors propose a new Diverse Beam Search (DBS) decoding procedure that produces more diverse responses than standard Beam Search (BS). The authors divide the beam of size B into G groups of size B/G. At each step they perform beam search for each group with an added similarity penalty (with scaling factor lambda) that encourages groups to be pick different outputs. This procedure is done greedily, i.e. group 1 does regular BS, group 2 is conditioned on group 1, group 3 is conditioned on group 1 and 2, and so on. Similarity functions include Hamming distance, Cumulative Diversity, n-gram diversity and neural embedding diversity. Hamming Distance tends to perform best. The authors evaluate their model on Image Captioning (COCO, PASCAL-50S), Machine Translation (europarl) and Visual Question Generation. For Image Captioning the authors perform a human evaluation (1000 examples on Mechanical Turk) and find that DBS is preferred over BS 60% of the time. |
[link]
TLDR; The authors encourage exploration by adding a pseudo-reward of the form $\frac{\beta}{\sqrt{count(state)}}$ for infrequently visited states. State visits are counted using Locality Sensitive Hashing (LSH) based on an environment-specific feature representation like raw pixels or autoencoder representations. The authors show that this simple technique achieves gains in various classic RL control tasks and several games in the ATARI domain. While the algorithm itself is simple there are now several more hyperaprameters to tune: The bonus coefficient `beta`, the LSH hashing granularity (how many bits to use for hashing) as well as the type of feature representation based on which the hash is computed, which itself may have more parameters. The experiments don't paint a consistent picture and different environments seem to need vastly different hyperparameter settings, which in my opinion will make this technique difficult to use in practice. |
[link]
TLDR; The authors introduce CopyNet, a variation on the seq2seq that incorporates a "copying mechanism". With this mechanism, the effective vocabulary is the union of the standard vocab and the words in the current source sentence. CopyNet predicts words based on a mixed probability of the standard attention mechanism and a new copy mechanism. The authors show empirically that on toy and summarization tasks CopNet behaves as expected: The decoder is dominated by copy mode when it tries to replicate something from the source. |
[link]
TLDR; The authors propose a character-level Neural Machine Translation (NMT) architecture. The encoder is a convolutional network with max-pooling and highway layers that reduces size of the source representation. It does not use explicit segmentation. The decoder is a standard RNN. The authors apply their model to WMT'15 DE-EN, CS-EN, FI-EN and RU-EN data in bilingual and multilingual settings. They find that their model is competitive in bilingual settings and significantly outperforms competing models in the multilingual setting with a shared encoder. #### Key Points - Challenge: Apply standard seq2seq models to characters is hard because representation is too long. Attention network complexity grows quadratically with sequence length. - Word-Level models are unable to model rare and out-of-vocab tokens and softmax complexity grows with vocabulary size. - Character level models are more flexible: No need for explicit segmentation, can model morphological variants, multilingual without increasing model size. - Reducing the length of the source sentence is key to fast training in char models. - Encoder Network: Embedding -> Conv -> Maxpool -> Highway -> Bidirectional GRU - Attenton Network: Single Layer - Decoder: Two Layer GRU - Multilingual setting: Language examples are balanced within each batch. No language identifier is provided to the encoder - Bilingual Results: char2char performs as well as or better than bpe2char or bpe2bpe - Multilingual Results: char2char outperforms bpe2char - Trained model is robust to spelling mistakes and unseen morphologies - Training time: Single Titan X training time for bilingual model is ~2 weeks. ~2.5 updates per second with batch size 64. #### Notes - I wonder if you can extract segmentation info from the network post training. |
[link]
TLDR; The authors present a novel Attention-over-Attention (AoA) model for Machine Comprehension. Given a document and cloze-style question, the model predicts a single-word answer. The model, 1. Embeds both context and query using a bidirectional GRU 2. Computes a pairwise matching matrix between document and query words 3. Computes query-to-document attention values 4. Computes document-to-que attention averages for each query word 5. Multiplies the two attention vectors to get final attention scores for words in the document 6. Maps attention results back into the vocabulary space The authors evaluate the model on the CNN News and CBTest Question Answering datasets, obtaining state-of-the-art results and beating other models including EpiReader, ASReader, etc. #### Notes: - Very good model visualization in the paper - I like that this model is much simpler than EpiReader while also performing better |
[link]
TLDR; The authors propose Associate LSTMs, a combination of external memory based on Holographic Reduced Representations and LSTMs. The memory provides noisy key-value lookup based on matrix multiplications without introducing additional parameters. The authors evaluate their model on various sequence copying and memorization tasks, where it outperforms vanilla LSTMs and competing models with a similar number of parameters. #### Key Points - Two limitations of LSTMs: 1. N cells require NxN weight matrices. 2. Lacks mechanism to index memory - Idea of memory comes from "Holographic Reduced Representations" (Plate, 2003), but authors add multiple redundant memory copies to reduce noise. - More copies of memory => Less noise during retrieval - In the LSTM update equations input and output keys to the memory are computed - Compared to: LSTM, Permutation LSTM, Unitary LSTM, Multiplicative Unitary LSTM - Tasks: Episodic Copy, XML modeling, variable assignment, arithmetic, sequence prediction #### Notes - Only brief comparison with Neural Turing Machines in appendix. Probably NTMs outperform this and are simpler. No comparison with attention mechanisms, memory networks, etc. Why? - It's surprising to me that deep LSTM without any bells and whistles actually perform pretty well on many of the tasks. Is the additional complexity really worth it? |
[link]
TLDR; The authors apply adversarial training on labeld data and virtual adversarial training on unlabeled data to the embeddings in text classification tasks. Their models, which are straightforward LSTM architectures, either match or surpass the current state of the art on several text classification tasks. The authors also show that the embeddings learned using adversarial training tend to be tuned better to the corresponding classification task. #### Key Points - In Image classification we can apply adversarial training directly to the inputs. In Text classification the inputs are discrete and we cannot make small perturbations, but we can instead apply adversarial training to embeddings. - Trick: To prevent the model from making perturbations irrelevant by learning embeddings with large norms: Use normalized embeddings. - Adversarial Training (on labeled examples) - At each step of training, identify the "worst" (in terms of cost) perturbation `r_adv` to the embeddings within a given constant epsilon, which a hyperparameter. Train on that. In practice `r_adv` is estimated using a linear approximation. - Add a `L_adv` adversarial loss term to the cost function. - Virtual Adversarial Training (on unlabeled examples) - Minimize the KL divergence between the outputs of the model given the regular and perturbed example as inputs. - Add `L_vad` loss to the cost function. - Common misconception: Adversarial training is equivalent to training on noisy examples, but it actually is a stronger regularizer because it explicitly increases cost. - Model Architectures: - (1) Unidirectional LSTM with prediction made at the last step - (2) Bidirectional LSTM with predictions based on concatenated last outputs - Experiments/Results - Pre-Training: For all experiments a 1-layer LSTM language model is pre-trained on all labeled and unlabeled examples and used to initialize the classification LSTM. - Baseline Model: Only embedding dropout and pretraining - IMDB: raining curves show that adversarial training acts as a good regularizer and prevents overfitting. VAT matches state of the art using a unidirectional LSTM only. - IMDB embeddings: Baseline model places "good" close to "bad" in embedding space. Adv. training ensures that small perturbations in embeddings don't change the sentiment classification result so these two words become properly separated. - Amazon Reviews and RCV1: Adv. + Vadv. achieve state of the art. - Rotten Tomatoes: Adv. + Vadv. achieve state of the art. Because unlabeled data overwhelms labeled data vadv. training results in decrease of performance. - DBPedia: Even the baseline outperforms state of the art (better optimizer?), adversarial training improves on that. ### Thoughts - I think this is a very well-written paper with impressive results. The only thing that's lacking is a bit of consistency. Sometimes pure virtual adversarial training wins, and sometimes adversarial + virtual adversarial wins. Sometimes bi-LSTMs make things worse, sometimes better. What is the story behind that? Do we really need to try all combinations to figure out what works for a given dataset? - Not a big deal, but a few bi-LSTM experiments seem to be missing. This just always makes me if they are "missing for a reason" or not ;) - There are quite a few differences in hyperparameters and batch sizes between datasets. I wonder why. Is this to stay consistent with the models they compare to? Were these parameters optimized on a validation set (the authors say only dropout and epsilon were optimized)? - If Adversarial Training is a stronger regularizer than random permutations I wonder if we still need dropout in the embeddings. Shouldn't adversarial training take care of that? |
[link]
TLDR; The authors propose to use the Actor Critic framework from Reinforcement Learning for Sequence prediction. They train an actor (policy) network to generate a sequence together with a critic (value) network that estimates the q-value function. Crucially, the actor network does not see the ground-truth output, but the critic does. This is different from LL (log likelihood) where errors are likely to cascade. The authors evaluate their framework on an artificial spelling correction and a real-world German-English Machine Translation tasks, beating baselines and competing approaches in both cases. #### Key Points - In LL training, the model is conditioned on its own guesses during search, leading to error compounding. - The critic is allowed to see the ground truth, but the actor isn't - The reward is a task-specific score, e.g. BLEU - Use bidirectional RNN for both actor and critic. Actor uses a soft attention mechanism. - The reward is partially receives at each intermediate step, not just at the end - Framework is analogous to TD-Learning in RL - Trick: Use additional target network to compute q_t (see Deep-Q paper) for stability - Trick: Use delayed actor (as in Deep Q paper) for stability - Trick: Put constraint on critic to deal with large action spaces (is this analogous to advantage functions?) - Pre-train actor and critic to encourage exploration of the right space - Task 1: Correct corrupt character sequence. AC outperforms LL training. Longer sequences lead to stronger lift. - Task 2: GER-ENG Machine Translation: Beats LL and Reinforce models - Qualitatively, critic assigns high values to words that make sense - BLUE scores during training are lower than those of LL model - Why? Strong regularization? Can't overfit the training data. #### Notes - Why does the sequence length for spelling prediction only go up to 30? This seems very short to me and something that an LSTM should be able to handle quite easily. Would've like to see much longer sequences. |
[link]
TLDR; The paper presents Adaptive Computation Time (ACT), an algorithm that allows RNNs to adaptively decide how much computation to expend per time step, also called "pondering". To prevent the network from computng indefinitely an extra term that encourages shorter computation is added to the cost. The architecture is fully differentiable and applicable to any type of RNN (e.g. LSTMs). The authors evaluate ACT on tasks of Parity, Logic, Addition, Sorting and character prediction. An interesting observation is that the numbet of pondering steps seems to predict "boundaries" in the data. |
[link]
TLDR; The authors augment the A3C (Asynchronous Actor Critic) algorithm with auxiliary tasks. These tasks share some of the network parameters but value functions for them are learned off-policy using n-step Q-Learning. The auxiliary tasks only used to learn a better representation and don't directly influence the main policy control. The technique, called UNREAL (Unsupervised Reinforcement and Auxiliary Learning), outperforms A3C on both the Atari and Labyrinth domains in terms of performance and training efficiency. #### Key Points - Environments contain a wide variety of possible training signals, not just cumulative reward - Base A3C agent uses CNN + RNN - Auxiliary Control and Prediction tasks share the convolutional and LSTM network for the "base agent". This forces the agent to balance improvement and base and aux. tasks. - Auxiliary Tasks - Use off-policy RL algorithms (e.g. n-step Q-Learning) so that the same stream of experience from the base agent can be used for maximizing all tasks. Experience is sampled from a replay buffer. - Pixel Changes (Auxiliary Control): Learn a policy for maximally changing the pixels in a grid of cells overlaid over the images - Network Features (Auxiliary Control): Learn a policy for maximally activating units in a specific hidden layer - Reward Prediction (Auxiliary Reward): Predict the next reward given some historical context. Crucially, because rewards tend to be sparse, histories are sampled in a skewed manner from the replay buffer so that P(r!=0) = 0.5. Convolutional features are shared with the base agent. - Value Function Replay: Value function regression for the base agent with varying window for n-step returns. - UNREAL - Base agent is optimized on-policy (A3C) and aux. tasks are optimized off-policy. - Experiments - Agent is trained with 20-step returns and aux. tasks are performed every 20 steps. - Replay buffer stores the most recent 2k observations, actions and rewards - UNREAL tends to be more robust to hyperparameter settings than A3C - Labyrinth - 38% -> 83% human-normalized score. Each aux. tasks independently adds to the performance. - Significantly faster learning, 11x across all levels - Compared to input reconstruction technique: Input reconstruction hurts final performance b/c it puts too much focus on reconstructing relevant parts. - Atari - Not all experiments are completed yet, but UNREAL already surpasses state of the art agents and is more robust. #### Thoughts - I want an algorithm box please :) |
[link]
TLDR; The authors adopt Generative Adversarial Networks (GANs) to RNNs and train a discriminator to distinguish between sequences generated using teacher forcing (feeding ground truth inputs to the RNN) and scheduled sampling (feeding generated outputs as the next inputs). The inputs to the discriminator are both the predictions and the hidden states of the generative RNN. The generator is trained to fool the discriminator, forcing the dynamics of teacher forcing and scheduled sampling to become more similar. This procedure acts as regularizer, and results in better sample quality and generalization, particularly for long sequences. The authors evaluate their framework on Language Model (PTB), Pixel Generation (Sequential MNIST), Handwriting Generation, and Musisc Synthesis. ### Key Points - Problem: During inference, errors in an RNN easily compound because the conditioning context may diverge from what is seen during training when the ground-truth labels are fed as inputs (teacher forcing). - Goal of professor forcing: Make the generative (free-run) behavior and the teacher-forced behavior match as closely as possible. - Discriminator Details - Input is a behavior sequence `B(x, y, theta)` from the generative RNN that contains the hidden states and outputs. - The training objective is to correctly classify whether or not a behavior sequence is generated using teacher forcing vs. scheduled sampling. - Generator - Standard RNN with MLE training objective and an additional term to fool the discrimator: Change the free-running behavior as to match the teacher-forced behavior while keeping the latter constant. - Second optional another term: Change the teacher-forced behavior to match the free-running behavior. - Like GAN, backprop from discriminator into generator. - Architectures - Generator is a standard GRU Recurrent Neural Network with softmax - Behavior function `B(x, y, theta)` outputs pre-tanh activation of GRU states and tje softmax output - Discriminator: Bidirectional GRU with 3-layer MLP on top - Training trick: To prevent "bad gradients" the authors backprop from the discriminator into the generator only if the classification accuracy is between 75% and 99%. - Trained used Adam optimizer - Experiments - PTB Chracter-Level Modeling: Reduction in test NLL, profesor forcing seem to act as a regularizier. 1.48 BPC - Sequential MNIST: Second-best NLL (79.58) after PixelCNN - Handwriting generation: Professor forcing is better at generating longer sequences than seen during training as per human eval. - Music Synthesis: Human eval significantly better for Professor forcing - Negative Results on word-level modeling: Professor forcing doesn't have any effect. Perhaps because long-term dependencies are more pronounced in character-level modeling. - The authors show using t-SNE that the hidden state distributions actually become more similar when using professor forcing ### Thoughts - Props to the authors for a very clear and well-written paper. This is rarer than it should be :) - It's an intersting idea to also match the states of the RNN instead of just the outputs. Intuitively, matching the outputs should implicitly match the state distribution. I wonder if the authors tried this and it didn't work as expected. - Note from [Ethan Caballero](https://github.com/ethancaballero) about why they chose to match hidden states: It's significantly harder to use GANs on sampled (argmax) output tokens because they are discrete as (as opposed to continuous like the hidden states and their respective softmaxes). They would have had to estimate discrete outputs with policy gradients like in [seqGAN](https://github.com/dennybritz/deeplearning-papernotes/blob/master/notes/seq-gan.md) which is [harder to get to converge](https://www.quora.com/Do-you-have-any-ideas-on-how-to-get-GANs-to-work-with-text), which is why they probably just stuck with the hidden states which already contain info about the discrete sampled outputs (the index of the highest probability in the the distribution) anyway. Professor Forcing method is unique in that one has access to the continuous probability distribution of each token at each timestep of the two sequence generation modes trying to be pushed closer together. Conversely, when applying GANs to pushing real samples and generated samples closer together as is traditionally done in models like seqGAN, one only has access to the next dicrete token (not continuous probability distributions of next token) at each timestep, which prevents straight-forward differentiation (used in professor forcing) from being applied and forces one to use policy gradient estimation. However, there's a chance one might be able to use straight-forward differentiation to train seqGANs in the traditional sampling case if one swaps out each discrete sampled token with its continuous distributional word embedding (from pretrained word2vec, GloVe, etc.), but no one has tried it yet TTBOMK. - I would've liked to see a comparison of the two regularization terms in the generator. The experiments don't make it clear if both or only of them them is used. - I'm guessing that this architecture is quite challenging to train. Woul've liked to see a bit more detail about when/how they trade off the training of discriminator and generator. - Translation is another obvious task to apply this too. I'm interested whether or not this works for seq2seq. |
[link]
TLDR; The authors train an Generative Adversarial Network where the generator is an RNN producing discrete tokens. The discriminator is used to provide a reward for each generated sequence (episode) and to train the generator network via via Policy Gradients. The discriminator network is a CNN in the experiments. The authors evaluate their model on a synthetic language modeling task and 3 real world tasks: Chinese poem generation, speech generation and music generation. Seq-GAN outperforms competing approaches (MLE, Schedule Sampling, PG-BLEU) on the synthetic task and outperforms MLE on the real world task based on a BLEU evaluation metric. #### Key Points - Code: https://github.com/LantaoYu/SeqGAN - RL Problem setup: State is already generated partial sequence. Action space is the space of possible tokens to output at the current step. Each episode is a fully generated sequence of fixed length T. - Exposure Bias in the Maximum Likelihood approach: During decoding the model generates the next token based on a series previously generated tokens that it may never have seen during training leading to compounding errors. - A discriminator can provide a reward when no task-specific reward (e.g. BLEU score) is available or when it is expensive to obtain such a reward (e.g. human eval). - The reward is provided by the discriminator at the end of each episode, i.e. when the full sequence is generated. To provide feedback at intermediate steps the rest of the sequence is sampled via Monte Carlo search. - Generator and discriminator are trained alternatively and strategy is defined by hyperparameters g-steps (# of Steps to train generator), d-steps (number of steps to train discriminator with newly generated data) and k (number of epochs to train discriminator with same set of generated data). - Synthetic task: Randomly initialized LSTM as oracle for a language modeling task. 10,000 sequences of length 20. - Hyperparameters g-steps, d-steps and k have a huge impact on training stability and final model performance. Bad settings lead to a model that is barely better than the MLE baseline. #### My notes: - Great paper overall. I also really like the synethtic task idea, I think it's a neat way to compare models. - For the real-world tasks I would've liked to see a comparison to PG-BLEU as they do in the synthetic task. The authors evaluate on BLEU score so I wonder how much difference a direct optimization of the evaluation metric makes. - It seems like SeqGAN outperforms MLE significantly only on the poem generation task, not the other tasks. What about the other baselines on the other tasks? What is it about the poem generation that makes SeqGAN perform so well? |
[link]
TLDR; The authors train a standard Neural Machine Translation (NMT) model (the teacher model) and distill it by having a smaller student model learn the distribution of the teacher model. They investigate three types of knowledge distillation for sequence models: 1. Word Level Distillation 2. Sequence Level Distillation and 3. Sequence Level Interpolation. Experiments on WMT'14 and IWSLT 2015 show that it is possible to significantly reduce the parameters of the model with only a minor loss in BLEU score. The experiments also demonstrates that the distillation techniques are largely complementary. Interestingly, the perplexity of distilled models is significantly higher than that of the baselines without leading to a loss in BLEU score. ### Key Points - Knowledge Distillation: Learn a smaller student network from a larger teacher network. - Approach 1 - Word Level KD: This is standard Knowledge Distillation applied to sequences where we match the student output distribution of each word to the teacher's using the cross-entropy loss. - Approach 2 - Sequence Level KD: We want to mimic the distribution of a full sequence, not just per word. To do that we sample outputs from the teacher using beam search and then train the student on these "examples" using Cross Entropy. This is a very sparse approximation of the true objective. - Approach 3: Sequence-Level Interpolation: We train the student on a mixture of training data and teacher-generated data. We could use the approximation from #2 here, but that's not ideal because it doubles size of training data and leads to different targets conditioned on the same source. The solution is to use generate a response that has high probability under the teacher model and is similar to the ground truth and then have both mixture terms use it. - Greedy Decoding with seq-level fine-tuned model behaves similarly to beam search on original model. - Hypothesis: KD allows student to only model the mode of the teacher distribution, not wasting other parameters. Experiments show good evidence of this. Thus, greedy decoding has an easier time finding the true max whereas beam search was necessary to do that previously. - Lower perplexity does not lead to better BLEU. Distilled models have significantly higher perplexity (22.7 vs 8.2) but have better BLEU (+4.2). |
[link]
TLDR; The authors apply a [WaveNet](https://arxiv.org/abs/1609.03499)-like architecture to the task of Machine Translation. Encoder ("Source Network") and Decoder ("Target Network") are CNNs that use Dilated Convolutions and they are stacked on top of each other. The Target Network uses [Masked Convolutions](https://arxiv.org/abs/1606.05328) to ensure that it only relies on information from the past. Crucially, the time complexity of the network is `c(|S| + |T|)`, which is cheaper than that of the common seq2seq attention architecture (`|S|*|T|`). Through dilated convolutions the network has constant path lengths between [source input -> target output] and [target inputs -> target output] nodes. This allows for efficient propagation of gradients. The authors evlauate their model on Character-Level Language Modeling and Character-Level Machine Translation (WMT EN-DE) and achieve state-of-the-art on the former and a competitive BLEU score on the latter. ### Key Points - Problems with current approaches - Runtime is not linear in the length of source/target sequence. E.g. seq2seq with attention is `O(|S|*|T|)`. - Some architectures compress the source into a fixed-length "though-vector", putting a memorization burden on the model. - RNNs are hard to parallelize - ByteNet: Stacked network of encoder/decoder. In this work the authors use CNNs, but the network could be RNNs. - ByteNet properties: - Resolution preserving: The representation of the source sequence is linear in the length of the source. Thus, a longer source sentence will have a bigger representation. - Runtime is linear in the length of source and target sequences: `O(|S| + |T|)` - Source network can be run in parallel, it's a CNN. - Distance (number of hops) between nodes in the network is short, allowing for efficient backprop. - Architecture Details - Dynamic Unfolding: `representation_t(source)` is fed into time step `t` of the target network. Anything past the source sequence length is zero-padded. This is possible due to the resolution preserving property which ensures that the source representation is the same width as the source input. - Masked 1D Convolutions: The target network uses masked convolutions to prevent it from looking at the future during training. - Dilation: Dilated Convoltuions increase the receptive field exponentially in higher layers. This leads to short connection paths for efficient backprop. - Each layer is wrapped in a residual block, either with ReLUs or multiplicative units (depending on the task). - Sub-Batch Normalization: To preven the target network from conditioning on future tokens (similar to masked convolutions) a new variant of Batch Normalization is used. - Recurrent ByteNets, i.e. ByteNets with RNNs instead of CNNs, are possible but are not evaluated. - Architecture Comparison: Table 1 is great. It compares various enc-dec architectures across runtime, resolution preserving and path lengths properties. - Character Prediction Experiments: - [Hutter Prize Version of Wikipedia](http://prize.hutter1.net/): ~90M characters - Sample a batch of 515 characters predict latter 200 from the first 315 - New SOTA: 1.33 NLL (bits/chracter) - Character-Level Machine Translation - [WMT](http://www.statmt.org/wmt16/translation-task.html) EN-DE. Vocab size ~140 - Bags of character n-grams as additional embeddings - Examples are bucketed according to length - BLEU: 18.9. Current state of the art is ~22.8 and standard attention enc-dec is 20.6 ### Thoughts - Overall I think this a very interesting contribution. The ideas here are pretty much identical to the [WaveNet](https://arxiv.org/abs/1609.03499) + [PixelCNN](https://arxiv.org/abs/1606.05328) papers. This paper doesn't have much detail on any of the techniques, no equations whatsoever. Implementing the ByteNet architecture based on the paper alone would be very challenging. The fact that there's no code release makes this worse. - One of the main arguments is the linear runtime of the ByteNet model. I would've liked to see experiments that compare implementations in frameworks like Tensorflow to standard seq2seq implementations. What is the speedup in *practice*, and how does it scale with increased paralllelism? Theory is good and all, but I want to know how fast I can train with this. - Through dynamic unfolding target inputs as time t depend directly on the source representation at time t. This makes sense for language pairs that are well aligned (i.e. English/German), but it may hurt performance for pairs that are not aligned since the the path length would be longer. Standard attention seq2seq on the other hand always has a fixed path length of 1. Experiments on this would've been nice. - I wonder how much difference the "bag of character n-grams" made in the MT experiments. Is this used by the other baselines? |
[link]
TLDR; The authors introduce Batch Normalization, a technique to normalize unit activations to zero mean and unit variance within the network. The authors show that in Feedforward and Convolutional Networks, Batch Normalization leads to faster training and better accuracies. BN also acts as a regularizer, reducing the need for Dropout, etc. Using an ensemble of batch normalized networks the authors achieve state of the art on ILSVRC. #### Key Points - Network training is complicated because the input distributions to higher level change as the parameter in lower layers are changing: Internal Covariate Shift. Solution: Normalize within the network. - BN: Normalize input to nonlinearity to have zero mean and unit variance. Then add two additional parameters (scaling and bias) per unit to preserve expressability of the network. Statistics are calculated per minibatch. - Network parameters increase, but not by much: 2 parameter per unit that has batch normalization applied to it. - Works well for fully connected and convolutional layers. Authors didn't try RNNs. - Change to make when adding BN: Increase learning rate, remove/decrease dropout and l2 regularization, accelerate learning rate decay, shuffle training examples more thoroughly.
2 Comments
|
[link]
TLDR; The authors propose two different architectures to improve the performance of character-level RNNs. In the first architecture ("mixed") the authors condition the model on the state of a word-level RNN. In the second architecture ("cond") they condition the output classifier on character n-grams. The authors show that the proposed architecture outperform plain character-level RNNs in terms of entropy in bits per character. #### Key Points - Plain character-level RNNs need a huge hidden representation in order to model long-term dependencies. But Word-level RNNs can't generalize to new vocabulary and may require a huge output vocab. - Model 1: Jointly train word-level and char-level CNN. Interpolate the losses of the two models. - Model 2: Condition softmax on n-grams before character, "relieving" the network of memorizing some of the sequence. - Training: Constant learning rate, reduce every epoch when validation accuracy decreases - N-gram model can be applied to arbitrary data, not just characters. Authors evaluate on binary data. #### Notes / Questions - In the comparison table the authors don't show the number of parameters for the models. They compare models with the same number of hidden units, but their proposed architecture need extra parameters and computation. Unfair comparison? - People typically use LSTMs/GRUs for language modeling. Of course the proposed techniques can be applied to LSTM/GRU networks, but the experimental result may look very different. Do these architectures result in any benefit when using LSTM/GRU char data? - Entropy in bits per character seems like somewhat of a strange evaluation metric. I don't really know what to make of it, and no intuitive explanations are given. - One argument the authors make in the paper is that character-level models can be applied to arbitrary input data (different languages, binary data, code, etc). But their mixed is clearly very language-specific. It can't be applied to arbitrary data, and many languages don't have clear word boundaries. Similarly, n-grams may be prohibituvely expensive depending on what kind of data we're working with. - The n-gram conditioned models isn't clearly explained, I *think* I understand what it does, but I'm not quite sure. No intuitive explanations what any of the models are learning are given. |
[link]
TLDR; The authors propose an Attention with Intention (AWI) model for Conversation Modeling. AWI consists of three recurrent networks: An encoder that embeds the source sentence from the user, an intention network that models the intention of the conversation over time, and a decoder that generates responses. The authors show that the network can general natural responses. #### Key Points - Intuition: Intention changes over the course of a conversation, e.g. communicate problem -> resolve issue -> acknowledge. - Encoder RNN: Depends on last state of the decoder. Reads the input sequence and converts it into a fixed-length vector. - Intention RNN: Gets encoder representation, previous intention state, and previous decoder state as input and generates new representation of the intention. - Decoder RNN: Gets current intention state and attention vector over the encoder as an input. Generates a new output. - Architecture is evaluated on an internal helpdesk chat dataset with 10k dialogs, 100k turns and 2M tokens. Perplexity scores and a sample conversation are reported. #### Notes/Questions - It's a pretty short paper and not sure what to make of the results. The PPL scores were not compared to alternative implementations and no other evaluations (e.g. crowdsourced as in Neural Conversational Model) are done. |
[link]
TLDR; The authors argue that the human visual cortex doesn't contain ultra-deep networks like ResNet's (100s or 1000s of layers), but that it does contain recurrent connections. The authors then explore ResNets with weight sharing and show how they are equivalent to unrolled standard RNNs with skip connections. The authors find that ResNets with weight sharing perform almost as well as ResNets without weight sharing, while needing drastically fewer parameters. Thus, they argue that the success of ultra-deep networks may actually stem from the fact that they can approximate recurrent computations. |
[link]
TLDR; The author explore the gap between Deep Learning methods and human learning. The argue that natural intelligence is still the best example of intelligence, so it's worth exploring. To demonstrate their points they explore two challenges: 1. Recognizing new characters and objects 2. Learning to play the game Frostbite. The authors make several arguments: - Humans have an intuitive understanding of physics and psychology (understanding goals and agents) very early on. These two types of "software" help them to learn new tasks quickly. - Humans build causal models of the world instead of just performing pattern recognition. These models allow humans to learn from far fewer examples than current Deep Learning methods. For example, AlphaGo played a billion games or so, Lee Sedol perhaps 50,000. Incorporating compositionality, learning-to-learn (transfer learning) and causality helps humans to build these models. - Humans use both model-free and model-based learning algorithms. |
[link]
TLDR; The authors build an LSTM Neural Language model, but instead of using word embeddings as inputs, they use the per-word outputs of a character-level CNN, plus a highway layer. This architecture results in state of the art performance and significantly fewer parameters. It also seems to work well on languages with rich morphology. #### Key Points - Small Model: 15-dimensional char embeddings, filter sizes 1-6, tanh, 1-layer highway with ReLU, 2-layer LSTM with 300-dimensional cells. 5M Parameters. Hiearchical Softmax. - Large Model: 15-dimensional char embeddings, filter sizes 1-7, tanh, 2-layer highway with ReLU, 2-layer LSTM with 670-dimensional cells. 19M Parameters. Hiearchical Softmax. - Can generalize to out of vocabulary words due to character-level representations. Some datasets already had OOV words replaced with a special token, so the results don't reflect this. - Highway Layers are key to performance. Susbtituting HW with MLP does not work well. Intuition is that HW layer adaptively combines different local features for higher-level representation. - Nearest neighbors after Highway layer are more smenatic than before highway layer. Suggests compositional nature. - Surprisingly combinbing word and char embeddings as LSTM input results in worse performance - Characters alone are sufficient? - Can apply same architecture to NML or Classification tasks. Highway Layers at the output may also help these tasks. #### Notes / Questions - Essentially this is a new way to learn word embeddings comprised of lower-level character embeddings. Given this, what about stacking this architecture and learn sentence representations based on these embeddings? - It is not 100% clear to me why the MLP at the output layer does so much worse. I understand that the highway layer can adaptively combine feature, but what if you combined MLP and plain representations and add dropout? Shouldn't that result in similar perfomance? - I wonder if the authors experimented with higher-dimensional character embeddings. What is the intuition behind the very low-dimensional (15) embeddings? |
[link]
TLDR; The authors evaluate the use for 9-layer deep CNNs on large-scale data sets for text classification, operating directly on one-hot encodings of characters. The architecture achieves competitive performance across datasets. #### Key Points - 9 Layers, 6 conv/ppol layers, 3 affine layers. 1024-dimensional input features for large model, 256-dimensional input features for small model. - Authors optionally use English thesaurus for training data augmentation - Fixed input length l: 1014 characters - Simple n-gram models performs very well on these data sets and beats other models and the smaller data sets (<= 500k examples). CNN wins on the larger data sets (>1M examples) #### Notes / Questions - Comparing the CNN with input restricted to 1014 characters to models that operate on words seems unfair. Also, how long is the average document? Would've liked to see some dataset statistics. The fixed input length doesn't make a lot of sense to me. - Contribution of this paper is that the architecture works without word knowledge and for any language, but at the same time the authors use a word-level English thesaurus to improve their performance? To be fair, the thesaurus doesn't seem to make a huge difference. - The reason this architecture requires so much data is probably because it's very deep (How many parameters?). Did the authors experiment with fewer layers? Did they perform much worse? - What about unsupervised pre-training? Can that reduce the amount of data required to achieve good performance. Currently this model doesn't seem very useful in practice as there are very few datasets of such size out there. |
[link]
TLDR; The authors propose a Contextual LSTM (CLSTM) model that appends a context vector to input words when making predictions. The authors evaluate the model and Language Modeling, next sentence selection and next topic prediction tasks, beating standard LSTM baselines. #### Key Points - The topic vector comes from an internal classifier system and is supervised data. Topics can also be estimated using unsupervised techniques. - Topic can be calculated either based on the previous words of the current sentence (SentSegTopic), all words of the previous sentence (PrevSegTopic), and current paragraph (ParaSegTopic). Best CLSTM uses all of them. - English Wikipedia Dataset: 1400M words train, 177M validation, 178M words test. 129k vocab. - When current segment topic is present, the topic of the previous sentence doesn't matter. - Authors couldn't compare to other models that incorporate topics because they don't scale to large-scale datasets. - LSTMs are a long chain and authors don't reset the hidden state between sentence boundaries. So, a sentence has implicit access to the prev. sentence information, but explicitly modeling the topic still makes a difference. #### Notes/Thoughts - Increasing number of hidden units seems to have a *much* larger impact on performance than increasing model perplexity. The simple word-based LSTM model with more hidden units significantly outperforms the complex CLSTM model. This makes me question the practical usefulness of this model. - IMO the comparisons are somewhat unfair because by using an external classifier to obtain topic labels you are bringing in external data that the baseline models didn't have access to. - What about using other unsupervised sentence embeddings as context vectors, e.g. seq2seq autoencoders or PV? - If the LSTM was perfect in modeling long-range dependencies then we wouldn't need to feed extra topic vectors. What about residual connections? |
[link]
TLDR; The authors apply an RNN to modeling the students knowledge. The input is an exercise question and answer (correct/incorrect), either as one-hot vectors or embedded. The network then predicts whether or not the student can answer a future question correctly. The authors show that the RNN approach results in significant improvement over previous models, can be used for curriculum optimization, and also discovers the latent structure in exercise concepts. #### Key Points - Two encodings tried: One hot, embedded - RNN/LSTM, 200-dimensional hidden layer, output dropout, NLL. - No expert annotation for concepts or question/answers are needed - Blocking (series of exercises of same type) vs Mixing for curriculum optimization: Blocking seems to perform better - Lots of cool future direction ideas #### Question / Notes - Can we not only predict whether an exercise is answered correctly, but also what the most likely student answer would be? My give insight into confusing concepts. |
[link]
TLDR; The authors present Residual Nets, which achieve 3.57% error on the ImageNet test set and won the 1st place on the ILSVRC 2015 challenge. ResNets work by introducing "shortcut" connections across stacks of layers, allowing the optimizer to learn an easier residual function instead of the original mapping. This allows for efficient training of very deep nets without the introduction of additional parameters or training complexity. The authors present results on ImageNet and CIFAR-100 with nets as deep as 152 layers (and one ~1000 layer deep net). #### Key Points - Problem: Deeper networks experience a *degradation* problem. They don't overfit but nonetheless perform worse than shallower networks on both training and test data due to being more difficult to optimize. - Because Deep Nets can in theory learn an identity mapping for their additional layers they should strict outperform shallower nets. In practice however, optimizers have problems learning identity (or near-identity) mappings. Learning residual mappings is easier, mitigating this problem. - Residual Mapping: If the desired mapping is H(x), let the layers learn F(x) = H(x) - x and add x back through a shortcut connection H(x) = F(x) + x. An identity mapping can then be learned easily by driving the learned mapping F(x) to 0. - No additional parameters or computational complexity are introduced by residuals nets. - Similar to Highway Networks, but gates are not data-dependent (no extra parameters) and are always open. - Due the the nature of the residual formula, input and output must be of same size (just like Highway Networks). We can do size transformation by zero-padding or projections. Projections introduce additional parameters. Authors found that projections perform slightly better, but are "not worth" the large number of extra parameters. - 18 and 34-layer VGG-like plain net gets 27.94 and 28.54 error respectively, not that higher error for deeper net. ResNet gets 27.88 and 25.03 respectively. Error greatly reduces for deeper net. - Use Bottleneck architecture with 1x1 convolutions to change dimensions. - Single ResNet outperforms previous start of the art ensembles. ResNet ensemble even better. #### Notes/Questions - Love the simplicity of this. - I wonder how performance depends on the number of layers skipped by the shortcut connections. The authors only present results with 2 or 3 layers. - "Stacked" or recursive residuals? - In principle Highway Networks should be able to learn the same mappings quite easily. Is this an optimization problem? Do we just not have enough data. What if we made the gates less fine-grained and substituted sigmoid with something else? - Can we apply this to RNNs, similar to LSTM/GRU? Seems good for learning long-range dependencies. |
[link]
TLDR; The authors show that we can distill the knowledge of a complex ensemble of models into a smaller model by letting the smaller model learn directly from the "soft targets" (softmax output with high temperature) of the ensemble. Intuitively, this works because the errors in probability assignment (e.g. assigning 0.1% to the wrong class) carry a lot of information about what the network learns. Learning directly from logits (unnormalized scores) as was done in a previous paper, is a special case of the distillation approach. The authors show how distillation works on the MNIST and an ASR data set. #### Key Points - Can use unlabeled data to transfer knowledge, but using the same training data seems to work well in practice. - Use softmax with temperature, values from 1-10 seem to work well, depending on the problem. - The MNIST networks learn to recognize digits without ever having seen base, solely based on the "errors" that the teacher network makes. (Bias needs to be adjusted) - Training on soft targets with less data performs much better than training on hard targets with same amount of data. #### Notes/Question - Breaking up the complex models into specialists didn't really fit into this paper without distilling those experts into one model. Also would've liked to see training of only specialists (without general network) and then distill their knowledge. |
[link]
TLDR; The authors present Paragraph Vector, which learns fixed-length, semantically meaningful vector representations for text of any length (sentences, paragraphs, documents, etc). The algorithm works by training a word vector model with an additional paragraph embedding vector as an input. This paragraph embedding is fixed for each paragraph, but varies across paragraphs. Similar to word2vec, PV comes in 2 flavors: - A Distributed Memory Model (PV-DM) that predicts the next word based on the paragraph and preceding words - A BoW model (PW-BoW) that predicts context words for a given paragraph A notable property of PV is that during inference (when you see a new paragraph) it requires training of a new vector, which can be slow. The learned embeddings can used as the input to other models. In their experiments the authors train both variants and concatenate the results. The authors evaluate PV on Classification and Information Retrieval Tasks and achieve new state-of-the-art. #### Data Sets / Results Stanford Sentiment Treebank Polar error: 12.2% Stanford Sentiment Treebank Fine-Grained error: 51.3% IMDB Polar error: 7.42% Query-based search result retrieval (internal) error: 3.82% #### Key Points - Authors use 400-dimensional PV and word embeddings. The window size is a hyperparameter chosen on the validation set, values from 5-12 seem to work well. In IMDB, window size resulted in error fluctuation of ~0.7%. - PV-DM performs well on its own, but concatenating PV-DM and PV-BoW consistently leads to (small) improvements. - When training the PV-DM model, use concatenation instead of averaging to combine words and paragraph vectors (this preserves ordering information) - Hierarchical Softmax is used to deal with large vocabularies. - For final classification, authors use LR or MLP, depending on the task (see below) - IMDB Training (25k documents, 230 average length) takes 30min on 16 core machine, CPU I assume. #### Notes / Question - How did the authors choose the final classification model? Did they cross-validate this? The authors mention that NN performs better than LR for the IMDB data, but they don't show how large the gap is. Does PV maybe perform significantly worse with a simpler model? - I wonder if we can train hierarchical representations of words, sentences, paragraphs, documents, keep the vectors of each one fixed at each layer, and predicting sequences using RNNs. - I wonder how PV compares to an attention-based RNN autoencoder approach. When training PV you are in a way attending to specific parts of the paragraph to predict the missing parts. |
[link]
TLDR; The authors use a Maximum Mutual Information (MMI) objective function to generate conversational responses. They still train their models with maximum likelihood, but use MMI to generate responses during decoding. The idea behind MMI is that it promotes more diversity and penalizes trivial responses. The authors evaluate their method using BLEU scores, human evaluators, and qualitative analysis and find that the proposed metric indeed leads to more diverse responses. #### Key Points - In practice, NCM (Neural Conversation Models) often generate trivial responses using high-frequency terms partly due to the likelihood objective function. - Two models: MMI-antiLM and MMI-bidi depending on the formulation of the MMI objective. These objectives are used during response generation, not during training. - Use Deep 4-layer LSTM with 1000-dimensional hidden state, 1000-dimensional word embeddings. - Datasets: Twitter triples with 129M context-message-response triples. OpenSubtitles with 70M spoken lines that are noisy and don't include turn information. - Authors state that perplexity is not a good metric because their objective is to explicitly steer away from the high probability responses. #### Notes - BLEU score seems like a bad metric for this. Shouldn't more diverse responses result in a lower BLEU score? - Not sure if I like the direction of this. To me it seems wrong to "artificially" promote diversity. Shouldn't diversity come naturally as a function of context and intention? |
[link]
TLDR; The authors evaluate Paragraph Vectors on large Wikipedia and arXiv document retrieval tasks and compare the results to LDA, BoW and word vector averaging models. Paragraph Vectors either outperform or match the performance of other models. The authors show how the embedding dimensionality affects the results. Furthermore, the authors find that one can perform arithemetic operations on paragraph vectors and obtain meaningful results and present qualitative analyses in the form of visualizations and document examples. #### Data Sets Accuracy is evaluated by constructing triples, where a pair of items are close to each other and the third one is unrelated (or less related). Cosine similarity is used to evaluate semantic closeness. Wikipedia (hand-built) PV: 93% Wikipedia (hand-built) LDA: 82% Wikipedia (distantly supervised) PV: 78.8% Wikipedia (distantly supervised) LDA: 67.7% arXiv PV: 85% arXiv LDA: 85% #### Key Points - Jointly training PV and word vectors seems to improve performance. - Used Hierarchical Softmax as Huffman tree for large vocabulary - The use only the PV-BoW model, because it's more efficient. #### Questions/Notes - Why the performance discrepancy between the arXiv and Wikipedia tasks? BoW performs surprisingly well on Wikipedia, but not arXiv. LDA is the opposite. |
[link]
TLDR; The authors train a Hierarchical Recurrent Encoder-Decoder (HRED) network for dialog generation. The "lower" level encodes a sequence of words into a though vector, and the higher-level encoder uses these thought vectors to build a representation of the context. The authors evaluate their model on the *MoviesTriples* dataset using perplexity measures and achieve results better than plain RNNs and the DCGM model. Pre-training with a large Question-Answer corpus significantly reduces perplexity. #### Key Points - Three RNNs: Utterance encoder, context encoder, and decoder. GRU hidden units, ~300d hidden state spaces. - 10k vocabulary. Preprocessing: Remove entities and numbers using NLTK - The context in the experiments is only a single utterance - MovieTriples is a small dataset, about 200k training triples. Pretraining corpus has 5M Q-A pairs, 90M tokens. - Perplexity is used as an evaluation metric. Not perfect, but reasonable. - Pre-training has a much more significant impact than the choice of the model architecture. It reduces perplexity ~10 points, while model architecture makes a tiny difference (~1 point). - Authors suggest exploring architectures that separate semantic from syntactic structure - Realization: Most good predictions are generic. Evaluation metrics like BLEU will favor pronouns and punctuation marks that dominate during training and are therefore bad metrics. #### Notes/Questions - Does using a larger dataset eliminate the need for pre-training? - What about the more challenging task for longer contexts? |
[link]
TLDR; The authors use a CNN to extract features from character-based document representations. These features are then fed into a RNN to make a final prediction. This model, called ConvRec, has significantly fewer parameters (10-50x) then comparable convolutional models with more layers, but achieves similar to better performance on large-scale document classification tasks. #### Key Points - Shortcomings of word-level approach: Each word is distinct despite common roots, cannot handle OOV words, many parameters. - Character-level Convnets need many layers to capture long-term dependencies due to the small sizes of the receptive fields. - Network architecture: 1. Embedding 8-dim 2. Convnet: 2-5 layers, 5 and 3-dim convolutions, 2-dim pooling, ReLU activation, 3. RNN LSTM with 128d hidden state. Dropout after conv and recurrent layer. - Training: 96 characters, Adadelta, batch size of 128, Examples are padded and masked to longest sequence in batch, gradient norm clipping of 5, early stopping - Models tends to outperform large CNN for smaller datasets. Maybe because of overfitting? - More convolutional layers or more filters doesn't impact model performance much #### Notes/Questions - Would've been nice to graph the effect of #params on the model performance. How much do additional filters and conv layers help? - hat about training time? How does it compare? |
[link]
TLDR; The authors propose a recurrent memory-based model that can reason over multiple hops and be trained end to end with standard gradient descent. The authors evaluate the model on QA and Language Modeling Tasks. In the case of QA, the network inputs are a list of sentences, a query and (during training) an answer. The network then attends to the sentences at each time step, considering the next piece information relevant to the question. The network outperforms baseline approaches, but does not come close to a strongly supervised (relevant sentences are pre-selected) approach. #### Key Takeaways - Sentence Representation: 1. Word embeddings are averaged (BoW) 2. Positional Encoding (PE) - Synthetic dataset with vocabulary size of ~180. Version one has 1k training example, version 2 has 10k training examples. - The model is similar to Bahdanau seq2seq attention model, only that it operates on sentences and does not output at every step and used a simpler scoring function. #### Questions / Notes - The positional encoding formula is not explained neither is it intutiive. - There are so many hyperparameters and model variations (jittering, linear start) that it's easy to lose track of the essential. - No intuitive explanation of what the model does. The easiest way for me to understand this model was to look at it as a variation of Bahdanau's attention model, which is very intuitive. I don't understand the intuition behind the proposed weight constraints. - The LM results are not convincing. The model beats the baselines by a little bit, but probably only due to very time-intensive hyperparameter optimization. - What is the training complexity and training time? |
[link]
TLDR; The authors train large-scale language modeling LSTMs on the 1B word dataset to achieve new state of the art results for single models (51.3 -> 30 Perplexity) and ensemble models (41 -> 24.2 Perplexity). The authors evaluate how various architecture choices impact the model performance: Importance Sampling Loss, NCE Loss, Character-Level CNN inputs, Dropout, character-level CNN output, character-level LSTM Output. #### Key Points - 800k vocab, 1B words training data - Using a CNN on characters instead of a traditional softmax significantly reduces number of parameters, but lacks the ability to differentiate between similar-looking words with very different meanings. Solution: Add correction factor - Dropout on non-recurrent connections significantly improves results - Character-level LSTM for prediction performs significantly worse than softmax or CNN softmax - Sentences are not pre-processed, fed in 128-sized batches without resetting any LSTM state in between examples. Max word length for character-level input: 50 - Training: Adagrad and learning rate of 0.2. Gradient norm clipping 1.0. RNN unrolled for 20 steps. Small LSTM beats state of the art after just 2 hours training, largest and best model trained for 3 weeks on 32 K40 GPUs. - NC vs. Importance Sampling: IC is sufficient - Using character-level CNN word embeddings instead of a traditional matrix is sufficient and performs better #### Notes/Questions - Exact hyperparameters in table 1 are not clear to me. |
[link]
TLDR; Authors apply 3-layer seq2seq LSTM with 256 units and attention mechanism to consituency parsing task and achieve new state of the art. Attention made a huge difference for a small dataset (40k examples), but less so for a noisy large dataset (~11M examples). #### Data Sets and model performance - WSJ (40k examples): 90.5 - Large distantly supervised corpus (90k gold examples, 11M noisy examples): 92.8 #### Key Takeaways - The authors use existing parsers to label a large dataset to be used for training. The trained model then outperforms the "teacher" parsers. A possible explanation is that errors of supervising parsers look like noise to the more powerful LSTM model. These results are extremely valuable, as data is typically the limiting factor, but existing models almost always exist. - Attention mechanism can lead to huge improvements on small data sets. - All of the learned LSTM models were able to deal with long (~70 tokens) sentences without a significant impact of performance. - Reversing the input in seq2seq tasks is common. However, reversing resulted in only a 0.2 point bump in accuracy. - Pre-trained word vectors bumped scores by 0.4 (92.9 -> 94.3) only. #### Notes/Questions - How much does the ouput data representation matter? The authors linearized the parse tree using depth-first traversal and parentheses. Are there more efficient representations that may lead to better results? - How much does the noise in the auto-labeled training data matter when compared to the data size? Are there systematic errors in the auto-labeled data that put a ceiling on model performance? - Bidirectional LSTM? |
[link]
TLDR; The authors propose new ways to incorporate context (previous sentences) into a Recurrent Language Model (RLM). They propose 3 ways to model the context, and 2 ways to incorporate the context into the predictions for the current sentence. Context can be modeled with BoW, Sequence BoW (BoW for each sentence), and Sequence BoW with attention. Context can be incorporated using "early fusion", which gives the context as an input to the RNN, or "late fusion", which modifies the LSTM to directly incorporate the context. The authors evaluate their architecture on IMDB, BBC and Penn TreeBank corpora, and show that most approaches perform well (reducing perplexity), with Sequence BoW with attention + late fusion outperforming all others. #### Key Points: - Context as BoW: Compress N previous sentences into a single BoW vector - Context as Sequential Bow: Compress each of the N previous sentences into a BoW vector and use an LSTM to "embed" them. Alternatively, use an attention mechanism. - Early Fusion: Give the context vector as an input to the LSTM, together with the current word. - Late Fusion: Add another gate to the LSTM that incorporates the context vector. Helps to combat vanishing gradients. - Interestingly the Sequence BoW without attention performs very poorly. The reason here seems to be the same as for seq2seq, it's hard to compress the sentence vectors into a single fixed-length representation using an LSTM. - LSTM models trained with 1000 units, Adadelta. Only sentences up to 50 words are considered. - Noun phrases seem to benefit the most from the context, which makes intuitive sense. #### Notes/Questions: - A problem with current Language Models is that they are corpus-specific. A model trained on one corpus doesn't do well on another corpus because all sentences are treated as being independent. However, if we can correctly incorporate context we may be able to train a general-purpose LM that does well across various corpora. So I think this is important work. - I am surprised that the authors did not try using a sentence embedding (skip-thought, paragraph-vector) to construct their context vectors. That seems like an obvious choice over using BoW. - The argument for why the Sequence BoW without attention model performs poorly isn't convincing. In the seq2seq work the argument for attention was based on the length of the sequence. However, here the sequence is very short, so the LSTM should be able to capture all the dependencies. The performance may be poor due to the BoW representation, or due too little training data. - Would've been nice to visualize what the attention mechanism is modeling. - I'm not sure if I agree with the authors that relying explicit sentence boundaries is an advantage, I see it as a limiting factor. |
[link]
TLDR; The authors demonstrate how to condition on several predictors when generating text/code. For example, one may need to copy inputs or perform database lookups to produce good results, but training multiple predictors end-to-end is challenging. The authors propose Latent Predictor Networks that combine attention-based character generation with pointer networks to copy tokens from the input. The authors evaluate their model on the task of producing code for Trading Card Games like Magic and Hearthstone, where the card image is the input, and the code implementation of a card is the output. Latent Predictor Networks clearly beat seq2seq and attention-based baselines. |
[link]
TLDR; The authors present DV-ngram, a new method to learn document embeddings. DV-ngrams is a variation on Paragraph Vectors with a training objective of predicting words and n-grams solely based on the document vector, forcing the embedding to capture the semantics of the text. The authors evaluate their model on the IMDB data sets, beating both n-gram based and Deep Learning models. #### Key Points - When the word vectors are already sufficiently predictive of the next words, the standard PV embedding cannot learn anything useful. - Training objective: Predict words and n-grams solely based on document vector. Negative Sampling to deal with large vocabulary. In practice, each n-gram is treated as a special token and appended to the document. - Code will be at https://github.com/libofang/DV-ngram #### Question/Notes - The argument that PV may not work when the word vectors themselves are predictive enough makes intuitive sense. But what about applying word-level dropout? Wouldn't that also force the PV to learn the document semantics? - It seems to be that predicting n-grams leads to a huge sparse vocabulary space. I wonder how this method scales, even with negative sampling. I am actually surprised this works well at all. - The authors mention that they beat "other Deep Learning models, including PV, but neither their model nor PV are "deep learning". The networks are not deep ;) |
[link]
TLDR; The authors propose a novel encoder-decoder neural network architecture. The encoder RNN encodes a sequence into a fixed length vector representation and the decoder generates a new variable-length sequence based on this representation. The authors also introduce a new cell type (now called GRU) to be used with this network architecture. The model is evaluated on a statistical machine translation task where it is fed as an additional feature to a log-linear model. It leads to improved BLEU scores. The authors also find that the model learns syntactically and semantically meaningful representations of both words and phrases. #### Key Points: - New encoder-decoder architecture, seq2seq. Decoder conditioned on thought vector. - Architecture can be used for both scoring or generation - New hidden unit type, now called GRU. Simplified LSTM. - Could replace whole pipeline with this architecture, but this paper doesn't - 15k vocabulary (93% of dataset cover). 100d embeddings, 500 maxout units in final affine layer, batch size of 64, adagrad, 384M words, 3 days training time. - Architecture is trained without frequency information so we expect it to capture linguistic information rather than statistical information. - Visualizations of both words embeddings and thought vectors. #### Questions/Notes - Why not just use LSTM units? |
[link]
TLDR; The authors show that seq2seq LSTM networks (2 layers, 400-dims) can learn to evaluate short Python programs (loops, conditionals, addition, subtraction, multiplication). The program code is fed one character at a time, and the LSTM is tasked with generating an output number (12 character vocab). The authors also present a new curriculum learning strategy, where the network is fed with a sensible mixture of easy and increasingly difficult examples, allowing it to gradually build up the concepts required to evaluate these programs. #### Key Points - LSTM unrolled for 50 steps, 2 layer, 400 cells per layer, ~2.5M parameters. Gradient norm constrained to 5. - 3 Curriculum Learning strategies: 1. Naive (increase example difficulty) 2. Mixed: Randomly sample easy and hard problems, 3. Combined: Sample from Naive and Mixed strategy. Mixed or Combined almost always performs better. - Output Vocabulary: 10 digits, minus, dot - For evaluation teacher forcing is used: Feed correct output when generating target sequence - Evaluation Tasks: Program Evaluation, Addition, Memorization - Tricks: Reverse Input sequence, Double input sequence. Seem to make big difference. - Nesting loops makes the tasks difficult since LSTMs can't deal with compositionality. - Feeding easy examples and before hard examples may require the LSTM to restructure its memory. #### Notes / Questions - I wonder if there's a relation between regularization/dropout and curriculum learning. The authors propose that mixing example difficulty forces a more general representation. Shouldn't dropout be doing a similar thing? |
[link]
TLDR; The authors train a *single* Neural Machine Translation model that can translate between N*M language pairs, with a parameter spaces that grows linearly with the number of languages. The model uses a single attention mechanism shared across encoders/decoders. The authors demonstrate the the model performs particularly well for resource-constrained languages, outperforming single-pair models trained on the same data. #### Key Points - Attention mechanism: Both encoder and decoder output attention-specific vectors, which are then combined. Thus, adding a new source/target language does not result in a quadratic explosion of parameters. - Bidirectional RNN, 620-dimensional embeddings, GRU with 1k units, 1k affine layer tanh. Adam, minibatch 60 examples. Only use sentence up to length 50. - Model clearly outperforms single-pair models when parallel corpora are constrained to small size. Not so much for large corpora. - The single model doesn't fit on a GPU. - Can in theory be used to translate between pairs that didn't have a bilingual training corpus, but the authors don't evaluate this in the paper. - Main difference to "Multi-task Sequence to Sequence Learning": Uses attention mechanism #### Notes / Questions - I don't see anything that would force the encoders to map sequences of different languages into the same representation (as the authors briefly mentioned). Perhaps it just encodes language-specific information that the decoders can use to decide which source language it was? |
[link]
TLDR; The authors train a deep seq-2-seq LSTM directly on byte-level input of several langauges (shuffling the examples of all languages) and apply it to NER and POS tasks, achieving state-of-the-art or close to that. The model outputs spans of the form `[START_POSITION, LENGTH, LABEL]`, where each span element is a separate token prediction. A single model works well for all languages and learns shared high-level representations. The authors also present a novel way to dropout input tokens (bytes in their case), by randomly replacing them with a `DROP` symbol. #### Data and model performance Data: - POS Tagging: 13 languages, 2.87M tokens, 25.3M training segments - NER: 4 languags, 0.88M tokens, 6M training segments Results: - POS CRF Accuracy (average across languages): 95.41 - POS BTS Accuracy (average across languages): 95.85 - NER BTS en/de/es/nl F1: 86.50/76.22/82.95/82.84 - (See paper for NER comparsion models) #### Key Takeaways - Surprising to me that the span generations works so well without imposing independence assumptions on it. It's state the LSTM has to keep in memory. - 0.2-0.3 Dropout, 320-dimensional embeddings, 320 units LSTM, 4 layers seems to perform well. The resulting model is surprisingly compact (~1M parameters) due to the small vocabulary size of 256 bytes. Changing input sequence order didn't have much of an effect. Dropout and Byte Dropout significantly (74 -> 78 -> 82) improved F1 for NER. - To limit sequence length the authors split the text into k=60 sized segment, with 50% overlap to avoid splitting mid-span. - Byte Dropout can be seen as "blurring text". I believe I've seen the same technique applied to words before and labeled word dropout. - Training examples for all languages are shuffled together. The biggest improvements in scores are seen observed for low-resource languages. - Not clear how to tune recall of the model since non-spans are simply not annotated. #### Notes / Questions - I wonder if the fixed-vector embedding of the input sequence is a bottleneck since the decoder LSTM has to carry information not only about the input sequence, but also about the structure that has been produced so far. I wonder if the authors have experimented with varying `k`, or using attention mechanisms to deal with long sequences (I've seen papers dealing with sequences of 2000 tokens?). 60 seems quite short to me. Of course, output vocabulary size is also a concern with longer sequences. - What about LSTM initialization? When feeding spans coming from the same document, is the state kept around or re-initialized? I strongly suspect it's kept since 60 bytes probably don't contain enough information for proper labeling, but didn't see an explicit reference. - Why not a bidirectional LSTM? Seems to be the standard in most other papers. - How exactly are multiple languages encoded in the LSTM memories? I *kind of* understand the reasoning behind this, but it's unclear what these "high-level" representations are. Experiments that demonstrate what the LSTM cells represent would be valuable. - Is there a way to easily re-train the model for a new language? |
[link]
TLDR; The authors show that we can improve the performance of a reference task (like translation) by simultaneously training other tasks, like image caption generation or parsing, and vice versa. The authors evaluate 3 MLT (Multi-Task Learning) scenarios: One-to-many, many-to-one and many-to-many. The authors also find that using skip-thought unsupervised training works well for improving translation performance, but sequence autoencoders don't. #### Key Points - 4-Layer seq2seq LSTM, 1000-dimensional cells each layer and embedding, batch size 128, dropout 0.2, SGD wit LR 0.7 and decay. - The authors define a mixing ratio for parameter updates that is defined with respect to a reference tasks. Picking the right mixing ratio is a hyperparameter. - One-To-Many experiments: Translation (EN -> GER) + Parsing (EN). Improves result for both tasks. Surprising that even a very small amount of parsing updates significantly improves MT result. - Many-to-One experiments: Captioning + Translation (GER -> EN). Improves result for both tasks (wrt. to reference task) - Many-to-Many experiments: Translation (EN <-> GER) + Autoencoders or Skip-Thought. Skip-Thought vectors improve the result, but autoencoders make it worse. - No attention mechanism #### Questions / Notes - I think this is very promising work. it may allow us to build general-purpose systems for many tasks, even those that are not strictly seq2seq. We can easily substitute classification. - How do the authors pick the mixing ratios for the parameter updates, and how sensitive are the results to these ratios? It's a new hyperparameter and I would've liked to see graphs for these. Makes me wonder if they picked "just the right" ratio to make their results look good, or if these architectures are robust. - The authors found that seq2seq autoencoders don't improve translation, but skip-thought does. In fact, autoencoders made translation performance significantly worse. That's very surprising to me. Is there any intuition behind that? |
[link]
TLDR; The authors apply a neural seq2seq model to sentence summarization. The model uses an attention mechanism (soft alignment). #### Key Points - Summaries generated on the sentence level, not paragraph level - Summaries have fixed length output - Beam search decoder - Extractive tuning for scoring function to encourage the model to take words from the input sequence - Training data: Headline + first sentence pair. |
[link]
TLDR; The authors train a seq2seq model on conversations, building a chat bot. The first data set is an IT Helpdesk dataset with 33M tokens. The trained model can help solve simple IT problems. The second data set is the OpenSubtitles data with ~1.3B tokens (62M sentences). The resulting model learns simple world knowledge, can generalize to new questions, but lacks a coherent personality. #### Key Points - IT Helpdesk: 1-layer LSTM, 1024-dimensional cells, 20k vocabulary. Perplexity of 8. - OpenSubtitles: 2-layer LSTM, 4096-dimensional cells, 100k vocabulary, 2048 affine layer. Attention did not help. - OpenSubtitles: Treat two consecutive sentences as coming from different speakers. Noisy dataset. - Model lacks personality, gives different answers to similar questions (What do you do? What's your job?) - Feed previous context (whole conversation) into encoder, for IT data only. - In both data sets, the neural models achieve better perplexity than n-gram models. #### Notes / Questions - Authors mention that Attention didn't help in OpenSubtitles. It seems like the encoder/decoder context is very short (just two sentences, not a whole conversation). So perhaps attention doesn't help much here, as it's meant for long-range dependencies (or dealing with little data?) - Can we somehow encode conversation context in a separate vector, similar to paragraph vectors? - It seems like we need a principled way to deal with long sequences and context. It doesn't really make sense to treat each sentence tuple in OpenSubtitles as a separate conversation. Distant Supervision based on subtitles timestamps could also be interesting, or combine with multimodal learning. - How we can learn a "personality vector"? Do we need world knowledge or is it learnable from examples? |
[link]
TLDR; The author train a three variants of a seq2seq model to generate a response to social media posts taken from Weibo. The first variant, NRM-glo is the standard model without attention mechanism using the last state as the decoder input. The second variant, NRM-loc, uses an attention mechanism. The third variant, NRM-hyb combines both by concatenating local and global state vectors. The authors use human users to evaluate their responses and compare them to retrievel-based and SMT-based systems. The authors find that SRM models generate reasonable responses ~75% of the time. #### Key Points - STC: Short-text conversation. Generate only a response to a post. Don't need to keep track of a whole conversation. - Training data: 200k posts, 4M responses. - Authors use GRU with 1000 hidden units. - Vocabulary: Most frequent 40k words for both input and response. - Retrieval is done using beam search with beam size 10. - Hybrid model is difficult to train jointly. The authors train the model individually and then fine-tune the hybrid model. - Tradeoff with retrieval based methods: Responses are written by a human and don't have grammatical errors, but cannot easily generalize to unseen inputs. |
[link]
TLDR; The authors propose Neural Turing Machines (NTMs). A NTM consists of a memory bank and a controller network. The controller network (LSTM or MLP in this paper) controls read/write heads by focusing their attention softly, using a distribution over all memory addresses. It can learn the parameters for two addressing mechanisms: Content-based addressing ("find similar items") and location-based addressing. NTMs can be trained end-to-end using gradient descent. The authors evaluate NTMs on program generations tasks and compare their performance against that of LSTMs. Tasks include copying, recall, prediction, and sorting binary vectors. While both LSTMs and NTMs seems to perform well on training data, only NTMs are able to generalize to longer sequences. #### Key Observations - Controller network tried with LSTM or MLP. Which one works better is task-dependent, but LSTM "cache" can be a bottleneck. - Controller size, number of read/write heads, and memory size are hyperparameters. - Monitoring the memory addressing shows that the NTM actually learns meaningful programs. - Number LSTM parameters grow quadratically with hidden unit size due to recurrent connection, not so for NTMs, leading to models with fewer parameters. - Example problems are very small, typically using sequences 8 bit vectors. #### Notes/Questions - At what length to NTMs stop to work? Would've liked to see where results get significantly worse. - Can we automatically transform fuzzy NTM programs into deterministic ones? |
[link]
TLDR; The authors propose a novel "attention" mechanism that they evaluate on a Machine Translation task, achieving new state of the art (and large improvements in dealing with long sentences). Standard seq2seq models typically try to encode the input sequence into a fixed length vector (the last hidden state) based on which the decoder generates the output sequence. However, it is unreasonable to assume the all necessary information can be encoded in this one vector. Thus, the authors let the decoder depend on a attention vector, which based on the weighted sum (expectation) of the input hidden states. The attention weights are learned jointly, as part of the network architecture. #### Data Sets and model performance Bidirectional GRU, 1000 hidden units. Multilayer maxout to compute output probabilities in decoder. WMT '14 BLEU: 36.15 #### Key Takeaways - Attention mechanism is a weighted sum of the hidden states computed by the encoder. The weights come from a softmax-normalized attention function (a perceptron in this paper), which are learned during training. - Attention can be expensive, because it must be evaluated for each encoder-decoder output pair, resulting in a len(x) * len(y) matrix. - The attention mechanism improves performance across the board, but has a particularly large affect on long sentences, confirming the hyptohesis that the fixed vector encoding is a bottleneck. - The authors use a bidirectional-GRU, concatenating both hidden states into a final state at each time step. - It is easy to visualize the attention matrix (for a single input-ouput sequence pair). The authors show that in the case of English to French translations the matrix has large values on the diagonal, showing the these two languages are well aligned in terms of word order. #### Question/Notes - The attention mechanism seems limited in that it computes a simple weighted average. What about more complex attention functions that allow input states to interact? |
[link]
TLDR; The authors propose three neural models to generate a response (r) based on a context and message pair (c,m). The context is defined as a single message. The first model, RLMT, is a basic Recurrent Language Model that is fed the whole (c,m,r) triple. The second model, DCGM-1, encodes context and message into a BoW representation, put it through a feedforward neural network encoder, and then generates the response using an RNN decoder. The last model, DCGM-2, is similar but keeps the representations of context and message separate instead of encoding them into a single BoW vector. The authors train their models on 29M triple data set from Twitter and evaluate using BLEU, METEOR and human evaluator scores. #### Key Points: - 3 Models: RLMT, DCGM-1, DCGM-2 - Data: 29M triples from Twitter - Because (c,m) is very long on average the authors expect RLMT to perform poorly. - Vocabulary: 50k words, trained with NCE loss - Generates responses degrade with length after ~8 tokens #### Notes/Questions: - Limiting the context to a single message kind of defeats the purpose of this. No real conversations have only a single message as context, and who knows how well the approach works with a larger context? - Authors complain that dealing with long sequences is hard, but they don't even use an LSTM/GRU. Why? |
[link]
TLDR; The authors propose an importance-sampling approach to deal with large vocabularies in NMT models. During training, the corpus is partitioned, and for each partition only target words occurring in that partition are chosen. To improve decoding speed over the full vocabulary, the authors build a dictionary mapping from source sentence to potential target vocabulary. The authors evaluate their approach on standard MT tasks and perform better than the baseline models with smaller vocabulary. #### Key Points: - Computing partition function is the bottleneck. Use sampling-based approach. - Dealing with large vocabulary during training is separate from dealing with large vocab during decoding. Training is handled with importance sampling. Decoding is handled with source-based candidate list. - Decoding with candidate list takes around 0.12s (0.05) per token on CPU (GPU). Without target list 0.8s (0.25s). - Issue: Candidate list is depended on source sentence, so it must be re-computed for each sentence. - Reshuffling the data set is expensive as new partitions need to be calculated (not necessary, but improved scores). #### Notes: - How is the corpus partitioned? What's the effect of the partitioning strategy? - The authors say that they replace UNK tokens using "another word alignment model" but don't go into detail what this is. The results show that doing this results in much larger score bump than increasing the vocab does. (The authors do this for all comparison models though). - Reshuffling the dataset also results in a significant performance bump, but this operation is expensive. IMO the authors should take all these into account when reporting performance numbers. A single training update may be a lot faster, but the setup time increases. I'd would've like to see the authors assign a global time budget to train/test and then compare the models based on that. - The authors only briefly mentioned that re-building the target vocab for each source sentence is an issue and how they solve it, no details given. |
[link]
TLDR; The authors train a word-level NMT where UNK tokens in both source and target sentence are replaced by character-level RNNs that produce word representations. The authors can thus train a fast word-based system that still generalized that doesn't produce unknown words. The best system achieves a new state of the art BLEU score of 19.9 in WMT'15 English to Czech translation. #### Key Points - Source Sentence: Final hidden state of character-RNN is used as word representation. - Source Sentence: Character RNNs always initialized with 0 state to allow efficient pre-training - Target: Produce word-level sentence including UNK first and then run the char-RNNs - Target: Two ways to initialize char-RNN: With same hidden state as word-RNN (same-path), or with its own representation (separate-path) - Authors find that attention mechanism is critical for pure character-based NMT models #### Notes - Given that the authors demonstrate the potential of character-based models, is the hybrid approach the right direction? If we had more compute power, would pure character-based models win? |
[link]
TLDR; The authors propose a new architecture called "Pointer Network". A Pointer Network is a seq2seq architecture with attention mechanism where the output vocabulary is the set of input indices. Since the output vocabulary varies based on input sequence length, a Pointer Network can generalize to variable-length inputs. The attention method trough which this is achieved is O(n^2), and only a sight variation of the standard seq2seq attention mechanism. The authors evaluate the architecture on tasks where the outputs correspond to positions of the inputs: Convex Hull, Delaunay Triangulation and Traveling Salesman problems. The architecture performs well these, and generalizes to sequences longer than those found in the training data. #### Key Points - Similar to standard attention, but don't blend the encoder states, use the attention vector directory. - Softmax probabilities of outputs can be interpreted as a fuzzy pointer. - We can solve the same problem artificially using seq2seq and outputting "coordinates", but that ignores the output constraints and would be less efficient. - 512 unit LSTM, SGD with LR 1.0, batch size of 128, L2 gradient clipping of 2.0. - In the case of TSP, the "student" networks outperforms the "teacher" algorithm. #### Notes/ Questions - Seems like this architecture could be applied to generating spans (as in the newer "Text Processing From Bytes" paper), for POS tagging for example. That would require outputting classes in addition to input pointers. How? |
[link]
TLDR; The authors empirically evaluate seq2seq Neural Machine Translation systems. They find that performance degrades significantly as sentences get longer, and as the number of unknown words in the source sentence increases. Thus, they propose that more investigation into how to deal with large vocabularies and long-range dependencies is needed. The authors also present a new gated recursive convolutional network (grConv) architecture, which consists of a binary tree using GRU units. While this network architecture does not perform as well as the RNN encoder, it seems to be learning grammatical properties represented in the gate activations in an unsupervised fashion. #### Key Points - GrConv: Neuron computed as combination between left and right neuron in previous layer, gated with the activations of those neurons. 3 gates: Left, right, reset. - In experiments, encoder varies between RNN and grConv. Decoder is always RNN. - Model size is only 500MB. 30k vocabulary. Only trained on sentences <= 30 tokens. Networks not trained to convergence. - Beam search with scores normalized by sequence length to choose translations. - Hypothesis is that fixed vector representation is a bottleneck, or that decoder is not powerful enough. #### Notes/Questions - THe network is only trained on sequences <= 30 tokens. Can we really expect it to perform well on long sequences? Long sequences may inherently have grammatical structures that cannot be observed in short sequences. - There's a mistake in the new activation formula, wrong time superscript, should be (t-1). |
[link]
TLDR; The authors train a RNN that takes as input a glimpse (part of the image subsamples to same size) and outputs a new glimpse and action (prediction, agent move) at each step. Thus, the model adaptively selects which part of an image to "attend" to. By defining the number of glimpses and their reoslutions we can control the complexity of the model independently of image size, which is not true for CNNs. The model is not differentiable, but can be trained using Reinforcement Learning techniques. The authors evaluate the model on the MNIST dataset, a cluttered version of MNIST, and a dynamic video game environment. #### Questions / Notes - I think the the author's claim taht the model works independently of image size is only partly true, as larger images are likely to require more glimpses or bigger regions. - Would be nice to see some large-scale benchmarks as MNIST is very simple tasks. However, the authors clearly identify this as future work. - No mentions about training time. Is it even feasible to train this for large images (which probably require more glimpses)? |
[link]
TLDR; The authors propose a novel architecture called ReNet, which replaces convolutional and max-pooling layers with RNNs that sweep over the image vertically and horizontally. These RNN layers are then stacked. The authors demonstrate that ReNet architecture is a viable alternative to CNNs. ReNet doesn't outperform CNNs in this paper, but further optimizations and hyperparameter tuning are likely going to lead to improved results in the future. #### Key Points: - Split images into patches, feed one patch per time step into RNN, vertically then horizontally. 4 RNNs per layer, 2 vertical and 2 horizontal, one per diretion. - Because the RNNs sweep over the whole image they can see the context of the full image, as opposed to just a local context in the case of conv/pool layers. - Smooth from end-end to end. - In experiments, 2 256-dimensional ReNet layers, 2x2 patches, 4096-dimensional affine layers. - Flipping and shifting for data augmentation. #### Notes/Questions: - What is the training time/complexity compared to a CNN? - Why split the image into patches at all? I wonder if the authors have experimented with various patch sizes, like defining patches that go over the full vertical height. 2x2 patches as used in the experiment seem quite small and like a waste of computational resources. |
[link]
TLDR; the authors present Recurrent Memory Network. These networks use an attention mechanism (memory bank, MB) to explicitly incorporate information about preceding into the predictions at each time step. The MB is a layer that can be incorporated into any RNN, and the authors evaluate a total of 8 model variants: Optionally stacking another LSTM layer on top of the MB, optionally including a temporal matrix in the attention calcuation, and using a gating vs. linear function for the MB output. The authors apply the model to Language Modeling tasks, achieving state of the art performance, and demonstrating that inspecting the attention weights yields intuitive insights into what the network learns: Co-occurence statistics and dependency type information. The authors also evaluate the models on a sentence completion task, achieving new state of the art. #### Key Points - RM: LSTM with MB as the top layer. No "horizontal" connections from MB to MB. - RMR: LSTM with MB and another LSTM stacked on top. - RM with gating typically outperforms RMR. - Memory Bank (MB): Input is current hidden state, and n preceding inputs including the current one. Attention is then calculated over the inputs based on the hidden state. The Output is a new hidden state, which can be calculated with or without gating. Optionally apply temporal bias matrix to attention calculation. - Experiments: Hidden states and embeddings all of size 128. Memory size 15. SGD 15 epochs, halved each epoch after the forth. - Attention Analysis (Language Model): Obviously, most attention is given to current and recent words. But long-distance dependencies are also captured, e.g. separable verbs in German. Networks also discovers dependency types. #### Notes/Questions - This works seems related to "Alternative structures for character-level RNNs" where the authors feed n-grams from previous words into the classification layer. The idea is to relieve the network from having to memorize these. I wonder how the approaches compare. - No related work section? I don't know if I like the name memory bank and the reference to Memory Networks here. I think the main idea behind Memory Networks was to reason over multiple hops. The authors here only make one hop, which is essentially just a plain attention mechanism. - I wonder why exactly the RMR performs worse than the RM. I can't easily find an intuitive explanation for why that would be. Maybe just not enough training data? - How did the authors arrive at their hyperparameters (128 dimensions)? 128 seems small compared to other models. |
[link]
TLDR; The authors show that applying dropout to only the **non-recurrent** connections (between layers of the same timestep) in an LSTM works well, improving the scores on various sequence tasks. #### Data Sets and model performance - PTB Language Modeling Perplexity: 78.4 - Google Icelandic Speech Dataset WER Accuracy: 70.5 - WMT'14 English to French Machine Translation BLEU: 29.03 - MS COCO Image Caption Generation BLEU: 24.3 |
[link]
TLDR; The authors show that we can pre-train RNNs using unlabeled data by either reconstructing the original sequence (SA-LSTM), or predicting the next token as in a language model (LM-LSTM). We can then fine-tune the weights on a supervised task. Pre-trained RNNs are more stable, generalize better, and achieve state-of-the-art results on various text classification tasks. The authors show that unlabeled data can compensate for a lack of labeled data. #### Data Sets Error Rates for SA-LSTM, previous best results in parens. - IMDB: 7.24% (7.42%) - Rotten Tomatoes 16.7% (18.5%) (using additional unlabeled data) - 20 Newsgroups: 15.6% (17.1%) - DBPedia character-level: 1.19% (1.74%) #### Key Takeaways - SA-LSTM: Predict sequence based on final hidden state - LM-LSTM: Language-Model pretraining - LSTM, 1024-dimensional cell, 512-dimensional embedding, 512-dimensional hidden affine layer + 50% dropout, Truncated backprop 400 steps. Clipped cell outputs and gradients. Word and input embedding dropout tuned on dev set. - Linear Gain: Inject gradient at each step and linearly increase weights of prediction objectives #### Notes / Questions - Not clear when/how linear gain yields improvements. On some data sets it significantly reduces performance, on other it significantly improves performance. Any explanations? - Word dropout is used in the paper but not explained. I'm assuming it's replacing random words with `DROP` tokens? - The authors mention a joint training model, but it's only evaluated on the IMDB data set. I'm assuming the authors didn't evaluate it further because it performed badly, but it would be nice to get an intuition for why it doesn't work, and show results for other data sets. - All tasks are classification tasks. Does SA-LSTM also improve performance on seq2seq tasks? - What is the training time? :) (I also wonder how the batching is done, are texts padded to the same length with mask?) |
[link]
TLDR; The authors evaluate the impact of hyperparameters (embeddings, filter region size, number of feature maps, activation function, pooling, dropout and l2 norm constraint) on Kim's (2014) CNN for sentence classification. The authors present empirical findings with variance nunbers based on a large number of experiments on 7 classification data sets, and give practical recommendation for architecture decisions. #### Key Points - Recommended Baseline configuration: word2vec, (3,4,5) filter regions, 100 feature maps per region size, ReLU activation, 1-max-pooling, 0.5 dropout, l2 norm constraint on weight vector of 3. - One-hot vectors perform worse than pre-trained embeddings. word2vec outperforms GloVe most of the time. - Filter region size is dependent on data set in the range of 2-25. Recommended to do a line search over single region size and then combine multiple sizes. - Increasing the number of feature maps per filter region to more than 600 doesn't seem to help much. - ReLU almost always best activation function - Max-pooling almost always best pooling strategy - Dropout from 0.1 to 0.5 helps, l2 norm constraint not much #### Notes/Questions - All datasets analyzed in this paper are rather similar. They have similar average and max sentence length, and even the number of examples is of roughly the same magnitude. It would be interesting to see how the result change with very different datasets, such as long documents, or very large numbers of training examples. |
[link]
TLDR; The authors show that a multilayer LSTM RNN (4 layers, 1000 cells per layer, 1000d embeddings, 160k source vocab, 80k target vocab) can achieve competitive results on Machine Translation tasks. The authors find that reversing the input sequence leads to significant improvements, most likely due to the introduction of short-term dependencies that are more easily captured by the gradients. Somewhat surprisingly, the LSTM did not have difficulties on long sentences. The model is evaluated on MT tasks and achieves competitive results (34.8 BLEU) by itself, and close to state of the art if coupled with existing baseline systems (36.5 BLEU). #### Key Points - Invert input sequence leads to significant improvement - Deep LSTM performs much better than shallow LSTM. - User different parameters for encoder/decoder. This allows parallel training for multiple languages decoders. - 4 Layers, 1000 cells per layer. 1000-dimensional words embeddings. 160k source vocabulary. 80k target vocabulary.Trained on 12M sentences (652M words). SGD with fixed learning rate of 0.7, decreasing by 1/2 every epoch after 5 initial epochs. Gradient clipping. Parallelization on GPU leads to 6.3k words/sec. - Batching sentences of approximately the same length leads to 2x speedup. - PCA projection shows meaningful clusters of sentences robust to passive/active voice, suggesting that the fixed vector representation captures meaning. - "No complete explanation" for why the LSTM does so much better with the introduced short-range dependencies. - Beam size 1 already performs well, beam size 2 is best in deep model. #### Notes/Questions - Seems like the performance here is mostly due to the computational resources available and optimized implementation. These models are pretty big by most standards, and other approaches (e.g. attention) may lead to better results if they had more computational resources. - Reversing the input still feels like a hack to me, there should be a more principled solution to deal with long-range dependencies. |
[link]
TLDR; The authors use an attention mechanism in image caption generation, allowing the decoder RNN focus on specific parts of the image. In order find the correspondence between words and image patches, the RNN uses a lower convolutional layer as its input (before pooling). The authors propose both a "hard" attention (trained using sampling methods) and "soft" attention (trained end-to-end) mechanism, and show qualitatively that the decoder focuses on sensible regions while generating text, adding an additional layer of interpretability to the model. The attention-based models achieve state-of-the art on Flickr8k, Flickr30 and MS Coco. #### Key Points - To find image correspondence use lower convolutional layers to attend to. - Two attention mechanisms: Soft and hard. Depending on evaluation metric (BLEU vs. METERO) one or the other performs better. - Largest data set (MS COCO) takes 3 days to train on Titan Black GPU. Oxford VGG. - Soft attention is same as for seq2seq models. - Attention weights are visualized by upsampling and applying a Gaussian #### Notes/Questions - Would've liked to see an explanation of when/how soft vs. hard attention does better. - What is the computational overhead of using the attention mechanism? Is it significant? |
[link]
TLDR; The authors apply the skip-thoguth word2vec model to the sentence level, training auto-encoders that predict the previous and next sentences. The resulting general-purpose vector representations are called skip-thought vectors. The authors evaluate the performance of these vectors as features on semantic relatedness and classification tasks, achieving competitive results, but not beating fine-tuned models. #### Key Points - Code at https://github.com/ryankiros/skip-thoughts - Training is done on large book corpus (74M sentences, 1B tokens), takes 2 weeks. - Two variations: Bidirectional encoder and unidirectional encoder with 1200 and 2400 units per encoder respectively. GRU cell, Adam optimizer, gradient clipping norm 10. - Vocabulary can be expanded by learning a mapping from a large word2vec voab to the smaller skip-thought vocab. Could also used sampling/hierarchical softmax during training for larger vocab, or train on characters. #### Questions/Notes - Authors clearly state that this is not the goal of the paper, though I'd be curious how more sophisticated (non-linear) classifiers perform with skip-thought vectors. Authors probably tried this but it didn't do well ;) - The fact that the story generation doesn't seem work well shows that the model has problems learning or understanding long-term dependencies. I wonder if this can be solved by deeper encoders or attention. |
[link]
TLDR; The authors train an RNN-based topic model that takes into consideration word order and assumes words in the same sentence sahre the same topic. The authors sample topic mixtures from a Dirichlet distribution and then train a "topic embedding" together with the rest of the generative LSTM. The model is evaluated quantitatively using perplexity on generated sentences and on classification tasks and clearly beats competing models. Qualitative evaluation shows that the model can generate sensible sentences conditioned on the topic. |
[link]
TLDR; The authors introduce a new spatial transformation module that can be inserted into any Neural Network. The module consists of a spatial transformation network that predicts transformation parameters, a grid generator that chooses a sampling grid from the input, and a sampler that produces the output. Possible learned transformations include things cropping, translation, rotation, scaling or attention. The module can be trained end-to-end using backpropagation. The authors evaluate evaluate the module on both CNNs and MLPs, achieving state on distorted MNIST data, street view numbers, and fine-grained bird classification. #### Key Points: - STMs can be inserted between any layers, typically after the input or extracted features. The transform is dynamic and happens based on the input data. - The module is fast and doesn't adversely impact training speed. - The actual transformation parameters (output of localization network) can be fed into higher layers. - Attention can be seen as a special transformation that increases computational efficiency. - Can also be applied to RNNs, but more investigation is needed. |
[link]
TLDR; The authors randomly drop out complete layers during training using a modified ResNet architecture. The dropout probability hyperparameter decreases linearly (higher layers have a higher chance to be dropped) ending at 0.5 at the final layer in the experiments. This mechanisms helps vanishing gradients, diminishing feature reuse, and long training time. The model achieves new records on the CIFAR-10, CIFAR-100 and SVHN dataset. #### Key Points: - Can easily modify ResNet architecture to dropout out whole layer by only keeping the identity skip connection - Lower layers get lower probability of being dropped since they intuitively contain more "stable" features. Authors use linear decay with final value 0.5. - Training time reduces by 25% - 50% depending on dropout probability hyperparameter - Authors find that vanishing gradients are indeed reduces by plotting the gradient magnitudes vs. number of epochs - Can be interpreted as an ensemble of networks with varying depth - All layers are used during test time and need to scale activations appropriately - Authors successfully train network with 1000+ layers and achieve further error reduction |
[link]
TLDR; The authors evaluate softmax, hierarchical softmax, target sampling, NCE, self-normalization and differentiated softmax (novel technique presented in the paper) on data sets with varying vocabulary size (10k, 100k, 800k) with a fixed-time training budget. The authors find that techniques that work best for small vocabluaries are not necessarily the ones that work best for large vocabularies. #### Data and Models Models: - Sotmax - Hierarchical Softmax (cross-validation of clustering techniques) - Differentiated softmax, adjusting capacity based on token frequency (cross-validation of number of frequency bands and size) - Target Sampling (cross-validation of number of distractors) - NCE (cross-validation of noise ratio) - Self-normalization (cross-validation of regularization strenth) Data: - PTB (1M tokens, 10k vocab) - Gigaword (5B tokens, 100k vocab) - billionW (800M tokens, 800k vocab) #### Key Takeaways - Techniques that work best for small vocabluaries are not necessarily the ones that work best for large vocabularies. - Differentiated softmax varies the capacity (size of matrix slice in the last layer) based on token frequency. In practice, it's implemented as separate matrices with different sizes. - Perplexity doesn't seem to improve much after ~500M tokens - Models are trained for 1 week each - The competitiveness of softmax diminishes with vocabulary sizes. It seems to perform relatively well on 10k and 100k, but poorly on 800k since it need more processing time per example. - Traning time, not training data, is the main factor of limiting performance. The authors found that very large models are still making progress after one week and may eventually beat if the other models if allowed to run longer. #### Questions / Notes - What about the hyperparameters for Differentiated Softmax? The paper doesn't show an analysis. Also, the fact that this method introduces two additional hyperparameters makes it harder to apply in practice. - Would've liked to see more comparisons for Softmax, which is the simplest technique of all and doesn't need hyperparameter tuning. It doesn't work well on 800k vocab, but it does for 100k. So, the authors only show how it breaks down for one dataset. |
[link]
TLDR; The authors propose two LSTM-based models for target-dependent sentiment classification. TD-LSTM uses two LSTM networks running towards to target word from left and right respectively, making a prediction at the target time step. TC-LSTM is the same, but additionally incorporates the an averaged target word vector as an input at each time step. The authors evaluate their models with pre-trained word embeddings on a Twitter sentiment classification dataset, achieving state of the art. #### Key Points - TD-LSTM: Two LSTM networks, running from left to right towards the target. The final states of both networks are concatenated and the prediction is made at the target word. - TC-LSTM: Same architecture as TD-LSTM, but also incorporates the word vector as an input at each time step. The word vector is the average of the word vectors for the target phrase. - Embeddings seem to make a huge difference, state of the art is only obtained with 200-dimensional GloVe embeddings. #### Notes/Questions - A *huge* fraction of the performance improvement comes from pre-trained word embeddings. Without these, the proposed models clearly underperforms simpler models. This raises the question of whether incorporating the same embeddings into the simpler models would do. - Would've liked to see performance without *any* pre-trained embeddings. - The authors also experimented with attention mechanisms, but weren't able to achieve good results. Small size of training corpus may be the reason for this. |
[link]
TLDR; The authors generate a large dataset (~1M examples) for question answering by using cloze deletion on summaries of crawled CNN and Daily Mail articles. They evaluate 2 baselines, 2 symbolic models (frame semantic, word distance), and 4 neural models (Deep LSTM, Uniform Reader, Attentive Reader, Impatient Reader) on the dataset. The neural models, particularly those with attenton, beat the syntactic models. - Deep LSTM: 2-layer bidirectional LSTM without attention mechanism - Attentive reader: 1-layer bidirectional LSTM with attention mechanism for the whole query - Impatient Reader: 1-layer bidirectional LSTM with attention mechanism for each token in the query (can be interpreted as being able to re-read the document at each token) - Uniform Reader: Uniform attention to all document tokens In their experiments, the authors randomize document entities to avoid letting the models rely on world knowledge or co-occurence statistics, and intead purely testing document comprehension. This is done by replacing entities with consistent ids *within* a document, but using different ids across documents. #### Data and model performance All numbers are accuracies on two datasets (CNN, Daily Mail) - Maximum Frequency Entity Baseline: 33.2 / 25.5 - Exclusive Frequence Entity Baseline: 39.3 / 32.8 - Frame-semantic model: 40.2 / 35.5 - Word distance model: 50.9 / 55.5 - Deep LSTM Reader: 57.0 / 62.2 - Uniform Reader: 39.4 / 34.4 - Attentive Reader: 63.0 / 69.0 - Impatient Reader: 63.8 / 68.0 #### Key Takeaways - The input to the RNN is defined as QUERY <DELIMITER> DOCUMENT, which is then embedded with or without attention and run through `softmax(W*x)` . - Some sequences are very long, up to 2000 tokens, and the average length was 763 tokens. All LSTM models seem to be able to deal with this, but the attention models show significantly higher accuracy. - Very nice attention visualizations and negative examples analysis that show the attention-based models focusing on the relevant parts of the document to answer the questions. #### Notes / Questions - How does document length affect the Deep LSTM reader? The appendix shows an analysis for attention models, but not for the Deep LSTM. A goal of the paper was to show that attention mechanisms are well suited for long documents because the fixed vector encoding is a bottleneck. The reuslts here aren't clear. - Are the gradient truncated? I can't imagine the network is unrolled for 2000 steps. The training parameters details don't mention this. - The mathematical notation in this paper needs some love. The concepts are relatively simple, but the formulas are hard to parse. - What if you limited the output vocabulary to words appearing in the query document? - Can you apply the same "attention-based embedding" mechanism to text classification? |
[link]
TLDR; Authors apply 6-layer and 9-layer (+3 affine) convolutional nets to character-level input and evaluate their models on Sentiment Analysis and Categorization tasks using (new) large-scale data sets. The authors don't use pre-trained word-embeddings, or any notion of words, and instead learn directly from character-level input with characters being encoded as one-hot vetors. This means the same model can be applied to any language (provided the vocabulary is small enough). The models presented in this paper beat BoW and word2vec baseline models. ### Data and model performance Because existing ones were too small the authors collected several new datasets that don't have standard benchmarks. - DBpedia Ontology Classification: 560k training, 70k test. - Amazon Reviews 5-class: 3M train, 650k test - Amazon Reviews polar: 3.6M train, 400k test - Yahoo! Answer topics 10-class: 1.4M train, 60k test - AG news classification 4-class: 120k train, 1.9k test - Sogou Chinese News 5-class: 450k train, 60k test Model accuracy for small and large models: - DBpedia: 98.02 / 98.27 - Amazon 5-class: 59.47 / 58.69 - Amazon 2-class: 94.50 / 94.49 - Yahoo 10-class: 70.16 / 70.45 - AG 4-class: 84.35 / 87.18 - Chinese 5-class: 91.35 / 95.12 #### Key Takeaways - Pretty Standard CNN architecture applied to characters. Conv, ReLU, Maxppol, fully-connected. Filter sizes of 7 and 3. See paper for parameter details. - Training takes a long time, presumably due to the size of the data. The authors quote 5 days per epoch on the large Amazon data set and large model. - Authors can't handle large vocabularies, they romanize Chinese. - Authors experiment with randomly replacing words with synonyms, seems to give a small improvements: #### Notes / Questions - The authors claim to do "text understanding" and learn representations, but all experiments are on simple classification tasks. There is no evidence that the network actually learns meaningful high-level representations, and doesn't just memorize n-grams for example. - These data sets are large, and the authors claim that they need large data sets, but there are no experiments in the paper that show this. How does performance vary with data size? - The comparision with other models is lacking. I would have liked to see some of the other state-of-the-art model being compared, e.g. Kim's CNN. Comparing with BoW doesn't show much. As these models are openly available the comparison should have been easy. - The romanization of Chinese is an ugly "hack" that goes against what the authors claim: Being language-independent and learning "from scratch". - It's strange that the authors use a thesaurus as a means for training example augmentation, as a theraus is word-level and language-specific, something that the authors explicitly argue against in this paper. Perhaps could have used word (character-level) dropout instead. - Are there any hyperparameters that were optimized? Authors don't mention any dev sets. - Have the datasets been made publicly available? The authors complain that "the unfortunate fact in literature is that there are no large openly accessible datasets", but fail to publish their own. - I'd expect the confustion matrix for the 5-star Amazon reviews to show mistakes coming from negations, but it doesn't, which suggests that the model really learns meaningful representations (such as negation). |
[link]
TLDR; The authors propose "Highway Networks", which uses gates (inspired by LSTMs) to determine how much of a layer's activations to transform or just pass through. Highway Networks can be used with any kind of activation function, including recurrent and convnolutional units, and trained using plain SGD. The gating mechanism allows highway networks with tens or hundreds of layers to be trained efficiently. The authors show that highway networks with fewer parameters achieve results competitive with state-of-the art for the MNIST and CIFAR tasks. Gates outputs vary significantly with the input examples, demonstrating that the network not just learns a "fixed structure", but dynamically routes data based for specific examples examples. Datasets used: MNIST, CIFAR-10, CIFAR-100 #### Key Takeaways - Apply LSTM-like gating to networks layers. Transform gate T and carry gate C. - The gating forces the layer inputs/outputs to be of the same size. We can use additional plain layers for dimensionality transformations. - Bias weights of the transform gates should be initialized to negative values (-1, -2, -3, etc) to initially force the networks to pass through information and learn long-term dependencies. - HWN does not learn a fixed structure (same gate outputs), but dynamic routing based on current input. - In complex data sets each layer makes an important contritbution, which is shown by lesioning (setting to pass-through) individual layers. #### Notes / Questions - Seems like the authors did not use dropout in their experiments. I wonder how these play together. Is dropout less effective for highway networks because the gates already learn efficients paths? - If we see that certain gates outputs have low variance across examples, can we "prune" the network into a fixed strucure to make it more efficient (for production deployments)? |
[link]
TLDR; The authors evaluate the use of a Bidirectional LSTM RNN on POS tagging, chunking and NER tasks. The inputs are task-independent input features: The word and its capitalization. The authors incorporate prior knowledge about the taging tasks by restricting the decoder to output valid sequences of tags, and also propose a novel way of learning word embeddings: Randomly replacing words in a sequence and using an RNN to predict which words are correct vs. incorrect. The authors show that their model combined with pre-trained word embeddings performs on par state of the art models. #### Key Points - Bidirectional LSTM with 100-dimensional embeddings, and 100-dimensional cells. Both 1 and 2 layers are evaluated. Predict tags at each step. Higher dimensionality of cells resultes in little improvement. - Word vector pretraining: Randomly replace words and use LSTM to predict correct/incorrect words. #### Notes/Questions - The fact that we need a task-specific decoder kind of defeats the purpose of this paper. The goal was to create a "task-independent" system. To be fair, the need for this decoder is probably only due to the small size of the training data. Not all tag combination appear in the training data. - The comparisons with other state of the art systems are somewhat unfair since the proposed model heavily relies on pre-trained word embeddings from external data (trained on more than 600M words) to achieve good performance. It also relies on external embeddings trained in yet another way. - I'm surprised that the authors didn't try combining all of the tagging tasks into one model, which seem like an obvious extension. |
[link]
TLDR; The authors propose a web navigation task where an agent must find a target page containing a search query (typically a few sentences) by navigating a web graph with restrictions on memory, path length and number of exlorable nodes. Tey train Feedforward and Recurrent Neural Networks and evaluate their performance against that of human volunteers. #### Key Points - Datasets: Wiki-[NUM_ALLOWED_HOPS]: WikiNav-4 (6k train), WikiNav-8 (890k train), WikiNav-16 (12M train). Authors evaluate variosu query lengths for all data sets. - Vector representation of pages: BoW of pre-trained word2vec embeddings. - State-dependent action space: All possible outgoing links on the current page. At each step, the agent can peek at the neighboring nodes and see their full content. - Training, a single correct path is fed to the agent. Beam search to make predictions. - NeuAgent-FF uses a single tanh layer. NeuAgent-Rec uses LSTM. - Human performance typically worse than that of Neural agents #### Notes/Questions - Is it reasonable to allow the agents to "peek" at neighboring pages? Humans can make decisions based on the hyperlink context. In practice, peaking at each page may not be feasible if there are many links on the page. - I'm not sure if I buy the fact that this task requires Natural Language Understanding. Agents are just matching query word vectors against pages, which is no indication of NLU. An indication of NLU would be if the query was posed in a question format, which is typically short. But here, the authors use several sentences as queries and longer queries lead to better results, suggesting that the agents don't actually have any understanding of language. They just match text. - Authors say that NeuAgent-Rec performed consistently better for high hop length, but I don't see that in the data. - The training method seems a bit strange to me because the agent is fed only one correct path, but in reality there are a large number of correct paths and target pages. It may be more sensible to train the agent with all possible target pages and paths to answer a query. |