First published: 2017/04/23 (6 years ago) Abstract: Several approaches have recently been proposed for learning decentralized
deep multiagent policies that coordinate via a differentiable communication
channel. While these policies are effective for many tasks, interpretation of
their induced communication strategies has remained a challenge. Here we
propose to interpret agents' messages by translating them. Unlike in typical
machine translation problems, we have no parallel data to learn from. Instead
we develop a translation model based on the insight that agent messages and
natural language strings mean the same thing if they induce the same belief
about the world in a listener. We present theoretical guarantees and empirical
evidence that our approach preserves both the semantics and pragmatics of
messages by ensuring that players communicating through a translation layer do
not suffer a substantial loss in reward relative to players with a common
This paper has an unusual and interesting goal, compared to those I more typically read: it wants to develop a “translation” between the messages produced by a model, and natural language used by a human. More specifically, the paper seeks to do this in the context of an two-player game, where one player needs to communicate information to the other. A few examples of this are:
- Being shown a color, and needing to communicate to your partner so they can choose that color
- Driving, in an environment where you can’t see the other car, but you have to send a coordinating message so that you don’t collide
Recently, people have started training multi-agent that play games like these, where they send “message” vectors back and forth, in a way fully integrated with the rest of the backpropogation procedure. From just observing the agents’ actions, it’s not necessarily clear which communication strategy they’re using. That’s why this paper poses as an explicit problem: how can we map between the communication vectors produced by the agents and the words that would be produced by a human in a similar environment?
Interestingly, the paper highlights two different ways you could think about structuring a translation objective. The first is “pragmatic interpretation,” under which you optimize what you communicate about something according to the operation that needs to be performed afterwards. To make that more clear, take a look at the attached picture. Imagine that player one is shown a shape, and needs to use a phrase from the bottom language (based on how many sides the shape has) to describe it to player two, who then needs to guess the size of the shape (big or small), and is rewarded for guessing correctly. Because “many” corresponds to both a large and a small shape, the strategy that optimizes the action that player two takes, conditional on getting player one’s message, is to lie and describe a hexagon as “few”, since that will lead to correct inference about the size of the shape, which is what’s most salient here. This example shows how, if you optimize a translation mapping by trying to optimize the reward that the post-translation agent can get, you might get a semantically incorrect translation. That might be good for the task at hand, but, because it leaves you with incorrect beliefs about the true underlying mapping, it will generalize poorly to different tasks.
The alternate approach, championed by the paper, is to train a translation such that the utterances in both languages are similar insofar as, conditional on hearing them, and having some value for their own current state, the listening player arrives at similar beliefs about the current state of the player sending the message. This is mathematically framed as by defining a metric q, representing the quality of the translation between two z vectors, as: “taking an expectation over all possible contextual states of (player 1, player 2), what is the difference between the distribution of beliefs about the state of player 1 (the sending player) induced in player 2 by hearing each of the z vectors. Because taking the full expectation over this joint distribution is intractable, the approach is instead done by sampling.
These equations require that you have reasonable models of human language, and understanding of human language, in the context of games. To do this, the authors used two types of datasets:
1. Linguistic descriptions of objects of things, like the xkcd color dataset. Here, the player’s hidden state is the color that they are trying to describe using some communication scheme.
2. Mechanical turk game runs playing the aforementioned driver game, where they have to communicate to the other driver. Here, the player’s “hidden state” represents a combination of its current location and intentions.
From these datasets, they can train simple emulator models that learn “what terms is a human most likely to use for a given color” [p(z|x)], and “what colors will a human guess, conditional on those terms”.
The paper closes by providing a proof as to how much reward-based value is lost by optimizing for the true semantic meaning, rather than the most pragmatically useful translation. They find that there is a bound on the gap, and that, in many empirical cases, the observed gap is quite small.
Overall, this paper was limited in scope, but provided an interesting conceptual framework for thinking about how you might structure a translation, and the different implications that structure might have on your results.
RNN language models are composed of:
1. Embedding layer
2. Recurrent layer(s) (RNN/LSTM/GRU/...)
3. Softmax layer (linear transformation + softmax operation)
The embedding matrix and the matrix of the linear transformation just before the softmax operation are of the same size (size_of_vocab * recurrent_state_size) .
They both contain one representation for each word in the vocabulary.
## __Weight Tying__
This paper shows, that by using the same matrix as both the input embedding and the pre-softmax linear transformation (the output embedding), the performance of a wide variety of language models is improved while the number of parameters is massively reduced.
In weight tied models each word has just one representation that is used in both the input and output embedding.
## __Why does weight tying work?__
1. In the paper we show that in un-tied language models, the output embedding contains much better word representations that the input embedding. We show that when the embedding matrices are tied, the quality of the shared embeddings is comparable to that of the output embedding in the un-tied model. So in the tied model the quality of the input and output embeddings is superior to the quality of those embeddings in the un-tied model.
2. In most language modeling tasks because of the small size of the datasets the models tend to overfit. When the number of parameters is reduced in a way that makes sense there is less overfitting because of the reduction in the capacity of the network.
## __Can I tie the input and output embeddings of the decoder of an translation model?__
Yes, we show that this reduces the model's size while not hurting its performance.
In addition, we show that if you preprocess your data using BPE, because of the large overlap between the subword vocabularies of the source and target language, __Three-Way Weight Tying__ can be used. In Three-Way Weight Tying, we tie the input embedding in the encoder to the input and output embeddings of the decoder (so each word has one representation which is used across three matrices).
[This](http://ofir.io/Neural-Language-Modeling-From-Scratch/) blog post contains more details about the weight tying method.
Incorporating an unsupervised language modeling objective to help train a bidirectional LSTM for sequence labeling. At the same time as training the tagger, the forward-facing LSTM is optimised to predict the next word and the backward-facing LSTM is optimised to predict the previous word. The model learns a better composition function and improves performance on NER, error detection, chunking and POS-tagging, without using additional data.
A toolkit for automatically annotating error correction data with error types. It takes original and corrected sentences as input, aligns them to infer error spans, and uses rules to assign error types. They use the tool to perform fine-grained evaluation of CoNLL-14 shared task participants.
Investigates different parameter choices for encoder-decoder NMT models. They find that LSTM is better than GRU, 2 bidirectional layers is enough, additive attention is the best, and a well-tuned beam search is important. They achieve good results on the WMT15 English->German task and release the code.
The paper proposes integrating a pre-trained language model into a sequence labeling model. The baseline model for sequence labeling is a two-layer LSTM/GRU. They concatenate the hidden states from pre-trained language models onto the output of the first LSTM layer. This provides an improvement on NER and chunking tasks.
They propose a neural architecture for assigning fine-grained labels to detected entity types. The model combines bidirectional LSTMs, attention over the context sequence, hand-engineered features, and the label hierarchy. They evaluate on Figer and OntoNotes datasets, showing improvements from each of the extensions.
They propose neural models for dialogue state tracking, making a binary decision for each possible slot-value pair, based on the latest context from the user and the system. The context utterances and the slot-value option are encoded into vectors, either by summing word representations or using a convnet. These vectors are then further combined to produce a binary output. The systems are evaluated on two dialogue datasets and show improvement over baselines that use hand-constructed lexicons.
Proposing character-based extensions to a neural MT system for grammatical error correction. OOV words are represented in the encoder and decoder using character-based RNNs. They evaluate on the CoNLL-14 dataset, integrate probabilities from a large language model, and achieve good results.
This paper attempts to open up the black box of neural machine translation models and inspect what the representations look like, specifically with respect to morphology. The technique they use is to train word-based and character-based seq2seq-style models on multiple source-target language pairs, of varying morphological complexity, and then ignore the target side to focus on the representations learned about the source language. Once they have an encoder trained to generate these representations, they attempt to use the encoder to create feature representations for external tasks that directly evaluate for morphology and part of speech information. (Contrast this with methods that may, for example, try to inspect activation patterns of individual neurons in a trained model.)
The first experiment shows that representations learned from character-based models are superior for POS tagging in the source language. The gap is bigger for morphologically rich languages like Arabic. The same result holds for morphological tagging. For infrequent words the gap is especially large -- the system can memorize morphological information for frequent words. They also show that the increases in accuracy are due to getting prevoiusly unseen words correct (both for POS and morph prediction) and that the biggest increase in accuracy is in predicting plural and determined noun categories. Next, they show that in a deeper network, the middle layer (of 3) has the best representations for predicting pos/morph information. The authors suggest the higher layers are more focused on semantics or other higher abstractions.
Overall, this work empirically confirms some conventional wisdom, that character representations are better for unseen words because of their ability to represent morphology.
(Reposting under ACL 2017 version)
Kind of a response/deeper dive into the durret/klein "easy victories" paper. Suggests that a) lexical features they used ("easy victories") are very prone to overfitting. They first show that several state of the art systems that use lexical features, trained on CoNLL data, perform poorly on wikiref, which was annotated using the same guidelines. Meanwhile the stanford sieve system performs about the same on both.
Then they show that a high percentage of gold standard linked headwords in the test set have been seen in the training set, and that a much lower percentage of errors are in the training set, implying that lexical features just allow you to memorize what kinds of things can be linked.
They suggest development of robust features, including using embeddings as lexical features, using lexical representations only for context, and on the evaluation side, using test sets that are different domains than the training set.
Multilingual embeddings are useful for creating embeddings for low resource languages for things like transfer learning (e.g., learning a POS tagger in a low-resource language using training data from a high resource language). However, they typically require some small amount of supervision in the form of aligned corpora, seed pairs, or dictionaries. This approach attempts to learn a mapping from a source embedding space into a target embedding space without supervision.
The approach uses two networks a la adversarial training. One network (the generator) is parameterized by a projection matrix that attempts to map source words into the target space. The other network (the discriminator) attempts to discriminate true target embeddings from projected source embeddings. Since adversarial training is known to be unstable (a "research frontier" as the authors say), quite a bit of the paper describes tricks and training methods the authors investigated to get training to converge and understand how to select models.
They evaluate on many pairs, including both similar and dissimilar language pairs, and get very nice results. In summary, better than seed-based approaches with 0-100 seeds, competitive with 100-1000 seeds. Much of what would be traditional discussion is instead devoted to details of training regimen, so unfortunately there is little discussion of why this works. Given the difficulty one might encounter attempting to train this, I think it might be a little preliminary to try using this for applications, but continued research in training adversarial networks for NLP and properties of embedding spaces could potentially make this approach reliable enough for real applications.
They get multilingual alignments from dictionaries, then train a Bilstm pos tagger in source language, then automatically tag many tokens in the target language, then manually annotate 1000 tokens in target language, then train a system with combined loss over distant tagging and gold tagging. They add an additional output layer that is learned for the gold annotations.