![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
This paper builds on the paper "Learned in Translation: Contextualized Word Vectors" , which learned contextualized word representations by using the sequence of encodings generated by a Bidirectional LSTM as the representation of the sequence of input words. This paper says “if we're learning a deep LSTM, ie one with more than one layer, why should we use only the last layer that it produces as the representation of the word?”. This paper instead suggests that it could be valuable for transfer learning if each task can learn a weighting of layer encodings that is most valuable for that task. In a prime example of “your model is a special case of my model,” they note that this framework can easily learn the approach of only using the final encoding layer by just only giving that layer a non-zero weight. As intuition for why this might be a valuable thing to do: different layers tend to capture different levels of meaning, with lower layers more likely to capture part of speech information, and higher layers more likely to capture more rich semantic context. https://i.imgur.com/s8Qn6YY.png One difficulty in comparing this paper directly to the “take the top layer encoding from a LSTM” paper is that they were trained on different problems: the top layer paper learned using a machine translation objective, where, by contrast, this one learns by using a much simpler language model. Here, a simple language model means a RNN that is trained to predict the next word, given the hidden state built up over all prior words. Because we want to pull in word context from both directions, this is’t just a LSTM but a bidirectional LSTM, which - surprise surprise - tries to pick the word *before* a word in a sentence by using all of the words that come after it. This has the advantage of not requiring parallel data, the way machine translation does, but also makes it difficult to make direct comparisons to prior work that isolate the effect of multi-layer combination, as separate from the switch between machine translation and direct language modeling. Although this is likely also a benefit you see with just top-layer contextual vectors, it is interesting to examine the attached table, and look at how effectively the model is able to learn different representations of the word “play” depending on the context in which it appears; in each case, the nearest neighbor of a context of “play” is a sentence in which the word is used in the same context. ![]() |
[link]
This paper is a clever but conceptually simple idea to improve the vectors learned for individual words. In this proposed approach, instead of learning a distinct vector per word in the word, the model instead views a word as being composed of overlapping n-grams, which are combined to make the full word. Recall: in the canonical skipgram approach to learning word embeddings, each word is represented by a single vector. The word might be tokenized first (for example, de-pluralized), but, fundamentally, there isn’t any way for the network to share information about the meanings of “liberty” and “liberation”; even though a human could see that they share root structure, for a skipgram model, they are two totally distinct concepts that need to be separately learned. The premise of this paper is that this approach leaves valuable information on the table, and that by learning vectors for subcomponents of words, and combining them to represent the whole word, we can more easily identify and capture shared patterns. On a technical level, this is done by: - For each n value in the range selected (typically 3-6 inclusive), representing each input word as a set of overlapping windows of that n, with special characters for Start and End. For example, if n=3, and the word is “where”, it could be represented as [“<wh”, “whe”, “her”, “ere>”] - In addition to the set of ngrams, additionally representing each word through its full-word representation of “<where>”, to “catch” any leftover meaning not captured in the smaller ngrams. - When you’re calculating loss, representing each word as simply being the sum of all of its component ngrams https://i.imgur.com/NP5qFEV.png This has some interesting consequences. First off, the perplexity of the model, which you can think of as a measure of unsupervised goodness of fit, is equivalent or improved by this approach relative to baselines on all but one model. Intriguingly, and predictably once you think about it, the advantage of the subword approach is much stronger for languages like German, Russian, and Arabic, which have strong re-use and aggregation of root words, and also strong patterns of morphological mutation of words. Additionally, the authors found that the subword model got to its minimum loss value using much less data than the canonical approach. This makes decent sense if you think about the fact that subcomponent re-use means there are fewer meaningful word subcomponents than their are unique words, and seeing a subcomponent used across many words means that you need fewer words to learn the patterns it corresponds to. https://i.imgur.com/EmN167L.png A lot of the benefit of this approach seems to be through better representation of syntax; when tested on an analogy task, embeddings trained with subword information did meaningfully better on syntactic analogies (“swim is to swum as ran is to <>”) but equivalent or worse on semantic analogies (“mother is to girl as father is to <>”). One theory about this is that focusing on the subword elements does a better job of more quickly getting the representation of the word to be close to it’s exact meaning, but has a harder time learning precise semantics, relative to a full-word model. ![]() |
[link]
This paper’s approach goes a step further away from the traditional word embedding approach - of training embeddings as the lookup-table first layer of an unsupervised monolingual network - and proposes a more holistic form of transfer learning that involves not just transferring over learned knowledge contained in a set of vectors, but a fully trained model. Transfer learning is the general idea of using part or all of a network trained on one task to perform a different task. The most common kind of transfer learning is in the image domain, where models are first trained on the enormous ImageNet dataset, and then several of the lower layers of the network (where more local, small-pixel-range patterns are detected) are transferred, with their weights fixed in place to a new network. The modeler then attaches a few more layers to the top, connects it to a new target, and then is able to much more quickly learn their new target, because the pre-training has gotten them into a useful region of parameter-space. https://i.imgur.com/wjloHdi.png Within NLP, the most common form of transfer learning is initializing the lookup table of vectors that’s used to convert discrete words in to vectors (also known as an embedding) with embeddings pre-trained on huge unsupervised datasets, like GloVe, trained on all of English Wikipedia. Again, this makes your overall task easier to train, because you’ve already converted words from their un-useful binary representation (where the word cat is just as far from Peru as it is from kitten) to a meaningful real-valued representation. The approach suggested in this paper goes beyond simply learning the vector input representation of words. Instead, the authors suggest using as word vectors the sequence of encodings produced by an encoder-decoder bi-directional recurrent model. An encoder-decoder model means that you have one part of the network that maps from input sentence to an “encoded” representation of the sentence, and then another part that maps that encoded representation into the proper tokens in the target language. Historically, this encoding had been a single vector for the whole sentence, which tried to conceptually capture all of the words into one vector. More recently, a different approach has grown popular, where the RNN produces a number of encodings equal to the number of input words. Then, when the decoder is producing words in the target sentence, it uses something called “attention” to select a weighted combination of these encodings at each point in time. Under this scheme, the decoder might pull out information about verbs when its own hidden state suggests it needs a verb, and might pull out information about pronoun referents when its own hidden state asks for that. The upshot of all of this is that you end up with a sequence of encoded vectors equal in length to your number of inputs. Because the RNN is bidirectional, which means the encoding is a concatenation of the forward RNN and backward RNN, that means that each of these encodings captures both information about its corresponding word, and contextual information about the rest of the sentence. The proposal of the authors is to train the encoder-decoder outlined above, and, once it is trained, lop off the decoder, and use the encoded sequence of words as your representation of the input sequence of words. An important note in all this is that recurrent encoder-decoder model was itself trained using a lookup table initialized with learned GloVe vectors, so in a sense they’re not substituting for the unsupervised embeddings so much as learning marginal information on top of them. The authors went on to test this approach on a few problems - question answering, logical entailment, and sentiment classification. They compared their use of the RNN encoded word vectors (which they call Context Vectors, or CoVE) with models initialized just using the fixed GloVE word vectors. One important note here is that, because each word vector is learned fully in context, the same word will have a different vector in each sentence it appears in. That’s why you can’t transfer one single vector per word, but instead have to transfer the recurrent model that can produce the vectors. All in all, the authors found that concatenating CoVe vectors to GloVe vectors, and using the concatenated version as input, produced sizable gains on the problems where it was tried. That said, it’s a pretty heavy lift to integrate someone else’s learned weights into your own model, just in terms of getting all the code to play together nicely. I’m not sure if this is a compelling enough result, a la ImageNet pretraining, for practitioners to want to go to the trouble of tacking a non-training RNN onto the bottom of all their models. If I ever get a chance, I’d be interested to play with the vectors you get out of this model, and look at how much variance you see in the vectors learned for different words across different sentences. Do you see clusters that correspond to sense disambiguation, (a la state of mind, vs a rogue state)? And, how does this contextual approach to the paper I reviewed yesterday, that also learns embeddings on a machine translation task, but does so in terms of training a lookup table, rather than using trained encodings? All in all, I enjoyed this paper: it was a simple idea, and I’m not sure whether it was a compelling one, but it did leave me with some interesting questions. ![]() |
[link]
If you’ve been paying any attention to the world of machine learning in the last five years, you’ve likely seen everyone’s favorite example for how Word2Vec word embeddings work: king - man + woman = queen. Given the ubiquity of Word2Vec, and similar unsupervised embeddings, it can be easy to start thinking of them as the canonical definition of what a word embedding *is*. But that’s a little oversimplified. In the context of machine learning, an embedding layer simply means any layer structured in the form of a lookup table, where there is some pre-determined number of discrete objects (for example: a vocabulary of words), each of which corresponds to a d-dimensional vector in the lookup table (where d is the number of dimensions you as the model designer arbitrarily chose). These embeddings are initialized in some way, and trained jointly with the rest of the network, using some kind of objective function. Unsupervised, monolingual word embeddings are typically learned by giving a model as input a sample of words that come before and after a given target word in a sentence, and then asking it to predict the target word in the center. Conceptually, if there are words that appear in very similar contexts, they will tend to have similar word vectors. This happens because scores are calculated using the dot product of the target vector with each of the context words, and if two words are to both score highly in that context, the dot product with their common-context vectors must be high for both, which pushes them towards similar values. For the last 3-4 years, unsupervised word vectors like these - which were made widely available for download - became a canonical starting point for NLP problems; this starting representation of words made it easier to learn from smaller datasets, since knowledge about the relationships between words was being transferred from the larger original word embedding training set, through the embeddings themselves. This paper seeks to challenge the unitary dominance of monolingual embeddings, by examining the embeddings learned when the objective is, instead, machine translation, where given a sentence in one language, you must produce it in another. Remember: an embedding is just a lookup table of vectors, and you can use it as the beginning of a machine translation model just as you can the beginning of a monolingual model. In theory, if the embeddings learned by a machine translation model had desirable properties, they could also be widely shared and used for transfer learning, like Word2Vec embeddings often are. When the authors of the paper dive into comparing the embeddings from both of these two approaches, they find some interesting results, such as: while the monolingual embeddings do a better job at analogy-based tests, machine translation embeddings do better at having similarity, within their vector space, map to true similarity of concept. Put another way, while monolingual systems push together words that appear in similar contexts (Teacher, Student, Principal), machine translation systems push words together when they map to the same or similar words in the target language (Teacher, Professor). The attached image shows some examples of this effect; the first three columns are all monolingual approaches, the final two are machine translation ones. When it comes to analogies, machine translation embeddings perform less well at semantic analogies (Ottowa is to Canada as Paris is to France) but does better at syntactic analogies (fast is to fastest as heavier is to heaviest). While I don’t totally understand why monolingual would be better at semantic analogies, it does make sense that the machine translation model would do a better job of encoding syntactic information, since such information is necessarily to sensibly structure a sentence. ![]() |
[link]
This paper outlines (yet another) variation on a variational autoencoder (VAE), which is, at a high level, a model that seeks to 1) learn to construct realistic samples from the data distribution, and 2) capture meaningful information about the data within its latent space. The “latent space” is a way of referring to the information bottleneck that happens when you compress the input (typically for these examples: an image) into a low-dimensional vector, before trying to predict that input out again using that low-dimensional vector as a seed or conditional input. In a typical VAE, the objective function is composed of two terms: a reconstruction loss that captures how well your Decoder distribution captures the X that was passed in as input, and a regularization loss that pushes the latent z code you create to be close to some input prior distribution. Pushing your learned z codes to be closer to a prior is useful because you can then sample using that prior, and have those draws map to the coheret regions of the space, where you’ve trained in the past. The Implicit Autoencoder proposal changes both elements of this objective function, but since one - the modification of the regularization term - is actually drawn from another (Adversarial Autoencoders), I’m primarily going to be focusing on the changes to the reconstruction term. In a typical variational autoencoder, the model is incentivized to perform an exact reconstruction of the input X, by using the latent code as input. Since this distance is calculated on a pixelwise basis, this puts a lot of pressure on the latent z code to learn ways of encoding this detailed local information, rather than what we’d like it to be capturing, which is broader, global structure of the data. In the IAE approach, instead of incentivizing the input x to be high probability in the distribution conditioned by the z that the encoder embedded off of x, we instead try to match the joint distributions of (x, z) and (reconstructed-x, z). This is done by taking these two pairs, and running them through a GAN, which needs to tell which pair represents the reconstructed x, and which the input x. Here, the GAN takes as input a concatenation of z (the embedded code for this image), and n, which is a random vector. Since a GAN is a deterministic mapping, this random vector n is what allows for sampling from this model, rather than just pulling the same output every time. Under this system, the model is under less pressure to recreate the details from the particular image that was input. Instead, it just needs to synchronize the use of z between the encoder and the decoder. To understand why this is true, imagine if you had an MNIST set of 1 and 2s, and a binary number for your z distribution. If you encode a 2, you can do so by setting that binary float to 0. Now, as long as your decoder realizes what the encoder was trying to do, and reconstructs a 2, then the joint distribution will be similar between the encoder and decoder, and our new objective function will be happy. An important fact here is: this doesn’t require that the decoder reconstruct the *exact* 2 that was passed in, as long as it matches, in distribution, the set of images that the encoder is choosing to map to the same z code, the decoder can do well. A consequence of this approach is an ability to modulate how much information you actually want to pull out into your latent vector, and how much you just want to be represented by your random noise vector, which will control randomness in the GAN and, to continue the example above, allow you to draw more than one distinct 2 off of the ‘2” latent code. If you have a limited set of z dimensionality, the they will represent high level concepts (for example: MNIST digits) and the rest of the variability in images will be modeled through the native GAN framework. If you have a high dimensional z, then more and more detail-level information will get encoded into the z vector, rather than just being left to the noise. ![]() |