Welcome to ShortScience.org! |
[link]
TLDR; The authors present an RNN-based variational autoencoder that can learn a latent sentence representation while learning to decode. A linear layer that predicts the parameter of a Gaussian distribution is inserted between encoder and decoder. The loss is a combination of the reconstruction objective and the KL divergence with the prior (Gaussian) - similar to the "standard" VAE does. The authors evaluate the model on Language Modeling and Impution (Inserting Missing Words) tasks and also present a qualitative analysis of the latent space. #### Key Points - Training is tricky. Vanilla training results in the decoder ignoring the encoder and the KL error term becoming zero. - Training Trick 1: KL Cost Annealing. During training, increase weight on the KL term of the cost to anneal from vanilla to VAE. - Training Trick 2: Word dropout using a word keep rate hyperparameter. This forces the decoder to rely more on the global representation. - Results on Language Modeling: Standard model (without cost annealing and word dropout) trails Vanilla RNNLM model, but not by much. KL cost term goes to zero in this setting. In an inputless decoder setting (word keep prob = 0) the VAE outperforms the RNNLM (obviously) - Results on Imputing Missing Words: Benchmarked using an adversarial error classifier. VAE significantly outperforms RNNLM. However, the comparison is somewhat unfair since the RNNML has nothing to condition on and relies on unigram distribution for the first token. - Qualitative: Can use higher word dropout to get more diverse sentences - Qualitative: Can walk the latent space and get grammatical and meaningful sentences. |
[link]
Last year, a machine translation paper came out, with an unfortunately un-memorable name (the Transformer network) and a dramatic proposal for sequence modeling that eschewed both Recurrent NNN and Convolutional NN structures, and, instead, used self-attention as its mechanism for “remembering” or aggregating information from across an input. Earlier this month, the same authors released an extension of that earlier paper, called Image Transformer, that applies the same attention-only approach to image generation, and also achieved state of the art performance there. The recent paper offers a framing of attention that I find valuable and compelling, and that I’ll try to explicate here. They describe attention as being a middle ground between the approaches of CNNs and RNNs, and one that, to use an over-abused cliche, gets the best of both worlds. CNNs are explicitly local: each convolutional filter only gathers information from the cells that fall in specific locations along some predefined grid. And, because convolutional filters have a unique parameter for every relative location in the grid they’re applied to, increasing the size of any given filter’s receptive field would engender an exponential increase in parameters: to go from a 3x3 grid to a 4x4 one, you go from 9 parameters to 16. Convolutional networks typically increase their receptive field through the mechanism of adding additional layers, but there is still this fundamental limitation that for a given number of layers, CNNs will be fairly constrained in their receptive field. On the other side of the receptive field balance, we have RNNs. RNNs have an effectively unlimited receptive field, because they just apply one operation again and again: take in a new input, and decide to incorporate that information into the hidden state. This gives us the theoretical ability to access things from the distant past, because they’re stored somewhere in the hidden state. However, each element is only seen once and needs to be stored in the hidden state in a way that sort of “averages over” all of the ways it’s useful for various points in the decoding/translation process. (My mental image basically views RNN hidden state as packing for a long trip in a small suitcase: you have to be very clever about what you decide to pack, averaging over all the possible situations you might need to be prepared for. You can’t go back and pull different things into your suitcase as a function of the situation you face; you had to have chosen to add them at the time you encountered them). All in all, RNNs are tricky both because they have difficulty storing information efficiently over long time frames, and also because they can be monstrously slow to train, since you have to run through the full sequence to built up hidden state, and can’t chop it into localized bits the way you can with CNNs. So, between CNN - with its locally-specific hidden state - and RNN - with its large receptive field but difficulty in information storage - the self-attention approach interposes itself. Attention works off of three main objects: a query, and a set of keys, each one is attached to a value. In general, all of these objects take the form of vectors. For a given query, you calculate its similarity with each key, and then normalize those into a distribution (a set of weights, all of which sum to 1) that is used as the weights in calculating a weighted average of the values. As a motivating example, think of a model that is “unrolling” or decoding a translated sentence. In order to translate a sentence properly, the model needs to “remember” not only the conceptual content of the sentence, but what it has already generated. So, at each given point in the unrolling, the model can “query” the past and get a weighted distribution over what’s relevant to it in its current context. In the original Transformer, and also in the new one, the models use “multi-headed attention”, which I think is best compared to convolution filters: in the same way that you learn different convolution filters, each with different parameters, to pick up on different features, you learn different “heads” of the attention apparatus for the same purpose. To go back to our CNN - Attention - RNN schematic from earlier: Attention makes it a lot easier to query a large receptive field, since you don’t need an additional set of learned parameters for each location you expand to; you just use the same query weights and key weights you use for every other key and query. And, it allows you to contextually extract information from the past, depending on the needs you have right now. That said, it’s still the case that it becomes infeasible to make the length of the past you calculate your attention distribution over excessively long, but that cost is in terms of computation, not additional parameters, and thus is a question of training time, rather than essential model complexity, the way additional parameters is. Jumping all the way back up the stack, to the actual most recent image paper, this question of how best to limit the receptive field is one of the more salient questions, since it still is the case that conducting attention over every prior pixel would be a very large number of calculations. The Image Transformer paper solves this in a slightly hacky way: by basically subdividing the image into chunks, and having each chunk operate over the same fixed memory region (rather than scrolling the memory region with each pixel shift) to take better advantage of the speed of batched big matrix multiplies. Overall, this paper showed an advantage for the Image Transformer approach relevative to PixelCNN autoregressive generation models, and cited the ability for a larger receptive field during generation - without explosion in number of parameters - as the most salient reason why. |
[link]
[Machine learning is a nuanced, complicated, intellectually serious topic...but sometimes it’s refreshing to let ourselves be a bit less serious, especially when it’s accompanied by clear, cogent writing on a topic. This particular is a particularly delightful example of good-natured silliness - from the dataset name HellaSwag to figures containing cartoons of BERT and ELMO representing language models - combined with interesting science.] https://i.imgur.com/CoSeh51.png This paper tackles the problem of natural language comprehension, which asks: okay, our models can generate plausible looking text, but do they actually exhibit what we would consider true understanding of language? One natural structure of task for this is to take questions or “contexts”, and, given a set of possible endings or completion, pick the correct one. Positive examples are relatively easy to come by: adjacent video captions and question/answer pairs from WikiHow are two datasets used in this paper. However, it’s more difficult to come up with *negative* examples. Even though our incorrect endings won’t be a meaningful continuation of the sentence, we want them to be “close enough” that we can feel comfortable attributing a model’s ability to pick the correct answer as evidence of some meaningful kind of comprehension. As an obvious failure mode, if the alternative multiple choice options were all the same word repeated ten times, that would be recognizable as being not real language, and it would be easy for a model to select the answer with the distributional statistics of real language, but it wouldn’t prove much. Typically failure modes aren’t this egregious, but the overall intuition still holds, and will inform the rest of the paper: your ability to test comprehension on a given dataset is a function of how contextually-relevant and realistic your negative examples are. Previous work (by many of the same authors as are on this paper), proposed a technique called Adversarial Filtering to try to solve this problem. In Adversarial Filtering, a generative language model is used to generate possible many endings conditioned on the input context, to be used as negative examples. Then, a discriminator is trained to predict the correct ending given the context. The generated samples that the discriminator had the highest confidence classifying as negative are deemed to be not challenging enough comparisons, and they’re thrown out and replaced with others from our pool of initially-generated samples. Eventually, once we’ve iterated through this process, we have a dataset with hopefully realistic negative examples. The negative examples are then given to humans to ensure none of them are conceptually meaningful actual endings to the sentence. The dataset released by the earlier paper, which used as it’s generator a relatively simple LSTM model, was called Swag. However, the authors came to notice that the performance of new language models (most centrally BERT) on this dataset might not be quite what it appears: its success rate of 86% only goes down to 76% if you don’t give the classifier access to the input context, which means it can get 76% (off of a random baseline of 25%, with 4 options) simply by evaluating which endings are coherent as standalone bits of natural language, without actually having to understand or even see the context. Also, shuffling the words in the words in the possible endings had a similarly small effect: the authors are able to get BERT to perform at 60% accuracy on the SWAG dataset with no context, and with shuffled words in the possible answers, meaning it’s purely selecting based on the distribution of words in the answer, rather than on the meaningfully-ordered sequence of words. https://i.imgur.com/f6vqJWT.png The authors overall conclusion with this is: this adversarial filtering method is only as good as the generator, and, more specifically, the training dynamic between the generator that produces candidate endings, and the discriminator that filters them. If the generator is too weak, the negative examples can be easily detected as fake by a stronger model, but if the generator is too strong, then the discriminator can’t get good enough to usefully contribute by weeding samples out. They demonstrate this by creating a new version of Swag, which they call HellaSwag (for the expected acronym-optimization reasons), with a GPT generator rather than the simpler one used before: on this new dataset, all existing models get relatively poor results (30-40% performance). However, the authors’ overall point wasn’t “we’ve solved it, this new dataset is the end of the line,” but rather a call in the future to be wary, and generally aware that with benchmarks like these, especially with generated negative examples, it’s going to be a constantly moving target as generation systems get better. |
[link]
This is an interesting paper, investigating (with a team that includes the original authors of the Lottery Ticket paper) whether the initializations that result from BERT pretraining have Lottery Ticket-esque properties with respect to their role as initializations for downstream transfer tasks. As background context, the Lottery Ticket Hypothesis came out of an observation that trained networks could be pruned to remove low-magnitude weights (according to a particular iterative pruning strategy that is a bit more complex than just "prune everything at the end of training"), down to high levels of sparsity (5-40% of original weights, and that those pruned networks not only perform well at the end of training, but also can be "rewound" back to their initialization values (or, in some cases, values from early in training) and retrained in isolation, with the weights you pruned out of the trained network still set to 0, to a comparable level of accuracy. This is thought of as a "winning ticket" because the hypothesis Frankle and Carbin generated is that the reason we benefit from massively overparametrized neural networks is that we are essentially sampling a large number of small subnetworks within the larger ones, and that the more samples we get, the likelier it is we find a "winning ticket" that starts our optimization in a place conducive to further training. In this particular work, the authors investigate a slightly odd variant of the LTH. Instead of looking at training runs that start from random initializations, they look at transfer tasks that start their learning from a massively-pretrained BERT language model. They try to find out: 1) Whether you can find "winning tickets" as subsets of the BERT initialization for a given downstream task 2) Whether those winning tickets generalize, i.e. whether a ticket/pruning mask for one downstream task can also have high performance on another. If that were the case, it would indicate that much of the value of a BERT initialization for transfer tasks could be captured by transferring only a small percentage of BERT's (many) weights, which would be beneficial for compression and mobile applications An interesting wrinkle in the LTH literature is the question of whether true "winning tickets" can be found (in the sense of the network being able to retrain purely from the masked random initializations), or whether it can only retrain to a comparable accuracy by rewinding to an early stage in training, but not the absolute beginning of training. Historically, the former has been difficult and sometimes not possible to find in more complex tasks and networks. https://i.imgur.com/pAF08H3.png One finding of this paper is that, when your starting point is BERT initialization, you can indeed find "winning tickets" in the first sense of being able to rewind the full way back to the beginning of (downstream task) training, and retrain from there. (You can see this above with the results for IMP, Iterative Magnitude Pruning, rolling back to theta-0). This is a bit of an odd finding to parse, since it's not like BERT really is a random initialization itself, but it does suggest that part of the value of BERT is that it contains subnetworks that, from the start of training, are in notional optimization basins that facilitate future training. A negative result in this paper is that, by and large, winning tickets on downstream tasks don't transfer from one to another, and, to the extent that they do transfer, it mostly seems to be according to which tasks had more training samples used in the downstream mask-finding process, rather than any qualitative properties of the task. The one exception to this was if you did further training of the original BERT objective, Masked Language Modeling, as a "downstream task", and took the winning ticket mask from that training, which then transferred to other tasks. This is some validation of the premise that MLM is an unusually good training task in terms of its transfer properties. An important thing to note here is that, even though this hypothesis is intriguing, it's currently quite computationally expensive to find "winning tickets", requiring an iterative pruning and retraining process that takes far longer than an original training run would have. The real goal here, which this is another small step in the hopeful direction of, is being able to analytically specify subnetworks with valuable optimization properties, without having to learn them from data each time (which somewhat defeats the point, if they're only applicable for the task they're trained on, though is potentially useful is they do transfer to some other tasks, as has been shown within a set of image-prediction tasks). |
[link]
TLDR; The authors propose Neural Turing Machines (NTMs). A NTM consists of a memory bank and a controller network. The controller network (LSTM or MLP in this paper) controls read/write heads by focusing their attention softly, using a distribution over all memory addresses. It can learn the parameters for two addressing mechanisms: Content-based addressing ("find similar items") and location-based addressing. NTMs can be trained end-to-end using gradient descent. The authors evaluate NTMs on program generations tasks and compare their performance against that of LSTMs. Tasks include copying, recall, prediction, and sorting binary vectors. While both LSTMs and NTMs seems to perform well on training data, only NTMs are able to generalize to longer sequences. #### Key Observations - Controller network tried with LSTM or MLP. Which one works better is task-dependent, but LSTM "cache" can be a bottleneck. - Controller size, number of read/write heads, and memory size are hyperparameters. - Monitoring the memory addressing shows that the NTM actually learns meaningful programs. - Number LSTM parameters grow quadratically with hidden unit size due to recurrent connection, not so for NTMs, leading to models with fewer parameters. - Example problems are very small, typically using sequences 8 bit vectors. #### Notes/Questions - At what length to NTMs stop to work? Would've liked to see where results get significantly worse. - Can we automatically transform fuzzy NTM programs into deterministic ones? |