Welcome to ShortScience.org! |
[link]
This is an interesting paper, investigating (with a team that includes the original authors of the Lottery Ticket paper) whether the initializations that result from BERT pretraining have Lottery Ticket-esque properties with respect to their role as initializations for downstream transfer tasks. As background context, the Lottery Ticket Hypothesis came out of an observation that trained networks could be pruned to remove low-magnitude weights (according to a particular iterative pruning strategy that is a bit more complex than just "prune everything at the end of training"), down to high levels of sparsity (5-40% of original weights, and that those pruned networks not only perform well at the end of training, but also can be "rewound" back to their initialization values (or, in some cases, values from early in training) and retrained in isolation, with the weights you pruned out of the trained network still set to 0, to a comparable level of accuracy. This is thought of as a "winning ticket" because the hypothesis Frankle and Carbin generated is that the reason we benefit from massively overparametrized neural networks is that we are essentially sampling a large number of small subnetworks within the larger ones, and that the more samples we get, the likelier it is we find a "winning ticket" that starts our optimization in a place conducive to further training. In this particular work, the authors investigate a slightly odd variant of the LTH. Instead of looking at training runs that start from random initializations, they look at transfer tasks that start their learning from a massively-pretrained BERT language model. They try to find out: 1) Whether you can find "winning tickets" as subsets of the BERT initialization for a given downstream task 2) Whether those winning tickets generalize, i.e. whether a ticket/pruning mask for one downstream task can also have high performance on another. If that were the case, it would indicate that much of the value of a BERT initialization for transfer tasks could be captured by transferring only a small percentage of BERT's (many) weights, which would be beneficial for compression and mobile applications An interesting wrinkle in the LTH literature is the question of whether true "winning tickets" can be found (in the sense of the network being able to retrain purely from the masked random initializations), or whether it can only retrain to a comparable accuracy by rewinding to an early stage in training, but not the absolute beginning of training. Historically, the former has been difficult and sometimes not possible to find in more complex tasks and networks. https://i.imgur.com/pAF08H3.png One finding of this paper is that, when your starting point is BERT initialization, you can indeed find "winning tickets" in the first sense of being able to rewind the full way back to the beginning of (downstream task) training, and retrain from there. (You can see this above with the results for IMP, Iterative Magnitude Pruning, rolling back to theta-0). This is a bit of an odd finding to parse, since it's not like BERT really is a random initialization itself, but it does suggest that part of the value of BERT is that it contains subnetworks that, from the start of training, are in notional optimization basins that facilitate future training. A negative result in this paper is that, by and large, winning tickets on downstream tasks don't transfer from one to another, and, to the extent that they do transfer, it mostly seems to be according to which tasks had more training samples used in the downstream mask-finding process, rather than any qualitative properties of the task. The one exception to this was if you did further training of the original BERT objective, Masked Language Modeling, as a "downstream task", and took the winning ticket mask from that training, which then transferred to other tasks. This is some validation of the premise that MLM is an unusually good training task in terms of its transfer properties. An important thing to note here is that, even though this hypothesis is intriguing, it's currently quite computationally expensive to find "winning tickets", requiring an iterative pruning and retraining process that takes far longer than an original training run would have. The real goal here, which this is another small step in the hopeful direction of, is being able to analytically specify subnetworks with valuable optimization properties, without having to learn them from data each time (which somewhat defeats the point, if they're only applicable for the task they're trained on, though is potentially useful is they do transfer to some other tasks, as has been shown within a set of image-prediction tasks). |
[link]
TLDR; The authors present an RNN-based variational autoencoder that can learn a latent sentence representation while learning to decode. A linear layer that predicts the parameter of a Gaussian distribution is inserted between encoder and decoder. The loss is a combination of the reconstruction objective and the KL divergence with the prior (Gaussian) - similar to the "standard" VAE does. The authors evaluate the model on Language Modeling and Impution (Inserting Missing Words) tasks and also present a qualitative analysis of the latent space. #### Key Points - Training is tricky. Vanilla training results in the decoder ignoring the encoder and the KL error term becoming zero. - Training Trick 1: KL Cost Annealing. During training, increase weight on the KL term of the cost to anneal from vanilla to VAE. - Training Trick 2: Word dropout using a word keep rate hyperparameter. This forces the decoder to rely more on the global representation. - Results on Language Modeling: Standard model (without cost annealing and word dropout) trails Vanilla RNNLM model, but not by much. KL cost term goes to zero in this setting. In an inputless decoder setting (word keep prob = 0) the VAE outperforms the RNNLM (obviously) - Results on Imputing Missing Words: Benchmarked using an adversarial error classifier. VAE significantly outperforms RNNLM. However, the comparison is somewhat unfair since the RNNML has nothing to condition on and relies on unigram distribution for the first token. - Qualitative: Can use higher word dropout to get more diverse sentences - Qualitative: Can walk the latent space and get grammatical and meaningful sentences. |
[link]
This is an interestingly pragmatic paper that makes a super simple observation. Often, we may want a usable network with fewer parameters, to make our network more easily usable on small devices. It's been observed (by these same authors, in fact), that pruned networks can achieve comparable weights to their fully trained counterparts if you rewind and retrain from early in the training process, to compensate for the loss of the (not ultimately important) pruned weights. This observation has been dubbed the "Lottery Ticket Hypothesis", after the idea that there's some small effective subnetwork you can find if you sample enough networks. Given these two facts - the usefulness of pruning, and the success of weight rewinding - the authors explore the effectiveness of various ways to train after pruning. Current standard practice is to prune low-magnitude weights, and then continue training remaining weights from values they had at pruning time, keeping the final learning rate of the network constant. The authors find that: 1. Weight rewinding, where you rewind weights to *near* their starting value, and then retrain using the learning rates of early in training, outperforms fine tuning from the place weights were when you pruned but, also 2. Learning rate rewinding, where you keep weights as they are, but rewind learning rates to what they were early in training, are actually the most effective for a given amount of training time/search cost To me, this feels a little bit like burying the lede: the takeaway seems to be that when you prune, it's beneficial to make your network more "elastic" (in the metaphor-to-neuroscience sense) so it can more effectively learn to compensate for the removed neurons. So, what was really valuable in weight rewinding was the ability to "heat up" learning on a smaller set of weights, so they could adapt more quickly. And the fact that learning rate rewinding works better than weight rewinding suggests that there is value in the learned weights after all, that value is just outstripped by the benefit of rolling back to old learning rates. All in all, not a super radical conclusion, but a useful and practical one to have so clearly laid out in a paper. |
[link]
* They describe a variation of convolutions that have a differently structured receptive field. * They argue that their variation works better for dense prediction, i.e. for predicting values for every pixel in an image (e.g. coloring, segmentation, upscaling). ### How * One can image the input into a convolutional layer as a 3d-grid. Each cell is a "pixel" generated by a filter. * Normal convolutions compute their output per cell as a weighted sum of the input cells in a dense area. I.e. all input cells are right next to each other. * In dilated convolutions, the cells are not right next to each other. E.g. 2-dilated convolutions skip 1 cell between each input cell, 3-dilated convolutions skip 2 cells etc. (Similar to striding.) * Normal convolutions are simply 1-dilated convolutions (skipping 0 cells). * One can use a 1-dilated convolution and then a 2-dilated convolution. The receptive field of the second convolution will then be 7x7 instead of the usual 5x5 due to the spacing. * Increasing the dilation factor by 2 per layer (1, 2, 4, 8, ...) leads to an exponential increase in the receptive field size, while every cell in the receptive field will still be part in the computation of at least one convolution. * They had problems with badly performing networks, which they fixed using an identity initialization for the weights. (Sounds like just using resdiual connections would have been easier.) ![Receptive field](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Multi-Scale_Context_Aggregation_by_Dilated_Convolutions__receptive.png?raw=true "Receptive field") *Receptive fields of a 1-dilated convolution (1st image), followed by a 2-dilated conv. (2nd image), followed by a 4-dilated conv. (3rd image). The blue color indicates the receptive field size (notice the exponential increase in size). Stronger blue colors mean that the value has been used in more different convolutions.* ### Results * They took a VGG net, removed the pooling layers and replaced the convolutions with dilated ones (weights can be kept). * They then used the network to segment images. * Their results were significantly better than previous methods. * They also added another network with more dilated convolutions in front of the VGG one, again improving the results. ![Segmentation performance](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Multi-Scale_Context_Aggregation_by_Dilated_Convolutions__segmentation.png?raw=true "Segmentation performance") *Their performance on a segmentation task compared to two competing methods. They only used VGG16 without pooling layers and with convolutions replaced by dilated convolutions.* |
[link]
Last year, a machine translation paper came out, with an unfortunately un-memorable name (the Transformer network) and a dramatic proposal for sequence modeling that eschewed both Recurrent NNN and Convolutional NN structures, and, instead, used self-attention as its mechanism for “remembering” or aggregating information from across an input. Earlier this month, the same authors released an extension of that earlier paper, called Image Transformer, that applies the same attention-only approach to image generation, and also achieved state of the art performance there. The recent paper offers a framing of attention that I find valuable and compelling, and that I’ll try to explicate here. They describe attention as being a middle ground between the approaches of CNNs and RNNs, and one that, to use an over-abused cliche, gets the best of both worlds. CNNs are explicitly local: each convolutional filter only gathers information from the cells that fall in specific locations along some predefined grid. And, because convolutional filters have a unique parameter for every relative location in the grid they’re applied to, increasing the size of any given filter’s receptive field would engender an exponential increase in parameters: to go from a 3x3 grid to a 4x4 one, you go from 9 parameters to 16. Convolutional networks typically increase their receptive field through the mechanism of adding additional layers, but there is still this fundamental limitation that for a given number of layers, CNNs will be fairly constrained in their receptive field. On the other side of the receptive field balance, we have RNNs. RNNs have an effectively unlimited receptive field, because they just apply one operation again and again: take in a new input, and decide to incorporate that information into the hidden state. This gives us the theoretical ability to access things from the distant past, because they’re stored somewhere in the hidden state. However, each element is only seen once and needs to be stored in the hidden state in a way that sort of “averages over” all of the ways it’s useful for various points in the decoding/translation process. (My mental image basically views RNN hidden state as packing for a long trip in a small suitcase: you have to be very clever about what you decide to pack, averaging over all the possible situations you might need to be prepared for. You can’t go back and pull different things into your suitcase as a function of the situation you face; you had to have chosen to add them at the time you encountered them). All in all, RNNs are tricky both because they have difficulty storing information efficiently over long time frames, and also because they can be monstrously slow to train, since you have to run through the full sequence to built up hidden state, and can’t chop it into localized bits the way you can with CNNs. So, between CNN - with its locally-specific hidden state - and RNN - with its large receptive field but difficulty in information storage - the self-attention approach interposes itself. Attention works off of three main objects: a query, and a set of keys, each one is attached to a value. In general, all of these objects take the form of vectors. For a given query, you calculate its similarity with each key, and then normalize those into a distribution (a set of weights, all of which sum to 1) that is used as the weights in calculating a weighted average of the values. As a motivating example, think of a model that is “unrolling” or decoding a translated sentence. In order to translate a sentence properly, the model needs to “remember” not only the conceptual content of the sentence, but what it has already generated. So, at each given point in the unrolling, the model can “query” the past and get a weighted distribution over what’s relevant to it in its current context. In the original Transformer, and also in the new one, the models use “multi-headed attention”, which I think is best compared to convolution filters: in the same way that you learn different convolution filters, each with different parameters, to pick up on different features, you learn different “heads” of the attention apparatus for the same purpose. To go back to our CNN - Attention - RNN schematic from earlier: Attention makes it a lot easier to query a large receptive field, since you don’t need an additional set of learned parameters for each location you expand to; you just use the same query weights and key weights you use for every other key and query. And, it allows you to contextually extract information from the past, depending on the needs you have right now. That said, it’s still the case that it becomes infeasible to make the length of the past you calculate your attention distribution over excessively long, but that cost is in terms of computation, not additional parameters, and thus is a question of training time, rather than essential model complexity, the way additional parameters is. Jumping all the way back up the stack, to the actual most recent image paper, this question of how best to limit the receptive field is one of the more salient questions, since it still is the case that conducting attention over every prior pixel would be a very large number of calculations. The Image Transformer paper solves this in a slightly hacky way: by basically subdividing the image into chunks, and having each chunk operate over the same fixed memory region (rather than scrolling the memory region with each pixel shift) to take better advantage of the speed of batched big matrix multiplies. Overall, this paper showed an advantage for the Image Transformer approach relevative to PixelCNN autoregressive generation models, and cited the ability for a larger receptive field during generation - without explosion in number of parameters - as the most salient reason why. |