ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

The Lottery Ticket Hypothesis for Pre-trained BERT Networks
Chen, Tianlong and Frankle, Jonathan and Chang, Shiyu and Liu, Sijia and Zhang, Yang and Wang, Zhangyang and Carbin, Michael
arXiv e-Print archive - 2020 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 5 years ago

This is an interesting paper, investigating (with a team that includes the original authors of the Lottery Ticket paper) whether the initializations that result from BERT pretraining have Lottery Ticket-esque properties with respect to their role as initializations for downstream transfer tasks. 

As background context, the Lottery Ticket Hypothesis came out of an observation that trained networks could be pruned to remove low-magnitude weights (according to a particular iterative pruning strategy that is a bit more complex than just "prune everything at the end of training"), down to high levels of sparsity (5-40% of original weights, and that those pruned networks not only perform well at the end of training, but also can be "rewound" back to their initialization values (or, in some cases, values from early in training) and retrained in isolation, with the weights you pruned out of the trained network still set to 0, to a comparable level of accuracy. This is thought of as a "winning ticket" because the hypothesis Frankle and Carbin generated is that the reason we benefit from massively overparametrized neural networks is that we are essentially sampling a large number of small subnetworks within the larger ones, and that the more samples we get, the likelier it is we find a "winning ticket" that starts our optimization in a place conducive to further training. 

In this particular work, the authors investigate a slightly odd variant of the LTH. Instead of looking at training runs that start from random initializations, they look at transfer tasks that start their learning from a massively-pretrained BERT language model. They try to find out: 

1) Whether you can find "winning tickets" as subsets of the BERT initialization for a given downstream task

2) Whether those winning tickets generalize, i.e. whether a ticket/pruning mask for one downstream task can also have high performance on another. If that were the case, it would indicate that much of the value of a BERT initialization for transfer tasks could be captured by transferring only a small percentage of BERT's (many) weights, which would be beneficial for compression and mobile applications 

An interesting wrinkle in the LTH literature is the question of whether true "winning tickets" can be found (in the sense of the network being able to retrain purely from the masked random initializations), or whether it can only retrain to a comparable accuracy by rewinding to an early stage in training, but not the absolute beginning of training. Historically, the former has been difficult and sometimes not possible to find in more complex tasks and networks. 

https://i.imgur.com/pAF08H3.png

One finding of this paper is that, when your starting point is BERT initialization, you can indeed find "winning tickets" in the first sense of being able to rewind the full way back to the beginning of (downstream task) training, and retrain from there. (You can see this above with the results for IMP, Iterative Magnitude Pruning, rolling back to theta-0). This is a bit of an odd finding to parse, since it's not like BERT really is a random initialization itself, but it does suggest that part of the value of BERT is that it contains subnetworks that, from the start of training, are in notional optimization basins that facilitate future training. 

A negative result in this paper is that, by and large, winning tickets on downstream tasks don't transfer from one to another, and, to the extent that they do transfer, it mostly seems to be according to which tasks had more training samples used in the downstream mask-finding process, rather than any qualitative properties of the task. The one exception to this was if you did further training of the original BERT objective, Masked Language Modeling, as a "downstream task", and took the winning ticket mask from that training, which then transferred to other tasks. This is some validation of the premise that MLM is an unusually good training task in terms of its transfer properties. 

An important thing to note here is that, even though this hypothesis is intriguing, it's currently quite computationally expensive to find "winning tickets", requiring an iterative pruning and retraining process that takes far longer than an original training run would have. The real goal here, which this is another small step in the hopeful direction of, is being able to analytically specify subnetworks with valuable optimization properties, without having to learn them from data each time (which somewhat defeats the point, if they're only applicable for the task they're trained on, though is potentially useful is they do transfer to some other tasks, as has been shown within a set of image-prediction tasks).

arxiv.org
scholar.google.com

Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases
Purushwalkam, Senthil and Gupta, Abhinav
arXiv e-Print archive - 2020 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 5 years ago

In the past year or so, contrastive learning has experienced widespread success, and has risen to be a dominant problem framing within self-supervised learning. The basic idea of contrastive learning is that, instead of needing human-generated labels to generate a supervised task, you instead assume that there exists some automated operation you can perform to a data element to generate another data element that, while different, should be considered still fundamentally the same, or at least more strongly related, and that you can contrast these related pairs against pairs constructed with the rest of the dataset, with which any given frame would not by default have this assumed relationship of sameness or quasi-similarity. 

One fairly central way that "different but still effectively similar" has been historically defined - at least within the realm of image-based models - is through the use of data augmentations: image transformations such as cropping, color jitter, or Gaussian blur, which are used on an image to create the counterpart in its related pair. Fundamentally, what we're doing when we define these particular augmentations is saying: these transformations don't cause a meaningful change in what the image is, and so we want the representations we get with and without the transformations to be close to one another (or at least to contain enough information to predict one another). Another way you can say this is that we're defining properties of the image that we want our representation to be invariant to. 

The authors of this paper make the point that, when aggressive cropping is part of your toolkit of augmentations, the crops of the image can actually contain meaningfully different content than the uncropped image. If you remove a few pixels around the edges of an image, it still fundamentally contains the same things. However, if you zoom in dramatically, you may get crops that contain different objects. From an image classification standpoint, you would expect that coding in an invariance to cropping in our representations would, in some cases, also mean coding in an invariance to object type, which would presumably be detrimental to the task of classifying objects. To explain the extent of the success that aggressive-cropping methods have had so far, they argue that ImageNet has the particular property that its images are curated to be primarily and centrally containing a single object at a time, such that, even if you zoom in, you're getting a part of the central object, rather than another object entirely. They argue that this dataset bias might explain why you haven't seen as much of this object-invariance be a problem in earlier augmentation-based contrastive work. To try to test this, they train different contrastive (MoCo v2) models on the MSCOCO dataset, which consists of pictures of rooms, and thus no longer has the property of being centrally of one object. They tried one setting where they performed contrastive loss on the images as a whole, and another where the input to the augmentation pipeline were images from the same dataset, but pre-cropped to only contain one image at a time. This was meant, as far as I can tell, to isolate the effect of "object-centric vs not" while holding other dataset factors constant. They then test how well these different models do on an object-centric classification task (Pascal Cropped Boxes). They find that the contrastive model that trains cropped versions of the dataset gets about 3.5 points higher mean accuracy (71.9 vs 75.3) compared to the contrastive loss done on the multi-object versions of images. 

They also explicitly try to measure different forms of invariance, through a scheme where they binarize the elements of the representation vector, and calculate what proportion of them fire on average with and without a given set of transformations. They find that the main form of invariances that contrastive learning does well at is invariance to occlusion (part of the image not being visible), where both contrastive methods give ~84 percent co-firing, and supervised pre-training only gets about 80.9. However, on other important measures of invariance - viewpoint, illumination direction, and instance (that is, specific instance within a class) - contrastive representations perform notably worse than supervised pretraining. 

https://i.imgur.com/7Ghbv5A.png

To try to solve these two problems, they propose a method that learns from video, and that uses temporally separated frames (which are then augmented) as pairs. They call this Frame Temporal Invariance, and argue, reasonably, that by pushing the representations of adjacent frames which track a (presumably consistent, or at least slowly-evolving) scene closer together, you should expect better invariance to viewpoint change and image deformation, since those things naturally happen when an object is moving through the world. They also suggest using an off-the-shelf object bounding box model to find particular objects, and track them throughout the video, and to use contrastive learning specifically on the bounding boxes that the algorithm thinks track a consistent object. 

https://i.imgur.com/2GfCTog.png

Overall, my take on this paper is that the analysis they do - of different kinds of invariances contrastive vs supervised loss does well on, and of the extent to which contrastive loss results might be biased by datasets - is quite interesting and a valuable contribution to our understanding of these very currently-hypey algorithms. However, I'm a bit less impressed by the novelty of their proposed solution. Temporal forms of contrastive learning have been around before - in reinforcement learning, and even in the original Contrastive Predictive Coding paper, where the related pairs were related by dint of temporal closeness. So, while using it in video is certainly a good idea, it doesn't really feel strongly novel to me. I also feel a little confused by their choice of using an off-the-shelf object detection model as a pre-requisite for a self-supervised task, since my impression was that a central goal of self-supervision was building techniques that could scale to situations where it was infeasible to get large amounts of labels, and any method that relies on a pre-existing trained object bounding box model is pretty inherently limited in that regard.

arxiv.org
scholar.google.com

Big Bird: Transformers for Longer Sequences
Zaheer, Manzil and Guruganesh, Guru and Dubey, Avinava and Ainslie, Joshua and Alberti, Chris and Ontañón, Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and Ahmed, Amr
arXiv e-Print archive - 2020 via Local Bibsonomy
Keywords: transfer-learning, pre-trained, transformer, bert

[link] Summary by CodyWild 5 years ago

Transformers - powered by self-attention mechanisms - have been a paradigm shift in NLP, and are now the standard choice for training large language models. However, while transformers do have many benefits in terms of computational constraints - most saliently, that attention between tokens can be computed in parallel, rather than needing to be evaluated sequentially like in a RNN - a major downside is their memory (and, secondarily, computational) requirements. The baseline form of self-attention works by having every token attend to every other token, where "attend" here means that a query from each token A will take an inner product with each other token -A, and then be elementwise-multiplied with the values of every other token -A. This implies a O(N^2) memory and computation requirement, where N is your sequence length. 

So, the question this paper asks is: how do you get the benefits, or most of the benefits, of a full-attention network, while reducing the number of other tokens each token attends to. 

The authors' solution - Big Bird - has three components. First, they approach the problem of approximating the global graph as a graph theory problem, where each token attending to every other is "fully connected," and the goal is to try to sparsify the graph in a way that keeps shortest path between any two nodes low. They use the fact that in an Erdos-Renyi graph - where very edge is simply chosen to be on or off with some fixed probability - the shortest path is known to be logN. In the context of aggregating information about a sequence, a short path between nodes means that the number of iterations, or layers, that it will take for information about any given node A to be part of the "receptive field" (so to speak) of node B, will be correspondingly short. Based on this, they propose having the foundation of their sparsified attention mechanism be simply a random graph, where each node attends to each other with probability k/N, where k is a tunable hyperparameter representing how many nodes each other node attends to on average. 

To supplement, the authors further note that sequence tasks of interest - particularly language - are very local in their information structure, and, while it's important to understand the global context of the full sequence, tokens close to a given token are most likely to be useful in constructing a representation of it. Given this, they propose supplementing their random-graph attention with a block diagonal attention, where each token attends to w/2 tokens prior to and subsequent to itself. (Where, again, w is a tunable hyperparameter) 

However, the authors find that these components aren't enough, and so they add a final component: having some small set of tokens that attend to all tokens, and are attended to by all tokens. This allows them to theoretically prove that Big Bird can approximate full sequences, and is a universal Turing machine, both of which are true for full Transformers. I didn't follow the details of the proof, but, intuitively, my reading of this is that having a small number of these global tokens basically acts as a shortcut way for information to get between tokens in the sequence - if information is globally valuable, it can be "written" to one of these global aggregator nodes, and then all tokens will be able to "read" it from there. 



The authors do note that while their sparse model approximates the full transformer well in many settings, there are some problems - like needing to find the token in the sequence that a given token is farthest from in vector space - that a full attention mechanism could solve easily (since it directly calculates all pairwise comparisons) but that a sparse attention mechanism would require many layers to calculate. 


Empirically, Big Bird ETC (a version which adds on additional tokens for the global aggregators, rather than making existing tokens serve thhttps://i.imgur.com/ks86OgJ.pnge purpose) performs the best on a big language model training objective, has comparable performance to existing models on questionhttps://i.imgur.com/x0BdamC.png answering, and pretty dramatic performance improvements in document summarization. It makes sense for summarization to be a place where this model in particular shines, because it's explicitly designed to be able to integrate information from very large contexts (albeit in a randomly sampled way), where full-attention architectures must, for reasons of memory limitation, do some variant of a sliding window approach.

arxiv.org
scholar.google.com

What is being transferred in transfer learning?
Neyshabur, Behnam and Sedghi, Hanie and Zhang, Chiyuan
arXiv e-Print archive - 2020 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 5 years ago

This is an interesting - and refreshing - paper, in that, instead of trying to go all-in on a particular theoretical point, the authors instead run a battery of empirical investigations, all centered around the question of how to explain what happens to make transfer learning work. The experiments don't all line up to support a single point, but they do illustrate different interesting facets of the transfer process. 

- An initial experiment tries to understand how much of the performance of fine-tuned models can be explained by (higher-level, and thus larger-scale) features, and how much is driven by lower level (and thus smaller-scale) image statistics.  To start with, the authors compare the transfer performance from ImageNet onto three different datasets - clip art, sketches, and real images. As expected, transfer performance is highest with real datasets, which are the most similar to training domain. However, there still *is* positive transfer in terms of final performance across all domains, as well as benefit in optimization speed.
- To try to further tease out the difference between the transfer benefits of high and low-level features, the authors run an experiment where blocks of pixels are shuffled around within the image on downstream tasks .  The larger the size of the blocks being shuffled, the more that large-scale features of the image are preserved. As predicted, accuracy drops dramatically when pixel block size is small, for both randomly initialized and pretrained models. In addition, the relative value added by pretraining drops, for all datasets except quickdraw (the dataset of sketches). This suggests that in most datasets, the value brought by fine-tuning was mostly concentrated in large-scale features. One interesting tangent of this experiment was the examination of optimization speed (in the form of mean training accuracy over initial epochs). Even at block sizes too small for pretraining to offer a benefit to final accuracy, it did still contribute to faster training. (See transparent bars in right-hand plot below)
https://i.imgur.com/Y8sO1da.png

- On a somewhat different front, the authors look into how similar pretrained + finetuned models are to one another, compared to models trained on the same dataset from random initializations. First, they look at a measure of feature similarity, and find that the features learned by two pretrained networks are more similar to each other than a pretrained network is to a randomly initalized network, and also more than two randomly initialized networks are to one another. Randomly initialized networks are closest to one another in their final-layer features, but this is still a multiple of 4 or 5 less than the similarity between the pretrained networks
- Looking at things from the perspective of optimization, the paper measures how much performance drops when you linearly interpolate between different solutions found by both randomly initialized and pretrained networks. For randomly initialized networks, interpolation requires traversing a region where test accuracy drops to 0%. However, for pretrained networks, this isn't the case, with test accuracy staying high throughout. This suggests that pretraining gets networks into a basin of the loss landscape, and that future training stays within that basin. There were also some experiments on module criticality that I believe were in a similar vein to these, but which I didn't fully follow
- Finally, the paper looks at the relationship between accuracy on the original pretraining task and both accuracy and optimization speed on the downstream task. They find that higher original-task accuracy moves in the same direction as higher downstream-task accuracy, though this is less true when the downstream task is less related (as with quickdraw). Perhaps more interestingly, they find that the benefits of transfer to optimization speed happen and plateau quite early in training. Clip Art and Real transfer tasks are much more similar in the optimization speed benefits they get form ImageNet training, where on the accuracy front, the real did dramatically better.
https://i.imgur.com/jBCJcLc.png

While there's a lot to dig into in these results overall, the things I think are most interesting are the reinforcing of the idea that even very random and noisy pretraining can be beneficial to optimization speed (this seems reminiscent of another paper I read from this year's NeurIPS, examining why pretraining on random labels can help downstream training), and the observation that pretraining deposits weights in a low-loss bucket, from which they can learn more efficiently (though, perhaps, if the task is too divergent from the pretraining task, this difficulty in leaving the basin becomes a disadvantage). This feels consistent with some work in the Lottery Ticket Hypothesis, which has recently suggested that, after a short duration of training, you can rewind a network to a checkpoint saved after that duration, and be successfully able to train to low loss again.

arxiv.org
scholar.google.com

Contrastive Learning with Adversarial Examples
Ho, Chih-Hui and Vasconcelos, Nuno
arXiv e-Print archive - 2020 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 5 years ago

Contrastive learning works by performing augmentations on a batch of images, and training a network to match the representations of the two augmented parts of a pair together, and push the representations of images not in a pair farther apart. Historically, these algorithms have benefitted from using stronger augmentations, which has the effect of making the two positive elements in a pair more visually distinct from one another. This paper tries to build on that success, and, beyond just using a strong augmentation, tries to learn a way to perturb images that adversarially increases contrastive loss. As with adversarial training in normal supervised setting, the thinking here is that examples which push loss up the highest are the hardest and thus most informative for the network to learn from 

While the concept of this paper made some sense, I found the notation and the explanation of mechanics a bit confusing, particularly when it came to choice to frame a contrastive loss as a cross-entropy loss, with the "weights" of the dot product in the the cross-entropy loss being, in fact, the projection by the learned encoder of various of the examples in the batch. 

https://i.imgur.com/iQXPeXk.png

This notion of the learned representations being "weights" is just odd and counter-intuitive, and the process of trying to wrap my mind around it isn't one I totally succeeded at. I think the point of using this frame is because it provides an easy analogue to the Fast Gradient Sign Method of normal supervised learning adversarial examples, even though it has the weird effect that, as the authors say "your weights vary by batch...rather than being consistent across training," 

Notational weirdness aside, my understanding is that the method of this paper: 

- Runs a forward pass of normal contrastive loss (framed as cross-entropy loss) which takes augmentations p and q and runs both forward through an encoder.
- Calculates a delta to apply to each input image in the q that will increase the loss most, taken over all the images in the p set
    - I think the delta is per-image in q, and is just aggregated over all images in p, but I'm not fully confident of this, as a result of notational confusion. It could also be one delta applied for all all images in q.
- Calculate the loss that results when you run forward the adversarially generated q against the normal p
- Train a combined loss that is a weighted combination of the normal p/q contrastive part and the adversarial p/q contrastive part

https://i.imgur.com/UWtJpVx.png

The authors show a small but relatively consistent improvement to performance using their method. Notably, this improvement is much stronger when using larger encoders (presumably because they have more capacity to learn from harder examples). One frustration I have with the empirics of the paper is that, at least in the main paper, they don't discuss the increase in training time required to calculate these perturbations, which, a priori, I would imagine to be nontrivial.