Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1547 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Debiased Contrastive Learning

Ching-Yao Chuang and Joshua Robinson and Lin Yen-Chen and Antonio Torralba and Stefanie Jegelka

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2021/06/19 (just now)

**Abstract:** A prominent technique for self-supervised representation learning has been to
contrast semantically similar and dissimilar pairs of samples. Without access
to labels, dissimilar (negative) points are typically taken to be randomly
sampled datapoints, implicitly accepting that these points may, in reality,
actually have the same label. Perhaps unsurprisingly, we observe that sampling
negative examples from truly different labels improves performance, in a
synthetic setting where labels are available. Motivated by this observation, we
develop a debiased contrastive objective that corrects for the sampling of
same-label datapoints, even without knowledge of the true labels. Empirically,
the proposed objective consistently outperforms the state-of-the-art for
representation learning in vision, language, and reinforcement learning
benchmarks. Theoretically, we establish generalization bounds for the
downstream classification task.
more
less

Ching-Yao Chuang and Joshua Robinson and Lin Yen-Chen and Antonio Torralba and Stefanie Jegelka

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.LG, stat.ML

[link]
The premise of contrastive loss is that we want to push together the representations of objects that are similar, and push dissimilar representations farther apart. However, in an unlabeled setting, we don't generally have class labels to tell which images (or objects in general) are supposed to be similar or dissimilar along the axes that matter to us, so we use the shortcut of defining some transformation on a given anchor frame that gets us a frame we're confident is related enough to that anchor that it can be considered a "positive" or target similarity-wise. Some of these transformations are data augmentations performed on a frame, or choosing temporally adjacent frames in a video sequence (which, since the real world evolves smoothly, are assumed to be similar). Anyhow, all of this is well and good, except for the fact that, especially in an image classification setting like CIFAR or ImageNet, sampling randomly from the other images in a given batch doesn't give you a set of things that are entirely "negatives" in terms of being dissimilar to the anchor image. It is true that most of the objects you get by sampling randomly are negatives (especially in a many-class setting), but some of them will be other samples from the same class. By treating all of those as negatives, we penalize the model for having representations of them that are chose to our anchor representation, even though, for many downstream tasks, we'd probably prefer elements of the same class to have more similar representations. However, the whole premise of the unsupervised setting is that we don't have class labels, so we don't know, for a given sample from the batch (of things that aren't specifically transformations of the anchor) whether it's an actual negative or secretly a positive (i.e. of the same class). And, that's true, but this paper argues that, even if you can't identify which specific elements in a batch are secret positives, you can try to account for them in aggregate, if you have some reasonably good estimate of the overall class probabilities, which will tell you how many positives you expect to find in a given batch in expectation. Given that, they reformulate the loss to be "debiased". They do this by taking the expectation over negatives in the denominator, which is actually a sample over the full p(x), not just the distribution over negatives, and trying to make it a better estimate of the actual distribution over negatives. https://i.imgur.com/URN4RBF.png This they accomplish by writing out the full p(x) as a weighted combination of the distributions over positive and negative (which here is "every class that doesn't match the anchor"), as shown above, and noticing that you can represent the negative part of the distribution by taking the full distribution, and subtracting out the positive distribution (which we have an estimator for by construction, with our transformations), weighted by the prior over how frequent the positives are in our overall distribution. https://i.imgur.com/5IgGIhu.png This leads to a change of estimating the similarity between the anchor and positives (which we already have in the numerator, but which we can also calculate with more augmentations/positive samples to get a better estimate) and doing a (weighted) subtraction of that from the similarity over negative examples. Intuitively, we keep in the part where we penalize similarity with negatives (by adding magnitude to the denominator), but reduce that penalty in accordance with how much we think that "similarity with negatives" is actually similarity with other positives in the batch, which we actually would like to keep around. https://i.imgur.com/kUGoemA.png https://i.imgur.com/5Gitdi7.png In terms of experimental results, my read is that this is most useful on problems - like CIFAR10 and STL10 - that don't have many classes (they each, per their names, have 10). The results there are meaningfully stronger than for the 200-class ImageNet. And, that makes pretty good intuitive sense, since you would expect the scale of the "secret positives in our random sample of images" bias problem to be a lot more acute in a setting where we've got a 1 in 10 chance of sampling a same-class image, compared to a 1-in-200 chance. |

GROVER: Self-supervised Message Passing Transformer on Large-scale Molecular Data

Rong, Yu and Bian, Yatao and Xu, Tingyang and Xie, Weiyang and Wei, Ying and Huang, Wenbing and Huang, Junzhou

arXiv e-Print archive - 2020 via Local Bibsonomy

Keywords: dblp

Rong, Yu and Bian, Yatao and Xu, Tingyang and Xie, Weiyang and Wei, Ying and Huang, Wenbing and Huang, Junzhou

arXiv e-Print archive - 2020 via Local Bibsonomy

Keywords: dblp

[link]
Large-scale transformers on unsupervised text data have been wildly successful in recent years; arguably, the most successful single idea in the last ~3 years of machine learning. Given that, it's understandable that different domains within ML want to take their shot at seeing whether the same formula will work for them as well. This paper applies the principles of (1) transformers and (2) large-scale unlabeled data to the problem of learning informative embeddings of molecular graphs. Labeling is a problem in much of machine learning - it's costly, and narrowly defined in terms of a certain task - but that problem is even more exacerbated when it comes to labeling properties of molecules, since they typically require wetlab chemistry to empirically measure. Given that, and also given the fact that we often want to predict new properties - like effectiveness against a new targetable drug receptor - that we don't yet have data for, finding a way to learn and transfer from unsupervised data has the potential to be quite valuable in the molecular learning sphere. There are two main conceptual parts to this paper and its method - named GROVER, in true-to-ML-form tortured acronym style. The first is the actual architecture of their model itself, which combines both a message-passing Graph Neural Network to aggregate local information, and a Transformer to aggregate global information. The paper was a bit vague here, but the way I understand it is: https://i.imgur.com/JY4vRdd.png - There are parallel GNN + Transformer stacks for both edges and nodes, each of which outputs both a node and edge embedding, for four embeddings total. I'll describe the one for nodes, and the parallel for edges operates the same way, except that hidden states live on edges rather than nodes, and attention is conducted over edges rather than nodes - In the NodeTransformer version, a message passing NN (of I'm not sure how many layers) performs neighborhood aggregation (aggregating the hidden states of neighboring nodes and edges, then weight-transforming them, then aggregating again) until each node has a representation that has "absorbed" in information from a few hops out of its surrounding neighborhood. My understanding is that there is a separate MPNN for queries, keys, and values, and so each nodes end up with three different vectors for these three things. - Multi-headed attention is then performed over these node representations, in the normal way, where all keys and queries are dot-product-ed together, and put into a softmax to calculate a weighted average over the values - We now have node-level representations that combine both local and global information. These node representations are then aggregated into both node and edge representations, and each is put into a MLP layer and Layer Norm before finally outputting a node-based node and edge representation. This is then joined by an edge-based node and edge representation from the parallel stack. These are aggregated on a full-graph level to predict graph-level properties https://i.imgur.com/NNl6v4Y.png The other component of the GROVER model is the way this architecture is actually trained - without explicit supervised labels. The authors use two tasks - one local, and one global. The local task constructs labels based on local contextual properties of a given atom - for example, the atom here has one double-bonded Nitrogen and one single-bonded Oxygen in its local environment - and tries to predict those labels given the representations of that atom (or node). The global task uses RDKit (an analytically constructed molecular analysis kit) to identify 85 different modifs or functional groups in the molecule, and encodes those into an 85-long one-hot vector that is being predicted on a graph level. https://i.imgur.com/jzbYchA.png With these two components, GROVER is pretrained on 10 million unlabeled molecules, and then evaluated in transfer settings where its representations are fine-tuned on small amounts of labeled data. The results are pretty impressive - it achieves new SOTA performance by relatively large amounts on all tasks, even relative to exist semi-supervised pretraining methods that similarly have access to more data. The authors perform ablations to show that it's important to do the graph-aggregation step before a transformer (the alternative being just doing a transformer on raw node and edge features), and also show that their architecture without pretraining (just used directly in downstream tasks) also performs worse. One thing I wish they'd directly ablated was the value-add of the local (also referred to as "contextual") and global semi-supervised tasks. Naively, I'd guess that most of the performance gain came from the global task, but it's hard to know without them having done the test directly. |

Weakly-Supervised Reinforcement Learning for Controllable Behavior

Lee, Lisa and Eysenbach, Benjamin and Salakhutdinov, Ruslan and Gu, Shixiang and Finn, Chelsea

arXiv e-Print archive - 2020 via Local Bibsonomy

Keywords: dblp

Lee, Lisa and Eysenbach, Benjamin and Salakhutdinov, Ruslan and Gu, Shixiang and Finn, Chelsea

arXiv e-Print archive - 2020 via Local Bibsonomy

Keywords: dblp

[link]
I tried my best, but I'm really confused by the central methodology of this paper. Here are the things I do understand: 1. The goal of the method is to learn disentangled representations, and, specifically, to learn representations that correspond to factors of variation in the environment that are selected by humans. That means, we ask humans whether a given image is higher or lower on a particular relevant axis, and aggregate those rankings into a vector, where a particular index of the vector corresponds to a particular factor. Given a small amount of supervision, the hope is to learn an encoder that takes in an image, and produces a Z code that encodes where the image is on that particular axis 2. With those disentangled representations, the authors hope they can learn goal-conditioned policies, where the distance between the current image's representation and the goal image's representation can serve as a reward. In particular, they're trying to show that their weakly supervised disentangled representation performs better as a metric space to do goal-conditioning distance calculations in, relative to other learned spaces 3. The approach uses a GAN-based design, where a generator generates the images that correspond with a given z1 and z2, and the discriminator tries to tell the difference between the two real images, paired with their supervision vector, and two generated images, with their fake supervision vector [Here is the relevant equation, along with some notation-explaining text] https://i.imgur.com/XNbxK6i.png The thing I'm confused by is the actual mechanism for why (3) gets you disentangled representations. To my understanding, the thing the generator should be trying to do is generate images whose relationship to one another is governed by the relationship between z1 and z2; if z is really capturing your factors of variation, the two images should differ in places and in ways governed by where those z values are different. Based on this, I'd expect the fake supervision vector here to be some kind of binarized element-wise difference between the two (randomly sampled) vectors, z1 and z2. But the authors claim that the fake supervision vector that the generator is trying to replicate is just the zero vector. That seems like it would just result in the generator trying to generate images that don't differ on any axes, with two different z vectors as input. |

Rethinking Bias-Variance Trade-off for Generalization of Neural Networks

Yang, Zitong and Yu, Yaodong and You, Chong and Steinhardt, Jacob and Ma, Yi

- 2020 via Local Bibsonomy

Keywords: readings, generalization, variance, bias, analysis

Yang, Zitong and Yu, Yaodong and You, Chong and Steinhardt, Jacob and Ma, Yi

- 2020 via Local Bibsonomy

Keywords: readings, generalization, variance, bias, analysis

[link]
This is a really cool paper that posits a relatively simple explanation for the strange phenomena known as double descent - both the fact of seeing it in the first place, and the difficulty in robustly causing it to appear. In the classical wisdom of statistics, increasing model complexity too far will lead to increase in variance, and thus an increase in test error (or "test risk" or "empirical risk"), leading to a U-shaped test error curve as a function of model complexity. Double descent is the name given to the observation that, in modern neural networks, we tend to not see this phenomenon, and, in fact, sometimes see test error first increasing but then descend again below its initial minimum. Test error going up, and then back down again: double descent. However, this phenomenon proved to be a bit elusive: often in order to see it, you had to add artificial noise to your labels. This paper provides a cohesive theory for both the existence of double descent, and the fact that it sometimes can only be elicited with increased label noise. They empirically estimate the bias and variance components of test error for a range of neural nets on a range of datasets, and show that when they directly estimate bias and variance this way, they see bias decreasing (or, at least, non-increasing) monotonically with model complexity, as expected. But, they also see variance, rather than strictly increasing with model complexity, exhibiting unimodal behavior, where it first increases, and then decreases, as a function of model complexity. Taking a step back, bias is here understood as the component of your test error that comes from the difference between your expected learned estimator and the true underlying function. Variance is the squared difference between the expected learned estimator (that is, the one you get if you average over different splits in the data), and the estimator learned on each split of the data. The actual estimator you get is a function of both your average estimator, and the particular estimator you draw in the distribution around that average, which is defined by the variance. The authors empirically measure bias and variance by conducting k different N-way splits of their datasets, and averaging these k*N estimates to get an average or expected estimator. Given that, they can (as shown below), take the variance to be the squared difference between the k*N individual estimators and the average. Since test error is composed of bias + variance, we can then simply calculate bias as whatever remains of test error when variance has been accounted for. https://i.imgur.com/VPzujaZ.png This provides an elegant explanation for the different relationships we see between complexity and test error. In regimes where the decrease in bias from additional complexity is much larger than the increase in variance - which they argue is the case in modern deep networks - we don't see double descent, because the "bump" due to the variance peak is overshadowed by the continuing decrease in variance. However, in regimes where the overall scale of variance (at all levels of complexity) is higher, we see the increasing variance overwhelming the decreasing bias, and test error increases (before, ultimately, going down again, after the variance peaks). This explains why double descent has previously appeared preferentially in cases of injected label noise: more label noise means higher irreducible variability in the model learned from different sets of data, which makes the scale of the variance peak more pronounced compared to the bias drop. In addition to their empirical work, the authors also analytically analyze a two-layer linear neural network, and show that you would theoretically expect a peaked variance shape in that setting. In a certain sense, this just pushes the problem down the road, since the paper doesn't explain why, in any kind of conceptual or statistical sense, we would expect variance to be unimodal in this way. (They do offer a conjecture, but it was not the main thrust of the paper, and I didn't fully follow it). However, it does offer conceptual clarity into a previously somewhat more murky empirical phenomenon, and hopefully will let us focus on understanding why variance behaves in this way. |

Critic Regularized Regression

Ziyu Wang and Alexander Novikov and Konrad Zolna and Jost Tobias Springenberg and Scott Reed and Bobak Shahriari and Noah Siegel and Josh Merel and Caglar Gulcehre and Nicolas Heess and Nando de Freitas

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**First published:** 2021/06/19 (just now)

**Abstract:** Offline reinforcement learning (RL), also known as batch RL, offers the
prospect of policy optimization from large pre-recorded datasets without online
environment interaction. It addresses challenges with regard to the cost of
data collection and safety, both of which are particularly pertinent to
real-world applications of RL. Unfortunately, most off-policy algorithms
perform poorly when learning from a fixed dataset. In this paper, we propose a
novel offline RL algorithm to learn policies from data using a form of
critic-regularized regression (CRR). We find that CRR performs surprisingly
well and scales to tasks with high-dimensional state and action spaces --
outperforming several state-of-the-art offline RL algorithms by a significant
margin on a wide range of benchmark tasks.
more
less

Ziyu Wang and Alexander Novikov and Konrad Zolna and Jost Tobias Springenberg and Scott Reed and Bobak Shahriari and Noah Siegel and Josh Merel and Caglar Gulcehre and Nicolas Heess and Nando de Freitas

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

[link]
Offline reinforcement learning is potentially high-value thing for the machine learning community learn to do well, because there are many applications where it'd be useful to generate a learnt policy for responding to a dynamic environment, but where it'd be too unsafe or expensive to learn in an on-policy or online way, where we continually evaluate our actions in the environment to test their value. In such settings, we'd like to be able to take a batch of existing data - collected from a human demonstrator, or from some other algorithm - and be able to learn a policy from those pre-collected transitions, without being able to query the environment further by taking arbitrary actions. There are two broad strategies for learning a policy from precollected transitions. One is to simply learn to mimic the action policy used by the demonstrator, predicting the action the demonstrator would take in a given state, without making use of reward data at all. This is Behavioral Cloning, and has the advantage of being somewhat more conservative (in terms of not experimenting with possibly-unsafe-or-low-reward actions the demonstrator never took), but this is also a disadvantage, because it's not possible to get higher reward than the demonstrator themselves got if you're simply copying their behavior. Another approach is to learn a Q function - estimating the value of a given action in a given state - using the reward data from the precollected transitions. This can also have some downsides, mostly in the direction of overconfidence. Q value Temporal Difference learning works by using the current reward added to the max Q value over possible next actions as the target for the current-state Q estimate. This tends to lead to overestimates, because regression to the mean effects mean that the highest value Q estimates are disproportionately likely to be noisy (possibly because they correspond to an action with little data in the demonstrator dataset). In on-policy Q learning, this is less problematic, because the agent can take the action associated with their noisily inaccurate estimate, and as a result get more data for that action, and get an estimate that is less noisy in future. But when we're in a fully offline setting, all our learning is completed before we actually start taking actions with our policy, so taking high-uncertainty actions isn't a valuable source of new information, but just risky. The approach suggested by this DeepMind paper - Critic Regularized Regression, or CRR - is essentially a synthesis of these two possible approaches. The method learns a Q function as normal, using temporal difference methods. The distinction in this method comes from how to get a policy, given a learned Q function. Rather than simply taking the action your Q estimate says is highest-value at a particular point, CRR optimizes a policy according to the formula shown below. The f() function is a stand-in for various potential functions, all of which are monotonic with respect to the Q function, meaning they increase when the Q function does. https://i.imgur.com/jGmhYdd.png This basically amounts to a form of a behavioral cloning loss (with the part that maximizes the probability under your policy of the actions sampled from the demonstrator dataset), but weighted or, as the paper terms it, filtered, by the learned Q function. The higher the estimated q value for a transition, the more weight is placed on that transition from the demo dataset having high probability under your policy. Rather than trying to mimic all of the actions of the demonstrator, the policy preferentially tries to mimic the demonstrator actions that it estimates were particularly high-quality. Different f() functions lead to different kinds of filtration. The `binary`version is an indicator function for the Advantage of an action (the Q value for that action at that state minus some reference value for the state, describing how much better the action is than other alternatives at that state) being greater than zero. Another, `exp`, uses exponential weightings which do a more "soft" upweighting or downweighting of transitions based on advantage, rather than the sharp binary of whether an actions advantage is above 1. The authors demonstrate that, on multiple environments from three different environment suites, CRR outperforms other off-policy baselines - either more pure behavioral cloning, or more pure RL - and in many cases does so quite dramatically. They find that the sharper binary weighting scheme does better on simpler tasks, since the trade-off of fewer but higher-quality samples to learn from works there. However, on more complex tasks, the policy benefits from the exp weighting, which still uses and learns from more samples (albeit at lower weights), which introduces some potential mimicking of lower-quality transitions, but at the trade of a larger effective dataset size to learn from. |

About