[link]
The Slot Attention module maps from a set of N input feature vectors to a set of K output vectors that we refer to as slots. Each vector in this output set can, for example, describe an object or an entity in the input. https://i.imgur.com/81nh508.png 
[link]
The goal of this paper is to learn a model that embeds 2D keypoints(the locations of specific key body parts in 2D space) representing a particular pose into a vector embedding where nearby points in embedding space are also nearby in 3D space. This sort of model is useful because the same 3D pose can generate a wide variety of 2D pose projections, and it can be useful to learn which apparentlydistinct representations actually map to the same 3D pose. To do this, the basic approach used by the authors (with has a few variants), is  Take a dataset of 3D poses, and corresponding 2D projections  Define a notion of "matching" 3D poses, based on a parameter kappa, which designates the maximum average perjoint distance at which two 3D poses can be considered the same  Construct triplets composed of an anchor pose, a "positive" pose (a different 2D pose with a matching 3D pose), and a "negative" pose (some other 2D pose sampled from the dataset using a strategy that explicitly seeks out hard negative examples)  Calculate a triplet loss, that pushes positive examples closer together, and pulls negative examples farther apart. This is specifically done by defining a probabilistic representation of p(match  z1, z2), or, the probability of a match in 3D space given the embeddings of the two 2D poses. This is parametrized using a sigmoid with trainable parameters, as shown below https://i.imgur.com/yFCCVuA.png  They they calculate a distance kernel as the negative log of that probability, and calculate the basic triplet loss, which tries to maximize the diff between the the distance between negative examples, and the distance between positive examples.  They also add an additional loss further incentivizing the match probability to be higher on the positive pair (in addition to just pushing the positive and negative pair further apart)  The final loss is a Gaussian prior loss, incentivizing the learned embeddings z to be in the shape of a Gaussian https://i.imgur.com/SxvcvJG.png This represents the central shape of the method. Some additional ablations include:  Camera Augmentation: Creational additional triplets by taking existing 3D poses and generating artificial pairs of 2D poses at different camera views  Temporal Pose Embedding  Embedding multiple temporally connected pose, rather than just a single one  Keypoint Dropout  To simulate situations where some keypoints are occluded, the authors tried training with some keypoints dropped out, either keypoints selected at random, or selected jointly and nonindependently based on a model of which keypoints are likely to be occluded together The authors found that their method was generally quite a bit stronger that prior approaches for the task of querying similar 3D poses given a 2D pose input, including some alternate methods that do direct 3D estimation. 
[link]
This was an amusinglytimed paper for me to read, because just yesterday I was listening to a different paper summary where the presenter offhandedly mentioned the idea of compressing the sequence length in Transformers through subsequent layers (the way a ConvNet does pooling to a smaller spatial dimension in the course of learning), and it made me wonder why I hadn't heard much about that as an approach. And, lo, I came on this paper in my list the next day, which does exactly that. As a refresher, Transformers work by starting out with one embedding per token in the first layer, and, on each subsequent layer, they create new representations for each token by calculating an attention mechanism over all tokens in the prior layer. This means you have one representation per token for the full sequence length, and for the full depth of the network. In addition, you typically have a CLS token that isn't connected to any particular word, but is the designated place where sequencelevel representations aggregate and are used for downstream tasks. This paper notices that many applications of trained transformers care primarily about that aggregated representation, rather than precise perword representations. For cases where that's true, you're spending a lot of computation power on continually calculating the SeqLength^2 attention maps in later layers, when they might not be bringing you that much value in your downstream transfer tasks. A central reason why you do generally need pertoken representations in training Transformers, though, even if your downstream tasks need them less, is that the canonical Masked Language Model and newer ELECTRA loss functions require tokenlevel predictions for the specific tokens being masked. To accommodate this need, the authors of this paper structure their "Funnel" Transformer as more of an hourglass. It turns it into basically a VAEesque Encoder/Decoder structure, where attention downsampling layers reduce the length of the internal representation down, and then a "decoder" amplifies it back to the full sequence size, so you have one representation per token for training purposes (more on the exact way this works in a bit). The nifty thing here is that, for downstream tasks, you can chop off the decoder, and be left with a network with comparatively less computation cost per layer of depth. https://i.imgur.com/WC0VQXi.png The exact mechanisms of downsampling and upsampling in this paper are quite clever. To perform downsampling at a given attention layer, you take a sequence of representations h, and downsampling it to h' of half the size by meanpooling adjacent tokens. However, in the attention calculation, you only use h' for the queries, and use the full sequence h for the keys and values. Essentially, this means that you have an attention layer where the downsampled representations attend to and pull information from the full scope of the (nondownsampled) representations of the layer below. This means you have a much more flexible downsampling operation, since the attention mechanism can choose to pull information into the downsampled representation, rather than it being calculated automatically by a pooling operation The paper inflates the bottlenecked representations back up to the full sequence length by first tiling the downsampled representation (for example, if you had downsampled from 20 to 5, you would tile the first representation 4 times, then the second representation 4 times, and so on until you hit 20). That tiled representation, which can roughly be though to represent a large region of the sequence, is then added, ResNetstyle, to the fulllength sequence of representations that came out of the first attention layer, essentially combining shallow tokenlevel representations with deep regionlevel representations. This aggregated representation is then used for tokenlevel loss prediction The authors benchmark again common baseline models, using deeper models with fewer tokens per layer, and find that they can reach similar or higher levels of performance with fewer FLOPs on text aggregation tasks. They fall short of fullsequence models for tasks that require strong pertoken representations, which fits with my expectation. 
[link]
This summary builds substantially on my summary of NERFs, so if you haven't yet read that, I recommend doing so first! The idea of a NERF is learn a neural network that represents a 3D scene, and from which you can, once the model is trained, sample an image of that scene from any desired angle. This involves structuring your neural network as a function that predicts the RGB color and density/opacity for a given point in 3D space (x, y, z), from a given viewing angle (theta, phi). With such a function, you can generate predictions of what images taken from certain angles would look like by sampling along a viewing ray, and integrating the combined hue and opacity into an aggregated view. This prediction can then be compared to a true image taken from that direction, and gradients passed backwards into the prediction model. An important assumption of this model is that the scene being photographed is static; specifically, that every point in space is always inhabited by the same part of the 3D object, regardless of what angle it's viewed from. This is a reasonable assumption for photos of inanimate objects, or of humans in highly controlled lab settings, but it is often not true for humans when you, say, ask them to take a selfie video of themselves. Even if they're trying to keep roughly still, there will be slight shifts in the location and position of their head between frames, and the authors of this paper show that this can lead to strange artifacts if you naively try to train a NERF from the images (including a particularly odd one where it hallucinates tiny copies of the image in the air surrounding the face). https://i.imgur.com/IUVh6uM.png The fix proposed by this paper is to apply a learnable deformation field to each image, where the notion is to deform each view into being in one canonical position (fixed per network, since, again, one network corresponds to a single scene). This means that, along with learning the parameters of the NERF itself, you're also learning what deformation to apply to each training image to get it into this canonical position. This is done by parametrizing the deformation in a particular way, and then having that deformation be conditioned by a latent vector that's trained similar to how you'd train an embedding (one learned vector per image example). The parametrization of the deformation is honestly a little bit over my head, given my lack of grounding in 3D modeling, but my general sense is that it applies some constraints and regularization to ensure that the learned deformations are realistic, insofar as humans are mostly rigid (one patch of skin on my forehead generally doesn't move except in concordance with the rest of my forehead), but with some possibility for elasticity (skin can stretch if I, say, smile). The authors also include an annealing scheme whereby, early in training, the model focuses on learning course (largescale) deformations, and later in training, it's allowed to also learn weights for more precise deformations. This is to hopefully match macroscale shifts before adding the noise of precise changes. This addition of a learned deformation is most of the contribution of this method: with it applied, they show that they're able to learn realistic NERFs from selfies, which they term "NERFIES". They mention a few pieces of concurrent work that try to solve the same problem of nonstatic human subjects in different ways, but I haven't had a chance to read those, so I can't really comment on how NERFIES stacks up to alternate approaches, but it appears to be as least one empirically convincing solution to the problem it's aiming at. 
[link]
This summary builds extensively on my prior summary of SIRENs, so if you haven't read that summary or the underlying paper yet, I'd recommend doing that first! At a high level, the idea of SIRENs is to use a neural network to learn a compressed, continuous representation of an image, where the neural network encodes a mapping from (x, y) to the pixel value at that location, and the image can be reconstructed (or, potentially, expanded in size) by sampling from that function across the full range of the image. To do this effectively, they use sinusoidal activation functions, which let them match not just the output of the neural network f(x, y) to the true image, but also the first and second derivatives of the neural network to the first and second derivatives of the true image, which provides a more robust training signal. NERFs builds on this idea, but instead of trying to learn a continuous representation of an image (mapping from 2D position to 3D RGB), they try to learn a continuous representation of a scene, mapping from position (specified with with three coordinates) and viewing direction (specified with two angles) to the RGB color at a given point in a 3D grid (or "voxel", analogous to "pixel"), as well as the *density* or opacity of that point. Why is this interesting? Because if you have a NERF that has learned a good underlying function of a particular 3D scene, you can theoretically take samples of that scene from arbitrary angles, even angles not seen during training. It essentially functions as a usable 3D model of a scene, but one that, because it's stored in the weights of a neural network, and specified in a continuous function, is far smaller than actually storing all the values of all the voxels in a 3D scene (the authors give an example of 5MB vs 15GB for a NERF vs a full 3D model). To get some intuition for this, consider that if you wanted to store the curve represented by a particular thirddegree polynomial function between 0 and 10,000 it would be much more spaceefficient to simply store the 3 coefficients of that polynomial, and be able to sample from it at your desired granularity at will, rather than storing many empirically sampled points from along the curve. https://i.imgur.com/0c33YqV.png How is a NERF model learned?  The (x, y, z) position of each point is encoded as a combination of sinewave, Fourierstyle curves of increasingly higher frequency. This is similar to the positional encoding used by transformers. In practical turns, this means a location in space will be represented as a vector calculated as [some point on a lowfrequency curve, some point on a slightly higher frequency curve..., some point on the highestfrequency curve]. This doesn't contain any more *information* than the (x, y, z) representation, but it does empirically seem to help training when you separate the frequencies like this  You take a dataset of images for which viewing direction is known, and simulate sending a ray through the scene in that direction, hitting some line (or possibly tube?) of voxels on the way. You calculate the perceived color at that point, which is an integral of the color information and density/opacity returned by your model, for each point. Intuitively, if you have a high opacity weight early on, that part of the object blocks any voxels further in the ray, whereas if the opacity weight is lower, more of the voxels behind will contribute to the overall effective color perceived. You then compare these predicted perceived colors to the actual colors captured by the 2D image, and train on the prediction error.  (One note on sampling: the paper proposes a hierarchical sampling scheme to help with sampling efficiently along the ray, first taking a course sample, and then adding additional samples in regions of high predicted density)  At the end of training, you have a network that hopefully captures the information from *that particular scene*. A notable downside of this approach is that it's quite slow for any use cases that require training on many scenes, since each individual scene network takes about 12 days of GPU time to train 
[link]
This is an interesting paper, investigating (with a team that includes the original authors of the Lottery Ticket paper) whether the initializations that result from BERT pretraining have Lottery Ticketesque properties with respect to their role as initializations for downstream transfer tasks. As background context, the Lottery Ticket Hypothesis came out of an observation that trained networks could be pruned to remove lowmagnitude weights (according to a particular iterative pruning strategy that is a bit more complex than just "prune everything at the end of training"), down to high levels of sparsity (540% of original weights, and that those pruned networks not only perform well at the end of training, but also can be "rewound" back to their initialization values (or, in some cases, values from early in training) and retrained in isolation, with the weights you pruned out of the trained network still set to 0, to a comparable level of accuracy. This is thought of as a "winning ticket" because the hypothesis Frankle and Carbin generated is that the reason we benefit from massively overparametrized neural networks is that we are essentially sampling a large number of small subnetworks within the larger ones, and that the more samples we get, the likelier it is we find a "winning ticket" that starts our optimization in a place conducive to further training. In this particular work, the authors investigate a slightly odd variant of the LTH. Instead of looking at training runs that start from random initializations, they look at transfer tasks that start their learning from a massivelypretrained BERT language model. They try to find out: 1) Whether you can find "winning tickets" as subsets of the BERT initialization for a given downstream task 2) Whether those winning tickets generalize, i.e. whether a ticket/pruning mask for one downstream task can also have high performance on another. If that were the case, it would indicate that much of the value of a BERT initialization for transfer tasks could be captured by transferring only a small percentage of BERT's (many) weights, which would be beneficial for compression and mobile applications An interesting wrinkle in the LTH literature is the question of whether true "winning tickets" can be found (in the sense of the network being able to retrain purely from the masked random initializations), or whether it can only retrain to a comparable accuracy by rewinding to an early stage in training, but not the absolute beginning of training. Historically, the former has been difficult and sometimes not possible to find in more complex tasks and networks. https://i.imgur.com/pAF08H3.png One finding of this paper is that, when your starting point is BERT initialization, you can indeed find "winning tickets" in the first sense of being able to rewind the full way back to the beginning of (downstream task) training, and retrain from there. (You can see this above with the results for IMP, Iterative Magnitude Pruning, rolling back to theta0). This is a bit of an odd finding to parse, since it's not like BERT really is a random initialization itself, but it does suggest that part of the value of BERT is that it contains subnetworks that, from the start of training, are in notional optimization basins that facilitate future training. A negative result in this paper is that, by and large, winning tickets on downstream tasks don't transfer from one to another, and, to the extent that they do transfer, it mostly seems to be according to which tasks had more training samples used in the downstream maskfinding process, rather than any qualitative properties of the task. The one exception to this was if you did further training of the original BERT objective, Masked Language Modeling, as a "downstream task", and took the winning ticket mask from that training, which then transferred to other tasks. This is some validation of the premise that MLM is an unusually good training task in terms of its transfer properties. An important thing to note here is that, even though this hypothesis is intriguing, it's currently quite computationally expensive to find "winning tickets", requiring an iterative pruning and retraining process that takes far longer than an original training run would have. The real goal here, which this is another small step in the hopeful direction of, is being able to analytically specify subnetworks with valuable optimization properties, without having to learn them from data each time (which somewhat defeats the point, if they're only applicable for the task they're trained on, though is potentially useful is they do transfer to some other tasks, as has been shown within a set of imageprediction tasks). 
[link]
This a nice, compact paper testing a straightforward idea: can we use the contrastive loss structure so widespread in unsupervised learning as a framework for generating and training against adversarial examples? In the context of the adversarial examples literature, adversarial training  or, training against examples that were adversarially generated so as to minimize the loss of the model you're training  is the primary strategy used to train robust models (robust here in the sense of not being susceptible to said adversarial attacks). Typically, these attacks are generated with the use of class labels, since they are meant to attack supervised classifiers that assign a class label to an image. Therefore, the goal of the adversarial attack is to push down the probability of the correct class label (either in favor of a specific alternate class, or just in favor of any class that isn't the true one). However, labels are hard and expensive, so, one wonders: in the same way that you can learn representations from unlabeled data, can you also make those representations (otherwise referred to as "embeddings") robust in a similarly labelfree way. This paper tests an approach that does so in a quite simple way, by just generating adversarial examples against your contrastive loss target. This works by: 1) Taking an image, and generating two augmentations (or transformations) of it. This is part of the standard contrastive pipeline 2) Applying an adversarial perturbation to one of those transformations, where the perturbation is optimized to maximize the contrastive loss (ability to differentiate an augmented version of the same image from augmented versions of other images) 3) Training on that adversarial sample to generate more robustness https://i.imgur.com/ttF6k1A.png And this simple approach appears to work quite well! They find that, in defending against supervised adversarial attacks, it performs comparably to supervised adversarial training, and that it has the added benefits of (1) slightly higher accuracy on clean examples (in general, robustness is known to decrease your cleansample accuracy), and (2) better robustness against attack types other than the attack type used for the adversarial training. It also achieves better transfer performance (that is, adversarially training on one dataset, and then evaluating robustness on another) than a supervised method, when evaluated on both CIFAR10 → CIFAR100 and CIFAR100 → CIFAR10. This does make pretty good sense to me, since instancelevel stability does seem like it's getting at a more fundamental set of invariances that to would transfer better to different distributions of classes. 
[link]
The premise of contrastive loss is that we want to push together the representations of objects that are similar, and push dissimilar representations farther apart. However, in an unlabeled setting, we don't generally have class labels to tell which images (or objects in general) are supposed to be similar or dissimilar along the axes that matter to us, so we use the shortcut of defining some transformation on a given anchor frame that gets us a frame we're confident is related enough to that anchor that it can be considered a "positive" or target similaritywise. Some of these transformations are data augmentations performed on a frame, or choosing temporally adjacent frames in a video sequence (which, since the real world evolves smoothly, are assumed to be similar). Anyhow, all of this is well and good, except for the fact that, especially in an image classification setting like CIFAR or ImageNet, sampling randomly from the other images in a given batch doesn't give you a set of things that are entirely "negatives" in terms of being dissimilar to the anchor image. It is true that most of the objects you get by sampling randomly are negatives (especially in a manyclass setting), but some of them will be other samples from the same class. By treating all of those as negatives, we penalize the model for having representations of them that are chose to our anchor representation, even though, for many downstream tasks, we'd probably prefer elements of the same class to have more similar representations. However, the whole premise of the unsupervised setting is that we don't have class labels, so we don't know, for a given sample from the batch (of things that aren't specifically transformations of the anchor) whether it's an actual negative or secretly a positive (i.e. of the same class). And, that's true, but this paper argues that, even if you can't identify which specific elements in a batch are secret positives, you can try to account for them in aggregate, if you have some reasonably good estimate of the overall class probabilities, which will tell you how many positives you expect to find in a given batch in expectation. Given that, they reformulate the loss to be "debiased". They do this by taking the expectation over negatives in the denominator, which is actually a sample over the full p(x), not just the distribution over negatives, and trying to make it a better estimate of the actual distribution over negatives. https://i.imgur.com/URN4RBF.png This they accomplish by writing out the full p(x) as a weighted combination of the distributions over positive and negative (which here is "every class that doesn't match the anchor"), as shown above, and noticing that you can represent the negative part of the distribution by taking the full distribution, and subtracting out the positive distribution (which we have an estimator for by construction, with our transformations), weighted by the prior over how frequent the positives are in our overall distribution. https://i.imgur.com/5IgGIhu.png This leads to a change of estimating the similarity between the anchor and positives (which we already have in the numerator, but which we can also calculate with more augmentations/positive samples to get a better estimate) and doing a (weighted) subtraction of that from the similarity over negative examples. Intuitively, we keep in the part where we penalize similarity with negatives (by adding magnitude to the denominator), but reduce that penalty in accordance with how much we think that "similarity with negatives" is actually similarity with other positives in the batch, which we actually would like to keep around. https://i.imgur.com/kUGoemA.png https://i.imgur.com/5Gitdi7.png In terms of experimental results, my read is that this is most useful on problems  like CIFAR10 and STL10  that don't have many classes (they each, per their names, have 10). The results there are meaningfully stronger than for the 200class ImageNet. And, that makes pretty good intuitive sense, since you would expect the scale of the "secret positives in our random sample of images" bias problem to be a lot more acute in a setting where we've got a 1 in 10 chance of sampling a sameclass image, compared to a 1in200 chance. 
[link]
Largescale transformers on unsupervised text data have been wildly successful in recent years; arguably, the most successful single idea in the last ~3 years of machine learning. Given that, it's understandable that different domains within ML want to take their shot at seeing whether the same formula will work for them as well. This paper applies the principles of (1) transformers and (2) largescale unlabeled data to the problem of learning informative embeddings of molecular graphs. Labeling is a problem in much of machine learning  it's costly, and narrowly defined in terms of a certain task  but that problem is even more exacerbated when it comes to labeling properties of molecules, since they typically require wetlab chemistry to empirically measure. Given that, and also given the fact that we often want to predict new properties  like effectiveness against a new targetable drug receptor  that we don't yet have data for, finding a way to learn and transfer from unsupervised data has the potential to be quite valuable in the molecular learning sphere. There are two main conceptual parts to this paper and its method  named GROVER, in truetoMLform tortured acronym style. The first is the actual architecture of their model itself, which combines both a messagepassing Graph Neural Network to aggregate local information, and a Transformer to aggregate global information. The paper was a bit vague here, but the way I understand it is: https://i.imgur.com/JY4vRdd.png  There are parallel GNN + Transformer stacks for both edges and nodes, each of which outputs both a node and edge embedding, for four embeddings total. I'll describe the one for nodes, and the parallel for edges operates the same way, except that hidden states live on edges rather than nodes, and attention is conducted over edges rather than nodes  In the NodeTransformer version, a message passing NN (of I'm not sure how many layers) performs neighborhood aggregation (aggregating the hidden states of neighboring nodes and edges, then weighttransforming them, then aggregating again) until each node has a representation that has "absorbed" in information from a few hops out of its surrounding neighborhood. My understanding is that there is a separate MPNN for queries, keys, and values, and so each nodes end up with three different vectors for these three things.  Multiheaded attention is then performed over these node representations, in the normal way, where all keys and queries are dotproducted together, and put into a softmax to calculate a weighted average over the values  We now have nodelevel representations that combine both local and global information. These node representations are then aggregated into both node and edge representations, and each is put into a MLP layer and Layer Norm before finally outputting a nodebased node and edge representation. This is then joined by an edgebased node and edge representation from the parallel stack. These are aggregated on a fullgraph level to predict graphlevel properties https://i.imgur.com/NNl6v4Y.png The other component of the GROVER model is the way this architecture is actually trained  without explicit supervised labels. The authors use two tasks  one local, and one global. The local task constructs labels based on local contextual properties of a given atom  for example, the atom here has one doublebonded Nitrogen and one singlebonded Oxygen in its local environment  and tries to predict those labels given the representations of that atom (or node). The global task uses RDKit (an analytically constructed molecular analysis kit) to identify 85 different modifs or functional groups in the molecule, and encodes those into an 85long onehot vector that is being predicted on a graph level. https://i.imgur.com/jzbYchA.png With these two components, GROVER is pretrained on 10 million unlabeled molecules, and then evaluated in transfer settings where its representations are finetuned on small amounts of labeled data. The results are pretty impressive  it achieves new SOTA performance by relatively large amounts on all tasks, even relative to exist semisupervised pretraining methods that similarly have access to more data. The authors perform ablations to show that it's important to do the graphaggregation step before a transformer (the alternative being just doing a transformer on raw node and edge features), and also show that their architecture without pretraining (just used directly in downstream tasks) also performs worse. One thing I wish they'd directly ablated was the valueadd of the local (also referred to as "contextual") and global semisupervised tasks. Naively, I'd guess that most of the performance gain came from the global task, but it's hard to know without them having done the test directly. 
[link]
I tried my best, but I'm really confused by the central methodology of this paper. Here are the things I do understand: 1. The goal of the method is to learn disentangled representations, and, specifically, to learn representations that correspond to factors of variation in the environment that are selected by humans. That means, we ask humans whether a given image is higher or lower on a particular relevant axis, and aggregate those rankings into a vector, where a particular index of the vector corresponds to a particular factor. Given a small amount of supervision, the hope is to learn an encoder that takes in an image, and produces a Z code that encodes where the image is on that particular axis 2. With those disentangled representations, the authors hope they can learn goalconditioned policies, where the distance between the current image's representation and the goal image's representation can serve as a reward. In particular, they're trying to show that their weakly supervised disentangled representation performs better as a metric space to do goalconditioning distance calculations in, relative to other learned spaces 3. The approach uses a GANbased design, where a generator generates the images that correspond with a given z1 and z2, and the discriminator tries to tell the difference between the two real images, paired with their supervision vector, and two generated images, with their fake supervision vector [Here is the relevant equation, along with some notationexplaining text] https://i.imgur.com/XNbxK6i.png The thing I'm confused by is the actual mechanism for why (3) gets you disentangled representations. To my understanding, the thing the generator should be trying to do is generate images whose relationship to one another is governed by the relationship between z1 and z2; if z is really capturing your factors of variation, the two images should differ in places and in ways governed by where those z values are different. Based on this, I'd expect the fake supervision vector here to be some kind of binarized elementwise difference between the two (randomly sampled) vectors, z1 and z2. But the authors claim that the fake supervision vector that the generator is trying to replicate is just the zero vector. That seems like it would just result in the generator trying to generate images that don't differ on any axes, with two different z vectors as input. 
[link]
Offline reinforcement learning is potentially highvalue thing for the machine learning community learn to do well, because there are many applications where it'd be useful to generate a learnt policy for responding to a dynamic environment, but where it'd be too unsafe or expensive to learn in an onpolicy or online way, where we continually evaluate our actions in the environment to test their value. In such settings, we'd like to be able to take a batch of existing data  collected from a human demonstrator, or from some other algorithm  and be able to learn a policy from those precollected transitions, without being able to query the environment further by taking arbitrary actions. There are two broad strategies for learning a policy from precollected transitions. One is to simply learn to mimic the action policy used by the demonstrator, predicting the action the demonstrator would take in a given state, without making use of reward data at all. This is Behavioral Cloning, and has the advantage of being somewhat more conservative (in terms of not experimenting with possiblyunsafeorlowreward actions the demonstrator never took), but this is also a disadvantage, because it's not possible to get higher reward than the demonstrator themselves got if you're simply copying their behavior. Another approach is to learn a Q function  estimating the value of a given action in a given state  using the reward data from the precollected transitions. This can also have some downsides, mostly in the direction of overconfidence. Q value Temporal Difference learning works by using the current reward added to the max Q value over possible next actions as the target for the currentstate Q estimate. This tends to lead to overestimates, because regression to the mean effects mean that the highest value Q estimates are disproportionately likely to be noisy (possibly because they correspond to an action with little data in the demonstrator dataset). In onpolicy Q learning, this is less problematic, because the agent can take the action associated with their noisily inaccurate estimate, and as a result get more data for that action, and get an estimate that is less noisy in future. But when we're in a fully offline setting, all our learning is completed before we actually start taking actions with our policy, so taking highuncertainty actions isn't a valuable source of new information, but just risky. The approach suggested by this DeepMind paper  Critic Regularized Regression, or CRR  is essentially a synthesis of these two possible approaches. The method learns a Q function as normal, using temporal difference methods. The distinction in this method comes from how to get a policy, given a learned Q function. Rather than simply taking the action your Q estimate says is highestvalue at a particular point, CRR optimizes a policy according to the formula shown below. The f() function is a standin for various potential functions, all of which are monotonic with respect to the Q function, meaning they increase when the Q function does. https://i.imgur.com/jGmhYdd.png This basically amounts to a form of a behavioral cloning loss (with the part that maximizes the probability under your policy of the actions sampled from the demonstrator dataset), but weighted or, as the paper terms it, filtered, by the learned Q function. The higher the estimated q value for a transition, the more weight is placed on that transition from the demo dataset having high probability under your policy. Rather than trying to mimic all of the actions of the demonstrator, the policy preferentially tries to mimic the demonstrator actions that it estimates were particularly highquality. Different f() functions lead to different kinds of filtration. The `binary`version is an indicator function for the Advantage of an action (the Q value for that action at that state minus some reference value for the state, describing how much better the action is than other alternatives at that state) being greater than zero. Another, `exp`, uses exponential weightings which do a more "soft" upweighting or downweighting of transitions based on advantage, rather than the sharp binary of whether an actions advantage is above 1. The authors demonstrate that, on multiple environments from three different environment suites, CRR outperforms other offpolicy baselines  either more pure behavioral cloning, or more pure RL  and in many cases does so quite dramatically. They find that the sharper binary weighting scheme does better on simpler tasks, since the tradeoff of fewer but higherquality samples to learn from works there. However, on more complex tasks, the policy benefits from the exp weighting, which still uses and learns from more samples (albeit at lower weights), which introduces some potential mimicking of lowerquality transitions, but at the trade of a larger effective dataset size to learn from. 
[link]
This paper is ultimately relatively straightforward, for all that it's embedded in the somewhat newtome literature around graphbased Neural Architecture Search  the problem of iterating through options to find a graph representing an optimized architecture. The authors want to understand whether in this problem, as in many others in deep learning, we can benefit from building our supervised models off of representations learned during an unsupervised pretraining step. In this case, the unsupervised pretraining is conceptually simple  a variational autoencoder  even though the components of the VAE are more complex by dint of being made up of graph networks. This autoencoder, termed arch2vec, is trained on a simple reconstruction loss, and uses the Graph Isomorphism Network (or, GIN) architecture in its encoder, rather than a more typical Graph Convolutional Network. I don't feel like I fully follow the intuitive difference between these two structures, but have the general sense that GIN architectures are simpler; calculating a weighted sum of current central node features with the features of neighboring nodes, rather than learning a function of the full concatenated (current_node, sum_of_neighbors) vector. First, the authors investigate the quality of their embedding space, compared to the representation implicitly learned by doing endtoend supervised (i.e. with accuracies as labels) NAS. They show that (1) distances in their continuous embedding space correlate more strongly with the edit distance between graphs, compared to the embedding learned by the supervised model, and that (2) their embedding fills more of the space (as would be expected from the KL regularization term) and leads to highperforming networks being naturally concentrated within the space. https://i.imgur.com/SavZnce.png Looking into their representation as an initialization point, they demonstrate that their initializations do lead to lower longterm regret over the course of the architecture search process, particularly differentiating themselves from random initializations at the end of training. https://i.imgur.com/4DG7lZd.png The authors argue that this is because the representations learned by the supervised methods are "biased towards weightfree operations, which are often preferred in the early stage of the search process, resulting in lower final accuracies." I admit I don't fully understand why this would be true, though they do cite a few papers they say demonstrate it. My initial thought was that weightfree architectures would overperform early in the training of each individual network, but my understanding was that the dataset used here is a labeled static dataset of architectures and accuracies, so the withintrainingrun dynamics wouldn't obviously play a role. Nevertheless, there does seem to be empirical benefit that comes from using these pretrained representations, even if I don't follow the intuition behind it fully. 
[link]
This is a nice little empirical paper that does some investigation into which features get learned during the course of neural network training. To look at this, it uses a notion of "decodability", defined as the accuracy to which you can train a linear model to predict a given conceptual feature on top of the activations/learned features at a particular layer. This idea captures the amount of information about a conceptual feature that can be extracted from a given set of activations. They work with two synthetic datasets. 1. Trifeature: Generated images with a color, shape, and texture, which can be engineered to be either entirely uncorrelated or correlated with each other to varying degrees. 2. Navon: Generated images that are letters on the level of shape, and are also composed of letters on the level of texture The first thing the authors investigate is: to what extent are the different properties of these images decodable from their representations, and how does that change during training? In general, decodability is highest in lower layers, and lowest in higher layers, which makes sense from the perspective of the Information Processing Inequality, since all the information is present in the pixels, and can only be lost in the course of training, not gained. They find that decodability of color is high, even in the later layers untrained networks, and that the decodability of texture and shape, while much less high, is still above chance. When the network is trained to predict one of the three features attached to an image, you see the decodability of that feature go up (as expected), but you also see the decodability of the other features go down, suggesting that training doesn't just involve amplifying predictive features, but also suppressing unpredictive ones. This effect is strongest in the Trifeature case when training for shape or color; when training for texture, the dampening effect on color is strong, but on shape is less pronounced. https://i.imgur.com/o45KHOM.png The authors also performed some experiments on cases where features are engineered to be correlated to various degrees, to see which of the predictive features the network will represent more strongly. In the case where two features are perfectly correlated (and thus both perfectly predict the label), the network will focus decoding power on whichever feature had highest decodability in the untrained network, and, interestingly, will reduce decodability of the other feature (not just have it be lower than the chosen feature, but decrease it in the course of training), even though it is equally as predictive. https://i.imgur.com/NFx0h8b.png Similarly, the network will choose the "easy" feature (the one more easily decodable at the beginning of training) even if there's another feature that is slightly *more* predictive available. This seems quite consistent with the results of another recent paper, Shah et al, on the Pitfalls of Simplicity Bias in neural networks. The overall message of both of these experiments is that networks generally 'put all their eggs in one basket,' so to speak, rather than splitting representational power across multiple features. There were a few other experiments in the paper, and I'd recommend reading it in full  it's quite well written  but I think those convey most of the key insights for me. 
[link]
Transformers  powered by selfattention mechanisms  have been a paradigm shift in NLP, and are now the standard choice for training large language models. However, while transformers do have many benefits in terms of computational constraints  most saliently, that attention between tokens can be computed in parallel, rather than needing to be evaluated sequentially like in a RNN  a major downside is their memory (and, secondarily, computational) requirements. The baseline form of selfattention works by having every token attend to every other token, where "attend" here means that a query from each token A will take an inner product with each other token A, and then be elementwisemultiplied with the values of every other token A. This implies a O(N^2) memory and computation requirement, where N is your sequence length. So, the question this paper asks is: how do you get the benefits, or most of the benefits, of a fullattention network, while reducing the number of other tokens each token attends to. The authors' solution  Big Bird  has three components. First, they approach the problem of approximating the global graph as a graph theory problem, where each token attending to every other is "fully connected," and the goal is to try to sparsify the graph in a way that keeps shortest path between any two nodes low. They use the fact that in an ErdosRenyi graph  where very edge is simply chosen to be on or off with some fixed probability  the shortest path is known to be logN. In the context of aggregating information about a sequence, a short path between nodes means that the number of iterations, or layers, that it will take for information about any given node A to be part of the "receptive field" (so to speak) of node B, will be correspondingly short. Based on this, they propose having the foundation of their sparsified attention mechanism be simply a random graph, where each node attends to each other with probability k/N, where k is a tunable hyperparameter representing how many nodes each other node attends to on average. To supplement, the authors further note that sequence tasks of interest  particularly language  are very local in their information structure, and, while it's important to understand the global context of the full sequence, tokens close to a given token are most likely to be useful in constructing a representation of it. Given this, they propose supplementing their randomgraph attention with a block diagonal attention, where each token attends to w/2 tokens prior to and subsequent to itself. (Where, again, w is a tunable hyperparameter) However, the authors find that these components aren't enough, and so they add a final component: having some small set of tokens that attend to all tokens, and are attended to by all tokens. This allows them to theoretically prove that Big Bird can approximate full sequences, and is a universal Turing machine, both of which are true for full Transformers. I didn't follow the details of the proof, but, intuitively, my reading of this is that having a small number of these global tokens basically acts as a shortcut way for information to get between tokens in the sequence  if information is globally valuable, it can be "written" to one of these global aggregator nodes, and then all tokens will be able to "read" it from there. The authors do note that while their sparse model approximates the full transformer well in many settings, there are some problems  like needing to find the token in the sequence that a given token is farthest from in vector space  that a full attention mechanism could solve easily (since it directly calculates all pairwise comparisons) but that a sparse attention mechanism would require many layers to calculate. Empirically, Big Bird ETC (a version which adds on additional tokens for the global aggregators, rather than making existing tokens serve thhttps://i.imgur.com/ks86OgJ.pnge purpose) performs the best on a big language model training objective, has comparable performance to existing models on questionhttps://i.imgur.com/x0BdamC.png answering, and pretty dramatic performance improvements in document summarization. It makes sense for summarization to be a place where this model in particular shines, because it's explicitly designed to be able to integrate information from very large contexts (albeit in a randomly sampled way), where fullattention architectures must, for reasons of memory limitation, do some variant of a sliding window approach. 
[link]
This is an interesting paper that makes a fairly radical claim, and I haven't fully decided whether what they find is an interestingbutrare corner case, or a more fundamental weakness in the design of neural nets. The claim is: neural nets prefer learning simple features, even if there exist complex features that are equally or more predictive, and even if that means learning a classifier with a smaller margin  where margin means "the distance between the decision boundary and the nearestby data". A largemargin classifier is preferable in machine learning because the larger the margin, the larger the perturbation that would have to be made  by an adversary, or just by the random nature of the test set  to trigger misclassification. https://i.imgur.com/PJ6QB6h.png This paper defines simplicity and complexity in a few ways. In their simulated datasets, a feature is simpler when the decision boundary along that axis requires fewer piecewise linear segments to separate datapoints. (In the example above, note that having multiple alternating blocks still allows for linear separation, but with a higher piecewise linear requirement). In their datasets that concatenate MNIST and CIFAR images, the MNIST component represents the simple feature. The authors then test which models use which features by training a model with access to all of the features  simple and complex  and then testing examples where one set of features is sampled in alignment with the label, and one set of features is sampled randomly. If the features being sampled randomly are being used by the model, perturbing them like this should decrease the test performance of the model. For the simulated datasets, a fully connected network was used; for the MNIST/CIFAR concatenation, a variety of different image classification convolutional architectures were tried. The paper finds that neural networks will prefer to use the simpler feature to the complete exclusion of more complex features, even if the complex feature is slightly more predictive (can achieve 100 vs 95% separation). The authors go on to argue that what they call this Extreme Simplicity Bias, or Extreme SB, might actually explain some of the observed pathologies in neural nets, like relying on spurious features or being subject to adversarial perturbations. They claim that spurious features  like background color or texture  will tend to be simpler, and that their theory explains networks' reliance on them. Additionally, relying completely or predominantly on single features means that a perturbation along just that feature can substantially hurt performance, as opposed to a network using multiple features, all of which must be perturbed to hurt performance an equivalent amount. As I mentioned earlier, I feel like I'd need more evidence before I was strongly convinced by the claims made in this paper, but they are interestingly provocative. On a broader level, I think a lot of the difficulties in articulating why we expect simpler features to perform well come from an imprecision in thinking in language around the idea  we think of complex features as inherently brittle and highdimensional, but this paper makes me wonder how well our existing definitions of simplicity actually match those intuitions. 
[link]
Generalization is, if not the central, then at least one of the central mysteries of deep learning. We are somehow able to able to train highcapacity, overparametrized models, that empirically have the capacity to fit to random data  meaning that they have the capacity to memorize the labeled data we give them  and which yet still manage to train functions that generalize to test data. People have tried to come up with generalization bounds  that is, bounds on the expected test error of a model class  but that have all been vacuous, which here means that their upper bound is so far above the actual observed test set error that it's meaningless for the purpose of predicting which changes will enhance or detract from generalization. This paper builds on  and somewhat critiques  an earlier paper, Jiang et al, which takes the approach of assessing generalization bounds empirically. The central approach taken by both papers is to compare the empirical test error of two networks that are identical except for one axis which is varied, and test whether the ranking of the predicted generalization errors for the two networks, resulting from a particular analytical bound, aligns with the ranking of actual, empirical test error. Said succinctly: the goal is to measure how good a generalization bound is at predicting which networks will actually generalize, across the kinds of hyperparameter changes we'd be likely to experiment with in practice. An important note here is that this kind of rankbased measurement is insensitive to whether the actual magnitude of the generalization bound is; it only cares about relative bounds for different model configurations. For a given pair of environments (or pairs of hyperparameter settings), the experimental framework trains multiple seeds and averages the sign error across them. If the two models in the pair were close to one another in generalization error, they were downweighted in the overall average, or removed from the estimation if they were too close, to reduce noise. A difference in methodologies between the Jiang paper and this one is that this one puts a lot of emphasis on the need to rank generalization measures not just by their average performance over a suite of different hyperparameter perturbations, but also by a metric capturing how robust the measure is, for which they suggest the max error rather than average error. Their rationale is that simply looking at an average obscures cases where a measure performs poorly in a particular region of hyperparameter space, in a way that might tell us interesting things about its failure modes. For example, beyond just being able to say that generalization bounds based on Frobenius norms performed poorly on average at predicting the effects of changes to training set size, they were able to look at the particular settings where it performed the worst, which turn out to be on small network sizes. The plot below shows the results from all of the tested measures aggregated together. Each row represents a different axis that was being varied, and, for each measure, a number of different settings were sampled over (for the hyperparameters that were being held fixed across pairs, rather than being varied. Each distribution rectangle represents the average sign error across all of the pairs that were sampled for that measure, and that axis of variation. The measures are listed from left to right according to their average performance across all environments and all axes of variation. https://i.imgur.com/Tg3wdA3.png Some conclusions from this experiment were:  Generalization measures seem to not perform well on changes made to width, however, the authors note this was mostly because changes to width tended to not change the generalization performance in consistent ways, and so the difference in test error between the networks in the pair was more often within the range of noise  Most but not all generalization bounds correctly predict that more training data should result in better generalization  No bound does particularly well on predicting the generalization effects of changes in depth Overall, I found this paper delightfully well written, and a real pleasure to read. My one critique is that the authors explicitly point out that an important piece of data for comparing generalization bounds is the set of features they depend on. That is, if a generalization bound can only make predictions with access to the learned weights (in addition to the model class and data characteristics), it's a lot less practically useful, in terms of model design, than one that doesn't. I wish they had followed through on that and represented the dependencies of the different bounds on some way in their central figure, so that it was easier to compare them "fairly," or accounting for the information they had access to. 
[link]
Occasionally, I come across results in machine learning that I'm glad exist, even if I don't fully understand them, precisely because they remind me how little we know about the complicated information architectures we're building, and what kinds of signal they can productively use. This is one such result. The paper tests a method called selftraining, and compares it against the more common standard of pretraining. Pretraining works by first training your model on a different dataset, in a supervised way, with the labels attached to that dataset, and then transferring the learned weights on that model model (except for the final prediction head) and using that as initialization for training on your downstream task. Selftraining also uses an external dataset, but doesn't use that external data's labels. It works by 1) Training a model on the labeled data from your downstream task, the one you ultimately care about final performance on 2) Using that model to make label predictions (for the label set of your downstream task), for the external dataset 3) Retraining a model from scratch with the combined set of human labels and predicted labels from step (2) https://i.imgur.com/HaJTuyo.png This intuitively feels like cheating; something that shouldn't quite work, and yet the authors find that it equals or outperforms pretraining and selfsupervised learning in the setting they examined (transferring from ImageNet as an external dataset to CoCo as a downstream task, and using data augmentations on CoCo). They particularly find this to be the case when they're using stronger data augmentations, and when they have more labeled CoCo data to train with from the pretrained starting point. They also find that selftraining outperforms selfsupervised (e.g. contrastive) learning in similar settings. They further demonstrate that selftraining and pretraining can stack; you can get marginal value from one, even if you're already using the other. They do acknowledge that  because it requires training a model on your dataset twice, rather than reusing an existing model directly  their approach is more computationally costly than the pretrainedImagenet alternative. This work is, I believe, rooted in the literature on model distillation and student/teacher learning regimes, which I believe has found that you can sometimes outperform a model by training on its outputs, though I can't fully remember the setups used in those works. The authors don't try too hard to give a rigorous theoretical account of why this approach works, which I actually appreciate. I think we need to have space in ML for people to publish what (at least to some) might be unintuitive empirical results, without necessarily feeling pressure to articulate a theory that may just be a halfbaked afterthefact justification. One criticism or caveat I have about this paper is that I wish they'd evaluated what happened if they didn't use any augmentation. Does pretraining do better in that case? Does the training process they're using just break down? Only testing on settings with augmentations made me a little less confident in the generality of their result. Their best guess is that it demonstrates the value of taskspecificity in your training. I think there's a bit of that, but also feel like this ties in with other papers I've read recently on the surprising efficacy of training with purely random labels. I think there's, in general, a lot we don't know about what ostensibly supervised networks learn in the face of noisy or even completely permuted labels. 
[link]
The thing I think is happening here: It proposes a selfsupervised learning scheme (which...seems fairly basic, but okay) to generate encodings. It then trains a Latent World Model, which takes in the current state encoding, the action, and the belief state (I think just the prior RNN state?) and predicts a next state. The intrinsic reward is the difference between this and the actual encoding of the next step. (This is dependent on a particular action and resulting next obs, it seems). I don't really know what the belief state is doing here. Is it a... scalar rather than a RNN state? It being said to start at 0 makes it seem that way. Summary: For years, an active area of research has been the question about how to incentivize reinforcement learning agents to more effectively explore their environment. In many environments, the state space is large, and it's quite difficult to find reward just by randomly traversing it, and, in the absence of reward signal, most reinforcement learning algorithms won't learn. To remedy this, a common approach has been to attempt to define a measure of novelty, such that you can reward policies for exploring novel states. One approach for this is to tabulate counts of how often you've seen given past states, and explore in inverse proportion to those counts. However, this is complicated to scale to imagebased and continuous state spaces. Another tactic has been to use uncertainty in an agent's model of the world as an indication of that part of state space being insufficiently explored. In this way of framing the problem, exploring a part of space gives you more samples from it, and if you use those samples to train your forward predictive model  a model predicting the next state  it will increase in accuracy for that state and states like this. So, in this setting, the "novelty reward" for your agent comes from the prediction error; it's incentivized to explore states where its model is more incorrect. However, a problem with this, if you do simple pixelbased prediction, is that there are some inherent sources of uncertainty in an environment, that don't get reduced by you drawing more samples from those parts of space. The canonical example of this is static on a tv  it's just fundamentally noisy and unpredictable, and no amount of gathering data will reduce that fundamental noise. A lot of ways of naively incentivizing uncertainty draw you into those parts of state space again and again, even though they aren't really serving the purpose of getting you to explore interesting, uninvestigated parts of the space. This paper argues for a similar uncertaintybased metric, but based on prediction of a particular kind of representation of the state, which they argue the pathological property described earlier, of getting stuck in regions of high inherent uncertainty. They do this by first learning a selfsupervised representation that seems *kind of* like contrastive predictive coding, but slightly different. Their approach simply pushes the Euclidean distance between the representations of nearby timesteps to be smaller, without any explicit negative set to contrast again. To avoid the network learning the degenerative solution of "always predict a constant, so everything is close to everything else", the authors propose a "whitening" (or rescaling, or normalizing" operation before the mean squared error. This involves subtracting out the mean representation, and dividing by the covariance, before doing a mean squared error. This means that, even if you've pushed your representations together in absolute space, after the whitening operation, they will be "pulled out" again to be in a spherical unit Gaussian, which stops the network from being able to get a benefit from falling into the collapsing solution. https://i.imgur.com/Psjlf4t.png Given this pretrained encoder, the method works by:  Constructing a recurrent Latent World Model (LWM) that takes in the encoded observation, the current action, and the prior belief state of the recurrent model  Encoding the current observation with the pretrained encoder  Passing that representation into the LWM to get a predicted next representation out (prediction happens in representation space, not pixel space)  Using the error between the actual encoding of the next observation, and the predicted next representation, as the novelty signal  Training a DQN on top of the encoding Something I'm a little confused by is whether the encoding network is exclusively trained via MSE loss, or whether it also gets gradients from the actual RL DQN task. Overall, this method makes sense, but I'm not quite sure why the proposed representation structure would be notably better than other, more canonical selfsupervised losses, like CPC. They show ActionConditioned CPC as a baseline when demonstrating the quality of the representations, but not as a slotin replacement for the MSE representations in their overall architecture. It does seem to get strong performance on explorationheavy tasks, but I'll admit I'm not familiar with the quality of the baselines they chose, and so don't have a great sense of whether the  admittedly quite strong!  performance shown in the table below is in fact comparing against the current state of the art. https://i.imgur.com/cIr2Y4w.png 
[link]
This is an interesting  and refreshing  paper, in that, instead of trying to go allin on a particular theoretical point, the authors instead run a battery of empirical investigations, all centered around the question of how to explain what happens to make transfer learning work. The experiments don't all line up to support a single point, but they do illustrate different interesting facets of the transfer process.  An initial experiment tries to understand how much of the performance of finetuned models can be explained by (higherlevel, and thus largerscale) features, and how much is driven by lower level (and thus smallerscale) image statistics. To start with, the authors compare the transfer performance from ImageNet onto three different datasets  clip art, sketches, and real images. As expected, transfer performance is highest with real datasets, which are the most similar to training domain. However, there still *is* positive transfer in terms of final performance across all domains, as well as benefit in optimization speed.  To try to further tease out the difference between the transfer benefits of high and lowlevel features, the authors run an experiment where blocks of pixels are shuffled around within the image on downstream tasks . The larger the size of the blocks being shuffled, the more that largescale features of the image are preserved. As predicted, accuracy drops dramatically when pixel block size is small, for both randomly initialized and pretrained models. In addition, the relative value added by pretraining drops, for all datasets except quickdraw (the dataset of sketches). This suggests that in most datasets, the value brought by finetuning was mostly concentrated in largescale features. One interesting tangent of this experiment was the examination of optimization speed (in the form of mean training accuracy over initial epochs). Even at block sizes too small for pretraining to offer a benefit to final accuracy, it did still contribute to faster training. (See transparent bars in righthand plot below) https://i.imgur.com/Y8sO1da.png  On a somewhat different front, the authors look into how similar pretrained + finetuned models are to one another, compared to models trained on the same dataset from random initializations. First, they look at a measure of feature similarity, and find that the features learned by two pretrained networks are more similar to each other than a pretrained network is to a randomly initalized network, and also more than two randomly initialized networks are to one another. Randomly initialized networks are closest to one another in their finallayer features, but this is still a multiple of 4 or 5 less than the similarity between the pretrained networks  Looking at things from the perspective of optimization, the paper measures how much performance drops when you linearly interpolate between different solutions found by both randomly initialized and pretrained networks. For randomly initialized networks, interpolation requires traversing a region where test accuracy drops to 0%. However, for pretrained networks, this isn't the case, with test accuracy staying high throughout. This suggests that pretraining gets networks into a basin of the loss landscape, and that future training stays within that basin. There were also some experiments on module criticality that I believe were in a similar vein to these, but which I didn't fully follow  Finally, the paper looks at the relationship between accuracy on the original pretraining task and both accuracy and optimization speed on the downstream task. They find that higher originaltask accuracy moves in the same direction as higher downstreamtask accuracy, though this is less true when the downstream task is less related (as with quickdraw). Perhaps more interestingly, they find that the benefits of transfer to optimization speed happen and plateau quite early in training. Clip Art and Real transfer tasks are much more similar in the optimization speed benefits they get form ImageNet training, where on the accuracy front, the real did dramatically better. https://i.imgur.com/jBCJcLc.png While there's a lot to dig into in these results overall, the things I think are most interesting are the reinforcing of the idea that even very random and noisy pretraining can be beneficial to optimization speed (this seems reminiscent of another paper I read from this year's NeurIPS, examining why pretraining on random labels can help downstream training), and the observation that pretraining deposits weights in a lowloss bucket, from which they can learn more efficiently (though, perhaps, if the task is too divergent from the pretraining task, this difficulty in leaving the basin becomes a disadvantage). This feels consistent with some work in the Lottery Ticket Hypothesis, which has recently suggested that, after a short duration of training, you can rewind a network to a checkpoint saved after that duration, and be successfully able to train to low loss again. 
[link]
Contrastive learning works by performing augmentations on a batch of images, and training a network to match the representations of the two augmented parts of a pair together, and push the representations of images not in a pair farther apart. Historically, these algorithms have benefitted from using stronger augmentations, which has the effect of making the two positive elements in a pair more visually distinct from one another. This paper tries to build on that success, and, beyond just using a strong augmentation, tries to learn a way to perturb images that adversarially increases contrastive loss. As with adversarial training in normal supervised setting, the thinking here is that examples which push loss up the highest are the hardest and thus most informative for the network to learn from While the concept of this paper made some sense, I found the notation and the explanation of mechanics a bit confusing, particularly when it came to choice to frame a contrastive loss as a crossentropy loss, with the "weights" of the dot product in the the crossentropy loss being, in fact, the projection by the learned encoder of various of the examples in the batch. https://i.imgur.com/iQXPeXk.png This notion of the learned representations being "weights" is just odd and counterintuitive, and the process of trying to wrap my mind around it isn't one I totally succeeded at. I think the point of using this frame is because it provides an easy analogue to the Fast Gradient Sign Method of normal supervised learning adversarial examples, even though it has the weird effect that, as the authors say "your weights vary by batch...rather than being consistent across training," Notational weirdness aside, my understanding is that the method of this paper:  Runs a forward pass of normal contrastive loss (framed as crossentropy loss) which takes augmentations p and q and runs both forward through an encoder.  Calculates a delta to apply to each input image in the q that will increase the loss most, taken over all the images in the p set  I think the delta is perimage in q, and is just aggregated over all images in p, but I'm not fully confident of this, as a result of notational confusion. It could also be one delta applied for all all images in q.  Calculate the loss that results when you run forward the adversarially generated q against the normal p  Train a combined loss that is a weighted combination of the normal p/q contrastive part and the adversarial p/q contrastive part https://i.imgur.com/UWtJpVx.png The authors show a small but relatively consistent improvement to performance using their method. Notably, this improvement is much stronger when using larger encoders (presumably because they have more capacity to learn from harder examples). One frustration I have with the empirics of the paper is that, at least in the main paper, they don't discuss the increase in training time required to calculate these perturbations, which, a priori, I would imagine to be nontrivial. 
[link]
This was a really cooltome paper that asked whether contrastive losses, of the kind that have found widespread success in semisupervised domains, can add value in a supervised setting as well. In a semisupervised context, contrastive loss works by pushing together the representations of an "anchor" data example with an augmented version of itself (which is taken as a positive or target, because the image is understood to not be substantively changed by being augmented), and pushing the representation of that example away from other examples in the batch, which are negatives in the sense that they are assumed to not be related to the anchor image. This paper investigates whether this same structure  of training representations of positives to be close relative to negatives  could be expanded to the supervised setting, where "positives" wouldn't just mean augmented versions of a single image, but augmented versions of other images belonging to the same class. This would ideally combine the advantages of selfsupervised contrastive loss  that explicitly incentivizes invariance to augmentationbased changes  with the advantages of a supervised signal, which allows the representation to learn that it should also see instances of the same class as close to one another. https://i.imgur.com/pzKXEkQ.png To evaluate the performance of this as a loss function, the authors first train the representation  either with their novel supervised contrastive loss SupCon, or with a control crossentropy loss  and then train a linear regression with crossentropy on top of that learned representation. (Just because, structurally, a contrastive loss doesn't lead to assigning probabilities to particular classes, even if it is supervised in the sense of capturing information relevant to classification in the representation) The authors investigate two versions of this contrastive loss, which differ, as shown below, in terms of the relative position of the sum and the log operation, and show that the L_out version dramatically outperforms (and I mean dramatically, with a topone accuracy of 78.7 vs 67.4%). https://i.imgur.com/X5F1DDV.png The authors suggest that the L_out version is superior in terms of training dynamics, and while I didn't fully follow their explanation, I believe it had to do with L_out version doing its normalization outside of the log, which meant it actually functioned as a multiplicative normalizer, as opposed to happening inside the log, where it would have just become an additive (or, really, subtractive) constant in the gradient term. Due to this stronger normalization, the authors positive the L_out loss was less noisy and more stable. Overall, the authors show that SupCon consistently (if not dramatically) outperforms crossentropy when it comes to final accuracy. They also show that it is comparable in transfer performance to a selfsupervised contrastive loss. One interesting extension to this work, which I'd enjoy seeing more explored in the future, is how the performance of this sort of loss scales with the number of different augmentations that performed of each element in the batch (this work uses two different augmentations, but there's no reason this number couldn't be higher, which would presumably give additional useful signal and robustness?) 
[link]
This is another paper that was a bit of a personalgrowth test for me to try to parse, since it's definitely heavier on analytical theory than I'm used to, but I think I've been able to get something from it, even though I'll be the first to say I didn't understand it entirely. The question of this paper is: why does it seem to be the case that training a neural network on a data distribution  but with your supervised labels randomly sampled  seems to afford some level of advantage when finetuning on those randomtrained with correct labels. What is it that these networks learn from random labels that gives them a headstart on future training? To try to answer this, the authors focus on analyzing the firstlayer weights of a network, and frame both the input data and the learned weights (after random training) to both be random variables, each with some mean and covariance matrix. The central argument made by the paper is: After training with random labels, the weights come to have a distributional form with a covariance matrix that is "aligned" with the covariance matrix of the data. "Aligned" here means alignment on the level of eigenvectors. Formally, it is defined as a situation where every eigenspace in the data covariance matrix is contained in, or is a subset of, an eigenspace of the weight matrix. Intuitively, it means that the principal components — the axes that define the principle dimensions of variation  of the weight space are being aligned, in a linear algebra sense, with those of the data, down to a difference of a scaling factor (that is, the fact that the eigenvalues may be different between the two). They do show some empirical evidence of this being the case, by calculating the actual covariance matrices of both the data and the learned weight matrices, and showing that you see high degrees of similarity between the vector spaces of the two (though sometimes by having to add eigenspaces of the data together to be equivalent to an eigenspace of the weights). https://i.imgur.com/TB5JM6z.png They also show some indication that this property drives the advantage in finetuning. They do this by just taking their analytical model of what they believe is happening during training  that weights are coming to be drawn from a distribution governed by a covariance matrix aligned with the data covariance matrix  and sample weights from a normal distribution that has that property. They show, in the plot below, that this accounts for most of the advantage that has been observed in subsequent training from training on random labels (other than the previouslydiscovered effect of "all training increases the scale of the weights, which helps in future training," which they account for by normalizing). https://i.imgur.com/cnT27HI.png Unfortunately, beyond this central intuition of covariance matrix alignment, I wasn't able to get much else from the paper. Some other things they mentioned, that I didn't follow were:  The actual proof for why you'd expect this property of alignment in the case of random training  An analysis of the way that "the first layer [of a network] effectively learns a function which maps each eigenvalue of the data covariance matrix to the corresponding eigenvalue of the weight covariance matrix." I understand that their notion of alignment predicts that there should be some relationship between these two eigenvalues, but I don't fully follow which part of the first layer of a neural network will *learn* that function, or produce it as an output  An analysis of how this framework explains both the cases where you get positive transfer from randomlabel training (i.e. finetuned networks training better subsequently) and the cases where you get negative transfer 
[link]
In the past year or so, contrastive learning has experienced widespread success, and has risen to be a dominant problem framing within selfsupervised learning. The basic idea of contrastive learning is that, instead of needing humangenerated labels to generate a supervised task, you instead assume that there exists some automated operation you can perform to a data element to generate another data element that, while different, should be considered still fundamentally the same, or at least more strongly related, and that you can contrast these related pairs against pairs constructed with the rest of the dataset, with which any given frame would not by default have this assumed relationship of sameness or quasisimilarity. One fairly central way that "different but still effectively similar" has been historically defined  at least within the realm of imagebased models  is through the use of data augmentations: image transformations such as cropping, color jitter, or Gaussian blur, which are used on an image to create the counterpart in its related pair. Fundamentally, what we're doing when we define these particular augmentations is saying: these transformations don't cause a meaningful change in what the image is, and so we want the representations we get with and without the transformations to be close to one another (or at least to contain enough information to predict one another). Another way you can say this is that we're defining properties of the image that we want our representation to be invariant to. The authors of this paper make the point that, when aggressive cropping is part of your toolkit of augmentations, the crops of the image can actually contain meaningfully different content than the uncropped image. If you remove a few pixels around the edges of an image, it still fundamentally contains the same things. However, if you zoom in dramatically, you may get crops that contain different objects. From an image classification standpoint, you would expect that coding in an invariance to cropping in our representations would, in some cases, also mean coding in an invariance to object type, which would presumably be detrimental to the task of classifying objects. To explain the extent of the success that aggressivecropping methods have had so far, they argue that ImageNet has the particular property that its images are curated to be primarily and centrally containing a single object at a time, such that, even if you zoom in, you're getting a part of the central object, rather than another object entirely. They argue that this dataset bias might explain why you haven't seen as much of this objectinvariance be a problem in earlier augmentationbased contrastive work. To try to test this, they train different contrastive (MoCo v2) models on the MSCOCO dataset, which consists of pictures of rooms, and thus no longer has the property of being centrally of one object. They tried one setting where they performed contrastive loss on the images as a whole, and another where the input to the augmentation pipeline were images from the same dataset, but precropped to only contain one image at a time. This was meant, as far as I can tell, to isolate the effect of "objectcentric vs not" while holding other dataset factors constant. They then test how well these different models do on an objectcentric classification task (Pascal Cropped Boxes). They find that the contrastive model that trains cropped versions of the dataset gets about 3.5 points higher mean accuracy (71.9 vs 75.3) compared to the contrastive loss done on the multiobject versions of images. They also explicitly try to measure different forms of invariance, through a scheme where they binarize the elements of the representation vector, and calculate what proportion of them fire on average with and without a given set of transformations. They find that the main form of invariances that contrastive learning does well at is invariance to occlusion (part of the image not being visible), where both contrastive methods give ~84 percent cofiring, and supervised pretraining only gets about 80.9. However, on other important measures of invariance  viewpoint, illumination direction, and instance (that is, specific instance within a class)  contrastive representations perform notably worse than supervised pretraining. https://i.imgur.com/7Ghbv5A.png To try to solve these two problems, they propose a method that learns from video, and that uses temporally separated frames (which are then augmented) as pairs. They call this Frame Temporal Invariance, and argue, reasonably, that by pushing the representations of adjacent frames which track a (presumably consistent, or at least slowlyevolving) scene closer together, you should expect better invariance to viewpoint change and image deformation, since those things naturally happen when an object is moving through the world. They also suggest using an offtheshelf object bounding box model to find particular objects, and track them throughout the video, and to use contrastive learning specifically on the bounding boxes that the algorithm thinks track a consistent object. https://i.imgur.com/2GfCTog.png Overall, my take on this paper is that the analysis they do  of different kinds of invariances contrastive vs supervised loss does well on, and of the extent to which contrastive loss results might be biased by datasets  is quite interesting and a valuable contribution to our understanding of these very currentlyhypey algorithms. However, I'm a bit less impressed by the novelty of their proposed solution. Temporal forms of contrastive learning have been around before  in reinforcement learning, and even in the original Contrastive Predictive Coding paper, where the related pairs were related by dint of temporal closeness. So, while using it in video is certainly a good idea, it doesn't really feel strongly novel to me. I also feel a little confused by their choice of using an offtheshelf object detection model as a prerequisite for a selfsupervised task, since my impression was that a central goal of selfsupervision was building techniques that could scale to situations where it was infeasible to get large amounts of labels, and any method that relies on a preexisting trained object bounding box model is pretty inherently limited in that regard. 
[link]
A central problem in the domain of reinforcement learning is how to incentivize exploration and diversity of experience, since RL agents can typically only learn from states they go to, and it can often be the case that states with high reward don't have an obvious trail of highreward states leading to them, meaning that algorithms that are naively optimizing for reward will be relatively unlikely to discover them. One potential way to promote exploration is to train an ensemble of agents, and have part of your loss that incentivizes diversity in the behavior of those agents. Intuitively, an incentive for diversity will push policies away from one another, which will force them to behave differently, and thus reach a wider range of different states. Current work in diversity tends to penalize, for each agent, the average pairwise distance between it and every other agent. The authors of this paper have two critiques with this approach: 1. Pure pairwise distance doesn't fully capture the notion of diversity we might want, since if you have policies that are clustered together into multiple clusters, average pairwise distance can be increased by moving the clusters farther apart, without having to break up the clusters themselves 2. Having the diversity term be calculated for each policy independently can lead to "cycling" behavior, where policies move in ways that increase distance at each step, but don't do so when each agent's independent steps are taken into account As an alternative, they propose calculating the pairwise Kernel function similarity between all policies (where each policy is represented as the average action probabilities it returns across a sample of states), and calculating the determinate of that matrix. The authors claim that this represents a better measure of full population diversity. I can't fully connect the dots on this intuitively, but what I think they're saying is: the determinant of the kernel matrix tells you something about the effective dimensionality spanned by the different policies. In the same way that, if a matrix is lowrank, it tells you that some vectors within it can be nearly represented by linear combinations of others, a low value of the Kernel determinant means that some policies can be sufficiently represented by combinations of other policies, such that it isn't really adding diversity value to the ensemble. https://i.imgur.com/CmlGsNP.png Another contribution of this paper is to propose an interesting banditbased way of determining when to incentivize diversity vs focus on pure reward. The diversity term in the loss is governed by a lambda parameter, and the paper's model sets up Thompson sampling to determine what the value of the parameter should be at each training iteration. This bandit setup works by starting out uncertain over whether to include diversity in the loss, and building a model of whether reward tends to increase during steps where diversity is used. Over time, if diversity consistently doesn't produce benefits, the sampler will tend more towards excluding it from the loss. This is just a really odd, different idea, that I've never heard of before in the context of hyperparameter scheduling during training. I will say that I'm somewhat confused about the window of time that the bandit uses for calculating rewards. Is it a set of trajectories used in a single training batch, or longer than that? Questions aside, they do so fairly convincingly that the adaptive parameter schedule outperforms over a fixed schedule, though they don't test it against simpler forms of annealing, so I'm less clear on whether it would outperform those. Overall, I still have some confusions about the method proposed by this paper, but it seems to be approaching the question of exploration from an interesting direction, and I'd enjoy trying to understand it further in future.
1 Comments

[link]
This work attempts to use metalearning to learn an update rule for a reinforcement learning agent. In this context, "learning an update rule" means learning the parameters of an LSTM module that takes in information about the agent's recent reward and current model and outputs two values  a scalar and a vector  that are used to update the agent's model. I'm not going to go too deep into metalearning here, but, at a high level, meta learning methods optimize parameters governing an agent's learning, and, over the course of many training processes over many environments, optimize those parameters such that the reward over the full lifetime of training is higher. To be more concrete, the agent in a given environment learns two things:  A policy, that is, a distribution over predicted action given a state.  A "prediction vector". This fits in the conceptual slot where most RL algorithms would learn some kind of value or Q function, to predict how much future reward can be expected from a given state. However, in this context, this vector is *very explicitly* not a value function, but is just a vector that the agentmodel generates and updates. The notion here is that maybe our humandesigned construction of a value function isn't actually the best quantity for an agent to be predicting, and, if we metalearn, we might find something more optimal. I'm a little bit confused about the structure of this vector, but I think it's *intended* to be a categorical 1ofm prediction At each step, after acting in the environment, the agent passes to an LSTM:  The reward at the step  A binary of whether the trajectory is done  The discount factor  The probability of the action that was taken from state t  The prediction vector evaluated at state t  The prediction vector evaluated at state t+1 Given that as input (and given access to its past history from earlier in the training process), the LSTM predicts two things:  A scalar, pihat  A prediction vector, yhat These two quantities are used to update the existing policy and prediction model according to the rule below. https://i.imgur.com/xx1W9SU.png Conceptually, the scalar governs whether to increase or decrease probability assigned to the taken action under the policy, and yhat serves as a target for the prediction vector to be pulled towards. An important thing to note about the LSTM structure is that none of the quantities it takes as input are dependent on the action or observation space of the environment, so, once it is learned it can (hopefully) generalize to new environments. Given this, the basic meta learning objective falls out fairly easily  optimize the parameters of the LSTM to maximize lifetime reward, taken in expectation over training runs. However, things don't turn out to be quite that easy. The simplest version of this metalearning objective is wildly unstable and difficult to optimize, and the authors had to add a number of training hacks in order to get something that would work. (It really is dramatic, by the way, how absolutely essential these are to training something that actually learns a prediction vector). These include:  A entropy bonus, pushing the meta learned parameters to learn policies and prediction vectors that have higher entropy (which is to say: are less deterministic)  An L2 penalty on both pihat and yhat  A removal of the softmax that had originally been originally taken over the kdimensional prediction vector categorical, and switching that target from a KL divergence to a straight mean squared error loss. As far as I can tell, this makes the prediction vector no longer actually a 1ofk categorical, but instead just a continuous vector, with each value between 0 and 1, which makes it make more sense to think of k separate binaries? This I was definitely confused about in the paper overall https://i.imgur.com/EL8R1yd.png With the help of all of these regularizers, the authors were able to get something that trained, and that appeared to be able to perform comparably to or better than A2C  the humandesigned baseline  across the simple gridworlds it was being trained in. However, the two most interesting aspects of the evaluation were: 1. The authors showed that, given the values of the prediction vector, you could predict the true value of a state quite well, suggesting that the vector captured most of the information about what states were high value. However, beyond that, they found that the metalearned vector was able to be used to predict the value calculated with discount rates different that than one used in the metalearned training, which the handengineered alternative, TDlambda, wasn't able to do (it could only wellpredict values at the same discount rate used to calculate it). This suggests that the network really is learning some more robust notion of value that isn't tied to a specific discount rate. 2. They also found that they were able to deploy the LSTM update rule learned on grid worlds to Atari games, and have it perform reasonably well  beating A2C in a few cases, though certainly not all. This is fairly impressive, since it's an example of a rule learned on a different, much simpler set of environments generalizing to more complex ones, and suggests that there's something intrinsic to Reinforcement Learning that it's capturing 
[link]
This paper focuses on an effort by a Deepmind team to train an agent that can play the game Diplomacy  a complex, multiplayer game where players play as countries controlling units, trying to take over the map of Europe. Some relevant factors of this game, for the purposes of this paper, are: 1) All players move at the same time, which means you need to model your opponent's current move, and play a move that succeeds in expectation over that predicted move distribution. This also means that, in order to succeed, your policy can't be easily predictable, since, if it is, you're much easier to exploit, since your opponents can more easily optimize their response to what they predict you'll do 2) The action space is huge, even compared to something like Chess 3) The game has significant multiagent complexity: rather than being straightforwardly zerosum in its reward structure, like Chess or Go, it's designed to require alliances between players in order to succeed Prior work  DipNet  had been able to outperform other handcoded models through a deep network that imitated human actions, but the authors hadn't been able to get RL to successfully learn on top of that imitation baseline. The basic approach this model takes is one that will probably feel familiar if you've read Deepmind's prior work on Chess or Go: an interplay between a fasttoevaluate neural net component, and a slower, more explicit, strategically designed component. The slower "expert" component uses the fast network component as part of its evaluation of different moves' consequences, and then, once the slow expert has generated a series of decisions, the neural net policy learns to imitate those decisions. In this case, the slower expert tries to explicitly calculate a Best Response strategy, given some belief about what your opponents will do at the state you're in. Since the action space is so large, it isn't possible to calculate a full best response (that is, your best *possible* action given the actions you expect your opponents to take), so this paper instead lays out a Sampled Best Response algorithm. It takes as input a state, as well as an opponent policy, a candidate policy, and a value network. (More on how those come to be layer). In the simplest case, the SBR algorithm works by: 1. Sampling some number (B) of actions from the opponent policy given the state. These represent a sample distribution of what moves you think your opponents are likely to play 2. Sampling some number (C) of candidate actions from your candidate policy. 3. For each candidate action, evaluating  for each opponent action  the state you'd reach if you took the candidate action and the opponent took the opponent action, according to your value network. 4. This gives you an estimated Q value for each of your candidate actions; if you pick the action with the highest Q value, this approximates your best response action to the opponent policy distribution you pass in Once you have this SBR procedure, you can use it to bootstrap a better policy and a value network by starting with a policy and value learned from pure humanimitation, and then using SBR  with your current policy as both the opponent and candidate policy  to generate a dataset of trajectories (where each action in the trajectory is chosen according to the SBR procedure). With that trajectory dataset, you can train your policy and value networks to be able to better approximate the actions that SBR would have taken. The basic version of this procedure is called IBR  Iterated Best Response. This is because, at each stage, SBR will return a policy that tries to perform well against the current version of the opponent policy. This kind of behavior is potentially troublesome, since you can theoretically have it lead to cycles, like those in Rock Paper Scissors where paper beats rock, scissors beat paper, and then rock beats scissors again. At each stage, you find the best response to what your opponent is doing right now, but you don't actually make transitive progress. A common way of improving along the axis of this particular failure mode is to learn via a "fictitious play" rather than "self play". Putting aside the name  which I don't find very usefully descriptive  this translates to simulating playing against, not the current version of your opponent, but a distribution made up of past versions of the opponent. This helps prevent cycling, because choosing a strategy that only defeats the current version but would lose to a prior version is no longer a rewarding option. The Fictitious Playbased approach this paper proposes  FPPI2  tries to resolve this problem. Instead of sampling actions in step (1) from only your current opponent policy, it uses a sampling procedure where, for each opponent action, it first samples a point in history, and then samples the opponent policy vector at that point in history (so the multiple opponent moves collectively represent a coherent point in strategy space). Given this action profile, the final version of the algorithm continues to use the most recent/updated timestep of the policy and value network for the candidate policy and value network used in SBR, so that you're (hopefully) sampling high quality actions, and making accurate assessments of states, even while you use the distribution of (presumably worse) policies to construct the opponent action distribution that you evaluate your candidate actions against. https://i.imgur.com/jrQAAQW.png The authors don't evaluate against human players, but do show that their FPPI2 approach consistently outperforms the prior DipNet state of the art, and performed the best overall, though Iterated Best Response performs better than they say they expected. Some other thoughts:  This system is still fairly exploitable, and doesn't become meaningfully less so during training, though it does become more difficult to exploit quickly  This does seem like a problem overall where you do a lot of modeling of what you expect your opponents strategies to be, and it seems hard to be robust, especially to collusion of your opponents  I was a bit disappointed that the initial framing of the paper put a lot of emphasis on the alliancefocused nature of the game, but then neither suggested mechanisms targeting that aspect of the game, nor seemed to do any specific analysis of  I would have expected this game to benefit from some explicit modeling of different agents having different policies; possibly this just isn't something they could have had be the case under their evaluation scheme, which played against copies of a given policy? Overall, my sense is that this is still a pretty earlystage checkpoint in the effort of playing Diplomacy, and that we've still got a ways to go, but it is interesting early work, and I'm curious where it leads. 
[link]
This is an interestingly pragmatic paper that makes a super simple observation. Often, we may want a usable network with fewer parameters, to make our network more easily usable on small devices. It's been observed (by these same authors, in fact), that pruned networks can achieve comparable weights to their fully trained counterparts if you rewind and retrain from early in the training process, to compensate for the loss of the (not ultimately important) pruned weights. This observation has been dubbed the "Lottery Ticket Hypothesis", after the idea that there's some small effective subnetwork you can find if you sample enough networks. Given these two facts  the usefulness of pruning, and the success of weight rewinding  the authors explore the effectiveness of various ways to train after pruning. Current standard practice is to prune lowmagnitude weights, and then continue training remaining weights from values they had at pruning time, keeping the final learning rate of the network constant. The authors find that: 1. Weight rewinding, where you rewind weights to *near* their starting value, and then retrain using the learning rates of early in training, outperforms fine tuning from the place weights were when you pruned but, also 2. Learning rate rewinding, where you keep weights as they are, but rewind learning rates to what they were early in training, are actually the most effective for a given amount of training time/search cost To me, this feels a little bit like burying the lede: the takeaway seems to be that when you prune, it's beneficial to make your network more "elastic" (in the metaphortoneuroscience sense) so it can more effectively learn to compensate for the removed neurons. So, what was really valuable in weight rewinding was the ability to "heat up" learning on a smaller set of weights, so they could adapt more quickly. And the fact that learning rate rewinding works better than weight rewinding suggests that there is value in the learned weights after all, that value is just outstripped by the benefit of rolling back to old learning rates. All in all, not a super radical conclusion, but a useful and practical one to have so clearly laid out in a paper. 
[link]
One of the most notable flaws of modern modelfree reinforcement learning is its sample inefficiency; where humans can learn a new task with relatively few examples, model that learn policies or value functions directly from raw data need huge amounts of data to train properly. Because the model isn't given any semantic features, it has to learn a meaningful representation from raw pixels using only the (often sparse, often noisy) signal of reward. Some past approaches have tried learning representations separately from the RL task (where you're not bottlenecked by agent actions), or by adding more informative auxiliary objectives to the RL task. Instead, the authors of this paper, quippily titled "Image Augmentation Is All You Need", suggest using data augmentation of input observations through image modification (in particular, by taking random different crops of an observation stack), and integrating that augmentation into the native structure of a RL loss function (in particular, the loss term for Q learning). There are two main reasons why you might expect image augmentation to be useful. 1. On the most simplistic level, it's just additional data for your network 2. But, in particular, it's additional data designed to exhibit ways an image observation can be different on a pixel level, but still not be meaningfully different in terms of its state within the game. You'd expect this kind of information to make your model robust to overfitting. The authors go into three different ways they could add image augmentation to a Q Learning model, and show that each one provides additional marginal value. The first, and most basic, is to just add augmented versions of observations to your training dataset. The basic method being used, Soft Actor Critic, uses a replay buffer of old observations, and this augmentation works by simply applying a different crop transformation each time an observation is sampled from a replay buffer. This is a neat and simple trick, that effectively multiplies the number of distinct observations your network sees by the number of possible crops, making it less prone to overfitting. The next two ways involve integrating transformed versions of an observation into the structure of the Q function itself. As a quick background, Q learning is trained using a Bellman consistency loss, and Q tries to estimate the value of a (state, action) pair, assuming that you do the best possible thing at every point after you take the action at the state. The consistency loss tries to push your Q estimate of the value of a (state, action) pair closer to the sum of reward you got by taking that action and your current max Q estimate for the next state. The second term in this loss, the combined reward and nextstep Q value, is called the target, since it's what you push your currentstep Q value closer towards. This paper suggests both:  Averaging your currentstep Q estimate over multiple different crops the observation stack at the current state  Averaging the nextstep Q estimate used in the target over multiple different crops (that aren't the ones used in the currentstep averaging) This has the nice side effect that, in addition to telling your network about image transformations (like small crops) that shouldn't impact its strategic assessment, it also makes your Q learning process overall lower variance, because both the current step and next step quantities are averages rather than singlesample values. https://i.imgur.com/LactlFq.png Operating in a lower data regime, the authors found that simply adding augmentations to their replay buffer sampling (without the two averaging losses) gave them a lot of gains in how efficiently they could learn, but all three combined gave the best performance. 
[link]
The Transformer architecture  which uses a structure entirely based on keyvalue attention mechanisms to process sequences such as text  has taken over the worlds of language modeling and NLP in the past three years. However, Transformers at the scale used for large language models have huge computational and memory requirements. This is largely driven by the fact that information at every step in the sequence (or, in the sofargenerated sequence during generation) is used to inform the representation at every other step. Although the same *parameters* are used for each of these pairwise calculation between keys and queries at each step, this is still a pairwise, and thus N^2, calculation, which can get very costly when processing long sequences on the scale of tens of thousands of tokens. This cost comes from both computation and memory, with memory being the primary focus of this paper, because the max memory requirements of a network step dictate the hardware it can be run on, in a way that the pure amount of computation that needs to be performed doesn't. A L^2 attention calculation, as naively implemented in a set of matrix multiplies, not only has to perform N^2 calculations, but has to be able to hold N^2 values in memory while performing the softmax and weighted sum that is the attention aggregation process. Memory requirements in Transformers are also driven by  The high parameter counts of dense layers within the network, which have less parameter use per calculation than attention does, and  The fact that needing to pass forward one representation per element in the list at each layer necessitates cumulatively keeping all the activations from each layer in the forward pass, so that you can use them to calculate derivatives in the backward pass. This paper, and the "Reformer" architecture they suggest, is less a single idea, and more a suite of solutions targeted to make Transformers more efficient in use of both compute and memory. 1. The substitution of Locality Sensitive Hashing for normal keyquery attention is a strategy for reducing the L^2 compute and memory requirements of the raw attention calculation. The essential idea of attention is "calculate how well the query at position i is matched by the key at every other position, and then use those matching softmax weights to calculate a weighted sum of the representations at each other position". If you consider keys and queries to be in the same space, you can think of this as a similarity calculation between positions, where you want to most highly weight the most similar positions to you in the calculation of your own nextlayer value. In this spirit, the authors argue that this weighted sum will be mostly influenced by the highestsimilarity positions within the softmax, and so, instead of performing attention over the whole sequence, we can first sort positions into buckets based on similarity of their key/query vector for a given head, and perform attention weighting within those buckets. https://i.imgur.com/tQJkfGe.png This has the advantage that the first step, of assigning a position's key/query vector to a bucket, can be done for each position individually, rather than with respect to the value at another position. In this case, this bucketing is performed by a Locality Sensitive Hashing algorithm, which works by projecting each position's vector into a lowerdimensional space, and then taking the index of that vector which has the max value. This is then used as a bucket ID for performing full attention within. This shifts the time complexity of attention from O(L^2) to O(LlogL), since for each position in the length, you only need to calculate explicit pairwise similarity for the log(L) other elements in its bucket 2. Reversible layers. This addresses the problem of needing to keep activations from each layer around for computing the backwardpass derivatives. It takes an idea used in RevNets, which proposes a reversible alternative to the commonly used ResNet architecture. In a normal ResNet scenario, Y = X + F(X), where F(X) is the computation of a single layer or block, and Y are the resulting activations passed to the next layer. In this setup, you can't go back from Y to get the value of X if you discard X for memory reasons, because the difference between the two is a function of X, which you don't have. As an alternative, RevNets define a sort of odd crosswise residual structure, that starts by partitioning X into two components, X1 and X2, and the output Y into Y1 and Y2, and performing the calculation shown below. https://i.imgur.com/EK2vBkK.png This allows you to work backward, getting X2 from Y1 and Y2 (both of which you have as outputs), and then get X1 from knowing the other three parts. https://i.imgur.com/uLTrdyf.png This means that as soon as you have the activations at a given layer, you can discard earlier layer activations, which makes things a lot more memory efficient. 3. There's also a proposal to do (what I *think* is) a pretty basic chunking of feed forward calculations across sequence length, and performing feedforward calculations on parts of the sequence rather than the whole thing. The latter would be faster with vectorized computing, for parallelization reasons, but the former is more memory efficient 
[link]
This paper, presented this week at ICLR 2020, builds on existing applications of messagepassing Graph Neural Networks (GNN) for molecular modeling (specifically: for predicting quantum properties of molecules), and extends them by introducing a way to represent angles between atoms, rather than just distances between them, as current methods are limited to. The basic version of GNNs on molecule data works by creating features attached to atoms at each level (starting at level 0 with the elementspecific embedding of that atom), and constructing "messages" between neighboring atoms that are a function of the neighbor atom's feature vector and the distance between the two neighbors. (This is the minimal version; some methods also include the bond type along with the distance as part of the edgespecific features). At a given layer, an atom's features are updated by applying an update function to both its own prior value and the sum of all the messages it receives from neighbors. The trouble with this method is that it doesn't account for angular relationships between sets of atoms, which are physically important to quantum properties of a molecule. The naive way you might imagine representing angle is by situating the molecule in a 2D grid, and applying spherical convolutions, so your contribution to a neighbor's features would be based on your spherical distance away. However, this doesn't work, because molecules don't have a canonical frame of reference  there is no fixed left or right, or up and down, and operating in this way would mean that a molecule and its horizontal flip would have different representations. Instead, the authors propose an interesting inversion of the existing approaches, where feature vectors are attached to atoms, and are updated using the features of other atoms. In this model, features live on "messages" between pairs of atoms, and are updated by incorporating information from all messages within some local distance window. Importantly, each pair of atoms has a vector associated with their relationship in the molecule, and so when you combine two such messages together, you can calculate the angle between the associated vectors. This angle is invariant to flipping or rotation, because it's defined based on reference points internal to the molecule, which move together when the molecule is moved. https://i.imgur.com/mw46gWz.png Messages are updated from other messages using a combination of the distance between the nonshared endpoints of the messages (that is, if both message vectors share an endpoint i, and go to j and k respectively, this would be the distance between j and k), and the angle between the (ij) vector and the (ij) vector. For physicsbased reasons I admit I didn't fully follow, these two pieces of information are embedded in a spherical basis function, so messages will update from each other differently based on their relative positions in a sphere. https://i.imgur.com/Tvc7Gex.png The representation of a given atom is then simply the sum of all its incoming messages, conditioned by the distance between the reference atom and the paired neighbor for which the message is defined. A concatenation of atom representations across layers is used to create a final atom representation, which is used for final quantum property prediction. The authors tested on two datasets, and found dramatic improvements, with an average of 31% relative gain on the prior state of the art over different quantum property targets. 
[link]
disclaimer: I'm the first author of the paper ## TL;DR We have made a lot of progress on catastrophic forgetting within the standard evaluation protocol, i.e. sequentially learning a stream of tasks and testing our models' capacity to remember them all. We think it's time a new approach to Continual Learning (CL), coined OSAKA, which is more aligned with reallife applications of CL. It brings CL closer to Online Learning and OpenWorld Learning. main modifications we propose:  bring CL closer to Online learning i.e. at test time, the model is continually learning and evaluated on its online predictions  it's fine to forget, as long as you can quickly remember (just like we humans do)  we allow pretraining, (because you wouldn't deploy an untrained CL system, right?) but at test time, the model will have to quickly learn new outofdistribution (OoD) tasks (because the world is full of surprises)  the tasks distribution is actually a hidden Markov chain. This implies:  new and old tasks can reoccur (just like in real life). Better remember them quickly if you want to get a good total performance!  tasks have different lengths  and the tasks boundaries are unknown (task agnostic setting) ### Bonus: We provide a unifying framework explaining the space of machine learning setting {supervised learning, meta learning, continual learning, metacontinual learning, continualmeta learning} in case it was starting to get confusing :p ## Motivation We imagine an agent, embedded or not, first pretrained in a controlled environment and later deployed in the real world, where it faces new or unexpected situations. This scenario is relevant for many applications. For instance, in robotics, the agent is pretrained in a factory and deployed in homes or in manufactures where it will need to adapt to new domains and maybe solve new tasks. Likewise, a virtual assistant can be pretrained on static datasets and deployed in a user’s life to fit its personal needs. Further motivations can be found in time series forecasting, e.g., market prediction, game playing, autonomous customer service, recommendation systems, autonomous driving, to name a few. In this scenario, we are interested in the cumulative performance of the agent throughout its lifetime. Differently, standard CL reports the agent’s final performance on all tasks at the end of its life. In order to succeed in this scenario, agents need the ability to learn new tasks as well as quickly remembering old ones. ## Unifying Framework We propose a unifying framework explaining the space of machine learning setting {supervised learning, meta learning, continual learning, metacontinual learning, continualmeta learning} with meta learning terminology. https://i.imgur.com/U16kHXk.png (easier to digest with accompanying text) ## OSAKA The main features of the evaluation framework are  task agnosticism  pretraining is allowed, but OoD tasks at test time  task revisiting  controllable nonstationarity  online evaluation (see paper for the motivations of the features) ## ContinualMAML: an initial baseline A simple extension of MAML that is better suited than previous methods in the proposed setting. https://i.imgur.com/C86WUc8.png Features are:  Fast Adapatation  Dynamic Representation  Task boundary detection  Computational efficiency ## Experiments We provide a suite of 3 benchmarks to test algorithms in the new setting. The first includes the Omniglot, MNIST and FashionMNIST dataset. The second and third use the Synbols (Lacoste et al. 2018) and TieredImageNet datasets, respectively. The first set of experiments shows that the baseline outperforms previous approaches, i.e., supervised learning, meta learning, continual learning, metacontinual learning, continualmeta learning, in the new setting. https://i.imgur.com/IQ1WYTp.png The second and third experiments lead us to similar conclusions code: https://github.com/ElementAI/osaka 
[link]
The authors introduce a new, samplingfree method for training and evaluating energybased models (aka EBMs, aka unnormalized density models). There are two broad approches for training EBMs. Samplingbased approaches like contrastive divergence try to estimate the likelihood with MCMC, but can be biased if the chain is not sufficiently long. The speed of training also greatly depends on the sampling parameters. Other approches, like score matching, avoid sampling by solving a surrogate objective that approximates the likelihood. However, using a surrogate objective also introduces bias in the solution. In any case, comparing goodness of fit of different models is challenging, regardless of how the models were trained. The authors introduce a measure of probability distance between distributions $p$ and $q$ called the Learned Stein Discrepancy ($LSD$): $$ LSD(f_{\phi}, p, q) = \mathbb{E}_{p(x)} [\nabla_x \log q(x)^T f_{\phi}(x) + Tr(\nabla_x f_{\phi} (x)) $$ This measure is derived from the Stein Discrepancy $SD(p,q)$. Note that like the $SD$, the $LSD$ is 0 iff $p = q$. Typically, $p$ is the data distribution and $q$ is the learned approximate distribution (an EBM), although this doesn't have to be the case. Note also that this objective only requires a differentiable unnormalized distribution $\tilde{q}$, and does not require MCMC sampling or computation of the normalizing constant $Z$, since $\nabla_x \log q(x) = \nabla_x \log \tilde{q}(x)  \nabla_x \log Z = \nabla_x \log \tilde{q}(x)$. $f_\phi$ is known as the critic function, and minimizing the $LSD$ with respect to $\phi$ (i.e. with gradient descent) over a bounded space of functions $\mathcal{F}$ can approximate the $SD$ over that space. The authors choose to define the function space $\mathcal{F} = \{ f: \mathbb{E}_{p(x)} [f(x)^Tf(x)] < \infty \}$, which is convenient because it can be optimized by introducing a simple L2 regularizer on the critic's output: $\mathcal{R}_\lambda (f_\phi) = \lambda \mathbb{E}_{p(x)} [f_\phi(x)^T f_\phi(x)]$. Since the trace of a matrix is expensive to backpropagate through, the authors use a singlesample Monte Carlo estimate $Tr(\nabla_x f_\phi(x)) \approx \mathbb{E}_{\mathbb{N}(\epsilon0,1)} [\epsilon^T \nabla_x f_\phi(x) \epsilon] $, which is more efficient since $\epsilon^T \nabla_x f_\phi(x)$ is a vectorJacobian product. The overall objective is thus the following: $$ \text{arg} \max_\phi \mathbb{E}_{p(x)} [\nabla_x \log q(x)^T f_{\phi}(x) + \mathbb{E}_{\epsilon} [\epsilon^T \nabla_x f_{\phi} (x) \epsilon)]  \lambda f_\phi(x)^T f_\phi(x)] $$ It is possible to compare two different EBMs $q_1$ and $q_2$ by optimizing the above objective for two different critic parameters $\phi_1$ and $\phi_2$, using the training and validation data for critic optimization (then evaluating on the heldout test set). Note that when computing the $LSD$ on the test set, the exact trace can be computed instead of the Monte Carlo approximation to reduce variance, since gradients are no longer required. The model that is closer to 0 has achieved a better fit. Similarly, a hypothesis test using the $LSD$ can be used to test if $p = q$ for the data distribution $p$ and model distribution $q$. The authors then show how EBM parameters $\theta$ can actually be optimized by gradient descent on the $LSD$ objective, in a minimax problem that is similar to the problem of optimizing a generative adversarial network (GAN). For given $\theta$, you first optimize the critic $f_\phi$ w.r.t. $\phi$ to try to get the $LSD(f_\phi, p, q_\theta)$ close to its theoretical optimum with the current $q_\theta$, then you take a single gradient step $\nabla_\theta LSD$ to minimize the $LSD$. They show some experiments that indicates that this works pretty well. One thing that was not clear to me when reading this paper is whether the $LSD(f_\phi,p,q)$ should be minimized or maximized with respect to $\phi$ to get it close to the true $SD(p,q)$. Although it it possible for $LSD$ to be above or below 0 for a given choice of $q$ and $f_\phi$, the problem can always be formulated as minimization by simply changing the sign of $f_\phi$ at the beginning such that the $LSD$ is positive (or as maximization by making it negative). 
[link]
In this note, I'll implement the [Stochastically Unbiased Marginalization Objective (SUMO)](https://openreview.net/forum?id=SylkYeHtwr) to estimate the logpartition function of an energy funtion. Estimation of logpartition function has many important applications in machine learning. Take latent variable models or Bayeisian inference. The logpartition function of the posterior distribution $$p(zx)=\frac{1}{Z}p(xz)p(z)$$ is the logmarginal likelihood of the data $$\log Z = \log \int p(xz)p(z)dz = \log p(x)$$. More generally, let $U(x)$ be some energy function which induces some density function $p(x)=\frac{e^{U(x)}}{\int e^{U(x)} dx}$. The common practice is to look at a variational form of the logpartition function, $$ \log Z = \log \int e^{U(x)}dx = \max_{q(x)}\mathbb{E}[U(x)\log q(x)] \nonumber $$ Plugging in an arbitrary $q$ would normally yield a strict lower bound, which means $$ \frac{1}{n}\sum_{i=1}^n \left(U(x_i)  \log q(x_i)\right) \nonumber $$ for $x_i$ sampled *i.i.d.* from $q$, would be a biased estimate for $\log Z$. In particular, it would be an underestimation. To see this, lets define the energy function $U$ as follows: $$ U(x_1,x_2)=  \log \left(\frac{1}{2}\cdot e^{\frac{(x_1+2)^2 + x_2^2}{2}} + \frac{1}{2}\cdot\frac{1}{4}e^{\frac{(x_12)^2 + x_2^2}{8}}\right) \nonumber $$ It is not hard to see that $U$ is the energy function of a mixture of Gaussian distribution $\frac{1}{2}\mathcal{N}([2,0], I) + \frac{1}{2}\mathcal{N}([2,0], 4I)$ with a normalizing constant $Z=2\pi\approx6.28$ and $\log Z\approx1.8379$. ```python def U(x): x1 = x[:,0] x2 = x[:,1] d2 = x2 ** 2 return  np.log(np.exp(((x1+2) ** 2 + d2)/2)/2 + np.exp(((x12) ** 2 + d2)/8)/4/2) ``` To visualize the density corresponding to the energy $p(x)\propto e^{U(x)}$ ```python xx = np.linspace(5,5,200) yy = np.linspace(5,5,200) X = np.meshgrid(xx,yy) X = np.concatenate([X[0][:,:,None], X[1][:,:,None]], 2).reshape(1,2) unnormalized_density = np.exp(U(X)).reshape(200,200) plt.imshow(unnormalized_density) plt.axis('off') ``` https://i.imgur.com/CZSyIQp.png As a sanity check, lets also visualize the density of the mixture of Gaussians. ```python N1, N2 = mvn([2,0], 1), mvn([2,0], 4) density = (np.exp(N1.logpdf(X))/2 + np.exp(N2.logpdf(X))/2).reshape(200,200) plt.imshow(density) plt.axis('off') print(np.allclose(unnormalized_density / density  2*np.pi, 0)) ``` `True` https://i.imgur.com/g4inQxB.png Now if we estimate the logpartition function by estimating the variational lower bound, we get ```python q = mvn([0,0],5) xs = q.rvs(10000*5) elbo =  U(xs)  q.logpdf(xs) plt.hist(elbo, range(5,10)) print("Estimate: %.4f / Ground true: %.4f" % (elbo.mean(), np.log(2*np.pi))) print("Empirical variance: %.4f" % elbo.var()) ``` `Estimate: 1.4595 / Ground true: 1.8379` `Empirical variance: 0.9921` https://i.imgur.com/vFzutuY.png The lower bound can be tightened via [importance sampling): $$ \log \int e^{U(x)} dx \geq \mathbb{E}_{q^K}\left[\log\left(\frac{1}{K}\sum_{j=1}^K \frac{e^{U(x_j)}}{q(x_j)}\right)\right] \nonumber $$ > This bound is tighter for larger $K$ partly due to the [concentration of the average](https://arxiv.org/pdf/1906.03708.pdf) inside of the $\log$ function: when the random variable is more deterministic, using a local linear approximation near its mean is more accurate as there's less "mass" outside of some neighborhood of the mean. Now if we use this new estimator with $K=5$ ```python k = 5 xs = q.rvs(10000*k) elbo =  U(xs)  q.logpdf(xs) iwlb = elbo.reshape(10000,k) iwlb = np.log(np.exp(iwlb).mean(1)) plt.hist(iwlb, range(5,10)) print("Estimate: %.4f / Ground true: %.4f" % (iwlb.mean(), np.log(2*np.pi))) print("Empirical variance: %.4f" % iwlb.var()) ``` `Estimate: 1.7616 / Ground true: 1.8379` `Empirical variance: 0.1544` https://i.imgur.com/sCcsQd4.png We see that both the bias and variance decrease. Finally, we use the [Stochastically Unbiased Marginalization Objective](https://openreview.net/pdf?id=SylkYeHtwr) (SUMO), which uses the *Russian Roulette* estimator to randomly truncate a telescoping series that converges in expectation to the log partition function. Let $\text{IWAE}_K = \log\left(\frac{1}{K}\sum_{j=1}^K \frac{e^{U(x_j)}}{q(x_j)}\right)$ be the importanceweighted estimator, and $\Delta_K = \text{IWAE}_{K+1}  \text{IWAE}_K$ be the difference (which can be thought of as some form of correction). The SUMO estimator is defined as $$ \text{SUMO} = \text{IWAE}_1 + \sum_{k=1}^K \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \nonumber $$ where $K\sim p(K)=\mathbb{P}(\mathcal{K}=K)$. To see why this is an unbiased estimator, $$ \begin{align*} \mathbb{E}[\text{SUMO}] &= \mathbb{E}\left[\text{IWAE}_1 + \sum_{k=1}^K \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \right] \nonumber\\ &= \mathbb{E}_{x's}\left[\text{IWAE}_1 + \mathbb{E}_{K}\left[\sum_{k=1}^K \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \right]\right] \nonumber \end{align*} $$ The inner expectation can be further expanded $$ \begin{align*} \mathbb{E}_{K}\left[\sum_{k=1}^K \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \right] &= \sum_{K=1}^\infty P(K)\sum_{k=1}^K \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \\ &= \sum_{k=1}^\infty \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \sum_{K=k}^\infty P(K) \\ &= \sum_{k=1}^\infty \frac{\Delta_K}{\mathbb{P}(\mathcal{K}\geq k)} \mathbb{P}(\mathcal{K}\geq k) \\ &= \sum_{k=1}^\infty\Delta_K \\ &= \text{IWAE}_{2}  \text{IWAE}_1 + \text{IWAE}_{3}  \text{IWAE}_2 + ... = \lim_{k\rightarrow\infty}\text{IWAE}_{k}\text{IWAE}_1 \end{align*} $$ which shows $\mathbb{E}[\text{SUMO}] = \mathbb{E}[\text{IWAE}_\infty] = \log Z$. > (N.B.) Some care needs to be taken care of for taking the limit. See the paper for more formal derivation. A choice of $P(K)$ proposed in the paper satisfy $\mathbb{P}(\mathcal{K}\geq K)=\frac{1}{K}$. We can sample such a $K$ easily using the [inverse CDF](https://en.wikipedia.org/wiki/Inverse_transform_sampling), $K=\lfloor\frac{u}{1u}\rfloor$ where $u$ is sampled uniformly from the interval $[0,1]$. Now putting things all together, we can estimate the logpartition using SUMO. ```python count = 0 bs = 10 iwlb = list() while count <= 1000000: u = np.random.rand(1) k = np.ceil(u/(1u)).astype(int)[0] xs = q.rvs(bs*(k+1)) elbo =  U(xs)  q.logpdf(xs) iwlb_ = elbo.reshape(bs, k+1) iwlb_ = np.log(np.cumsum(np.exp(iwlb_), 1) / np.arange(1,k+2)) iwlb_ = iwlb_[:,0] + ((iwlb_[:,1:k+1]  iwlb_[:,0:k]) * np.arange(1,k+1)).sum(1) count += bs * (k+1) iwlb.append(iwlb_) iwlb = np.concatenate(iwlb) plt.hist(iwlb, range(5,10)) print("Estimate: %.4f / Ground true: %.4f" % (iwlb.mean(), np.log(2*np.pi))) print("Empirical variance: %.4f" % iwlb.var()) ``` `Estimate: 1.8359 / Ground true: 1.8379` `Empirical variance: 4.1794` https://i.imgur.com/04kPKo5.png Indeed the empirical average is quite close to the true logpartition of the energy function. However we can also see that the distribution of the estimator is much more spreadout. In fact, it is very heavytailed. Note that I did not tune the proposal distribution $q$ based on the ELBO, or IWAE or SUMO. In the paper, the authors propose to tune $q$ to minimize the variance of the $\text{SUMO}$ estimator, which might be an interesting trick to look at next. (Reposted, see more details and code from https://www.chinweihuang.com/pages/sumo) 