[link]
This is a mildly silly paper to summarize, since there isn't really a new mechanism to understand, but rather a number of straightforward (and interesting!) empirical results that are also quite well-explained in the paper itself. That said, for the sake of a tiny bit more brevity than the paper itself provides, I'll try to pull out some of the conclusions I found the most interesting here. The general goal of this paper is to better understand the contours of when self-supervised representation learning is valuable for vision (and specifically when it can compete with supervised learning), and when it doesn't. In general, the results are all using ResNet backbones, with SimCLR SSL, on image classification datasets. Some bullet-point takeaways: - The SSL models being tested here seem to roughly saturate at unsupervised dataset sizes of around 500K; the comparative jump from dataset sizes of 500K to 1M is fairly small. - Once you have a supervised dataset of around 50K or more, the benefit of SSL pretraining starts to diminish, and it converges to being more similar to just supervised learning on that numbrer of labeled images. On the flip side, it's only possible to get close to "good" fully supervised performance by using 100K images or more on top of a SSL baseline. - Even within image classification datasets, it's much better to do SSL representation on the same dataset as the one you'll use for downstream training; trying to transfer representations to different datasets leads to meaningfully worse results. Interestingly, this is even true when you add out-of-domain (i.e. other-dataset) data to an existing in-domain dataset: a dataset of 250K in-dataset images does better than a 500K dataset of images from mixed datasets, and does notably better than a 1M dataset of mixed images. In this case, adding more out-of-domain images seems to have just degraded performance - SSL seems to perform more closely to SL on a course label set; when the label set gets more granular, the task gets harder overall, but, more specifically, the gap between SSL and SL grows - When the authors tried different forms of dataset corruption, SSL was much more robust to adding salt-and-pepper noise than it was to removing high-frequency information in the form of reducing the images to a lower resolution. |
[link]
This paper is an interesting extension of earlier work, in the TransformerXL paper, that sought to give Transformers access to a "memory" beyond the scope of the subsequence where full self-attention was being performed. This was done by caching the activations from prior subsequences, and making them available to the subsequence currently being calculated in a "read-only" way, with gradients not propagated backwards. This had the effect of (1) reducing the maximum memory size compared to simply doubling the subsequence length, and (2) reducing the extent to which gradients had to propagate backward through time. The authors of the Compressive Transformers paper want to build on that set of ideas to construct an even longer accessible memory. So, they take the baseline non-backpropogated memory design of TransformerXL, but instead of having tokens roll out of memory after the end of the previous (cached) subsequence, they create an extra compressed memory. Each token in this compressed memory is a function of C inputs in the normal memory. So, if C=3, you would input 3 memory vectors into your compression function to get one instance of a compressed memory vector. Depending on the scale of your C, you can turn up the temporal distance into the past that your compressed memory had to. https://i.imgur.com/7BaCzoU.png While the gradients from the main loss function didn't, as far as I could tell, pass back into the compression function, they did apply a compression loss to incentivize the compression to be coherent. They considered an autoencoder loss to reconstruct the input tokens from the compressed memory, but decided against that on the principle that memory inherently has to be compressed and lossy to be effective, and an autoencoder loss would promote infeasibly lossless compression. Instead, they take the interesting approach of incentivizing the compressed representations to be able to reconstruct the attention calculation performed on the pre-compressed representations. Basically, any information pulled out of the pre-compressed memories by content-based lookup also needs to be able to be pulled out of the compressed memories. This incentives the network to preferentially keep the information that was being actively used by the attention mechanisms in prior steps, and discard less useful information. One framing from this paper that I enjoyed was them drawing a comparison between the approach of Transformers (of keeping all lower-level activations in memory, and recombining them "in real time," for each downstream use of that information), and the approach of RNNs (of keeping a running compressed representation of everything seen up to this point). In this frame, their method is somewhere in between, with a tunable compression rate C (by contrast, a RNN would have an effectively unlimited compression rate, since all prior tokens would be compressed into a single state representation). |
[link]
The idea of the Switch Transformer is to have more parameters available for a network to use, but to only use a small subset of those parameters for each example that's run through the network. This is achieved through a routing scheme, whereby a weighting layer is applied to each token and produces a set of logits/softmax weights over the set of possible experts. The token is then sent to the expert that was given the highest weight. The network is implemented such that different experts can actually live on different devices. https://i.imgur.com/HEB7cJw.png This architecture is inspired by previous Mixture of Experts work, which applied a similar scheme, but sent each token through a set of k experts rather than just a single one. This had the ostensible effect of increasing stability and performance, but the authors of this paper argue that using a single expert per token is actually preferable on both of these fronts. There are a lot of experiments in this paper, and I'd recommend taking a look at them in detail if you're interested, but, at a high level, they found evidence that, compared to models with a comparable amount of parameters they were indeed able to get comparable or better performance with a lower number of FLOPS. It also meant they were able to build up to a trillion-parameter model, without having unreasonable computation requirements. Some interesting considerations relevant to this approach: - To keep training speed up, you need to strike the right balance of the number of tokens sent to each expert; in this case, the authors added a loss term to incentivize the division between experts to be roughly uniform - There was some numerical instability around the expert training procedure if you used float16 data types, so they switched to using float32, but only within the experts themselves, rather than in the rest of the network. - To regularize this huge of a network, the authors decided to apply dropout, but only within the experts |
[link]
When machine learning models need to run on personal devices, that implies a very particular set of constraints: models need to be fairly small and low-latency when run on a limited-compute device, without much loss in accuracy. A number of human-designed architectures have been engineered to try to solve for these constraints (depthwise convolutions, inverted residual bottlenecks), but this paper's goal is to use Neural Architecture Search (NAS) to explicitly optimize the architecture against latency and accuracy, to hopefully find a good trade-off curve between the two. This paper isn't the first time NAS has been applied on the problem of mobile-optimized networks, but a few choices are specific to this paper. 1. Instead of just optimizing against accuracy, or optimizing against accuracy with a sharp latency requirement, the authors here construct a weighted loss that includes both accuracy and latency, so that NAS can explore the space of different trade-off points, rather than only those below a sharp threshold. 2. They design a search space where individual sections or "blocks" of the network can be configured separately, with the hope being that this flexibility helps NAS trade off complexity more strongly in the early parts of the network, where, at a higher spatial resolution, it implies greater computation cost and latency, without necessary dropping that complexity later in the network, where it might be lower-cost. Blocks here are specified by the type of convolution op, kernel size, squeeze-and-excitation ratio, use of a skip op, output filter size, and the number of times an identical layer of this construction will be repeated to constitute a block. Mechanically, models are specified as discrete strings of tokens (a block is made up of tokens indicating its choices along these design axes, and a model is made up of multiple blocks). These are represented in a RL framework, where a RNN model sequentially selects tokens as "actions" until it gets to a full model specification . This is repeated multiple times to get a batch of models, which here functions analogously to a RL episode. These models are then each trained for only five epochs (it's desirable to use a full-scale model for accurate latency measures, but impractical to run its full course of training). After that point, accuracy is calculated, and latency determined by running the model on an actual Pixel phone CPU. These two measures are weighted together to get a reward, which is used to train the RNN model-selection model using PPO. https://i.imgur.com/dccjaqx.png Across a few benchmarks, the authors show that models found with MNasNet optimization are able to reach parts of the accuracy/latency trade-off curve that prior techniques had not. |
[link]
The goal of this paper is to learn a model that embeds 2D keypoints(the locations of specific key body parts in 2D space) representing a particular pose into a vector embedding where nearby points in embedding space are also nearby in 3D space. This sort of model is useful because the same 3D pose can generate a wide variety of 2D pose projections, and it can be useful to learn which apparently-distinct representations actually map to the same 3D pose. To do this, the basic approach used by the authors (with has a few variants), is - Take a dataset of 3D poses, and corresponding 2D projections - Define a notion of "matching" 3D poses, based on a parameter kappa, which designates the maximum average per-joint distance at which two 3D poses can be considered the same - Construct triplets composed of an anchor pose, a "positive" pose (a different 2D pose with a matching 3D pose), and a "negative" pose (some other 2D pose sampled from the dataset using a strategy that explicitly seeks out hard negative examples) - Calculate a triplet loss, that pushes positive examples closer together, and pulls negative examples farther apart. This is specifically done by defining a probabilistic representation of p(match | z1, z2), or, the probability of a match in 3D space given the embeddings of the two 2D poses. This is parametrized using a sigmoid with trainable parameters, as shown below https://i.imgur.com/yFCCVuA.png - They they calculate a distance kernel as the negative log of that probability, and calculate the basic triplet loss, which tries to maximize the diff between the the distance between negative examples, and the distance between positive examples. - They also add an additional loss further incentivizing the match probability to be higher on the positive pair (in addition to just pushing the positive and negative pair further apart) - The final loss is a Gaussian prior loss, incentivizing the learned embeddings z to be in the shape of a Gaussian https://i.imgur.com/SxvcvJG.png This represents the central shape of the method. Some additional ablations include: - Camera Augmentation: Creational additional triplets by taking existing 3D poses and generating artificial pairs of 2D poses at different camera views - Temporal Pose Embedding - Embedding multiple temporally connected pose, rather than just a single one - Keypoint Dropout - To simulate situations where some keypoints are occluded, the authors tried training with some keypoints dropped out, either keypoints selected at random, or selected jointly and non-independently based on a model of which keypoints are likely to be occluded together The authors found that their method was generally quite a bit stronger that prior approaches for the task of querying similar 3D poses given a 2D pose input, including some alternate methods that do direct 3D estimation. |
[link]
Federated learning is the problem of training a model that incorporates updates from the data of many individuals, without having direct access to that data, or having to store it. This is potentially desirable both for reasons of privacy (not wanting to have access to private data in a centralized way), and for potential benefits to transport cost when data needed to train models exists on a user's device, and would require a lot of bandwidth to transfer to a centralized server. Historically, the default way to do Federated Learning was with an algorithm called FedSGD, which worked by: - Sending a copy of the current model to each device/client - Calculating a gradient update to be applied on top of that current model given a batch of data sampled from the client's device - Sending that gradient back to the central server - Averaging those gradients and applying them all at once to a central model The authors note that this approach is equivalent to one where a single device performs a step of gradient descent locally, sends the resulting *model* back to the the central server, and performs model averaging by averaging the parameter vectors there. Given that, and given their observation that, in federated learning, communication of gradients and models is generally much more costly than the computation itself (since the computation happens across so many machines), they ask whether the communication required to get to a certain accuracy could be better optimized by performing multiple steps of gradient calculation and update on a given device, before sending the resulting model back to a central server to be average with other clients models. Specifically, their algorithm, FedAvg, works by: - Dividing the data on a given device into batches of size B - Calculating an update on each batch and applying them sequentially to the starting model sent over the wire from the server - Repeating this for E epochs Conceptually, this should work perfectly well in the world where data from each batch is IID - independently drawn from the same distribution. But that is especially unlikely to be true in the case of federated learning, when a given user and device might have very specialized parts of the data space, and prior work has shown that there exist pathological cases where averaged models can perform worse than either model independently, even *when* the IID condition is met. The authors experiment empirically ask the question whether these sorts of pathological cases arise when simulating a federated learning procedure over MNIST and a language model trained on Shakespeare, trying over a range of hyperparameters (specifically B and E), and testing the case where data is heavily non-IID (in their case: where different "devices" had non-overlapping sets of digits). https://i.imgur.com/xq9vi8S.png They show that, in both the IID and non-IID settings, they are able to reach their target accuracy, and are able to do so with many fewer rounds of communciation than are required by FedSGD (where an update is sent over the wire, and a model sent back, for each round of calculation done on the device.) The authors argue that this shows the practical usefulness of a Federated Learning approach that does more computation on individual devices before updating, even in the face of theoretical pathological cases. |
[link]
In certain classes of multi-agent cooperation games, it's useful for agents to be able to coordinate on future actions, which is an obvious use case for having a communication channel between the two players. However, prior work in multi-agent RL has shown that it's surprisingly hard to train agents that (1) consistently learn to use a communication channel in a way that is informative rather than random, and (2) if they do use communication, can come to a common grounding on the meaning of symbols, to use them in an effective way. This paper suggests the straightforward and clever approach of, instead of just having agents communicate using arbitrary vectors produced as part of a policy, having those communication vectors be directly linked to the content of an agent's observations. Specifically, this is done by taking the encoding of the image that is used for making policy decisions, and passing that encoding through an autoencoder, using the bottleneck at the middle of the autoencoder as the communication vector sent to other agents. This structure incentivizes the agent to generate communication vectors that are intrinsically grounded in the observation, enforcing a certain level of consistency that the authors hope makes it easier for the other agent to follow and interpret the communication. https://i.imgur.com/u9OAZm8.png Empirically, there seem to be fairly compelling evidence that this autoencoder-based form of grounding is more stable and thus more mutually learnable than learning from RL alone. The authors even found that adding RL training to the autoencoder-based training deteriorated performance. |
[link]
This strikes me as a really straightforward, clever, and exciting paper that uses the supervision intrinsic in the visual, audio, and text streams of a video to train a shared multimodal model. The basic premise is: - Tokenize all three modalities into a sequence of embedding tokens. For video, split into patches, and linearly project the voxels of these patches to get a per-token representation. For audio, a similar strategy but with waveform patches. For text, the normal per-token embedding is done. Combine this tokenization with a modality-specific positional encoding. - Run all of these embeddings through a Transformer with shared weights for all three modalities - Take the final projected CLS representation for each the video patches, and perform contrastive learning against both an aligned audio patch, and an aligned text region. This contrastive loss is calculated by, for each pair, projecting into a shared space (video and audio each project into a shared audio-video space, video and text each project into a shared video-text space, with specific projection weights), and then doing a normal contrastive setup where positive pairs come either from a direct alignment of audio and video, or from a soft "nearest neighbors" alignment of text with video, to account for not all video snippets containing text One technique that was fun in its simplicity was the author's DropToken strategy, which basically just said "hey, we have a high-resolution input, what if we just randomly dropped tokens within our sequence to reduce the S^2 sequence length cost. This obviously leads to some performance cost, but they found it not very dramatic. Experimental results were all-around impressive, achieving SOTA on a number of modality-specific tasks (action prediction in video, audio prediction) with their cross-modality model. |
[link]
This work expands on prior techniques for designing models that can both be stored using fewer parameters, and also execute using fewer operations and less memory, both of which are key desiderata for having trained machine learning models be usable on phones and other personal devices. The main contribution of the original MobileNets paper was to introduce the idea of using "factored" decompositions of Depthwise and Pointwise convolutions, which separate the procedures of "pull information from a spatial range" and "mix information across channels" into two distinct steps. In this paper, they continue to use this basic Depthwise infrastructure, but also add a new design element: the inverted-residual linear bottleneck. The reasoning behind this new layer type comes from the observation that, often, the set of relevant points in a high-dimensional space (such as the 'per-pixel' activations inside a conv net) actually lives on a lower-dimensional manifold. So, theoretically, and naively, one could just try to use lower dimensional internal representations to map the dimensionality of that assumed manifold. However, the authors argue that ReLU non-linearities kill information (because of the region where all inputs are mapped to zero), and so having layers contain only the number of dimensions needed for the manifold would mean that you end up with too-few dimensions after the ReLU information loss. However, you need to have non-linearities somewhere in the network in order to be able to learn complex, non-linear functions. So, the authors suggest a method to mostly use smaller-dimensional representations internally, but still maintain ReLus and the network's needed complexity. https://i.imgur.com/pN4d9Wi.png - A lower-dimensional output is "projected up" into a higher dimensional output - A ReLu is applied on this higher-dimensional layer - That layer is then projected down into a smaller-dimensional layer, which uses a linear activation to avoid information loss - A residual connection between the lower-dimensional output at the beginning and end of the expansion This way, we still maintain the network's non-linearity, but also replace some of the network's higher-dimensional layers with lower-dimensional linear ones |
[link]
I'm a little embarrassed that I'm only just now reading what seems like a fairly important paper from a year and a half ago, but, in my defense, March 2020 was not the best time for keeping up with the literature in a disciplined way. Anyhow, musings aside: this paper proposes an alternative training procedure for large language models, which the authors claim result in models that reach strong performance more efficiently than previous BERT, XLNet, or RoBERTa baselines. As some background context, the previously-canonical Masked Learning Model (MLM) task works by: - Replacing some percentage of tokens with a [MASK] indicator - Using the final-layer representation at the locations of those [MASK]s to predict the true input token - Using as a training signal the Maximum Likelihood of that prediction, or, how high the model's predicted probability on the true input. ELECTRA authors argue that there are a few notable disadvantages to this structure, if your goal is to train useful representations for downstream tasks. Firstly, your loss only consists of information (i.e. the true token) from the tokens you randomly masked, so a good amount of the data is going in some sense unused (except as context). Secondly, learning a full generative model of language requires a lot of data and training time, and it may not be all that beneficial for performance on your downstream tasks of interest. As an alternative, they propose: - Co-learning a (small) generator, trained in typical MLM fashion, alongside a discriminator. Randomly select tokens from the input to replace with fake tokens drawn from the distribution of the discriminator - The goal of the discriminator is to distinguish the true tokens from the fake ones. (minor note: if the generator happens to get lucky and generate the real token, that's counted as a "real" rather than "fake" token, even though it was generated by a generator). This uses more of the training data in the loss, since you can ask "real or fake" for every token in the input data, not (obviously) just the ones that are actually fake - An important note for those familiar with GANs is that the generator isn't trained to confuse the discriminator (as is GAN-standard), but is simply trained with it's own maximum likelihood loss, independent of the discriminator's performance. They argue, and show fairly convincingly, that ELECTRA is able to reach a higher efficiency-to-performance trade-off curve compared to BERT - matching the performance of previous models with notably less training, and outperforming them with comparable amounts of training. They go on to perform a few ablations, some of which felt more convincing than others. The most confusing ablation, which I'm not sure if I just misunderstood, was meant to ask how much of the value of ELECTRA came from calculating its loss over all the tokens in the training data, rather than just the masked ones. So, they tried just calculating the loss for the masked/replaced tokens. The resulting discriminator performs very poorly downstream. But, I find this a little odd as a design choice, since couldn't the discriminator learn to almost always predict that a replaced token was fake, since the only way it could be otherwise would be if the generator got lucky and produced the true word? They also did the (more sensible, to me) experiment of calculating the loss on a similarly-sized percentage of tokens, but not fully overlapping with the replacement mask, and that did more similarly to base ELECTRA. They also tested training a combined MLM/ELECTRA loss, where generated tokens were used in lieu of masking, and the full-sized MLM generator predicts the true token at every point in the sequence (which could be the token it gets as input, or could not be, in the case of a replacement). That model performed more closely to ELECTRA than BERT, which suggests that the efficiency gain of calculating a loss on every element in the training set was more important in practice than the gain from focusing a discriminator more directly on what was valuable for downstream tasks, rather than generating. |
[link]
This new architecture out of Deepmind applies combines information extraction and bottlenecks to a traditional Transformer base to get a model that can theoretically apply self-attention to meaningfully larger input sizes than earlier architectures allowed. Currently, self-attention models are quite powerful and capable, but because attention is quadratic-in-sequence-length in both time, and, often more saliently, memory, it's infeasible to use on long sequences without some modification. This paper propose what they call "cross-attention," where some smaller-dimensional latent vector attends to the input (the latent generates the queries, the input the keys and values). This lets the network pull information out of the larger-dimensional input into a smaller and fixed-by-hyperparameter, size of latent. From there, multiple self-attention layers are applied to generate a new latent, which can be fed back into the beginning of the process to query new information from the input, accounting for the "iterative" in the title of this work. The authors argue this approach lets them take larger inputs, and create deeper models, because the cost of each self-attention layer (going from latent-dim to latent-dim) is small and controlled. Like many other Transformer-based architectures, they use positional encodings, theirs based on Fourier features at different frequencies. https://i.imgur.com/Wc8rzII.png My overall take from the results presented is that it is competitive on many of the audio and vision tasks tested, with none of the convolutional priors that even something like Vision Transformer (which does course convolution-style preprocessing before going into Transformer layers) require, though it didn't dramatically outperform the state-of-the-art on any of the tested tasks. One thing that was strange to me was that they didn't (at least in the main paper, haven't read the appendix) seem to evaluate on text, which would seem like an obvious benchmark if you're proposing a Transformer-alternate architecture. |
[link]
This was an amusingly-timed paper for me to read, because just yesterday I was listening to a different paper summary where the presenter offhandedly mentioned the idea of compressing the sequence length in Transformers through subsequent layers (the way a ConvNet does pooling to a smaller spatial dimension in the course of learning), and it made me wonder why I hadn't heard much about that as an approach. And, lo, I came on this paper in my list the next day, which does exactly that. As a refresher, Transformers work by starting out with one embedding per token in the first layer, and, on each subsequent layer, they create new representations for each token by calculating an attention mechanism over all tokens in the prior layer. This means you have one representation per token for the full sequence length, and for the full depth of the network. In addition, you typically have a CLS token that isn't connected to any particular word, but is the designated place where sequence-level representations aggregate and are used for downstream tasks. This paper notices that many applications of trained transformers care primarily about that aggregated representation, rather than precise per-word representations. For cases where that's true, you're spending a lot of computation power on continually calculating the SeqLength^2 attention maps in later layers, when they might not be bringing you that much value in your downstream transfer tasks. A central reason why you do generally need per-token representations in training Transformers, though, even if your downstream tasks need them less, is that the canonical Masked Language Model and newer ELECTRA loss functions require token-level predictions for the specific tokens being masked. To accommodate this need, the authors of this paper structure their "Funnel" Transformer as more of an hourglass. It turns it into basically a VAE-esque Encoder/Decoder structure, where attention downsampling layers reduce the length of the internal representation down, and then a "decoder" amplifies it back to the full sequence size, so you have one representation per token for training purposes (more on the exact way this works in a bit). The nifty thing here is that, for downstream tasks, you can chop off the decoder, and be left with a network with comparatively less computation cost per layer of depth. https://i.imgur.com/WC0VQXi.png The exact mechanisms of downsampling and upsampling in this paper are quite clever. To perform downsampling at a given attention layer, you take a sequence of representations h, and downsampling it to h' of half the size by mean-pooling adjacent tokens. However, in the attention calculation, you only use h' for the queries, and use the full sequence h for the keys and values. Essentially, this means that you have an attention layer where the downsampled representations attend to and pull information from the full scope of the (non-downsampled) representations of the layer below. This means you have a much more flexible downsampling operation, since the attention mechanism can choose to pull information into the downsampled representation, rather than it being calculated automatically by a pooling operation The paper inflates the bottleneck-ed representations back up to the full sequence length by first tiling the downsampled representation (for example, if you had downsampled from 20 to 5, you would tile the first representation 4 times, then the second representation 4 times, and so on until you hit 20). That tiled representation, which can roughly be though to represent a large region of the sequence, is then added, ResNet-style, to the full-length sequence of representations that came out of the first attention layer, essentially combining shallow token-level representations with deep region-level representations. This aggregated representation is then used for token-level loss prediction The authors benchmark again common baseline models, using deeper models with fewer tokens per layer, and find that they can reach similar or higher levels of performance with fewer FLOPs on text aggregation tasks. They fall short of full-sequence models for tasks that require strong per-token representations, which fits with my expectation. |
[link]
This summary builds substantially on my summary of NERFs, so if you haven't yet read that, I recommend doing so first! The idea of a NERF is learn a neural network that represents a 3D scene, and from which you can, once the model is trained, sample an image of that scene from any desired angle. This involves structuring your neural network as a function that predicts the RGB color and density/opacity for a given point in 3D space (x, y, z), from a given viewing angle (theta, phi). With such a function, you can generate predictions of what images taken from certain angles would look like by sampling along a viewing ray, and integrating the combined hue and opacity into an aggregated view. This prediction can then be compared to a true image taken from that direction, and gradients passed backwards into the prediction model. An important assumption of this model is that the scene being photographed is static; specifically, that every point in space is always inhabited by the same part of the 3D object, regardless of what angle it's viewed from. This is a reasonable assumption for photos of inanimate objects, or of humans in highly controlled lab settings, but it is often not true for humans when you, say, ask them to take a selfie video of themselves. Even if they're trying to keep roughly still, there will be slight shifts in the location and position of their head between frames, and the authors of this paper show that this can lead to strange artifacts if you naively try to train a NERF from the images (including a particularly odd one where it hallucinates tiny copies of the image in the air surrounding the face). https://i.imgur.com/IUVh6uM.png The fix proposed by this paper is to apply a learnable deformation field to each image, where the notion is to deform each view into being in one canonical position (fixed per network, since, again, one network corresponds to a single scene). This means that, along with learning the parameters of the NERF itself, you're also learning what deformation to apply to each training image to get it into this canonical position. This is done by parametrizing the deformation in a particular way, and then having that deformation be conditioned by a latent vector that's trained similar to how you'd train an embedding (one learned vector per image example). The parametrization of the deformation is honestly a little bit over my head, given my lack of grounding in 3D modeling, but my general sense is that it applies some constraints and regularization to ensure that the learned deformations are realistic, insofar as humans are mostly rigid (one patch of skin on my forehead generally doesn't move except in concordance with the rest of my forehead), but with some possibility for elasticity (skin can stretch if I, say, smile). The authors also include an annealing scheme whereby, early in training, the model focuses on learning course (large-scale) deformations, and later in training, it's allowed to also learn weights for more precise deformations. This is to hopefully match macro-scale shifts before adding the noise of precise changes. This addition of a learned deformation is most of the contribution of this method: with it applied, they show that they're able to learn realistic NERFs from selfies, which they term "NERFIES". They mention a few pieces of concurrent work that try to solve the same problem of non-static human subjects in different ways, but I haven't had a chance to read those, so I can't really comment on how NERFIES stacks up to alternate approaches, but it appears to be as least one empirically convincing solution to the problem it's aiming at. |
[link]
This summary builds extensively on my prior summary of SIRENs, so if you haven't read that summary or the underlying paper yet, I'd recommend doing that first! At a high level, the idea of SIRENs is to use a neural network to learn a compressed, continuous representation of an image, where the neural network encodes a mapping from (x, y) to the pixel value at that location, and the image can be reconstructed (or, potentially, expanded in size) by sampling from that function across the full range of the image. To do this effectively, they use sinusoidal activation functions, which let them match not just the output of the neural network f(x, y) to the true image, but also the first and second derivatives of the neural network to the first and second derivatives of the true image, which provides a more robust training signal. NERFs builds on this idea, but instead of trying to learn a continuous representation of an image (mapping from 2D position to 3D RGB), they try to learn a continuous representation of a scene, mapping from position (specified with with three coordinates) and viewing direction (specified with two angles) to the RGB color at a given point in a 3D grid (or "voxel", analogous to "pixel"), as well as the *density* or opacity of that point. Why is this interesting? Because if you have a NERF that has learned a good underlying function of a particular 3D scene, you can theoretically take samples of that scene from arbitrary angles, even angles not seen during training. It essentially functions as a usable 3D model of a scene, but one that, because it's stored in the weights of a neural network, and specified in a continuous function, is far smaller than actually storing all the values of all the voxels in a 3D scene (the authors give an example of 5MB vs 15GB for a NERF vs a full 3D model). To get some intuition for this, consider that if you wanted to store the curve represented by a particular third-degree polynomial function between 0 and 10,000 it would be much more space-efficient to simply store the 3 coefficients of that polynomial, and be able to sample from it at your desired granularity at will, rather than storing many empirically sampled points from along the curve. https://i.imgur.com/0c33YqV.png How is a NERF model learned? - The (x, y, z) position of each point is encoded as a combination of sine-wave, Fourier-style curves of increasingly higher frequency. This is similar to the positional encoding used by transformers. In practical turns, this means a location in space will be represented as a vector calculated as [some point on a low-frequency curve, some point on a slightly higher frequency curve..., some point on the highest-frequency curve]. This doesn't contain any more *information* than the (x, y, z) representation, but it does empirically seem to help training when you separate the frequencies like this - You take a dataset of images for which viewing direction is known, and simulate sending a ray through the scene in that direction, hitting some line (or possibly tube?) of voxels on the way. You calculate the perceived color at that point, which is an integral of the color information and density/opacity returned by your model, for each point. Intuitively, if you have a high opacity weight early on, that part of the object blocks any voxels further in the ray, whereas if the opacity weight is lower, more of the voxels behind will contribute to the overall effective color perceived. You then compare these predicted perceived colors to the actual colors captured by the 2D image, and train on the prediction error. - (One note on sampling: the paper proposes a hierarchical sampling scheme to help with sampling efficiently along the ray, first taking a course sample, and then adding additional samples in regions of high predicted density) - At the end of training, you have a network that hopefully captures the information from *that particular scene*. A notable downside of this approach is that it's quite slow for any use cases that require training on many scenes, since each individual scene network takes about 1-2 days of GPU time to train |
[link]
[First off, full credit that this summary is essentially a distilled-for-my-own-understanding compression of Yannic Kilcher's excellent video on the topic] I'm interested in learning more about Neural Radiance Fields (or NERFs), a recent technique for learning a representation of a scene that lets you generate multiple views from it, and a paper referenced as a useful prerequisite for that technique was SIRENs, or Sinuisodial Representation Networks. In my view, the most complex part of understanding this technique isn't the technique itself, but the particularities of the problem being solved, and the ways it differs from a more traditional ML setup. Typically, the goal of machine learning is to learn a model that extracts and represents properties of a data distribution, and that can generalize to new examples drawn from that distribution. Instead, in this framing, a single network is being used to capture information about a single image, essentially creating a compressed representation of that image that brings with it some nice additional properties. Concretely, the neural network is representing a function that maps inputs of the form (x, y), representing coordinates within the image, to (r, g, b) values, representing the pixel values of the image at that coordinate. If you're able to train an optimal version of such a network, it would mean you have a continuous representation of the image. A good way to think about "continuous," here, is that, you could theoretically ask the model for the color value at pixel (3.5, 2.5), and, given that it's simply a numerical mapping, it could give you a prediction, even though in your discrete "sampling" of pixels, that pixel never appears. Given this problem setting, the central technique proposed by SIRENs is to use sinusoidal non-linearities between the layers. On the face of it, this may seem like a pretty weird choice: non-linearities are generally monotonic, and a sine wave is absolutely not that. The appealing property of sinusoidal activations in this context is: if you take a derivative of a sine curve, what you get is a cosine curve (which is essentially a shifted sine curve), and the same is true in reverse. This means that you can take multiple derivatives of the learned function (where, again, "learned function" is your neural network optimized for this particular image), and have them still be networks of the same underlying format, with shifting constants. This allows SIRENs to use an enhanced version of what would be a typical training procedure for this setting. Simplistically, the way you'd go about training this kind of representation would be to simply give the inputs, and optimize against a loss function that reduced your prediction error in predicting the output values, or, in other words, the error on the f(x, y) function itself. When you have a model structure that makes it easy to take first and second derivatives of the function calculated by the model, you can, as this paper does, decide to train against a loss function of matching, not just the true f(x, y) function (again, the pixel values at coordinates), but also the first and second-derivatives (gradients and Laplacian) of the image at those coordinates. This supervision lets you learn a better underlying representation, since it enforces not just what comes "above the surface" at your sampled pixels, but the dynamics of the true function between those points. One interesting benefit of this procedure of using loss in a first or second derivative space (as pointed out in the paper), is that if you want to merge the interesting parts of multiple images, you can approximate that by training a SIREN on the sum of their gradients, since places where gradients are zero likely don't contain much contrast or interesting content (as an example: a constant color background). The Experiments section goes into a lot of specific applications in boundary-finding problems, which I understand at less depth, and thus won't try to explain. It also briefly mentions trying to learn a prior over the space of image functions (that is, a prior over the set of network weights that define the underlying function of an image); having such a prior is interesting in that it would theoretically let you sample both the implicit image function itself (from the prior), and then also points within that function. |
[link]
This is an interesting paper, investigating (with a team that includes the original authors of the Lottery Ticket paper) whether the initializations that result from BERT pretraining have Lottery Ticket-esque properties with respect to their role as initializations for downstream transfer tasks. As background context, the Lottery Ticket Hypothesis came out of an observation that trained networks could be pruned to remove low-magnitude weights (according to a particular iterative pruning strategy that is a bit more complex than just "prune everything at the end of training"), down to high levels of sparsity (5-40% of original weights, and that those pruned networks not only perform well at the end of training, but also can be "rewound" back to their initialization values (or, in some cases, values from early in training) and retrained in isolation, with the weights you pruned out of the trained network still set to 0, to a comparable level of accuracy. This is thought of as a "winning ticket" because the hypothesis Frankle and Carbin generated is that the reason we benefit from massively overparametrized neural networks is that we are essentially sampling a large number of small subnetworks within the larger ones, and that the more samples we get, the likelier it is we find a "winning ticket" that starts our optimization in a place conducive to further training. In this particular work, the authors investigate a slightly odd variant of the LTH. Instead of looking at training runs that start from random initializations, they look at transfer tasks that start their learning from a massively-pretrained BERT language model. They try to find out: 1) Whether you can find "winning tickets" as subsets of the BERT initialization for a given downstream task 2) Whether those winning tickets generalize, i.e. whether a ticket/pruning mask for one downstream task can also have high performance on another. If that were the case, it would indicate that much of the value of a BERT initialization for transfer tasks could be captured by transferring only a small percentage of BERT's (many) weights, which would be beneficial for compression and mobile applications An interesting wrinkle in the LTH literature is the question of whether true "winning tickets" can be found (in the sense of the network being able to retrain purely from the masked random initializations), or whether it can only retrain to a comparable accuracy by rewinding to an early stage in training, but not the absolute beginning of training. Historically, the former has been difficult and sometimes not possible to find in more complex tasks and networks. https://i.imgur.com/pAF08H3.png One finding of this paper is that, when your starting point is BERT initialization, you can indeed find "winning tickets" in the first sense of being able to rewind the full way back to the beginning of (downstream task) training, and retrain from there. (You can see this above with the results for IMP, Iterative Magnitude Pruning, rolling back to theta-0). This is a bit of an odd finding to parse, since it's not like BERT really is a random initialization itself, but it does suggest that part of the value of BERT is that it contains subnetworks that, from the start of training, are in notional optimization basins that facilitate future training. A negative result in this paper is that, by and large, winning tickets on downstream tasks don't transfer from one to another, and, to the extent that they do transfer, it mostly seems to be according to which tasks had more training samples used in the downstream mask-finding process, rather than any qualitative properties of the task. The one exception to this was if you did further training of the original BERT objective, Masked Language Modeling, as a "downstream task", and took the winning ticket mask from that training, which then transferred to other tasks. This is some validation of the premise that MLM is an unusually good training task in terms of its transfer properties. An important thing to note here is that, even though this hypothesis is intriguing, it's currently quite computationally expensive to find "winning tickets", requiring an iterative pruning and retraining process that takes far longer than an original training run would have. The real goal here, which this is another small step in the hopeful direction of, is being able to analytically specify subnetworks with valuable optimization properties, without having to learn them from data each time (which somewhat defeats the point, if they're only applicable for the task they're trained on, though is potentially useful is they do transfer to some other tasks, as has been shown within a set of image-prediction tasks). |
[link]
This a nice, compact paper testing a straightforward idea: can we use the contrastive loss structure so widespread in unsupervised learning as a framework for generating and training against adversarial examples? In the context of the adversarial examples literature, adversarial training - or, training against examples that were adversarially generated so as to minimize the loss of the model you're training - is the primary strategy used to train robust models (robust here in the sense of not being susceptible to said adversarial attacks). Typically, these attacks are generated with the use of class labels, since they are meant to attack supervised classifiers that assign a class label to an image. Therefore, the goal of the adversarial attack is to push down the probability of the correct class label (either in favor of a specific alternate class, or just in favor of any class that isn't the true one). However, labels are hard and expensive, so, one wonders: in the same way that you can learn representations from unlabeled data, can you also make those representations (otherwise referred to as "embeddings") robust in a similarly label-free way. This paper tests an approach that does so in a quite simple way, by just generating adversarial examples against your contrastive loss target. This works by: 1) Taking an image, and generating two augmentations (or transformations) of it. This is part of the standard contrastive pipeline 2) Applying an adversarial perturbation to one of those transformations, where the perturbation is optimized to maximize the contrastive loss (ability to differentiate an augmented version of the same image from augmented versions of other images) 3) Training on that adversarial sample to generate more robustness https://i.imgur.com/ttF6k1A.png And this simple approach appears to work quite well! They find that, in defending against supervised adversarial attacks, it performs comparably to supervised adversarial training, and that it has the added benefits of (1) slightly higher accuracy on clean examples (in general, robustness is known to decrease your clean-sample accuracy), and (2) better robustness against attack types other than the attack type used for the adversarial training. It also achieves better transfer performance (that is, adversarially training on one dataset, and then evaluating robustness on another) than a supervised method, when evaluated on both CIFAR10 → CIFAR100 and CIFAR100 → CIFAR10. This does make pretty good sense to me, since instance-level stability does seem like it's getting at a more fundamental set of invariances that to would transfer better to different distributions of classes. |
[link]
The premise of contrastive loss is that we want to push together the representations of objects that are similar, and push dissimilar representations farther apart. However, in an unlabeled setting, we don't generally have class labels to tell which images (or objects in general) are supposed to be similar or dissimilar along the axes that matter to us, so we use the shortcut of defining some transformation on a given anchor frame that gets us a frame we're confident is related enough to that anchor that it can be considered a "positive" or target similarity-wise. Some of these transformations are data augmentations performed on a frame, or choosing temporally adjacent frames in a video sequence (which, since the real world evolves smoothly, are assumed to be similar). Anyhow, all of this is well and good, except for the fact that, especially in an image classification setting like CIFAR or ImageNet, sampling randomly from the other images in a given batch doesn't give you a set of things that are entirely "negatives" in terms of being dissimilar to the anchor image. It is true that most of the objects you get by sampling randomly are negatives (especially in a many-class setting), but some of them will be other samples from the same class. By treating all of those as negatives, we penalize the model for having representations of them that are chose to our anchor representation, even though, for many downstream tasks, we'd probably prefer elements of the same class to have more similar representations. However, the whole premise of the unsupervised setting is that we don't have class labels, so we don't know, for a given sample from the batch (of things that aren't specifically transformations of the anchor) whether it's an actual negative or secretly a positive (i.e. of the same class). And, that's true, but this paper argues that, even if you can't identify which specific elements in a batch are secret positives, you can try to account for them in aggregate, if you have some reasonably good estimate of the overall class probabilities, which will tell you how many positives you expect to find in a given batch in expectation. Given that, they reformulate the loss to be "debiased". They do this by taking the expectation over negatives in the denominator, which is actually a sample over the full p(x), not just the distribution over negatives, and trying to make it a better estimate of the actual distribution over negatives. https://i.imgur.com/URN4RBF.png This they accomplish by writing out the full p(x) as a weighted combination of the distributions over positive and negative (which here is "every class that doesn't match the anchor"), as shown above, and noticing that you can represent the negative part of the distribution by taking the full distribution, and subtracting out the positive distribution (which we have an estimator for by construction, with our transformations), weighted by the prior over how frequent the positives are in our overall distribution. https://i.imgur.com/5IgGIhu.png This leads to a change of estimating the similarity between the anchor and positives (which we already have in the numerator, but which we can also calculate with more augmentations/positive samples to get a better estimate) and doing a (weighted) subtraction of that from the similarity over negative examples. Intuitively, we keep in the part where we penalize similarity with negatives (by adding magnitude to the denominator), but reduce that penalty in accordance with how much we think that "similarity with negatives" is actually similarity with other positives in the batch, which we actually would like to keep around. https://i.imgur.com/kUGoemA.png https://i.imgur.com/5Gitdi7.png In terms of experimental results, my read is that this is most useful on problems - like CIFAR10 and STL10 - that don't have many classes (they each, per their names, have 10). The results there are meaningfully stronger than for the 200-class ImageNet. And, that makes pretty good intuitive sense, since you would expect the scale of the "secret positives in our random sample of images" bias problem to be a lot more acute in a setting where we've got a 1 in 10 chance of sampling a same-class image, compared to a 1-in-200 chance. |
[link]
Large-scale transformers on unsupervised text data have been wildly successful in recent years; arguably, the most successful single idea in the last ~3 years of machine learning. Given that, it's understandable that different domains within ML want to take their shot at seeing whether the same formula will work for them as well. This paper applies the principles of (1) transformers and (2) large-scale unlabeled data to the problem of learning informative embeddings of molecular graphs. Labeling is a problem in much of machine learning - it's costly, and narrowly defined in terms of a certain task - but that problem is even more exacerbated when it comes to labeling properties of molecules, since they typically require wetlab chemistry to empirically measure. Given that, and also given the fact that we often want to predict new properties - like effectiveness against a new targetable drug receptor - that we don't yet have data for, finding a way to learn and transfer from unsupervised data has the potential to be quite valuable in the molecular learning sphere. There are two main conceptual parts to this paper and its method - named GROVER, in true-to-ML-form tortured acronym style. The first is the actual architecture of their model itself, which combines both a message-passing Graph Neural Network to aggregate local information, and a Transformer to aggregate global information. The paper was a bit vague here, but the way I understand it is: https://i.imgur.com/JY4vRdd.png - There are parallel GNN + Transformer stacks for both edges and nodes, each of which outputs both a node and edge embedding, for four embeddings total. I'll describe the one for nodes, and the parallel for edges operates the same way, except that hidden states live on edges rather than nodes, and attention is conducted over edges rather than nodes - In the NodeTransformer version, a message passing NN (of I'm not sure how many layers) performs neighborhood aggregation (aggregating the hidden states of neighboring nodes and edges, then weight-transforming them, then aggregating again) until each node has a representation that has "absorbed" in information from a few hops out of its surrounding neighborhood. My understanding is that there is a separate MPNN for queries, keys, and values, and so each nodes end up with three different vectors for these three things. - Multi-headed attention is then performed over these node representations, in the normal way, where all keys and queries are dot-product-ed together, and put into a softmax to calculate a weighted average over the values - We now have node-level representations that combine both local and global information. These node representations are then aggregated into both node and edge representations, and each is put into a MLP layer and Layer Norm before finally outputting a node-based node and edge representation. This is then joined by an edge-based node and edge representation from the parallel stack. These are aggregated on a full-graph level to predict graph-level properties https://i.imgur.com/NNl6v4Y.png The other component of the GROVER model is the way this architecture is actually trained - without explicit supervised labels. The authors use two tasks - one local, and one global. The local task constructs labels based on local contextual properties of a given atom - for example, the atom here has one double-bonded Nitrogen and one single-bonded Oxygen in its local environment - and tries to predict those labels given the representations of that atom (or node). The global task uses RDKit (an analytically constructed molecular analysis kit) to identify 85 different modifs or functional groups in the molecule, and encodes those into an 85-long one-hot vector that is being predicted on a graph level. https://i.imgur.com/jzbYchA.png With these two components, GROVER is pretrained on 10 million unlabeled molecules, and then evaluated in transfer settings where its representations are fine-tuned on small amounts of labeled data. The results are pretty impressive - it achieves new SOTA performance by relatively large amounts on all tasks, even relative to exist semi-supervised pretraining methods that similarly have access to more data. The authors perform ablations to show that it's important to do the graph-aggregation step before a transformer (the alternative being just doing a transformer on raw node and edge features), and also show that their architecture without pretraining (just used directly in downstream tasks) also performs worse. One thing I wish they'd directly ablated was the value-add of the local (also referred to as "contextual") and global semi-supervised tasks. Naively, I'd guess that most of the performance gain came from the global task, but it's hard to know without them having done the test directly. |
[link]
I tried my best, but I'm really confused by the central methodology of this paper. Here are the things I do understand: 1. The goal of the method is to learn disentangled representations, and, specifically, to learn representations that correspond to factors of variation in the environment that are selected by humans. That means, we ask humans whether a given image is higher or lower on a particular relevant axis, and aggregate those rankings into a vector, where a particular index of the vector corresponds to a particular factor. Given a small amount of supervision, the hope is to learn an encoder that takes in an image, and produces a Z code that encodes where the image is on that particular axis 2. With those disentangled representations, the authors hope they can learn goal-conditioned policies, where the distance between the current image's representation and the goal image's representation can serve as a reward. In particular, they're trying to show that their weakly supervised disentangled representation performs better as a metric space to do goal-conditioning distance calculations in, relative to other learned spaces 3. The approach uses a GAN-based design, where a generator generates the images that correspond with a given z1 and z2, and the discriminator tries to tell the difference between the two real images, paired with their supervision vector, and two generated images, with their fake supervision vector [Here is the relevant equation, along with some notation-explaining text] https://i.imgur.com/XNbxK6i.png The thing I'm confused by is the actual mechanism for why (3) gets you disentangled representations. To my understanding, the thing the generator should be trying to do is generate images whose relationship to one another is governed by the relationship between z1 and z2; if z is really capturing your factors of variation, the two images should differ in places and in ways governed by where those z values are different. Based on this, I'd expect the fake supervision vector here to be some kind of binarized element-wise difference between the two (randomly sampled) vectors, z1 and z2. But the authors claim that the fake supervision vector that the generator is trying to replicate is just the zero vector. That seems like it would just result in the generator trying to generate images that don't differ on any axes, with two different z vectors as input. |
[link]
This is a really cool paper that posits a relatively simple explanation for the strange phenomena known as double descent - both the fact of seeing it in the first place, and the difficulty in robustly causing it to appear. In the classical wisdom of statistics, increasing model complexity too far will lead to increase in variance, and thus an increase in test error (or "test risk" or "empirical risk"), leading to a U-shaped test error curve as a function of model complexity. Double descent is the name given to the observation that, in modern neural networks, we tend to not see this phenomenon, and, in fact, sometimes see test error first increasing but then descend again below its initial minimum. Test error going up, and then back down again: double descent. However, this phenomenon proved to be a bit elusive: often in order to see it, you had to add artificial noise to your labels. This paper provides a cohesive theory for both the existence of double descent, and the fact that it sometimes can only be elicited with increased label noise. They empirically estimate the bias and variance components of test error for a range of neural nets on a range of datasets, and show that when they directly estimate bias and variance this way, they see bias decreasing (or, at least, non-increasing) monotonically with model complexity, as expected. But, they also see variance, rather than strictly increasing with model complexity, exhibiting unimodal behavior, where it first increases, and then decreases, as a function of model complexity. Taking a step back, bias is here understood as the component of your test error that comes from the difference between your expected learned estimator and the true underlying function. Variance is the squared difference between the expected learned estimator (that is, the one you get if you average over different splits in the data), and the estimator learned on each split of the data. The actual estimator you get is a function of both your average estimator, and the particular estimator you draw in the distribution around that average, which is defined by the variance. The authors empirically measure bias and variance by conducting k different N-way splits of their datasets, and averaging these k*N estimates to get an average or expected estimator. Given that, they can (as shown below), take the variance to be the squared difference between the k*N individual estimators and the average. Since test error is composed of bias + variance, we can then simply calculate bias as whatever remains of test error when variance has been accounted for. https://i.imgur.com/VPzujaZ.png This provides an elegant explanation for the different relationships we see between complexity and test error. In regimes where the decrease in bias from additional complexity is much larger than the increase in variance - which they argue is the case in modern deep networks - we don't see double descent, because the "bump" due to the variance peak is overshadowed by the continuing decrease in variance. However, in regimes where the overall scale of variance (at all levels of complexity) is higher, we see the increasing variance overwhelming the decreasing bias, and test error increases (before, ultimately, going down again, after the variance peaks). This explains why double descent has previously appeared preferentially in cases of injected label noise: more label noise means higher irreducible variability in the model learned from different sets of data, which makes the scale of the variance peak more pronounced compared to the bias drop. In addition to their empirical work, the authors also analytically analyze a two-layer linear neural network, and show that you would theoretically expect a peaked variance shape in that setting. In a certain sense, this just pushes the problem down the road, since the paper doesn't explain why, in any kind of conceptual or statistical sense, we would expect variance to be unimodal in this way. (They do offer a conjecture, but it was not the main thrust of the paper, and I didn't fully follow it). However, it does offer conceptual clarity into a previously somewhat more murky empirical phenomenon, and hopefully will let us focus on understanding why variance behaves in this way. |
[link]
Offline reinforcement learning is potentially high-value thing for the machine learning community learn to do well, because there are many applications where it'd be useful to generate a learnt policy for responding to a dynamic environment, but where it'd be too unsafe or expensive to learn in an on-policy or online way, where we continually evaluate our actions in the environment to test their value. In such settings, we'd like to be able to take a batch of existing data - collected from a human demonstrator, or from some other algorithm - and be able to learn a policy from those pre-collected transitions, without being able to query the environment further by taking arbitrary actions. There are two broad strategies for learning a policy from precollected transitions. One is to simply learn to mimic the action policy used by the demonstrator, predicting the action the demonstrator would take in a given state, without making use of reward data at all. This is Behavioral Cloning, and has the advantage of being somewhat more conservative (in terms of not experimenting with possibly-unsafe-or-low-reward actions the demonstrator never took), but this is also a disadvantage, because it's not possible to get higher reward than the demonstrator themselves got if you're simply copying their behavior. Another approach is to learn a Q function - estimating the value of a given action in a given state - using the reward data from the precollected transitions. This can also have some downsides, mostly in the direction of overconfidence. Q value Temporal Difference learning works by using the current reward added to the max Q value over possible next actions as the target for the current-state Q estimate. This tends to lead to overestimates, because regression to the mean effects mean that the highest value Q estimates are disproportionately likely to be noisy (possibly because they correspond to an action with little data in the demonstrator dataset). In on-policy Q learning, this is less problematic, because the agent can take the action associated with their noisily inaccurate estimate, and as a result get more data for that action, and get an estimate that is less noisy in future. But when we're in a fully offline setting, all our learning is completed before we actually start taking actions with our policy, so taking high-uncertainty actions isn't a valuable source of new information, but just risky. The approach suggested by this DeepMind paper - Critic Regularized Regression, or CRR - is essentially a synthesis of these two possible approaches. The method learns a Q function as normal, using temporal difference methods. The distinction in this method comes from how to get a policy, given a learned Q function. Rather than simply taking the action your Q estimate says is highest-value at a particular point, CRR optimizes a policy according to the formula shown below. The f() function is a stand-in for various potential functions, all of which are monotonic with respect to the Q function, meaning they increase when the Q function does. https://i.imgur.com/jGmhYdd.png This basically amounts to a form of a behavioral cloning loss (with the part that maximizes the probability under your policy of the actions sampled from the demonstrator dataset), but weighted or, as the paper terms it, filtered, by the learned Q function. The higher the estimated q value for a transition, the more weight is placed on that transition from the demo dataset having high probability under your policy. Rather than trying to mimic all of the actions of the demonstrator, the policy preferentially tries to mimic the demonstrator actions that it estimates were particularly high-quality. Different f() functions lead to different kinds of filtration. The `binary`version is an indicator function for the Advantage of an action (the Q value for that action at that state minus some reference value for the state, describing how much better the action is than other alternatives at that state) being greater than zero. Another, `exp`, uses exponential weightings which do a more "soft" upweighting or downweighting of transitions based on advantage, rather than the sharp binary of whether an actions advantage is above 1. The authors demonstrate that, on multiple environments from three different environment suites, CRR outperforms other off-policy baselines - either more pure behavioral cloning, or more pure RL - and in many cases does so quite dramatically. They find that the sharper binary weighting scheme does better on simpler tasks, since the trade-off of fewer but higher-quality samples to learn from works there. However, on more complex tasks, the policy benefits from the exp weighting, which still uses and learns from more samples (albeit at lower weights), which introduces some potential mimicking of lower-quality transitions, but at the trade of a larger effective dataset size to learn from. |
[link]
This paper is ultimately relatively straightforward, for all that it's embedded in the somewhat new-to-me literature around graph-based Neural Architecture Search - the problem of iterating through options to find a graph representing an optimized architecture. The authors want to understand whether in this problem, as in many others in deep learning, we can benefit from building our supervised models off of representations learned during an unsupervised pretraining step. In this case, the unsupervised pretraining is conceptually simple - a variational autoencoder - even though the components of the VAE are more complex by dint of being made up of graph networks. This autoencoder, termed arch2vec, is trained on a simple reconstruction loss, and uses the Graph Isomorphism Network (or, GIN) architecture in its encoder, rather than a more typical Graph Convolutional Network. I don't feel like I fully follow the intuitive difference between these two structures, but have the general sense that GIN architectures are simpler; calculating a weighted sum of current central node features with the features of neighboring nodes, rather than learning a function of the full concatenated (current_node, sum_of_neighbors) vector. First, the authors investigate the quality of their embedding space, compared to the representation implicitly learned by doing end-to-end supervised (i.e. with accuracies as labels) NAS. They show that (1) distances in their continuous embedding space correlate more strongly with the edit distance between graphs, compared to the embedding learned by the supervised model, and that (2) their embedding fills more of the space (as would be expected from the KL regularization term) and leads to high-performing networks being naturally concentrated within the space. https://i.imgur.com/SavZnce.png Looking into their representation as an initialization point, they demonstrate that their initializations do lead to lower long-term regret over the course of the architecture search process, particularly differentiating themselves from random initializations at the end of training. https://i.imgur.com/4DG7lZd.png The authors argue that this is because the representations learned by the supervised methods are "biased towards weight-free operations, which are often preferred in the early stage of the search process, resulting in lower final accuracies." I admit I don't fully understand why this would be true, though they do cite a few papers they say demonstrate it. My initial thought was that weight-free architectures would overperform early in the training of each individual network, but my understanding was that the dataset used here is a labeled static dataset of architectures and accuracies, so the within-training-run dynamics wouldn't obviously play a role. Nevertheless, there does seem to be empirical benefit that comes from using these pretrained representations, even if I don't follow the intuition behind it fully. |
[link]
This is a nice little empirical paper that does some investigation into which features get learned during the course of neural network training. To look at this, it uses a notion of "decodability", defined as the accuracy to which you can train a linear model to predict a given conceptual feature on top of the activations/learned features at a particular layer. This idea captures the amount of information about a conceptual feature that can be extracted from a given set of activations. They work with two synthetic datasets. 1. Trifeature: Generated images with a color, shape, and texture, which can be engineered to be either entirely uncorrelated or correlated with each other to varying degrees. 2. Navon: Generated images that are letters on the level of shape, and are also composed of letters on the level of texture The first thing the authors investigate is: to what extent are the different properties of these images decodable from their representations, and how does that change during training? In general, decodability is highest in lower layers, and lowest in higher layers, which makes sense from the perspective of the Information Processing Inequality, since all the information is present in the pixels, and can only be lost in the course of training, not gained. They find that decodability of color is high, even in the later layers untrained networks, and that the decodability of texture and shape, while much less high, is still above chance. When the network is trained to predict one of the three features attached to an image, you see the decodability of that feature go up (as expected), but you also see the decodability of the other features go down, suggesting that training doesn't just involve amplifying predictive features, but also suppressing unpredictive ones. This effect is strongest in the Trifeature case when training for shape or color; when training for texture, the dampening effect on color is strong, but on shape is less pronounced. https://i.imgur.com/o45KHOM.png The authors also performed some experiments on cases where features are engineered to be correlated to various degrees, to see which of the predictive features the network will represent more strongly. In the case where two features are perfectly correlated (and thus both perfectly predict the label), the network will focus decoding power on whichever feature had highest decodability in the untrained network, and, interestingly, will reduce decodability of the other feature (not just have it be lower than the chosen feature, but decrease it in the course of training), even though it is equally as predictive. https://i.imgur.com/NFx0h8b.png Similarly, the network will choose the "easy" feature (the one more easily decodable at the beginning of training) even if there's another feature that is slightly *more* predictive available. This seems quite consistent with the results of another recent paper, Shah et al, on the Pitfalls of Simplicity Bias in neural networks. The overall message of both of these experiments is that networks generally 'put all their eggs in one basket,' so to speak, rather than splitting representational power across multiple features. There were a few other experiments in the paper, and I'd recommend reading it in full - it's quite well written - but I think those convey most of the key insights for me. |
[link]
This paper argues that, in semi-supervised learning, it's suboptimal to use the same weight for all examples (as happens implicitly, when the unsupervised component of the loss for each example is just added together directly. Instead, it tries to learn weights for each specific data example, through a meta-learning-esque process. The form of semi-supervised learning being discussed here is label-based consistency loss, where a labeled image is augmented and run through the current version of the model, and the model is optimized to try to induce the same loss for the augmented image as the unaugmented one. The premise of the authors argument for learning per-example weights is that, ideally, you would enforce consistency loss less on examples where a model was unconfident in its label prediction for an unlabeled example. As a way to solve this, the authors suggest learning a vector of parameters - one for each example in the dataset - where element i in the vector is a weight for element i of the dataset, in the summed-up unsupervised loss. They do this via a two-step process, where first they optimize the parameters of the network given the example weights, and then the optimize the example weights themselves. To optimize example weights, they calculate a gradient of those weights on the post-training validation loss, which requires backpropogating through the optimization process (to determine how different weights might have produced a different gradient, which might in turn have produced better validation loss). This requires calculating the inverse Hessian (second derivative matrix of the loss), which is, generally speaking, a quite costly operation for huge-parameter nets. To lessen this cost, they pretend that only the final layer of weights in the network are being optimized, and so only calculate the Hessian with respect to those weights. They also try to minimize cost by only updating the example weights for the examples that were used during the previous update step, since, presumably those were the only ones we have enough information to upweight or downweight. With this model, the authors achieve modest improvements - performance comparable to or within-error-bounds better than the current state of the art, FixMatch. Overall, I find this paper a little baffling. It's just a crazy amount of effort to throw into something that is a minor improvement. A few issues I have with the approach: - They don't seem to have benchmarked against the simpler baseline of some inverse of using Dropout-estimated uncertainty as the weight on examples, which would, presumably, more directly capture the property of "is my model unsure of its prediction on this unlabeled example" - If the presumed need for this is the lack of certainty of the model, that's a non-stationary problem that's going to change throughout the course of training, and so I'd worry that you're basically taking steps in the direction of a moving target - Despite using techniques rooted in meta-learning, it doesn't seem like this models learns anything generalizable - it's learning index-based weights on specific examples, which doesn't give it anything useful it can do with some new data point it finds that it wasn't specifically trained on Given that, I think I'd need to see a much stronger case for dramatic performance benefits for something like this to seem like it was worth the increase in complexity (not to mention computation, even with the optimized Hessian scheme) |
[link]
Transformers - powered by self-attention mechanisms - have been a paradigm shift in NLP, and are now the standard choice for training large language models. However, while transformers do have many benefits in terms of computational constraints - most saliently, that attention between tokens can be computed in parallel, rather than needing to be evaluated sequentially like in a RNN - a major downside is their memory (and, secondarily, computational) requirements. The baseline form of self-attention works by having every token attend to every other token, where "attend" here means that a query from each token A will take an inner product with each other token -A, and then be elementwise-multiplied with the values of every other token -A. This implies a O(N^2) memory and computation requirement, where N is your sequence length. So, the question this paper asks is: how do you get the benefits, or most of the benefits, of a full-attention network, while reducing the number of other tokens each token attends to. The authors' solution - Big Bird - has three components. First, they approach the problem of approximating the global graph as a graph theory problem, where each token attending to every other is "fully connected," and the goal is to try to sparsify the graph in a way that keeps shortest path between any two nodes low. They use the fact that in an Erdos-Renyi graph - where very edge is simply chosen to be on or off with some fixed probability - the shortest path is known to be logN. In the context of aggregating information about a sequence, a short path between nodes means that the number of iterations, or layers, that it will take for information about any given node A to be part of the "receptive field" (so to speak) of node B, will be correspondingly short. Based on this, they propose having the foundation of their sparsified attention mechanism be simply a random graph, where each node attends to each other with probability k/N, where k is a tunable hyperparameter representing how many nodes each other node attends to on average. To supplement, the authors further note that sequence tasks of interest - particularly language - are very local in their information structure, and, while it's important to understand the global context of the full sequence, tokens close to a given token are most likely to be useful in constructing a representation of it. Given this, they propose supplementing their random-graph attention with a block diagonal attention, where each token attends to w/2 tokens prior to and subsequent to itself. (Where, again, w is a tunable hyperparameter) However, the authors find that these components aren't enough, and so they add a final component: having some small set of tokens that attend to all tokens, and are attended to by all tokens. This allows them to theoretically prove that Big Bird can approximate full sequences, and is a universal Turing machine, both of which are true for full Transformers. I didn't follow the details of the proof, but, intuitively, my reading of this is that having a small number of these global tokens basically acts as a shortcut way for information to get between tokens in the sequence - if information is globally valuable, it can be "written" to one of these global aggregator nodes, and then all tokens will be able to "read" it from there. The authors do note that while their sparse model approximates the full transformer well in many settings, there are some problems - like needing to find the token in the sequence that a given token is farthest from in vector space - that a full attention mechanism could solve easily (since it directly calculates all pairwise comparisons) but that a sparse attention mechanism would require many layers to calculate. Empirically, Big Bird ETC (a version which adds on additional tokens for the global aggregators, rather than making existing tokens serve thhttps://i.imgur.com/ks86OgJ.pnge purpose) performs the best on a big language model training objective, has comparable performance to existing models on questionhttps://i.imgur.com/x0BdamC.png answering, and pretty dramatic performance improvements in document summarization. It makes sense for summarization to be a place where this model in particular shines, because it's explicitly designed to be able to integrate information from very large contexts (albeit in a randomly sampled way), where full-attention architectures must, for reasons of memory limitation, do some variant of a sliding window approach. |
[link]
This is an interesting paper that makes a fairly radical claim, and I haven't fully decided whether what they find is an interesting-but-rare corner case, or a more fundamental weakness in the design of neural nets. The claim is: neural nets prefer learning simple features, even if there exist complex features that are equally or more predictive, and even if that means learning a classifier with a smaller margin - where margin means "the distance between the decision boundary and the nearest-by data". A large-margin classifier is preferable in machine learning because the larger the margin, the larger the perturbation that would have to be made - by an adversary, or just by the random nature of the test set - to trigger misclassification. https://i.imgur.com/PJ6QB6h.png This paper defines simplicity and complexity in a few ways. In their simulated datasets, a feature is simpler when the decision boundary along that axis requires fewer piecewise linear segments to separate datapoints. (In the example above, note that having multiple alternating blocks still allows for linear separation, but with a higher piecewise linear requirement). In their datasets that concatenate MNIST and CIFAR images, the MNIST component represents the simple feature. The authors then test which models use which features by training a model with access to all of the features - simple and complex - and then testing examples where one set of features is sampled in alignment with the label, and one set of features is sampled randomly. If the features being sampled randomly are being used by the model, perturbing them like this should decrease the test performance of the model. For the simulated datasets, a fully connected network was used; for the MNIST/CIFAR concatenation, a variety of different image classification convolutional architectures were tried. The paper finds that neural networks will prefer to use the simpler feature to the complete exclusion of more complex features, even if the complex feature is slightly more predictive (can achieve 100 vs 95% separation). The authors go on to argue that what they call this Extreme Simplicity Bias, or Extreme SB, might actually explain some of the observed pathologies in neural nets, like relying on spurious features or being subject to adversarial perturbations. They claim that spurious features - like background color or texture - will tend to be simpler, and that their theory explains networks' reliance on them. Additionally, relying completely or predominantly on single features means that a perturbation along just that feature can substantially hurt performance, as opposed to a network using multiple features, all of which must be perturbed to hurt performance an equivalent amount. As I mentioned earlier, I feel like I'd need more evidence before I was strongly convinced by the claims made in this paper, but they are interestingly provocative. On a broader level, I think a lot of the difficulties in articulating why we expect simpler features to perform well come from an imprecision in thinking in language around the idea - we think of complex features as inherently brittle and high-dimensional, but this paper makes me wonder how well our existing definitions of simplicity actually match those intuitions. |
[link]
Generalization is, if not the central, then at least one of the central mysteries of deep learning. We are somehow able to able to train high-capacity, overparametrized models, that empirically have the capacity to fit to random data - meaning that they have the capacity to memorize the labeled data we give them - and which yet still manage to train functions that generalize to test data. People have tried to come up with generalization bounds - that is, bounds on the expected test error of a model class - but that have all been vacuous, which here means that their upper bound is so far above the actual observed test set error that it's meaningless for the purpose of predicting which changes will enhance or detract from generalization. This paper builds on - and somewhat critiques - an earlier paper, Jiang et al, which takes the approach of assessing generalization bounds empirically. The central approach taken by both papers is to compare the empirical test error of two networks that are identical except for one axis which is varied, and test whether the ranking of the predicted generalization errors for the two networks, resulting from a particular analytical bound, aligns with the ranking of actual, empirical test error. Said succinctly: the goal is to measure how good a generalization bound is at predicting which networks will actually generalize, across the kinds of hyperparameter changes we'd be likely to experiment with in practice. An important note here is that this kind of rank-based measurement is insensitive to whether the actual magnitude of the generalization bound is; it only cares about relative bounds for different model configurations. For a given pair of environments (or pairs of hyperparameter settings), the experimental framework trains multiple seeds and averages the sign error across them. If the two models in the pair were close to one another in generalization error, they were downweighted in the overall average, or removed from the estimation if they were too close, to reduce noise. A difference in methodologies between the Jiang paper and this one is that this one puts a lot of emphasis on the need to rank generalization measures not just by their average performance over a suite of different hyperparameter perturbations, but also by a metric capturing how robust the measure is, for which they suggest the max error rather than average error. Their rationale is that simply looking at an average obscures cases where a measure performs poorly in a particular region of hyperparameter space, in a way that might tell us interesting things about its failure modes. For example, beyond just being able to say that generalization bounds based on Frobenius norms performed poorly on average at predicting the effects of changes to training set size, they were able to look at the particular settings where it performed the worst, which turn out to be on small network sizes. The plot below shows the results from all of the tested measures aggregated together. Each row represents a different axis that was being varied, and, for each measure, a number of different settings were sampled over (for the hyperparameters that were being held fixed across pairs, rather than being varied. Each distribution rectangle represents the average sign error across all of the pairs that were sampled for that measure, and that axis of variation. The measures are listed from left to right according to their average performance across all environments and all axes of variation. https://i.imgur.com/Tg3wdA3.png Some conclusions from this experiment were: - Generalization measures seem to not perform well on changes made to width, however, the authors note this was mostly because changes to width tended to not change the generalization performance in consistent ways, and so the difference in test error between the networks in the pair was more often within the range of noise - Most but not all generalization bounds correctly predict that more training data should result in better generalization - No bound does particularly well on predicting the generalization effects of changes in depth Overall, I found this paper delightfully well written, and a real pleasure to read. My one critique is that the authors explicitly point out that an important piece of data for comparing generalization bounds is the set of features they depend on. That is, if a generalization bound can only make predictions with access to the learned weights (in addition to the model class and data characteristics), it's a lot less practically useful, in terms of model design, than one that doesn't. I wish they had followed through on that and represented the dependencies of the different bounds on some way in their central figure, so that it was easier to compare them "fairly," or accounting for the information they had access to. |
[link]
Occasionally, I come across results in machine learning that I'm glad exist, even if I don't fully understand them, precisely because they remind me how little we know about the complicated information architectures we're building, and what kinds of signal they can productively use. This is one such result. The paper tests a method called self-training, and compares it against the more common standard of pre-training. Pre-training works by first training your model on a different dataset, in a supervised way, with the labels attached to that dataset, and then transferring the learned weights on that model model (except for the final prediction head) and using that as initialization for training on your downstream task. Self-training also uses an external dataset, but doesn't use that external data's labels. It works by 1) Training a model on the labeled data from your downstream task, the one you ultimately care about final performance on 2) Using that model to make label predictions (for the label set of your downstream task), for the external dataset 3) Retraining a model from scratch with the combined set of human labels and predicted labels from step (2) https://i.imgur.com/HaJTuyo.png This intuitively feels like cheating; something that shouldn't quite work, and yet the authors find that it equals or outperforms pretraining and self-supervised learning in the setting they examined (transferring from ImageNet as an external dataset to CoCo as a downstream task, and using data augmentations on CoCo). They particularly find this to be the case when they're using stronger data augmentations, and when they have more labeled CoCo data to train with from the pretrained starting point. They also find that self-training outperforms self-supervised (e.g. contrastive) learning in similar settings. They further demonstrate that self-training and pre-training can stack; you can get marginal value from one, even if you're already using the other. They do acknowledge that - because it requires training a model on your dataset twice, rather than reusing an existing model directly - their approach is more computationally costly than the pretrained-Imagenet alternative. This work is, I believe, rooted in the literature on model distillation and student/teacher learning regimes, which I believe has found that you can sometimes outperform a model by training on its outputs, though I can't fully remember the setups used in those works. The authors don't try too hard to give a rigorous theoretical account of why this approach works, which I actually appreciate. I think we need to have space in ML for people to publish what (at least to some) might be unintuitive empirical results, without necessarily feeling pressure to articulate a theory that may just be a half-baked after-the-fact justification. One criticism or caveat I have about this paper is that I wish they'd evaluated what happened if they didn't use any augmentation. Does pre-training do better in that case? Does the training process they're using just break down? Only testing on settings with augmentations made me a little less confident in the generality of their result. Their best guess is that it demonstrates the value of task-specificity in your training. I think there's a bit of that, but also feel like this ties in with other papers I've read recently on the surprising efficacy of training with purely random labels. I think there's, in general, a lot we don't know about what ostensibly supervised networks learn in the face of noisy or even completely permuted labels. |
[link]
The thing I think is happening here: It proposes a self-supervised learning scheme (which...seems fairly basic, but okay) to generate encodings. It then trains a Latent World Model, which takes in the current state encoding, the action, and the belief state (I think just the prior RNN state?) and predicts a next state. The intrinsic reward is the difference between this and the actual encoding of the next step. (This is dependent on a particular action and resulting next obs, it seems). I don't really know what the belief state is doing here. Is it a... scalar rather than a RNN state? It being said to start at 0 makes it seem that way. Summary: For years, an active area of research has been the question about how to incentivize reinforcement learning agents to more effectively explore their environment. In many environments, the state space is large, and it's quite difficult to find reward just by randomly traversing it, and, in the absence of reward signal, most reinforcement learning algorithms won't learn. To remedy this, a common approach has been to attempt to define a measure of novelty, such that you can reward policies for exploring novel states. One approach for this is to tabulate counts of how often you've seen given past states, and explore in inverse proportion to those counts. However, this is complicated to scale to image-based and continuous state spaces. Another tactic has been to use uncertainty in an agent's model of the world as an indication of that part of state space being insufficiently explored. In this way of framing the problem, exploring a part of space gives you more samples from it, and if you use those samples to train your forward predictive model - a model predicting the next state - it will increase in accuracy for that state and states like this. So, in this setting, the "novelty reward" for your agent comes from the prediction error; it's incentivized to explore states where its model is more incorrect. However, a problem with this, if you do simple pixel-based prediction, is that there are some inherent sources of uncertainty in an environment, that don't get reduced by you drawing more samples from those parts of space. The canonical example of this is static on a tv - it's just fundamentally noisy and unpredictable, and no amount of gathering data will reduce that fundamental noise. A lot of ways of naively incentivizing uncertainty draw you into those parts of state space again and again, even though they aren't really serving the purpose of getting you to explore interesting, uninvestigated parts of the space. This paper argues for a similar uncertainty-based metric, but based on prediction of a particular kind of representation of the state, which they argue the pathological property described earlier, of getting stuck in regions of high inherent uncertainty. They do this by first learning a self-supervised representation that seems *kind of* like contrastive predictive coding, but slightly different. Their approach simply pushes the Euclidean distance between the representations of nearby timesteps to be smaller, without any explicit negative set to contrast again. To avoid the network learning the degenerative solution of "always predict a constant, so everything is close to everything else", the authors propose a "whitening" (or rescaling, or normalizing" operation before the mean squared error. This involves subtracting out the mean representation, and dividing by the covariance, before doing a mean squared error. This means that, even if you've pushed your representations together in absolute space, after the whitening operation, they will be "pulled out" again to be in a spherical unit Gaussian, which stops the network from being able to get a benefit from falling into the collapsing solution. https://i.imgur.com/Psjlf4t.png Given this pretrained encoder, the method works by: - Constructing a recurrent Latent World Model (LWM) that takes in the encoded observation, the current action, and the prior belief state of the recurrent model - Encoding the current observation with the pretrained encoder - Passing that representation into the LWM to get a predicted next representation out (prediction happens in representation space, not pixel space) - Using the error between the actual encoding of the next observation, and the predicted next representation, as the novelty signal - Training a DQN on top of the encoding Something I'm a little confused by is whether the encoding network is exclusively trained via MSE loss, or whether it also gets gradients from the actual RL DQN task. Overall, this method makes sense, but I'm not quite sure why the proposed representation structure would be notably better than other, more canonical self-supervised losses, like CPC. They show Action-Conditioned CPC as a baseline when demonstrating the quality of the representations, but not as a slot-in replacement for the MSE representations in their overall architecture. It does seem to get strong performance on exploration-heavy tasks, but I'll admit I'm not familiar with the quality of the baselines they chose, and so don't have a great sense of whether the - admittedly quite strong! - performance shown in the table below is in fact comparing against the current state of the art. https://i.imgur.com/cIr2Y4w.png |
[link]
When humans classify images, we tend to use high-level information about the shape and position of the object. However, when convolutional neural networks classify images,, they tend to use low-level, or textural, information more than high-level shape information. This paper tries to understand what factors lead to higher shape bias or texture bias. To investigate this, the authors look at three datasets with disagreeing shape and texture labels. The first is GST, or Geirhos Style Transfer. In this dataset, style transfer is used to render the content of one class in the style of another (for example, a cat shape in the texture of an elephant). In the Navon dataset, a large-scale letter is rendered by tiling smaller letters. And, in the ImageNet-C dataset, a given class is rendered with a particular kind of distortion; here the distortion is considered to be the "texture label". In the rest of the paper, "shape bias" refers to the extent to which a model trained on normal images will predict the shape label rather than the texture label associated with a GST image. The other datasets are used in experiments where a model explicitly tries to learn either shape or texture. https://i.imgur.com/aw1MThL.png To start off, the authors try to understand whether CNNs are inherently more capable of learning texture information rather than shape information. To do this, they train models on either the shape or the textural label on each of the three aforementioned datasets. On GST and Navon, shape labels can be learned faster and more efficiently than texture ones. On ImageNet-C (i.e. distorted ImageNet), it seems to be easier to learn texture than texture, but recall here that texture corresponds to the type of noise, and I imagine that the cardinality of noise types is far smaller than that of ImageNet images, so I'm not sure how informative this comparison is. Overall, this experiment suggests that CNNs are able to learn from shape alone without low-level texture as a clue, in cases where the two sources of information disagree The paper moves on to try to understand what factors about a normal ImageNet model give it higher or lower shape bias - that is, a higher or lower likelihood of classifying a GST image according to its shape rather than texture. Predictably, data augmentations have an effect here. When data is augmented with aggressive random cropping, this increases texture bias relative to shape bias, presumably because when large chunks of an object are cropped away, its overall shape becomes a less useful feature. Center cropping is better for shape bias, probably because objects are likely to be at the center of the image, so center cropping has less of a chance of distorting them. On the other hand, more "naturalistic" augmentations like adding Gaussian noise or distorting colors lead to a higher shape bias in the resulting networks, up to 60% with all the modifications. However, the authors also find that pushing the shape bias up has the result of dropping final test accuracy. https://i.imgur.com/Lb6RMJy.png Interestingly, while the techniques that increase shape bias seem to also harm performance, the authors also find that higher-performing models tend to have higher shape bias (though with texture bias still outweighing shape) suggesting that stronger models learn how to use shape more effectively, but also that handicapping models' ability to use texture in order to incentivize them to use shape tends to hurt performance overall. Overall, my take from this paper is that texture-level data is actually statistically informative and useful for classification - even in terms of generalization - even if is too high-resolution to be useful as a visual feature for humans. CNNs don't seem inherently incapable of learning from shape, but removing their ability to rely on texture seems to lead to a notable drop in accuracy, suggesting there was real signal there that we're losing out on.
1 Comments
|
[link]
This is an interesting - and refreshing - paper, in that, instead of trying to go all-in on a particular theoretical point, the authors instead run a battery of empirical investigations, all centered around the question of how to explain what happens to make transfer learning work. The experiments don't all line up to support a single point, but they do illustrate different interesting facets of the transfer process. - An initial experiment tries to understand how much of the performance of fine-tuned models can be explained by (higher-level, and thus larger-scale) features, and how much is driven by lower level (and thus smaller-scale) image statistics. To start with, the authors compare the transfer performance from ImageNet onto three different datasets - clip art, sketches, and real images. As expected, transfer performance is highest with real datasets, which are the most similar to training domain. However, there still *is* positive transfer in terms of final performance across all domains, as well as benefit in optimization speed. - To try to further tease out the difference between the transfer benefits of high and low-level features, the authors run an experiment where blocks of pixels are shuffled around within the image on downstream tasks . The larger the size of the blocks being shuffled, the more that large-scale features of the image are preserved. As predicted, accuracy drops dramatically when pixel block size is small, for both randomly initialized and pretrained models. In addition, the relative value added by pretraining drops, for all datasets except quickdraw (the dataset of sketches). This suggests that in most datasets, the value brought by fine-tuning was mostly concentrated in large-scale features. One interesting tangent of this experiment was the examination of optimization speed (in the form of mean training accuracy over initial epochs). Even at block sizes too small for pretraining to offer a benefit to final accuracy, it did still contribute to faster training. (See transparent bars in right-hand plot below) https://i.imgur.com/Y8sO1da.png - On a somewhat different front, the authors look into how similar pretrained + finetuned models are to one another, compared to models trained on the same dataset from random initializations. First, they look at a measure of feature similarity, and find that the features learned by two pretrained networks are more similar to each other than a pretrained network is to a randomly initalized network, and also more than two randomly initialized networks are to one another. Randomly initialized networks are closest to one another in their final-layer features, but this is still a multiple of 4 or 5 less than the similarity between the pretrained networks - Looking at things from the perspective of optimization, the paper measures how much performance drops when you linearly interpolate between different solutions found by both randomly initialized and pretrained networks. For randomly initialized networks, interpolation requires traversing a region where test accuracy drops to 0%. However, for pretrained networks, this isn't the case, with test accuracy staying high throughout. This suggests that pretraining gets networks into a basin of the loss landscape, and that future training stays within that basin. There were also some experiments on module criticality that I believe were in a similar vein to these, but which I didn't fully follow - Finally, the paper looks at the relationship between accuracy on the original pretraining task and both accuracy and optimization speed on the downstream task. They find that higher original-task accuracy moves in the same direction as higher downstream-task accuracy, though this is less true when the downstream task is less related (as with quickdraw). Perhaps more interestingly, they find that the benefits of transfer to optimization speed happen and plateau quite early in training. Clip Art and Real transfer tasks are much more similar in the optimization speed benefits they get form ImageNet training, where on the accuracy front, the real did dramatically better. https://i.imgur.com/jBCJcLc.png While there's a lot to dig into in these results overall, the things I think are most interesting are the reinforcing of the idea that even very random and noisy pretraining can be beneficial to optimization speed (this seems reminiscent of another paper I read from this year's NeurIPS, examining why pretraining on random labels can help downstream training), and the observation that pretraining deposits weights in a low-loss bucket, from which they can learn more efficiently (though, perhaps, if the task is too divergent from the pretraining task, this difficulty in leaving the basin becomes a disadvantage). This feels consistent with some work in the Lottery Ticket Hypothesis, which has recently suggested that, after a short duration of training, you can rewind a network to a checkpoint saved after that duration, and be successfully able to train to low loss again. |
[link]
Contrastive learning works by performing augmentations on a batch of images, and training a network to match the representations of the two augmented parts of a pair together, and push the representations of images not in a pair farther apart. Historically, these algorithms have benefitted from using stronger augmentations, which has the effect of making the two positive elements in a pair more visually distinct from one another. This paper tries to build on that success, and, beyond just using a strong augmentation, tries to learn a way to perturb images that adversarially increases contrastive loss. As with adversarial training in normal supervised setting, the thinking here is that examples which push loss up the highest are the hardest and thus most informative for the network to learn from While the concept of this paper made some sense, I found the notation and the explanation of mechanics a bit confusing, particularly when it came to choice to frame a contrastive loss as a cross-entropy loss, with the "weights" of the dot product in the the cross-entropy loss being, in fact, the projection by the learned encoder of various of the examples in the batch. https://i.imgur.com/iQXPeXk.png This notion of the learned representations being "weights" is just odd and counter-intuitive, and the process of trying to wrap my mind around it isn't one I totally succeeded at. I think the point of using this frame is because it provides an easy analogue to the Fast Gradient Sign Method of normal supervised learning adversarial examples, even though it has the weird effect that, as the authors say "your weights vary by batch...rather than being consistent across training," Notational weirdness aside, my understanding is that the method of this paper: - Runs a forward pass of normal contrastive loss (framed as cross-entropy loss) which takes augmentations p and q and runs both forward through an encoder. - Calculates a delta to apply to each input image in the q that will increase the loss most, taken over all the images in the p set - I think the delta is per-image in q, and is just aggregated over all images in p, but I'm not fully confident of this, as a result of notational confusion. It could also be one delta applied for all all images in q. - Calculate the loss that results when you run forward the adversarially generated q against the normal p - Train a combined loss that is a weighted combination of the normal p/q contrastive part and the adversarial p/q contrastive part https://i.imgur.com/UWtJpVx.png The authors show a small but relatively consistent improvement to performance using their method. Notably, this improvement is much stronger when using larger encoders (presumably because they have more capacity to learn from harder examples). One frustration I have with the empirics of the paper is that, at least in the main paper, they don't discuss the increase in training time required to calculate these perturbations, which, a priori, I would imagine to be nontrivial. |
[link]
This was a really cool-to-me paper that asked whether contrastive losses, of the kind that have found widespread success in semi-supervised domains, can add value in a supervised setting as well. In a semi-supervised context, contrastive loss works by pushing together the representations of an "anchor" data example with an augmented version of itself (which is taken as a positive or target, because the image is understood to not be substantively changed by being augmented), and pushing the representation of that example away from other examples in the batch, which are negatives in the sense that they are assumed to not be related to the anchor image. This paper investigates whether this same structure - of training representations of positives to be close relative to negatives - could be expanded to the supervised setting, where "positives" wouldn't just mean augmented versions of a single image, but augmented versions of other images belonging to the same class. This would ideally combine the advantages of self-supervised contrastive loss - that explicitly incentivizes invariance to augmentation-based changes - with the advantages of a supervised signal, which allows the representation to learn that it should also see instances of the same class as close to one another. https://i.imgur.com/pzKXEkQ.png To evaluate the performance of this as a loss function, the authors first train the representation - either with their novel supervised contrastive loss SupCon, or with a control cross-entropy loss - and then train a linear regression with cross-entropy on top of that learned representation. (Just because, structurally, a contrastive loss doesn't lead to assigning probabilities to particular classes, even if it is supervised in the sense of capturing information relevant to classification in the representation) The authors investigate two versions of this contrastive loss, which differ, as shown below, in terms of the relative position of the sum and the log operation, and show that the L_out version dramatically outperforms (and I mean dramatically, with a top-one accuracy of 78.7 vs 67.4%). https://i.imgur.com/X5F1DDV.png The authors suggest that the L_out version is superior in terms of training dynamics, and while I didn't fully follow their explanation, I believe it had to do with L_out version doing its normalization outside of the log, which meant it actually functioned as a multiplicative normalizer, as opposed to happening inside the log, where it would have just become an additive (or, really, subtractive) constant in the gradient term. Due to this stronger normalization, the authors positive the L_out loss was less noisy and more stable. Overall, the authors show that SupCon consistently (if not dramatically) outperforms cross-entropy when it comes to final accuracy. They also show that it is comparable in transfer performance to a self-supervised contrastive loss. One interesting extension to this work, which I'd enjoy seeing more explored in the future, is how the performance of this sort of loss scales with the number of different augmentations that performed of each element in the batch (this work uses two different augmentations, but there's no reason this number couldn't be higher, which would presumably give additional useful signal and robustness?) |
[link]
This is another paper that was a bit of a personal-growth test for me to try to parse, since it's definitely heavier on analytical theory than I'm used to, but I think I've been able to get something from it, even though I'll be the first to say I didn't understand it entirely. The question of this paper is: why does it seem to be the case that training a neural network on a data distribution - but with your supervised labels randomly sampled - seems to afford some level of advantage when fine-tuning on those random-trained with correct labels. What is it that these networks learn from random labels that gives them a head-start on future training? To try to answer this, the authors focus on analyzing the first-layer weights of a network, and frame both the input data and the learned weights (after random training) to both be random variables, each with some mean and covariance matrix. The central argument made by the paper is: After training with random labels, the weights come to have a distributional form with a covariance matrix that is "aligned" with the covariance matrix of the data. "Aligned" here means alignment on the level of eigenvectors. Formally, it is defined as a situation where every eigenspace in the data covariance matrix is contained in, or is a subset of, an eigenspace of the weight matrix. Intuitively, it means that the principal components — the axes that define the principle dimensions of variation - of the weight space are being aligned, in a linear algebra sense, with those of the data, down to a difference of a scaling factor (that is, the fact that the eigenvalues may be different between the two). They do show some empirical evidence of this being the case, by calculating the actual covariance matrices of both the data and the learned weight matrices, and showing that you see high degrees of similarity between the vector spaces of the two (though sometimes by having to add eigenspaces of the data together to be equivalent to an eigenspace of the weights). https://i.imgur.com/TB5JM6z.png They also show some indication that this property drives the advantage in fine-tuning. They do this by just taking their analytical model of what they believe is happening during training - that weights are coming to be drawn from a distribution governed by a covariance matrix aligned with the data covariance matrix - and sample weights from a normal distribution that has that property. They show, in the plot below, that this accounts for most of the advantage that has been observed in subsequent training from training on random labels (other than the previously-discovered effect of "all training increases the scale of the weights, which helps in future training," which they account for by normalizing). https://i.imgur.com/cnT27HI.png Unfortunately, beyond this central intuition of covariance matrix alignment, I wasn't able to get much else from the paper. Some other things they mentioned, that I didn't follow were: - The actual proof for why you'd expect this property of alignment in the case of random training - An analysis of the way that "the first layer [of a network] effectively learns a function which maps each eigenvalue of the data covariance matrix to the corresponding eigenvalue of the weight covariance matrix." I understand that their notion of alignment predicts that there should be some relationship between these two eigenvalues, but I don't fully follow which part of the first layer of a neural network will *learn* that function, or produce it as an output - An analysis of how this framework explains both the cases where you get positive transfer from random-label training (i.e. fine-tuned networks training better subsequently) and the cases where you get negative transfer |
[link]
In the past year or so, contrastive learning has experienced widespread success, and has risen to be a dominant problem framing within self-supervised learning. The basic idea of contrastive learning is that, instead of needing human-generated labels to generate a supervised task, you instead assume that there exists some automated operation you can perform to a data element to generate another data element that, while different, should be considered still fundamentally the same, or at least more strongly related, and that you can contrast these related pairs against pairs constructed with the rest of the dataset, with which any given frame would not by default have this assumed relationship of sameness or quasi-similarity. One fairly central way that "different but still effectively similar" has been historically defined - at least within the realm of image-based models - is through the use of data augmentations: image transformations such as cropping, color jitter, or Gaussian blur, which are used on an image to create the counterpart in its related pair. Fundamentally, what we're doing when we define these particular augmentations is saying: these transformations don't cause a meaningful change in what the image is, and so we want the representations we get with and without the transformations to be close to one another (or at least to contain enough information to predict one another). Another way you can say this is that we're defining properties of the image that we want our representation to be invariant to. The authors of this paper make the point that, when aggressive cropping is part of your toolkit of augmentations, the crops of the image can actually contain meaningfully different content than the uncropped image. If you remove a few pixels around the edges of an image, it still fundamentally contains the same things. However, if you zoom in dramatically, you may get crops that contain different objects. From an image classification standpoint, you would expect that coding in an invariance to cropping in our representations would, in some cases, also mean coding in an invariance to object type, which would presumably be detrimental to the task of classifying objects. To explain the extent of the success that aggressive-cropping methods have had so far, they argue that ImageNet has the particular property that its images are curated to be primarily and centrally containing a single object at a time, such that, even if you zoom in, you're getting a part of the central object, rather than another object entirely. They argue that this dataset bias might explain why you haven't seen as much of this object-invariance be a problem in earlier augmentation-based contrastive work. To try to test this, they train different contrastive (MoCo v2) models on the MSCOCO dataset, which consists of pictures of rooms, and thus no longer has the property of being centrally of one object. They tried one setting where they performed contrastive loss on the images as a whole, and another where the input to the augmentation pipeline were images from the same dataset, but pre-cropped to only contain one image at a time. This was meant, as far as I can tell, to isolate the effect of "object-centric vs not" while holding other dataset factors constant. They then test how well these different models do on an object-centric classification task (Pascal Cropped Boxes). They find that the contrastive model that trains cropped versions of the dataset gets about 3.5 points higher mean accuracy (71.9 vs 75.3) compared to the contrastive loss done on the multi-object versions of images. They also explicitly try to measure different forms of invariance, through a scheme where they binarize the elements of the representation vector, and calculate what proportion of them fire on average with and without a given set of transformations. They find that the main form of invariances that contrastive learning does well at is invariance to occlusion (part of the image not being visible), where both contrastive methods give ~84 percent co-firing, and supervised pre-training only gets about 80.9. However, on other important measures of invariance - viewpoint, illumination direction, and instance (that is, specific instance within a class) - contrastive representations perform notably worse than supervised pretraining. https://i.imgur.com/7Ghbv5A.png To try to solve these two problems, they propose a method that learns from video, and that uses temporally separated frames (which are then augmented) as pairs. They call this Frame Temporal Invariance, and argue, reasonably, that by pushing the representations of adjacent frames which track a (presumably consistent, or at least slowly-evolving) scene closer together, you should expect better invariance to viewpoint change and image deformation, since those things naturally happen when an object is moving through the world. They also suggest using an off-the-shelf object bounding box model to find particular objects, and track them throughout the video, and to use contrastive learning specifically on the bounding boxes that the algorithm thinks track a consistent object. https://i.imgur.com/2GfCTog.png Overall, my take on this paper is that the analysis they do - of different kinds of invariances contrastive vs supervised loss does well on, and of the extent to which contrastive loss results might be biased by datasets - is quite interesting and a valuable contribution to our understanding of these very currently-hypey algorithms. However, I'm a bit less impressed by the novelty of their proposed solution. Temporal forms of contrastive learning have been around before - in reinforcement learning, and even in the original Contrastive Predictive Coding paper, where the related pairs were related by dint of temporal closeness. So, while using it in video is certainly a good idea, it doesn't really feel strongly novel to me. I also feel a little confused by their choice of using an off-the-shelf object detection model as a pre-requisite for a self-supervised task, since my impression was that a central goal of self-supervision was building techniques that could scale to situations where it was infeasible to get large amounts of labels, and any method that relies on a pre-existing trained object bounding box model is pretty inherently limited in that regard. |
[link]
A central problem in the domain of reinforcement learning is how to incentivize exploration and diversity of experience, since RL agents can typically only learn from states they go to, and it can often be the case that states with high reward don't have an obvious trail of high-reward states leading to them, meaning that algorithms that are naively optimizing for reward will be relatively unlikely to discover them. One potential way to promote exploration is to train an ensemble of agents, and have part of your loss that incentivizes diversity in the behavior of those agents. Intuitively, an incentive for diversity will push policies away from one another, which will force them to behave differently, and thus reach a wider range of different states. Current work in diversity tends to penalize, for each agent, the average pairwise distance between it and every other agent. The authors of this paper have two critiques with this approach: 1. Pure pairwise distance doesn't fully capture the notion of diversity we might want, since if you have policies that are clustered together into multiple clusters, average pairwise distance can be increased by moving the clusters farther apart, without having to break up the clusters themselves 2. Having the diversity term be calculated for each policy independently can lead to "cycling" behavior, where policies move in ways that increase distance at each step, but don't do so when each agent's independent steps are taken into account As an alternative, they propose calculating the pairwise Kernel function similarity between all policies (where each policy is represented as the average action probabilities it returns across a sample of states), and calculating the determinate of that matrix. The authors claim that this represents a better measure of full population diversity. I can't fully connect the dots on this intuitively, but what I think they're saying is: the determinant of the kernel matrix tells you something about the effective dimensionality spanned by the different policies. In the same way that, if a matrix is low-rank, it tells you that some vectors within it can be nearly represented by linear combinations of others, a low value of the Kernel determinant means that some policies can be sufficiently represented by combinations of other policies, such that it isn't really adding diversity value to the ensemble. https://i.imgur.com/CmlGsNP.png Another contribution of this paper is to propose an interesting bandit-based way of determining when to incentivize diversity vs focus on pure reward. The diversity term in the loss is governed by a lambda parameter, and the paper's model sets up Thompson sampling to determine what the value of the parameter should be at each training iteration. This bandit setup works by starting out uncertain over whether to include diversity in the loss, and building a model of whether reward tends to increase during steps where diversity is used. Over time, if diversity consistently doesn't produce benefits, the sampler will tend more towards excluding it from the loss. This is just a really odd, different idea, that I've never heard of before in the context of hyperparameter scheduling during training. I will say that I'm somewhat confused about the window of time that the bandit uses for calculating rewards. Is it a set of trajectories used in a single training batch, or longer than that? Questions aside, they do so fairly convincingly that the adaptive parameter schedule outperforms over a fixed schedule, though they don't test it against simpler forms of annealing, so I'm less clear on whether it would outperform those. Overall, I still have some confusions about the method proposed by this paper, but it seems to be approaching the question of exploration from an interesting direction, and I'd enjoy trying to understand it further in future.
1 Comments
|
[link]
This work attempts to use meta-learning to learn an update rule for a reinforcement learning agent. In this context, "learning an update rule" means learning the parameters of an LSTM module that takes in information about the agent's recent reward and current model and outputs two values - a scalar and a vector - that are used to update the agent's model. I'm not going to go too deep into meta-learning here, but, at a high level, meta learning methods optimize parameters governing an agent's learning, and, over the course of many training processes over many environments, optimize those parameters such that the reward over the full lifetime of training is higher. To be more concrete, the agent in a given environment learns two things: - A policy, that is, a distribution over predicted action given a state. - A "prediction vector". This fits in the conceptual slot where most RL algorithms would learn some kind of value or Q function, to predict how much future reward can be expected from a given state. However, in this context, this vector is *very explicitly* not a value function, but is just a vector that the agent-model generates and updates. The notion here is that maybe our human-designed construction of a value function isn't actually the best quantity for an agent to be predicting, and, if we meta-learn, we might find something more optimal. I'm a little bit confused about the structure of this vector, but I think it's *intended* to be a categorical 1-of-m prediction At each step, after acting in the environment, the agent passes to an LSTM: - The reward at the step - A binary of whether the trajectory is done - The discount factor - The probability of the action that was taken from state t - The prediction vector evaluated at state t - The prediction vector evaluated at state t+1 Given that as input (and given access to its past history from earlier in the training process), the LSTM predicts two things: - A scalar, pi-hat - A prediction vector, y-hat These two quantities are used to update the existing policy and prediction model according to the rule below. https://i.imgur.com/xx1W9SU.png Conceptually, the scalar governs whether to increase or decrease probability assigned to the taken action under the policy, and y-hat serves as a target for the prediction vector to be pulled towards. An important thing to note about the LSTM structure is that none of the quantities it takes as input are dependent on the action or observation space of the environment, so, once it is learned it can (hopefully) generalize to new environments. Given this, the basic meta learning objective falls out fairly easily - optimize the parameters of the LSTM to maximize lifetime reward, taken in expectation over training runs. However, things don't turn out to be quite that easy. The simplest version of this meta-learning objective is wildly unstable and difficult to optimize, and the authors had to add a number of training hacks in order to get something that would work. (It really is dramatic, by the way, how absolutely essential these are to training something that actually learns a prediction vector). These include: - A entropy bonus, pushing the meta learned parameters to learn policies and prediction vectors that have higher entropy (which is to say: are less deterministic) - An L2 penalty on both pi-hat and y-hat - A removal of the softmax that had originally been originally taken over the k-dimensional prediction vector categorical, and switching that target from a KL divergence to a straight mean squared error loss. As far as I can tell, this makes the prediction vector no longer actually a 1-of-k categorical, but instead just a continuous vector, with each value between 0 and 1, which makes it make more sense to think of k separate binaries? This I was definitely confused about in the paper overall https://i.imgur.com/EL8R1yd.png With the help of all of these regularizers, the authors were able to get something that trained, and that appeared to be able to perform comparably to or better than A2C - the human-designed baseline - across the simple grid-worlds it was being trained in. However, the two most interesting aspects of the evaluation were: 1. The authors showed that, given the values of the prediction vector, you could predict the true value of a state quite well, suggesting that the vector captured most of the information about what states were high value. However, beyond that, they found that the meta-learned vector was able to be used to predict the value calculated with discount rates different that than one used in the meta-learned training, which the hand-engineered alternative, TD-lambda, wasn't able to do (it could only well-predict values at the same discount rate used to calculate it). This suggests that the network really is learning some more robust notion of value that isn't tied to a specific discount rate. 2. They also found that they were able to deploy the LSTM update rule learned on grid worlds to Atari games, and have it perform reasonably well - beating A2C in a few cases, though certainly not all. This is fairly impressive, since it's an example of a rule learned on a different, much simpler set of environments generalizing to more complex ones, and suggests that there's something intrinsic to Reinforcement Learning that it's capturing |
[link]
This paper focuses on an effort by a Deepmind team to train an agent that can play the game Diplomacy - a complex, multiplayer game where players play as countries controlling units, trying to take over the map of Europe. Some relevant factors of this game, for the purposes of this paper, are: 1) All players move at the same time, which means you need to model your opponent's current move, and play a move that succeeds in expectation over that predicted move distribution. This also means that, in order to succeed, your policy can't be easily predictable, since, if it is, you're much easier to exploit, since your opponents can more easily optimize their response to what they predict you'll do 2) The action space is huge, even compared to something like Chess 3) The game has significant multiagent complexity: rather than being straightforwardly zero-sum in its reward structure, like Chess or Go, it's designed to require alliances between players in order to succeed Prior work - DipNet - had been able to outperform other hand-coded models through a deep network that imitated human actions, but the authors hadn't been able to get RL to successfully learn on top of that imitation baseline. The basic approach this model takes is one that will probably feel familiar if you've read Deepmind's prior work on Chess or Go: an interplay between a fast-to-evaluate neural net component, and a slower, more explicit, strategically designed component. The slower "expert" component uses the fast network component as part of its evaluation of different moves' consequences, and then, once the slow expert has generated a series of decisions, the neural net policy learns to imitate those decisions. In this case, the slower expert tries to explicitly calculate a Best Response strategy, given some belief about what your opponents will do at the state you're in. Since the action space is so large, it isn't possible to calculate a full best response (that is, your best *possible* action given the actions you expect your opponents to take), so this paper instead lays out a Sampled Best Response algorithm. It takes as input a state, as well as an opponent policy, a candidate policy, and a value network. (More on how those come to be layer). In the simplest case, the SBR algorithm works by: 1. Sampling some number (B) of actions from the opponent policy given the state. These represent a sample distribution of what moves you think your opponents are likely to play 2. Sampling some number (C) of candidate actions from your candidate policy. 3. For each candidate action, evaluating - for each opponent action - the state you'd reach if you took the candidate action and the opponent took the opponent action, according to your value network. 4. This gives you an estimated Q value for each of your candidate actions; if you pick the action with the highest Q value, this approximates your best response action to the opponent policy distribution you pass in Once you have this SBR procedure, you can use it to bootstrap a better policy and a value network by starting with a policy and value learned from pure human-imitation, and then using SBR - with your current policy as both the opponent and candidate policy - to generate a dataset of trajectories (where each action in the trajectory is chosen according to the SBR procedure). With that trajectory dataset, you can train your policy and value networks to be able to better approximate the actions that SBR would have taken. The basic version of this procedure is called IBR - Iterated Best Response. This is because, at each stage, SBR will return a policy that tries to perform well against the current version of the opponent policy. This kind of behavior is potentially troublesome, since you can theoretically have it lead to cycles, like those in Rock Paper Scissors where paper beats rock, scissors beat paper, and then rock beats scissors again. At each stage, you find the best response to what your opponent is doing right now, but you don't actually make transitive progress. A common way of improving along the axis of this particular failure mode is to learn via a "fictitious play" rather than "self play". Putting aside the name - which I don't find very usefully descriptive - this translates to simulating playing against, not the current version of your opponent, but a distribution made up of past versions of the opponent. This helps prevent cycling, because choosing a strategy that only defeats the current version but would lose to a prior version is no longer a rewarding option. The Fictitious Play-based approach this paper proposes - FPPI2 - tries to resolve this problem. Instead of sampling actions in step (1) from only your current opponent policy, it uses a sampling procedure where, for each opponent action, it first samples a point in history, and then samples the opponent policy vector at that point in history (so the multiple opponent moves collectively represent a coherent point in strategy space). Given this action profile, the final version of the algorithm continues to use the most recent/updated timestep of the policy and value network for the candidate policy and value network used in SBR, so that you're (hopefully) sampling high quality actions, and making accurate assessments of states, even while you use the distribution of (presumably worse) policies to construct the opponent action distribution that you evaluate your candidate actions against. https://i.imgur.com/jrQAAQW.png The authors don't evaluate against human players, but do show that their FPPI2 approach consistently outperforms the prior DipNet state of the art, and performed the best overall, though Iterated Best Response performs better than they say they expected. Some other thoughts: - This system is still fairly exploitable, and doesn't become meaningfully less so during training, though it does become more difficult to exploit quickly - This does seem like a problem overall where you do a lot of modeling of what you expect your opponents strategies to be, and it seems hard to be robust, especially to collusion of your opponents - I was a bit disappointed that the initial framing of the paper put a lot of emphasis on the alliance-focused nature of the game, but then neither suggested mechanisms targeting that aspect of the game, nor seemed to do any specific analysis of - I would have expected this game to benefit from some explicit modeling of different agents having different policies; possibly this just isn't something they could have had be the case under their evaluation scheme, which played against copies of a given policy? Overall, my sense is that this is still a pretty early-stage checkpoint in the effort of playing Diplomacy, and that we've still got a ways to go, but it is interesting early work, and I'm curious where it leads. |
[link]
This is an interestingly pragmatic paper that makes a super simple observation. Often, we may want a usable network with fewer parameters, to make our network more easily usable on small devices. It's been observed (by these same authors, in fact), that pruned networks can achieve comparable weights to their fully trained counterparts if you rewind and retrain from early in the training process, to compensate for the loss of the (not ultimately important) pruned weights. This observation has been dubbed the "Lottery Ticket Hypothesis", after the idea that there's some small effective subnetwork you can find if you sample enough networks. Given these two facts - the usefulness of pruning, and the success of weight rewinding - the authors explore the effectiveness of various ways to train after pruning. Current standard practice is to prune low-magnitude weights, and then continue training remaining weights from values they had at pruning time, keeping the final learning rate of the network constant. The authors find that: 1. Weight rewinding, where you rewind weights to *near* their starting value, and then retrain using the learning rates of early in training, outperforms fine tuning from the place weights were when you pruned but, also 2. Learning rate rewinding, where you keep weights as they are, but rewind learning rates to what they were early in training, are actually the most effective for a given amount of training time/search cost To me, this feels a little bit like burying the lede: the takeaway seems to be that when you prune, it's beneficial to make your network more "elastic" (in the metaphor-to-neuroscience sense) so it can more effectively learn to compensate for the removed neurons. So, what was really valuable in weight rewinding was the ability to "heat up" learning on a smaller set of weights, so they could adapt more quickly. And the fact that learning rate rewinding works better than weight rewinding suggests that there is value in the learned weights after all, that value is just outstripped by the benefit of rolling back to old learning rates. All in all, not a super radical conclusion, but a useful and practical one to have so clearly laid out in a paper. |
[link]
One of the most notable flaws of modern model-free reinforcement learning is its sample inefficiency; where humans can learn a new task with relatively few examples, model that learn policies or value functions directly from raw data need huge amounts of data to train properly. Because the model isn't given any semantic features, it has to learn a meaningful representation from raw pixels using only the (often sparse, often noisy) signal of reward. Some past approaches have tried learning representations separately from the RL task (where you're not bottlenecked by agent actions), or by adding more informative auxiliary objectives to the RL task. Instead, the authors of this paper, quippily titled "Image Augmentation Is All You Need", suggest using data augmentation of input observations through image modification (in particular, by taking random different crops of an observation stack), and integrating that augmentation into the native structure of a RL loss function (in particular, the loss term for Q learning). There are two main reasons why you might expect image augmentation to be useful. 1. On the most simplistic level, it's just additional data for your network 2. But, in particular, it's additional data designed to exhibit ways an image observation can be different on a pixel level, but still not be meaningfully different in terms of its state within the game. You'd expect this kind of information to make your model robust to overfitting. The authors go into three different ways they could add image augmentation to a Q Learning model, and show that each one provides additional marginal value. The first, and most basic, is to just add augmented versions of observations to your training dataset. The basic method being used, Soft Actor Critic, uses a replay buffer of old observations, and this augmentation works by simply applying a different crop transformation each time an observation is sampled from a replay buffer. This is a neat and simple trick, that effectively multiplies the number of distinct observations your network sees by the number of possible crops, making it less prone to overfitting. The next two ways involve integrating transformed versions of an observation into the structure of the Q function itself. As a quick background, Q learning is trained using a Bellman consistency loss, and Q tries to estimate the value of a (state, action) pair, assuming that you do the best possible thing at every point after you take the action at the state. The consistency loss tries to push your Q estimate of the value of a (state, action) pair closer to the sum of reward you got by taking that action and your current max Q estimate for the next state. The second term in this loss, the combined reward and next-step Q value, is called the target, since it's what you push your current-step Q value closer towards. This paper suggests both: - Averaging your current-step Q estimate over multiple different crops the observation stack at the current state - Averaging the next-step Q estimate used in the target over multiple different crops (that aren't the ones used in the current-step averaging) This has the nice side effect that, in addition to telling your network about image transformations (like small crops) that shouldn't impact its strategic assessment, it also makes your Q learning process overall lower variance, because both the current step and next step quantities are averages rather than single-sample values. https://i.imgur.com/LactlFq.png Operating in a lower data regime, the authors found that simply adding augmentations to their replay buffer sampling (without the two averaging losses) gave them a lot of gains in how efficiently they could learn, but all three combined gave the best performance. |
[link]
This paper out of DeepMind is an interesting synthesis of ideas out of the research areas of meta learning and intrinsic rewards. The hope for intrinsic reward structures in reinforcement learning - things like uncertainty reduction or curiosity - is that they can incentivize behavior like information-gathering and exploration, which aren't incentivized by the explicit reward in the short run, but which can lead to higher total reward in the long run. So far, intrinsic rewards have mostly been hand-designed, based on heuristics or analogies from human intelligence and learning. This paper argues that we should use meta learning to learn a parametrized intrinsic reward function that more directly succeeds our goal of facilitating long run reward. They do this by: - Creating agents that have multiple episodes within a lifetime, and learn a policy network to optimize Eta, a neural network mapping from the agent's life history to scalars, that serves as an intrinsic reward. The learnt policy is carried over from episode to episode. - The meta learner then optimizes the Eta network to achieve higher lifetime reward according to the *true* extrinsic reward, which the agent itself didn't have access to - The learned intrinsic reward function is then passed onto the next "newborn" agent, so that, even though its policy is reinitialized, it has access to past information in the form of the reward function This neatly mimics some of the dynamics of human evolution, where our long term goals of survival and reproduction are distilled into effective short term, intrinsic rewards through chemical signals. The idea is, those chemical signals were evolved over millennia of human evolution to be ones that, if followed locally, would result in the most chance of survival. The authors find that they're able to learn intrinsic rewards that "know" that they agent they're attached to will be dropped in an environment with a goal, but doesn't know the location, and so learns to incentivize searching until a goal is found, and then subsequently moving towards it. This smooth tradeoff between exploration and exploitation is something that can be difficult to balance between intrinsic exploration-focused reward and extrinsic reward. While the idea is an interesting one, an uncertainty I have with the paper is whether it'd be likely to scale beyond the simple environments it was trained on. To really learn a useful reward function in complex environments would require huge numbers of timesteps, and it seems like it'd be difficult to effectively assign credit through long lifetimes of learning, even with the lifetime value function used in the paper to avoid needing to mechanically backpropogate through entire lifetimes. It's also worth saying that the idea seems quite similar to a 2017 paper written by Singh et al; having not read that one in detail, I can't comment on the extent to which this work may just build incrementally on that one. |
[link]
The Transformer architecture - which uses a structure entirely based on key-value attention mechanisms to process sequences such as text - has taken over the worlds of language modeling and NLP in the past three years. However, Transformers at the scale used for large language models have huge computational and memory requirements. This is largely driven by the fact that information at every step in the sequence (or, in the so-far-generated sequence during generation) is used to inform the representation at every other step. Although the same *parameters* are used for each of these pairwise calculation between keys and queries at each step, this is still a pairwise, and thus N^2, calculation, which can get very costly when processing long sequences on the scale of tens of thousands of tokens. This cost comes from both computation and memory, with memory being the primary focus of this paper, because the max memory requirements of a network step dictate the hardware it can be run on, in a way that the pure amount of computation that needs to be performed doesn't. A L^2 attention calculation, as naively implemented in a set of matrix multiplies, not only has to perform N^2 calculations, but has to be able to hold N^2 values in memory while performing the softmax and weighted sum that is the attention aggregation process. Memory requirements in Transformers are also driven by - The high parameter counts of dense layers within the network, which have less parameter use per calculation than attention does, and - The fact that needing to pass forward one representation per element in the list at each layer necessitates cumulatively keeping all the activations from each layer in the forward pass, so that you can use them to calculate derivatives in the backward pass. This paper, and the "Reformer" architecture they suggest, is less a single idea, and more a suite of solutions targeted to make Transformers more efficient in use of both compute and memory. 1. The substitution of Locality Sensitive Hashing for normal key-query attention is a strategy for reducing the L^2 compute and memory requirements of the raw attention calculation. The essential idea of attention is "calculate how well the query at position i is matched by the key at every other position, and then use those matching softmax weights to calculate a weighted sum of the representations at each other position". If you consider keys and queries to be in the same space, you can think of this as a similarity calculation between positions, where you want to most highly weight the most similar positions to you in the calculation of your own next-layer value. In this spirit, the authors argue that this weighted sum will be mostly influenced by the highest-similarity positions within the softmax, and so, instead of performing attention over the whole sequence, we can first sort positions into buckets based on similarity of their key/query vector for a given head, and perform attention weighting within those buckets. https://i.imgur.com/tQJkfGe.png This has the advantage that the first step, of assigning a position's key/query vector to a bucket, can be done for each position individually, rather than with respect to the value at another position. In this case, this bucketing is performed by a Locality Sensitive Hashing algorithm, which works by projecting each position's vector into a lower-dimensional space, and then taking the index of that vector which has the max value. This is then used as a bucket ID for performing full attention within. This shifts the time complexity of attention from O(L^2) to O(LlogL), since for each position in the length, you only need to calculate explicit pairwise similarity for the log(L) other elements in its bucket 2. Reversible layers. This addresses the problem of needing to keep activations from each layer around for computing the backward-pass derivatives. It takes an idea used in RevNets, which proposes a reversible alternative to the commonly used ResNet architecture. In a normal ResNet scenario, Y = X + F(X), where F(X) is the computation of a single layer or block, and Y are the resulting activations passed to the next layer. In this setup, you can't go back from Y to get the value of X if you discard X for memory reasons, because the difference between the two is a function of X, which you don't have. As an alternative, RevNets define a sort of odd crosswise residual structure, that starts by partitioning X into two components, X1 and X2, and the output Y into Y1 and Y2, and performing the calculation shown below. https://i.imgur.com/EK2vBkK.png This allows you to work backward, getting X2 from Y1 and Y2 (both of which you have as outputs), and then get X1 from knowing the other three parts. https://i.imgur.com/uLTrdyf.png This means that as soon as you have the activations at a given layer, you can discard earlier layer activations, which makes things a lot more memory efficient. 3. There's also a proposal to do (what I *think* is) a pretty basic chunking of feed forward calculations across sequence length, and performing feedforward calculations on parts of the sequence rather than the whole thing. The latter would be faster with vectorized computing, for parallelization reasons, but the former is more memory efficient |
[link]
I found this paper a bit difficult to fully understand. Its premise, as far as I can follow, is that we may want to use genetic algorithms (GA), where we make modifications to elements in a population, and keep elements around at a rate proportional to some set of their desirable properties. In particular we might want to use this approach for constructing molecules that have properties (or predicted properties) we want. However, a downside of GA is that its easy to end up in local minima, where a single molecule, or small modifications to that molecule, end up dominating your population, because everything else gets removed for having less-high reward. The authors proposed fix for this is by training a discriminator to tell the difference between molecules from the GA population and those from a reference dataset, and then using that discriminator loss, GAN-style, as part of the "fitness" term that's used to determine if elements stay in the population. The rest of the "fitness" term is made up of chemical desiderata - solubility, how easy a molecule is to synthesize, binding efficacy, etc. I think the intuition here is that if the GA produces the same molecule (or similar ones) over and over again, the discriminator will have an easy time telling the difference between the GA molecules and the reference ones. One confusion I had with this paper is that it only really seems to have one experiment supporting its idea of using a discriminator as part of the loss - where the discriminator wasn't used at all unless the chemical fitness terms plateaued for some defined period (shown below). https://i.imgur.com/sTO0Asc.png The other constrained optimization experiments in section 4.4 (generating a molecule with specific properties, improving a desired property while staying similar to a reference molecule, and drug discovery). They also specifically say that they'd like to be the case that the beta parameter - which controls the weight of the discriminator relative to the chemical fitness properties - lets you smoothly interpolate between prioritizing properties and prioritizing diversity/realness of images, but they find that's not the case, and that, in fact, there's a point at which you move beta a small amount and switch sharply to a regime where chemical fitness values are a lot lower. Plots of eventual chemical fitness found over time seem to be the highest for models with beta set to 0, which isn't what you'd expect if the discriminator was in fact useful for getting you out of plateaus and into long-term better solutions. Overall, I found this paper an interesting idea, but, especially since it was accepted into ICLR, found it had confusingly little empirical support behind it. |
[link]
This preprint is a bit rambling, and I don't know that I fully followed what it was doing, but here's my best guess: https://i.imgur.com/xC2ryzp.png - We think it's probably the case that SARS-COV2 (COVID19) uses a protease (enzyme involved in its reproduction) that isn't available and co-optable in the human body, and is also quite similar to the comparable protease protein in the original SARS virus. Therefore, it is hoped that we might be able to take inhibitors that bind to SARS, and modify them in small ways to make them bond to SARS-COV2 - The paper notes that it's specifically interested in targeted covalent inhibitors. These are drugs that inhibit the function of a protein by actually covalently binding with the relevant binding pocket, as opposed to most drugs, which by default just fit neatly inside the pocket and occupy it much of the time in equilibrium, but don't actually form permanent, stable covalent bonds with the protein. This class of drugs can be more effective, because its binding is stronger and more permanent, but it can also be more dangerous, because its potential side effects if it binds with a non-intended protein pocket can be more severe. - In order to get a covalently-binding drug that fits with the pocket of SARS-COV2, the authors start with a known inhibitor from SARS, and then use reinforcement learning to make modifications to it. The allowed modification actions consist of adding or removing "fragments" rather than atoms, where "fragments" here refers to coherent subcomponents of other drugs from similar families that were broken up according to hand-coded chemical rules. This approach is more stable than just adding on molecules, because at every stage in generation, the generated molecule will be coherent and chemically sound. - The part I don't fully follow is what they use for the reward function for compounds that are in the process of being made. They specify that they do reward intermediate compounds, rather than just ones at the end of generation, but don't specify what goes into the reward. If I had to guess, I'd imagine it consists of (1) a molecular docking simulation that can't be differentiated through, and thus can't be used directly as a loss function, and/or (2) hand-coded heuristics from chemists for what makes a stable binding |
[link]
This paper, presented this week at ICLR 2020, builds on existing applications of message-passing Graph Neural Networks (GNN) for molecular modeling (specifically: for predicting quantum properties of molecules), and extends them by introducing a way to represent angles between atoms, rather than just distances between them, as current methods are limited to. The basic version of GNNs on molecule data works by creating features attached to atoms at each level (starting at level 0 with the element-specific embedding of that atom), and constructing "messages" between neighboring atoms that are a function of the neighbor atom's feature vector and the distance between the two neighbors. (This is the minimal version; some methods also include the bond type along with the distance as part of the edge-specific features). At a given layer, an atom's features are updated by applying an update function to both its own prior value and the sum of all the messages it receives from neighbors. The trouble with this method is that it doesn't account for angular relationships between sets of atoms, which are physically important to quantum properties of a molecule. The naive way you might imagine representing angle is by situating the molecule in a 2D grid, and applying spherical convolutions, so your contribution to a neighbor's features would be based on your spherical distance away. However, this doesn't work, because molecules don't have a canonical frame of reference - there is no fixed left or right, or up and down, and operating in this way would mean that a molecule and its horizontal flip would have different representations. Instead, the authors propose an interesting inversion of the existing approaches, where feature vectors are attached to atoms, and are updated using the features of other atoms. In this model, features live on "messages" between pairs of atoms, and are updated by incorporating information from all messages within some local distance window. Importantly, each pair of atoms has a vector associated with their relationship in the molecule, and so when you combine two such messages together, you can calculate the angle between the associated vectors. This angle is invariant to flipping or rotation, because it's defined based on reference points internal to the molecule, which move together when the molecule is moved. https://i.imgur.com/mw46gWz.png Messages are updated from other messages using a combination of the distance between the non-shared endpoints of the messages (that is, if both message vectors share an endpoint i, and go to j and k respectively, this would be the distance between j and k), and the angle between the (i-j) vector and the (i-j) vector. For physics-based reasons I admit I didn't fully follow, these two pieces of information are embedded in a spherical basis function, so messages will update from each other differently based on their relative positions in a sphere. https://i.imgur.com/Tvc7Gex.png The representation of a given atom is then simply the sum of all its incoming messages, conditioned by the distance between the reference atom and the paired neighbor for which the message is defined. A concatenation of atom representations across layers is used to create a final atom representation, which is used for final quantum property prediction. The authors tested on two datasets, and found dramatic improvements, with an average of 31% relative gain on the prior state of the art over different quantum property targets. |
[link]
In the last three years, Transformers, or models based entirely on attention for aggregating information from across multiple places in a sequence, have taken over the world of NLP. In this paper, the authors propose using a Transformer to learn a molecular representation, and then building a model to predict drug/target interaction on top of that learned representation. A drug/target interaction model takes in two inputs - a protein involved in a disease pathway, and a (typically small) molecule being considered as a candidate drug - and predicts the binding affinity they'll have for one another. If binding takes place between the two, that protein will be prevented from playing its role in the chain of disease progression, and the drug will be effective. The mechanics of the proposed Molecular Transformer DTI (or MT-DTI) model work as follows: https://i.imgur.com/ehfjMK3.png - Molecules are represented as SMILES strings, a character-based representation of atoms and their structural connections. Proteins are represented as sequences of amino acids. - A Transformer network is constructed over the characters in the SMILES string, where, at each level, the representation of each character comes from taking an attention-weighted average of the representations at other positions in the character string at that layer. At the final layer, the aggregated molecular representation is taken from a special "REP" token. - The molecular transformer is pre-trained in BERT-like way: by taking a large corpus (97M molecules) of unsupervised molecular representations, masking or randomizing tokens within the strings, and training the model to predict the true correct token at each point. The hope is that this task will force the representations to encode information relevant to the global structures and chemical constraints of the molecule in question - This pre-trained Transformer is then plugged into the DTI model, alongside a protein representation model in the form of multiple layers convolutions acting on embedded representations of amino acids. The authors noted that they considered a similar pretrained transformer architecture for the protein representation side of the model, but that they chose not to because (1) the calculations involved in attention are N^2 in the length of the sequence, and proteins are meaningfully longer than the small molecules being studied, and (2) there's no comparably large dataset of protein sequences that could be effectively used for pretraining - The protein and molecule representations are combined with multiple dense layers, and then produce a final affinity score. Although the molecular representation model starts with a set of pretrained weights, it also fine tunes on top of them. https://i.imgur.com/qybLKvf.png When evaluated empirically on two DTI datasets, this attention based model outperforms the prior SOTA, using a convolutional architecture, by a small but consistent amount across all metrics. Interestingly, their model is reasonably competitive even if they don't fine-tune the molecular representation for the DTI task, but only pretraining and fine-tuning together get the MT-DTI model over the threshold of prior work. |
[link]
In January of this year (2020), DeepMind released a model called AlphaFold, which uses convolutional networks atop sequence-based and evolutionary features to predict protein folding structure. In particular, their model was designed to predict a distribution for how far away each pair of amino acids will be from one another in the final folded structure. Given such a trained model, you can score a candidate structure according to how likely it is under the model, and - if your process for generating candidates is differentiable, as it is in this case - you can directly optimize the structure to increase its probability. https://i.imgur.com/9ZBhqRo.png The distance-prediction model takes as input two main categories of feature: 1. Per-residue features characterizing which amino acid that residue is based on different techniques that produce one-hot amino acid type, or a distribution over amino acids. 2. Residue pair features based on parameters of Multiple Sequence Alignment (MSA) models. I don't deeply understand the details of how the specific models here work, but at a high level: MSA features are based on the evolutionary intuition that residues that make contact within a protein will likely evolve in a correlation with one another, and that you can simulate these evolutionary timestep correlations by comparing highly similar proteins (which were likely close in evolutionary time). https://i.imgur.com/h16lPwU.png These features are stacked in a LxL grid, with the per-residue-pair features differing at each point in the grid, and the per-residue features staying constant for a full row or column (since they correspond to a given i for all j). One relevant note here is that proteins can be hundreds or thousands of residues long, and, so you can't actually construct a full LxL matrix, either on the input or output end. Instead, the notional full LxL grid is subdivided into a courser grid of 64-residue square regions, and a single one of these 64x64 regions (which could be either adjacent or far away in the protein) is passed into the model at a time. Given these 64x64x<features> input, the model performs several layers of dilated convolutions - which allow features at a given point in the grid to be informed by information farther away - still in a 2D arrangement. The model then outputs a 64x64 grid (one element for each [i, j] amino acid pair), where each element in the grid is a 64-deep discretized probability distribution over the distance between those two residues. When I say "discretized probability distribution," what I actually mean is "histogram". This discretization of an output distribution, where you predict how much probability mass will be in each possible distance bin, allows for flexible and finer-grained predicted distributions than, for example, you could get with a continuous Gaussian model centered around a single point. Amusingly, because the model predicts distance histograms for each residue pair, the authors term the output a "distogram". During training, the next-to-last layer of the model is also used to predict per-residue auxiliary features: the accessible surface area of the residue in the folded structure, and the secondary structure type (helix, strand, etc) that the residue will be a part of. However, these are just used to provide more signal during training, and aren't used for either protein structure optimization or calculation of test scores. To actually generated predicted fold structures, the authors construct a generative model of fold structure where each amino acid is assigned two torsion angles that govern its connection to its neighbor. By setting these torsion angles to different values, you can twist and reshape the protein as a whole. Given this generative model, things proceed as you might suspect: you generate a candidate, calculate the resulting inter-residue distances, calculate a likelihood of those distances under the model you've learned, and send back a gradient to change your torsion angles to make that likelihood higher. Empirically, the Deepmind authors evaluated on a competition dataset, and specifically compared themselves against other approaches that (like theirs) didn't make predictions for a new protein by comparing against similar templates (Template Modeling, or TM) but instead modeled from raw features (Free Modeling, or FM). AlphaFold was able to achieve high accuracy on 24 out of the 43 test domains (where a domain is a cluster of highly related proteins) compared to the next best method, which only got 14 out of the 43. Definitely still not perfect, since almost half of the test proteins were out of its reach, but fairly compelling evidence that there's value to DeepMind's approach. |
[link]
Most of the interesting mechanics within living things are mediated by interactions between proteins, making it important and useful to have good predictive models of whether proteins will interact with one another, for validating possible interaction graph structures. Prior methods for this problem - which takes as its input sequence representations of two proteins, and outputs a probability of interaction - have pursued different ideas for how to combine information from the two proteins. On a basic level, you need your method to be independent to the ordering of the proteins, since the property we care about is a property of a pair of proteins, not a property of a particular ordering of proteins. Some examples of those have included: - A kernel function between some representation of proteins - Representing a protein pair according to whether and how often given k-mer sequences co-occur in both proteins This paper's DPPI method is built on a Siamese network, which applies a single shared set of convolutional layers to each of the two proteins, and then calculates a "binding score," that structurally acts a bit like a similarity score, but with allowances for proteins to be complementary, rather than just similar. In more detail: https://i.imgur.com/8ruY9es.png 1. Crop each protein into multiple overlapping subsequences of length 512 amino acids. Perform all following steps for every combination of cropped subsequences between the two proteins. (If A is divided into A1 and A2, you'd do the following steps for A1 x B and A2 x B and take the max score out of the two) 2. Each cropped protein is represented as a probabilistic sequence. Since we can't get fully certain sequences of what amino acid is at each point in the chain, we instead pass in a 20x512 representation, where at each of the 512 locations we have a distribution over 20 possible amino acids. This tensor is passed into multiple layers of convolutional network, with the same network weights applied to each of the two proteins. 3. A random projection is applied to the outputs of the convolutional network. The features that come out of the projection are conceptually similar to feature maps that might come out of a neural network layer, except that the weights aren't learned. This random projection has a specialized structure, in that it's composed of two (randomly-weighted) networks, A and B, each of which result in feature maps A1...K and B1...K. For protein 1, the outputs of the network are ordered A1...AK B1...BK, whereas for protein 2, the weights are swapped, and so the outputs are ordered B1...BK A1...AK. 4. A Hadamard product between the two random projection outputs. This is basically a dot product, but for matrices (you multiply each element in the matrix by the corresponding element in the other matrix). This is basically like calculating a similarity score between the feature values of the randomly projected features. One benefit of doing the odd reordering in the prior step is that it breaks symmetry: if we took a dot product between features calculated by a fully shared-weight network, then we'd be looking explicitly for similarity between sequence features, which might not be sufficient to know if proteins interact in a complementary way. Another benefit is that it makes the final fully connected layer (which predicts interaction) agnostic to the order of inputs. (Caveat: I only about 70% follow the logic of this) In the example above, the 1st element will end up being A1(Prot1) x B1(ProtB), and the K+1th element will end up being B1(Prot1) x A1(ProtB). Since multiplication is order-independent, both values 1 and K+1 represent the similarity between the proteins according to features A/B1. 5. Once you have this final representation, feed it into a fully connected layer https://i.imgur.com/3LsgZNn.png The authors show superior performance to past methods, and even show that they can get 96% accuracy on protein interactions within humans from training on a non-human species, showing that a lot of the biomechanical logic transfers. https://i.imgur.com/REoU3Ab.png They did an ablation test and showed that the random projection layer added value, but also that it was better to have the weights be random than learned, which was surprising to me, and suggests the model as a whole is prone to overfit. |
[link]
Prior to this paper, most methods that used machine learning to generate molecular blueprints did so using SMILES representations - a string format with characters representing different atoms and bond types. This preference came about because ML had existing methods for generating strings that could be built on for generating SMILES (a particular syntax of string). However, an arguably more accurate and fundamental way of representing molecules is as graphs (with atoms as nodes and bonds as edges). Dealing with molecules as graphs avoids the problem of a given molecule having many potential SMILES representations (because there's no canonical atom to start working your way around the molecule on), and, hopefully, would have an inductive bias that somewhat more closely matches the actual biomechanical interactions within a molecule. One way you could imagine generating a graph structure is by adding on single components (atoms or bonds) at a time. However, the authors of this paper argue that this approach is harder to constrain to only construct valid molecular graphs, since, in the course of sampling out a molecule, you'd have to go through intermediate stages that you expect to be invalid (for example, bonds with no attached atoms), making it hard to add in explicit validity checks. The alternate approach proposed here works as follows: - Atoms within molecules are grouped into valid substructures, based on a combination of biologically-motivated rules (like treating aromatic rings as a single substructure) and computational heuristics. For the purpose of this paper, substructures are generally either 1) a ring, 2) two particular atoms on either end of a bond, or 3) a "tail" with a bond and an atom. Importantly, these substructures are designed to be overlapping - if you had a N bonded with O, and then O with C (this example are entirely made up, and I expect chemically incoherent), then you could have "N-O" as one substructure, and "O-C" as another. https://i.imgur.com/yGzRPjT.png - Using these substructures (or clusters), you can form a simplified representation of a molecule, as a connected, non-cyclic junction tree of clusters connected together. This doesn't give you all the information you'd need to construct the molecule - since there could be multiple different ways, on a molecular level, to connect two substructures, but it does give a blueprint of what the molecule will look like. - Given these two representations, the paper proposes a two-step encoding and decoding process. For a given molecule, we encode both the full molecular graph and the simplified junction tree, getting out vectors Zg and Zt respectively. - The first step of decoding generates a tree given the Zt representation. This generation process works via graph message-passing, taking in the Zt vector in addition to whatever part of the tree exists, and predicting a probability for whether that node has a child, and, if it exists, a probability for what cluster is at that child node. Given this parametrized set of probabilities, we can calculate the probability of the junction tree representation of whatever ground truth molecule we're decoding, and train the tree decoder to increase that model likelihood. (Importantly, although we frame this step as "reconstruction," during training, we're not actually sampling discrete nodes and edges, because we couldn't backprop through that, we're just defining a probability distribution and trying to increase the probability of our real data under it). - The second step of decoding takes in a tree - which at this point is a set of cluster labels with connections specified between one another - as well as the Zg vector, and generates a full, atom-level graph. This is done by enumerating all the ways that two substructures could be attached (this number is typically small, ≤4), and learning a parametrized function that scores each possible type of connection, based on the full tree "blueprint", the Zg embedding, and the molecule that has been generated so far. - When you want to sample a new molecule, you can draw a sample from the prior distributions of Zg and Zt, and run the decoding process in a sampling mode, actually making discrete draws from the distributions given by your model https://i.imgur.com/QdSY25u.png The authors perform three empirical tests: ability to successfully sample-reconstruct a given molecule, ability to optimize for a desired chemical property by finding a Z that scores high on that property according to an auxiliary predictive model, and ability to optimize for a property while staying within a given similarity radius to an original anchor molecule. The Junction Tree approach outperforms on all three tasks. On reconstruction, it matches the best alternative method on reconstruction reliability, but with 100% valid molecules, rather than 43.5% on the competing method. Overall, I found this paper really enjoyable and satisfying to read. Occasionally, ML-for-bio papers err on the side of too little domain thought (just throwing the most generic-for-images model structure at a problem), or too little machine learning thought (take hand-designed features and throw them at a whole range of models), where I think this one struck a nice balance of some amount of domain knowledge (around what makes for valid substructures) but embedded in a complex and thoughtfully designed neural network framework. |
[link]
This paper's proposed method, the cleverly named ORGAN, combines techniques from GANs and reinforcement learning to generate candidate molecular sequences that incentivize desirable properties while still remaining plausibly on-distribution. Prior papers I've read on molecular generation have by and large used approaches based in maximum likelihood estimation (MLE) - where you construct some distribution over molecular representations, and maximize the probability of your true data under that distribution. However, MLE methods can't be less powerful when it comes to incentivizing your model to precisely conform with structural details of your target data distribution. Generative Adversarial Networks (GANs) on the other hand, use a discriminator loss that directly penalizes your generator for being recognizably different from the true data. However, GANs have historically been difficult to use on data like the string-based molecular representations used in this paper. That's because strings are made up of discrete characters, which need to be sampled from underlying distributions, and we don't naively have good ways of making sampling from discrete distributions a differentiable process. SeqGAN was proposed to remedy this: instead of using the discriminator loss directly as the generator's loss - which would require backpropogating through the sampling operation - the generator is trained with reinforcement learning, using the discriminator score as a reward function. Each addition of an element to the sequence - or, in our case, each addition of a character to our molecular representation string - represents an action, and full sequences are rewarded based on the extent to which they resemble true sequences. https://i.imgur.com/dqtcJDU.png This paper proposes taking that model as a base, but then adding a more actually-reward-oriented reward onto it, incentivizing the model to produce molecules that have certain predicted properties, as determined by a (also very not differentiable) external molecular simulator. So, just taking a weighted sum of discriminator loss and reward, and using that as your RL reward. After all, if you're already using the policy gradient structures of RL to train the underlying generator, you might as well add on some more traditional-looking RL rewards. The central intuition behind using RL in both of these cases is that it provides a way of using training signals that are unknown or - more to the point - non-differentiable functions functions of model output. In their empirical tests focusing on molecules, the authors target the RL to optimize for one of solubility, synthesizability, and druggability (three well-defined properties within molecular simulator RDKit), as well as for uniqueness, penalizing any molecule that had been generated already. https://i.imgur.com/WszVd1M.png For all that this is an interesting mechanic, the empirical results are more equivocal. Compared to Naive RL, which directly optimizes for reward without the discriminator loss, ORGAN (Or, ORWGAN, the better-performing method using a Wasserstein GAN) doesn't have notably better rates of how often its generated strings are valid, and (as you would expect) performs comparably or slightly worse when it comes to optimizing the underlying reward. It does exhibit higher diversity than naive RL on two of the three tasks, but it's hard to get an intuition for the scales involved, and how much that scale of diversity increase would impact real results. |
[link]
Over the past few days, I've been reading about different generative neural networks being tried out for molecular generation. So far this has mostly focused on latent variable space models like autoencoders, but today I shifted attention to a different approach rooted in reinforcement learning. The goal of most of these methods is 1) to build a generative model that can sample plausible molecular structures, but more saliently 2) specifically generate molecules optimized to exhibit some property of interest. The two autoencoder methods I read about did this by building a model to predict properties from latent space, and then optimizing the latent space vector to push up the value of those predictions. A central difficulty of this, and something that was a challenge for the autoencoder methods I read about, was the difficulty of explicitly incentivizing and promoting structurally valid molecular representations when going "off distribution" in search of molecules not in your training set that you predict will be better along some axis, since optimizing any direction - particularly a direction governed by a imperfect predictive model - without constraints is likely to lead to models that find the easy route of finding edge cases of your property-prediction model, rather than more difficult, truly valid and novel structures. https://i.imgur.com/NafoeDr.png An advantage of using reinforcement learning as a framework here is that, because your loss doesn't need to be a continuous analytic function of your outputs, you can explicitly add molecular validity, as calculated by some external program, as part of your reward signal. This allows you to penalize a model for optimizing away from valid outputs. The specific approach proposed by the authors of this paper has two phases of training. 1) A RNN sequence model trained to do character-by-character prediction of SMILES strings (a character-based molecular representation). This is just a probability distribution over SMILES strings, with no RL involved yet, and is referred to as the Prior. 2) Taking that pretrained sequence model, caching it, and then fine-tuning on top with a hybrid RL and maximum likelihood loss. As seen in the equation below, this loss creates a hybrid, posterior-esque likelihood that combines the probability of an action sequence (where an action is "the choice of next character given currently generated string") under the prior with the reward (or "score", S(A)) of the action sequence, and tries to push the probability under the learned policy to be closer to that hybrid likelihood. https://i.imgur.com/U4ZvKsJ.png https://i.imgur.com/b28Ea7m.png The notion here is that by including the prior in your RL loss, you keep your generated molecules closer to the learned molecular distribution, rather than letting it push towards edge cases that are improbable, but not in ways you were able to directly disincentivize with your reward function. This is an interesting framing of this problem as having two failure modes: generating molecules that are structurally invalid, in that they don't correspond to syntactically feasible representations, and generating molecules that are technically feasible but are unlikely under the actual distribution of molecules, which captures more nuanced and hard-to-hardcode facts about energetic feasibility. The authors experiment with three tasks: - Learning to avoid structures that contain sulphur (with a reward function that penalizes both invalid molecules and the presence of sulphur). On this task, they show that methods that make use of the prior (compared to ablations that are only incentivized to increase reward, or that are pulled towards the prior in a less direct way) do a better job of solving the problem in realistic ways rather than overly simplified ones. - Learning to generate structures with high similarity to a particular reference molecule. Here, they perform an interesting test where they remove the reference molecule and things similar to it from the training set of the Prior, which leads to the model not immediately falling into the easy solution of just generating exact copies of the reference molecule, but instead more interesting similar-but-not-identical analogues - Learning to generate structures that are predicted - by a separate predictive model - to be active against a target of interest. A similar Prior-limitation test was performed, where all the true positives from the bioactivity model are removed from sequence training, and this led to a more diverse set of solutions that did less of just mimicking the existing known positives Overall, while this paper was relatively straightforward from a machine learning perspective, I enjoyed it, thought the methods were a sensible improvement over prior work I'd read, and that the evaluations performed were an interesting test of some of the paper's ideas. |
[link]
I'll admit that I found this paper a bit of a letdown to read, relative to expectations rooted in its high citation count, and my general excitement and interest to see how deep learning could be brought to bear on molecular design. But before a critique, let's first walk through the mechanics of how the authors' approach works. The method proposed is basically a very straightforward Variational Auto Encoder, or VAE. It takes in a textual SMILES string representation of a molecular structure, uses an encoder to map that into a continuous vector representation, a decoder to map the vector representation back into a a SMILES string, and an auxiliary predictor to predict properties of a molecule given the continuous representation. So, the training loss is a combination of the reconstruction loss (log probability of the true molecule under the distribution produced by the decoder) and the semi-supervised predictive loss. The hope with this model is that it would allow you to sample from a space of potential molecules by starting from an existing molecule, and then optimizing the the vector representation of that molecule to make it score higher on whatever property you want to optimize for. https://i.imgur.com/WzZsCOB.png The authors acknowledge that, in this setup, you're just producing a probability distribution over characters, and that the continuous vectors sampled from the latent space might not actually map to valid SMILES strings, and beyond that may well not correspond to chemically valid molecules. Empirically, they said that the proportion of valid generated molecules ranged between 1 and 70%. But they argue that it'd be too difficult to enforce those constraints, and instead just sample from the model and run the results through a hand-designed filter for molecular validity. In my view, this is the central weakness of the method proposed in this paper: that they seem to have not tackled the question of either chemical viability or even syntactic correctness of the produced molecules. I found it difficult to nail down from the paper what the ultimate percentage of valid molecules was from points in latent space that were off of the training . A table reports "percentage of 5000 randomly-selected latent points that decode to valid molecules after 1000 attempts," but I'm confused by what the 1000 attempts means here - does that mean we draw 1000 samples from the distribution given by the decoder, and see if *any* of those samples are valid? That would be a strange metric, if so, and perhaps it means something different, but it's hard to tell. https://i.imgur.com/9sy0MXB.png This paper made me really curious to see whether a GAN could do better in this space, since it would presumably be better at the task of incentivizing syntactic correctness of produced strings (given that any deviation from correctness could be signal for the discriminator), but it might also lead to issues around mode collapse, and when I last checked the literature, GANs on text data in particular were still not great. |
[link]
In the years before this paper came out in 2017, a number of different graph convolution architectures - which use weight-sharing and order-invariant operations to create representations at nodes in a graph that are contextualized by information in the rest of the graph - had been suggested for learning representations of molecules. The authors of this paper out of Google sought to pull all of these proposed models into a single conceptual framework, for the sake of better comparing and testing the design choices that went into them. All empirical tests were done using the QM9 dataset, where 134,000 molecules have predicted chemical properties attached to them, things like the amount of energy released if bombs are sundered and the energy of electrons at different electron shells. https://i.imgur.com/Mmp8KO6.png An interesting note is that these properties weren't measured empirically, but were simulated by a very expensive quantum simulation, because the former wouldn't be feasible for this large of a dataset. However, this is still a moderately interesting test because, even if we already have the capability to computationally predict these features, a neural network would do much more quickly. And, also, one might aspirationally hope that architectures which learn good representations of molecules for quantum predictions are also useful for tasks with a less available automated prediction mechanism. The framework assumes the existence of "hidden" feature vectors h at each node (atom) in the graph, as well as features that characterize the edges between nodes (whether that characterization comes through sorting into discrete bond categories or through a continuous representation). The features associated with each atom at the lowest input level of the molecule-summarizing networks trained here include: the element ID, the atomic number, whether it accepts electrons or donates them, whether it's in an aromatic system, and which shells its electrons are in. https://i.imgur.com/J7s0q2e.png Given these building blocks, the taxonomy lays out three broad categories of function, each of which different architectures implement in slightly different ways. 1. The Message function, M(). This function is defined with reference to a node w, that the message is coming from, and a node v, that it's being sent to, and is meant to summarize the information coming from w to inform the node representation that will be calculated at v. It takes into account the feature vectors of one or both nodes at the next level down, and sometimes also incorporates feature vectors attached to the edge connecting the two nodes. In a notable example of weight sharing, you'd use the same Message function for every combination of v and w, because you need to be able to process an arbitrary number of pairs, with each v having a different number of neighbors. The simplest example you might imagine here is a simple concatenation of incoming node and edge features; a more typical example from the architectures reviewed is a concatenation followed by a neural network layer. The aggregate message being sent to the receiver node is calculated by summing together the messages from each incoming vector (though it seems like other options are possible; I'm a bit confused why the paper presented summing as the only order-invariant option). 2. The Update function, U(). This function governs how to take the aggregated message vector sent to a particular node, and combine that with the prior-layer representation at that node, to come up with a next-layer representation at that node. Similarly, the same Update function weights are shared across all atoms. 3. The Readout function, R(), which takes the final-layer representation of each atom node and aggregates the representations into a final graph-level representation an order-invariant way Rather than following in the footsteps of the paper by describing each proposed model type and how it can be described in this framework, I'll instead try to highlight some of the more interesting ways in which design choices differed across previously proposed architectures. - Does the message function being sent from w to v depend on the feature value at both w and v, or just v? To put the question more colloquially, you might imagine w wanting to contextually send different information based on different values of the feature vector at node v, and this extra degree of expressivity (not present in the earliest 2015 paper), seems like a quite valuable addition (in that all subsequent papers include it) - Are the edge features static, categorical things, or are they feature vectors that get iteratively updated in the same way that the node vectors do? For most of the architectures reviewed, the former is true, but the authors found that the highest performance in their tests came from networks with continuous edge vectors, rather than just having different weights for different category types of edge - Is the Readout function something as simple as a summation of all top-level feature vectors, or is it more complex? Again, the authors found that they got the best performance by using a more complex approach, a Set2Set aggregator, which uses item-to-item attention within the set of final-layer atom representations to construct an aggregated grap-level embedding The empirical tests within the paper highlight a few more interestingly relevant design choices that are less directly captured by the framework. The first is the fact that it's quite beneficial to explicitly include Hydrogen atoms as part of the graph, rather than just "attaching" them to their nearest-by atoms as a count that goes on that atom's feature vector. The second is that it's valuable to start out your edge features with a continuous representation of the spatial distance between atoms, along with an embedding of the bond type. This is particularly worth considering because getting spatial distance data for a molecule requires solving the free-energy problem to determine its spatial conformation, a costly process. We might ideally prefer a network that can work on bond information alone. The authors do find a non-spatial-information network that can perform reasonably well - reaching full accuracy on 5 of 13 targets, compared to 11 with spatial information. However, the difference is notable, which, at least from my perspective, begs the question of whether it'd ever be possible to learn representations that can match the performance of spatially-informed ones without explicitly providing that information. |
[link]
The goal of one-shot learning tasks is to design a learning structure that can perform a new task (or, more canonically, add a new class to an existing task) using only one a small number of examples of the new task or class. So, as an example: you'd want to be able to take one positive and one negative example of a given task and correctly classify subsequent points as either positive or negative. A common way of achieving this, and the way that the paper builds on, is to learn a parametrized function projecting both your labeled points (your "support set") and your unlabeled point (your "query") into an embedding space, and then assigning a class to your query according to how close it is to the support set points associated with each label. The hope is that, in the course of training on different but similar tasks, you've learned a metric space where nearby things tend to be of similar classes. This method is called a "matching network". This paper has the specific objective of using such one-shot methods for drug discovery, and evaluates on tasks drawn from that domain, but most of the mechanics of the paper can be understood without reference to molecular dat in particular. In the simplest version of such a network, the query and support set points are embedded unconditionally - meaning that the query would be embedded in the same way regardless of the values in the support set, and that each point in the support set would be embedded without knowledge of each other. However, given how little data we're giving our model to work with, it might be valuable to allow our query embedder (f(x)) and support set embedder (g(x)) to depend on the values within the support set. Prior work had achieved this by: 1) Creating initial f'(x) and g'(x) query and support embedders. 2) Concatenating the embedded support points g'(x) into a single vector and running a bidirectional LSTM over the concatenation, which results in a representation g(x) of each input that incorporates information from g'(x_i) for all other x_i (albeit in a way that imposes a concatenation ordering that may not correspond to a meaningful order) 3) Calculating f(x) of your embedding point by using an attention mechanism to combine f'(x) with the contextualized embeddings g(x) The authors of the current paper argue that this approach is suboptimal because of the artificially imposed ordering, and because it calculated g(x) prior to f(x) using asymmetrical model structures (though it's not super clear why this latter point is a problem). Instead, they propose a somewhat elaborate and difficult-to-follow attention based mechanism. As best as I can understand, this is what they're suggesting: https://i.imgur.com/4DLWh8H.png 1) Update the query embedding f(x) by calculating an attention distribution over the vector current embeddings of support set points (here referred to as bolded <r>), pooling downward to a single aggregate embedding vector r, and then using a LSTM that takes in that aggregate vector and the prior update to generate a new update. This update, dz, is added to the existing query embedding estimate to get a new one 2) Update the vector of support set embeddings by iteratively calculating an attention mapping between the vector of current support set embeddings and the original features g'(S), and using that attention mapping to create a new <r>, which, similar to the above, is fed into an LSTM to calculate the next update. Since the model is evaluated on molecular tasks, all of the embedding functions are structured as graph convolutions. Other than the obvious fact that attention is a great way of aggregating information in an order-independent way, the authors give disappointingly little justification of why they would expect their method to work meaningfully better than past approaches. Empirically, they do find that it performs slightly better than prior contextualized matching networks on held out tasks of predicting toxicity and side effects with only a small number from the held out task. However, neither this paper's new method nor previous one-shot learning work is able to perform very well on the challenging MUV dataset, where held-out binding tasks involve structurally dissimilar molecules from those seen during training, suggesting that whatever generalization this method is able to achieve doesn't quite rise to the task of making inferences based on molecules with different structures. |
[link]
This is a paper released by the creators of the DeepChem library/framework, explaining the efforts they've put into facilitating straightforward and reproducible testing of new methods. They advocate for consistency between tests on three main axes. 1. On the most basic level, that methods evaluate on the same datasets 2. That they use canonical train/test splits 3. That they use canonical metrics. To that end, they've integrated a framework they call "MoleculeNet" into DeepChem, containing standardized interfaces to datasets, metrics, and test sets. **Datasets** MoleculeNet contains 17 different datasets, where "dataset" here just means a collection of data labeled for a certain task or set of tasks. The tasks fall into one of four groups: - quantum mechanical prediction (atomization energy, spectra) - prediction of properties of physical chemistry (solubility, lipophilicity) - prediction of biophysical interactions like bonding affinity - prediction of human-level physiological properties (toxicity, side effects, whether it passes the blood brain barrier) An interesting thing to note here is that only some datasets contain 3D orientations of molecules, because spatial orientations are properties of *a given conformation* of a molecule, and while some output measures (like binding geometry) depend on 3D arrangement, others (like solubility) don't. **Metrics** The metrics chosen were pretty straightforward - Root Mean Squared Error or Absolute Error for continuous prediction tasks, and ROC-AUC or PRC-AUC for prediction ones. The only notable nuance was that the paper argued for PRC-AUC as the standard metric for datasets with a low number of positives, since that metric is the strictest on false positives. **Test/Train Split** Most of these were fairly normal - random split and time-based split - but I found the idea of a scaffold split (where you cluster molecules by similarity, and assign each cluster to either train or test), interesting. The idea here is that if molecules are similar enough to one another, seeing one of a pair during training might be comparable to seeing an actual shared example between training and test, and have the same propensity for overconfident results. **Models** DeepChem has put together implementations of a number of standard machine learning methods (SVM, Random Forest, XGBoost, Logistic Regression) on molecular features, as well as a number of molecule-specific graph-structured methods. At a high level, these are: https://i.imgur.com/x4yutlp.png - Graph Convolutions, which update atom representations by combining transformations of the features of bonded neighbor atoms - DAGs, which create an "atom-centric" graph for each atom in the molecule and "pull" information inwards from farther away nodes (for the record, I don't fully follow how this one works, since I haven't read the underlying paper) - Weave Model, which maintains both atom representations and pair representations between all pairs of atoms, not just ones bonded to one another, and updates each in a cross-cutting way: updating an atom representation from all of its pairs (as well as itself), and then updating a pair representation from the atoms in its pairing (as well as itself). This has the benefit of making information from far-away molecules available immediately, rather than having to propagate through a graph, but is also more computationally taxing - Message Passing Neural Network, which operates like Graph Convolutions except that the feature transform used to pull in information from neighboring atoms changes depending on the type of the bond between atoms - Deep Tensor Neural Network - Instead of bonds, this approach represents atoms in 3D space, and pulls in information based on other atoms nearby in spatial distance **Results** As part of creating its benchmark, MoleculeNet also tested its implementations of its models on all its datasets. It's interesting the extent to which the results form a narrative, in terms of which tasks benefit most from flexible structure-based methods (like graph approaches) vs hand-engineered features. https://i.imgur.com/dCAdJac.png Predictions of quantum mechanical properties and properties of physical chemistry do consistently better with graph-based methods, potentially suggesting that the features we've thought to engineer aren't in line with useful features for those tasks. By contrast, on biophysical tasks, hand-engineered features combined with traditional machine learning mostly comes out on top, a fact I found a bit surprising, given the extent to which I'd read about deep learning methods claiming strong results on prediction of things like binding affinity. This was a useful pointer of things I should do some more work to resolve clarity on. And, when it came to physiological properties like toxicity and side effects, results are pretty mixed between graph-based and traditional methods. |
[link]
This paper was published after the 2015 Duvenaud et al paper proposing a differentiable alternative to circular fingerprints of molecules: substituting out exact-match random hash functions to identify molecular structures with learned convolutional-esque kernels. As far as I can tell, the Duvenaud paper was the first to propose something we might today recognize as graph convolutions on atoms. I hoped this paper would build on that one, but it seems to be coming from a conceptually different direction, and it seems like it was more or less contemporaneous, for all that it was released later. This paper introduces a structure that allows for more explicit message passing along bonds, by calculating atom features as a function of their incoming bonds, and then bond features as a function of their constituent atoms, and iterating this procedure, so information from an atom can be passed into a bond, then, on the next iteration, pulled in by another atom on the other end of that bond, and then pulled into that atom's bonds, and so forth. This has the effect of, similar to a convolutional or recurrent network, creating representations for each atom in the molecular graph that are informed by context elsewhere in the graph, to different degrees depending on distance from that atom. More specifically, it defines: - A function mapping from a prior layer atom representation to a subsequent layer atom representation, taking into account only information from that atom (Atom to Atom) - A function mapping from a prior layer bond representation (indexed by the two atoms on either side of the bond), taking into account only information from that bond at the prior layer (Bond to Bond) - A function creating a bond representation by applying a shared function to the atoms at either end of it, and then combining those representations with an aggregator function (Atoms to Bond) - A function creating an atom representation by applying a shared function all the bonds that atom is a part of, and then combining those results with an aggregator function (Bonds to Atom) At the top of this set of layers, when each atom has had information diffused into it by other parts of the graph, depending on the network depth, the authors aggregate the per-atom representations into histograms (basically, instead of summing or max-pooling feature-wise, creating course distributions of each feature), and use that for supervised tasks. One frustration I had with this paper is that it doesn't do a great job of highlighting its differences with and advantages over prior work; in particular, I think it doesn't do a very good job arguing that its performance is superior to the earlier Duvenaud work. That said, for all that the presentation wasn't ideal, the idea of message-passing is an important one in graph convolutions, and will end up becoming more standard in later works. |
[link]
If you read modern (that is, 2018-2020) papers using deep learning on molecular inputs, almost all of them use some variant of graph convolution. So, I decided to go back through the citation chain and read the earliest papers that thought to apply this technique to molecules, to get an idea of lineage of the technique within this domain. This 2015 paper, by Duvenaud et al, is the earliest one I can find. It focuses the entire paper on comparing differentiable, message-passing networks to the state of the art standard at the time, circular fingerprints (more on that in a bit). I really appreciated this approach, which, rather than trying to claim an unrealistic level of novelty, goes into detail on the prior approach, and carves out specific areas of difference. At a high level, the authors' claim is: our model is, in its simplest case, a more flexible and super version of existing work. The unspoken corollary, which ended up being proven true, is that the flexibility of the neural network structure makes it easy to go beyond this initial level of simplicity. Circular Fingerprinting (or, more properly, Extended-Connectivity Circular Fingerprints), is a fascinating algorithm that captures many of the elements of convolution: shared weights, a hierarchy of kernels that match patterns at different scales, and a clever way of aggregating information across an arbitrary number of input nodes. Mechanistically, Circular Fingerprints work by: 1) Taking each atom, and creating a concatenated vector of its basic features, along with the basic features of each atom it's bonded to (with bonded neighbors quasi-randomly) 2) Calculating next-level features by applying some number of hash functions (roughly equivalent to convolutional kernels) to the neighborhood feature vector at the lower level to produce an integer 3) For each feature, setting the value of the fingerprint vector to 1 at the index implied by the integer in step (2) 4) Iterating this process at progressively higher layers, using the hash The effect of this is to assign each index of the vector to an binary feature (modulo hash collisions), where that feature is activated if an exact match is found to a structure within a given atom. Its main downside is that (a) its "kernel" equivalents are fixed and not trainable, since they're just random hashes, and (b) its features represent *exact* matches to lower-level feature patterns, which means you can't have one feature activated to different degrees by variations on a pattern it's identifying. https://i.imgur.com/V8FpfVE.png Duvenaud et al present their alternative in terms of keeping a similar structure, but swapping out fixed and binary components for trainable (because differentiable) and continuous ones. Instead of concatenating a random sorting of atom neighbors to enforce invariance to sorting, they simply sum feature vectors across neighbors, which is also an order-invariantoperation. Instead of applying hash functions, they apply parametrized kernel functions, with the same parameters used across all aggregated neighborhood vectors . This will no longer look for exact matches, but will activate to the extent a structure within an atom matches against a kernel pattern. Then, these features are put into a softmax, which instead setting an index of a vector to a sharp one value, activates different feature indices in the final vector to differing degrees. The final fingerprint is simply the sum of these softmax feature activations for each atom. The authors do a few tests to confirm their substitution is working well, including starting out with a random network (to better approximate the random hash functions), comparing distances between atoms according to either the circular or neural footprint (which had a high correlation), and confirming that that performs similarly to circular fingerprints on a set of supervised learning tasks on modules. When they trained weights to be better than random on three such supervised tasks, they found that their model was comparable or better than circular fingerprints on all three (to break that down: it was basically equivalent on one, and notably better on the other two, according to mean squared error) This really is the simplest possible version of a message-passing or graph convolutional network (it doesn't use edge features, it doesn't calculate features of a neighbor-connection according to the features of each node, etc), but it's really satisfying to see it laid out as a next-step alternative that offered value just by stepping away from exact-match feature dynamics and non-random functions, even without all the sophisticated additions that would later be added to such models.
2 Comments
|
[link]
My objective in reading this paper was to gain another perspective on, and thus a more well-grounded view of, machine learning scoring functions for docking-based prediction of ligand/protein binding affinity. As quick background context, these models are useful because many therapeutic compounds act by binding to a target protein, and it can be valuable to prioritize doing wet lab testing on compounds that are predicted to have a stronger binding affinity. Docking systems work by predicting the pose in which a compound (or ligand) would bind to a protein, and then scoring prospective poses based on how likely such a pose would be to have high binding affinity. It's important to note that there are two predictive components in such a pipeline, and thus two sources of potential error: the searching over possible binding poses, done by physics-based systems, and scoring of the affinity of a given pose, assuming that were actually the correct one. Therefore, in the second kind of modeling, which this paper focuses on, you take in features *of a particular binding pose*, which includes information like which atoms of the compound are nearby to which atoms of the protein. The actual neural network structure used here was admittedly a bit underwhelming (though, to be fair, many of the ideas it seems to be gesturing at wouldn't be properly formalized until Graph Convolutional Networks came around). I'll describe the network mechanically first, and then offer some commentary on the design choices. https://i.imgur.com/w9wKS10.png 1. For each atom (a) in the compound, a set of neighborhood features are defined. The neighborhood is based on two hyperparameters, one for "how many atoms from the protein should be included," and one for "how many atoms from the compound should be included". In both cases, you start by adding the closest atom from either the compound or protein, and as hyperparameter values of each increase, you add in farther-away atoms. The neighborhood features here are (i) What are the types of the atoms? (ii) What are the partial charges of the atoms? (iii) How far are the atoms from the reference atom? (iiii) What amino acid within the protein do the protein atoms come? 2. All of these features are turned into embeddings. Yes, all of them, even the ones (distance and charge) that are continuous values. Coming from a machine learning perspective, this is... pretty weird as a design choice. The authors straight-up discretize the distance values, and then use those as discrete values for the purpose of looking up embeddings. (So, you'd have one embedding vector for distance (0.25-0.5, and a different one for 0.0-0.25, say). 3. The embeddings are concatenated together into a single "atom neighborhood vector" based on a predetermined ordering of the neighbor atoms and their property vectors. We now have one atom neighborhood vector for each atom in the compound. 4. The authors then do what they call a convolution over the atom neighborhood vectors. But it doesn't act like a normal convolution in the sense of mixing information from nearby regions of atom space. It just is basically a fully connected layer that's applied to atom neighborhood vector separately, but with shared weights, so the same layer is applied to each neighborhood vector. They then do a feature-wise max pool across the layer-transformed version of neighborhood vectors, getting you one vector for the full compound 5. This single vector is then put into a softmax, which predicts whether this ligand (in in this particular pose) will have strong binding with the protein Some thoughts on what's going on here. First, I really don't have a good explanation for why they'd have needed to embed a discretized version of the continuous variables, and since they don't do an ablation test of that design choice, it's hard to know if it mattered. Second, it's interesting to see, in their "convolution" (which I think is more accurately described as a Siamese Network, since it's only convolution-like insofar as there are shared weights), the beginning intuitions of what would become Graph Convolutions. The authors knew that they needed methods to aggregate information from arbitrary numbers of atoms, and also that they need should learn representations that have visibility onto neighborhoods of atoms, rather than single ones, but they do so in an entirely hand-engineered way: manually specifying a fixed neighborhood and pulling in information from all those neighbors equally, in a big concatenated vector. By contrast, when Graph Convolutions come along, they act by defining a "message-passing" function for features to aggregate across graph edges (here: molecular bonds or binaries on being "near enough" to another atom), which similarly allows information to be combined across neighborhoods. And, then, the 'convolution' is basically just a simple aggregation: necessary because there's no canonical ordering of elements within a graph, so you need an order-agnostic aggregation like a sum or max pool. The authors find that their method is able to improve on the hand-designed scoring functions within the docking programs. However, they also find (similar to another paper I read recently) that their model is able to do quite well without even considering structural relationships of the binding pose with the protein, which suggests the dataset (DUD - a dataset of 40 proteins with ~4K correctly binding ligands, and ~35K ligands paired with proteins they don't bind to) and problem given to the model is too easy. It's also hard to tell how I should consider AUCs within this problem - it's one thing to be better than an existing method, but how much value do you get from a given unit of AUC improvement, when it comes to actually meaningfully reducing wetlab time used on testing compounds? I don't know that there's much to take away from this paper in terms of useful techniques, but it is interesting to see the evolution of ideas that would later be more cleanly formalized in other works. |
[link]
This paper focuses on the application of deep learning to the docking problem within rational drug design. The overall objective of drug design or discovery is to build predictive models of how well a candidate compound (or "ligand") will bind with a target protein, to help inform the decision of what compounds are promising enough to be worth testing in a wet lab. Protein binding prediction is important because many small-molecule drugs, which are designed to be small enough to get through cell membranes, act by binding to a specific protein within a disease pathway, and thus blocking that protein's mechanism. The formulation of the docking problem, as best I understand it, is: 1. A "docking program," which is generally some model based on physical and chemical interactions, takes in a (ligand, target protein) pair, searches over a space of ways the ligand could orient itself within the binding pocket of the protein (which way is it facing, where is it twisted, where does it interact with the protein, etc), and ranks them according to plausibility 2. A scoring function takes in the binding poses (otherwise known as binding modes) ranked the highest, and tries to predict the affinity strength of the resulting bond, or the binary of whether a bond is "active". The goal of this paper was to interpose modern machine learning into the second step, as alternative scoring functions to be applied after the pose generation . Given the complex data structure that is a highly-ranked binding pose, the hope was that deep learning would facilitate learning from such a complicated raw data structure, rather than requiring hand-summarized features. They also tested a similar model structure on the problem of predicting whether a highly ranked binding pose was actually the empirically correct one, as determined by some epsilon ball around the spatial coordinates of the true binding pose. Both of these were binary tasks, which I understand to be 1. Does this ranked binding pose in this protein have sufficiently high binding affinity to be "active"? This is known as the "virtual screening" task, because it's the relevant task if you want to screen compounds in silico, or virtually, before doing wet lab testing. 2. Is this ranked binding pose the one that would actually be empirically observed? This is known as the "binding mode prediction" task The goal of this second task was to better understand biases the researchers suspected existed in the underlying dataset, which I'll explain later in this post. The researchers used a graph convolution architecture. At a (very) high level, graph convolution works in a way similar to normal convolution - in that it captures hierarchies of local patterns, in ways that gradually expand to have visibility over larger areas of the input data. The distinction is that normal convolution defines kernels over a fixed set of nearby spatial coordinates, in a context where direction (the pixel on top vs the pixel on bottom, etc) is meaningful, because photos have meaningful direction and orientation. By contrast, in a graph, there is no "up" or "down", and a given node doesn't have a fixed number of neighbors (whereas a fixed pixel in 2D space does), so neighbor-summarization kernels have to be defined in ways that allow you to aggregate information from 1) an arbitrary number of neighbors, in 2) a manner that is agnostic to orientation. Graph convolutions are useful in this problem because both the summary of the ligand itself, and the summary of the interaction of the posed ligand with the protein, can be summarized in terms of graphs of chemical bonds or interaction sites. Using this as an architectural foundation, the authors test both solo versions and ensembles of networks: https://i.imgur.com/Oc2LACW.png 1. "L" - A network that uses graph convolution to summarize the ligand itself, with no reference to the protein it's being tested for binding affinity with 2. "LP" - A network that uses graph convolution on the interaction points between the ligand and protein under the binding pose currently being scored or predicted 3. "R" - A simple network that takes into account the rank assigned to the binding pose by the original docking program (generally used in combination with one of the above). The authors came to a few interesting findings by trying different combinations of the above model modules. First, they found evidence supporting an earlier claim that, in the dataset being used for training, there was a bias in the positive and negative samples chosen such that you could predict activity of a ligand/protein binding using *ligand information alone.* This shouldn't be possible if we were sampling in an unbiased way over possible ligand/protein pairs, since even ligands that are quite effective with one protein will fail to bind with another, and it shouldn't be informationally possible to distinguish the two cases without protein information. Furthermore, a random forest on hand-designed features was able to perform comparably to deep learning, suggesting that only simple features are necessary to perform the task on this (bias and thus over-simplified) Specifically, they found that L+LP models did no better than models of L alone on the virtual screening task. However, the binding mode prediction task offered an interesting contrast, in that, on this task, it's impossible to predict the output from ligand information alone, because by construction each ligand will have some set of binding modes that are not the empirically correct one, and one that is, and you can't distinguish between these based on ligand information alone, without looking at the actual protein relationship under consideration. In this case, the LP network did quite well, suggesting that deep learning is able to learn from ligand-protein interactions when it's incentivized to do so. Interestingly, the authors were only able to improve on the baseline model by incorporating the rank output by the original docking program, which you can think of an ensemble of sorts between the docking program and the machine learning model. Overall, the authors' takeaways from this paper were that (1) we need to be more careful about constructing datasets, so as to not leak information through biases, and (2) that graph convolutional models are able to perform well, but (3) seem to be capturing different things than physics-based models, since ensembling the two together provides marginal value. |
[link]
This paper is a bit provocative (especially in the light of the recent DeepMind MuZero paper), and poses some interesting questions about the value of model-based planning. I'm not sure I agree with the overall argument it's making, but I think the experience of reading it made me hone my intuitions around why and when model-based planning should be useful. The overall argument of the paper is: rather than learning a dynamics model of the environment and then using that model to plan and learn a value/policy function from, we could instead just keep a large replay buffer of actual past transitions, and use that in lieu of model-sampled transitions to further update our reward estimators without having to acquire more actual experience. In this paper's framing, the central value of having a learned model is this ability to update our policy without needing more actual experience, and it argues that actual real transitions from the environment are more reliable and less likely to diverge than transitions from a learned parametric model. It basically sees a big buffer of transitions as an empirical environment model that it can sample from, in a roughly equivalent way to being able to sample transitions from a learnt model. An obvious counter-argument to this is the value of models in being able to simulate particular arbitrary trajectories (for example, potential actions you could take from your current point, as is needed for Monte Carlo Tree Search). Simply keeping around a big stock of historical transitions doesn't serve the use case of being able to get a probable next state *for a particular transition*, both because we might not have that state in our data, and because we don't have any way, just given a replay buffer, of knowing that an available state comes after an action if we haven't seen that exact combination before. (And, even if we had, we'd have to have some indexing/lookup mechanism atop the data). I didn't feel like the paper's response to this was all that convincing. It basically just argues that planning with model transitions can theoretically diverge (though acknowledges it empirically often doesn't), and that it's dangerous to update off of "fictional" modeled transitions that aren't grounded in real data. While it's obviously definitionally true that model transitions are in some sense fictional, that's just the basic trade-off of how modeling works: some ability to extrapolate, but a realization that there's a risk you extrapolate poorly. https://i.imgur.com/8jp22M3.png The paper's empirical contribution to its argument was to argue that in a low-data setting, model-free RL (in the form of the "everything but the kitchen sink" Rainbow RL algorithm) with experience replay can outperform a model-based SimPLe system on Atari. This strikes me as fairly weak support for the paper's overall claim, especially since historically Atari has been difficult to learn good models of when they're learnt in actual-observation pixel space. Nonetheless, I think this push against the utility of model-based learning is a useful thing to consider if you do think models are useful, because it will help clarify the reasons why you think that's the case. |
[link]
Arguably, the central achievement of the deep learning era is multi-layer neural networks' ability to learn useful intermediate feature representations using a supervised learning signal. In a supervised task, it's easy to define what makes a feature representation useful: the fact that's easier for a subsequent layer to use to make the final class prediction. When we want to learn features in an unsupervised way, things get a bit trickier. There's the obvious problem of what kinds of problem structures and architectures work to extract representations at all. But there's also a deeper problem: when we ask for a good feature representation, outside of the context of any given task, what are we asking for? Are there some inherent aspects of a representation that can be analyzed without ground truth labels to tell you whether the representations you've learned are good are not? The notion of "disentangled" features is one answer to that question: it suggests that a representation is good when the underlying "factors of variation" (things that are independently variable in the underlying generative process of the data) are captured in independent dimensions of the feature representation. That is, if your representation is a ten-dimensional vector, and it just so happens that there are ten independent factors along which datapoints differ (color, shape, rotation, etc), you'd ideally want each dimension to correspond to each factor. This criteria has an elegance to it, and it's previously been shown useful in predicting when the representations learned by a model will be useful in predicting the values of the factors of variation. This paper goes one step further, and tests the value representations for solving a visual reasoning task that involves the factors of variation, but doesn't just involve predicting them. In particular, the authors use learned representations to solve a task patterned on a human IQ test, where some factors stay fixed across a row in a grid, and some vary, and the model needs to generate the image that "fits the pattern". https://i.imgur.com/O1aZzcN.png To test the value of disentanglement, they looked at a few canonical metrics of disentanglement, including scores that represent "how many factors are captured in each dimension" and "how many dimensions is a factor spread across". They measured the correlation of these metrics with task performance, and compared that with the correlation between simple autoencoder reconstruction error and performance. They found that at early stages of training on top of the representations, the disentanglement metrics were more predictive of performance than reconstruction accuracy. This distinction went away as the model learning on top of the representations had more time to train. It makes reasonable sense that you'd mostly see value for disentangled features in a low-data regime, since after long enough the fine-tuning network can learn its own features regardless. But, this paper does appear to contribute to evidence that disentangled features are predictive of task performance, at least when that task directly involves manipulation of specific, known, underlying factors of variation. |
[link]
Summary: An odd thing about machine learning these days is how far you can get in a line of research while only ever testing your method on image classification and image datasets in general. This leads one occasionally to wonder whether a given phenomenon or advance is a discovery of the field generally, or whether it's just a fact about the informatics and learning dynamics inherent in image data. This paper, part of a set of recent papers released by Facebook centering around the Lottery Ticket Hypothesis, exists in the noble tradition of "lets try <thing> on some non-image datasets, and see if it still works". This can feel a bit silly in the cases where the ideas or approaches do transfer, but I think it's still an important impulse for the field to have, lest we become too captured by ImageNet and its various descendants. This paper test the Lottery Ticket Hypothesis - the idea that there are a small subset of weights in a trained network whose lucky initializations promoted learning, such that if you reset those weights to their initializations and train only them you get comparable or near-comparable performance to the full network - on reinforcement learning and NLP datasets. In particular, within RL, they tested on both simple continuous control (where the observation state is a vector of meaningful numbers) and Atari from pixels (where the observation is a full from-pixels image). In NLP, they trained on language modeling and translation, with both a LSTM and a Transformer respectively. (Prior work had found that Transformers didn't exhibit lottery ticket like phenomenon, but this paper found a circumstance where they appear to. ) Some high level interesting results: https://i.imgur.com/kd03bQ4.png https://i.imgur.com/rZTH7FJ.png - So as to not bury the lede: by and large, "winning" tickets retrained at their original initializations outperform random initializations of the same size and configuration on both NLP and Reinforcement Learning problems - There is wide variability in how much pruning in general (a necessary prerequisite operation) impacts reinforcement learning. On some games, pruning at all crashes performance, on others, it actually improves it. This leads to some inherent variability in results https://i.imgur.com/4o71XPt.png - One thing that prior researchers in this area have found is that pruning weights all at once at the end of training tends to crash performance for complex models, and that in order to find pruned models that have Lottery Ticket-esque high-performing properties, you need to do "iterative pruning". This works by training a model for a period, then pruning some proportion of weights, then training again from the beginning, and then pruning again, and so on, until you prune down to the full percentage you want to prune. The idea is that this lets the model adapt gradually to a drop in weights, and "train around" the capacity reduction, rather than it just happening all at once. In this paper, the authors find that this is strongly necessary for Lottery Tickets to be found for either Transformers or many RL problems. On a surface level, this makes sense, since Reinforcement Learning is a notoriously tricky and non-stationary learning problem, and Transformers are complex models with lots of parameters, and so dramatically reducing parameters can handicap the model. A weird wrinkle, though, is that the authors find that lottery tickets found without iterative pruning actually perform worse than "random tickets" (i.e. initialized subnetworks with random topology and weights). This is pretty interesting, since it implies that the topology and weights you get if you prune all at once are actually counterproductive to learning. I don't have a good intuition as to why, but would love to hear if anyone does. https://i.imgur.com/9LnJe6j.png - For the Transformer specifically, there was an interesting divergence in the impact of weight pruning between the weights of the embedding matrix and the weights of the rest of the network machinery. If you include embeddings in the set of weights being pruned, there's essentially no difference in performance between random and winning tickets, whereas if you exclude them, winning tickets exhibit the more typical pattern of outperforming random ones. This implies that whatever phenomenon that makes winning tickets better is more strongly (or perhaps only) present in weights for feature calculation on top of embeddings, and not very present for the embeddings themselves |
[link]
In my view, the Lottery Ticket Hypothesis is one of the weirder and more mysterious phenomena of the last few years of Machine Learning. We've known for awhile that we can take trained networks and prune them down to a small fraction of their weights (keeping those weights with the highest magnitudes) and maintain test performance using only those learned weights. That seemed somewhat surprising, in that there were a lot of weights that weren't actually necessary to encoding the learned function, but, the thinking went, possibly having many times more weights than that was helpful for training, even if not necessary once a model is trained. The authors of the original Lottery Ticket paper came to the surprising realization that they could take the weights that were pruned to exist in the final network, re-initialize them (and only them) to the values they had during initial training, and perform almost as well as the final pruned model that had all weights active during training. And, performance using the specific weights and their particular initialization values is much higher than training a comparable topology of weights with random initial values. This paper out of Facebook AI adds another fascinating experiment to the pile of off evidence around lottery tickets: they test whether lottery tickets transfer *between datasets*, and they find that they often do (at least when the dataset on which the lottery ticket is found is more complex (in terms of in size, input complexity, or number of classes) than the dataset the ticket is being transferred to. Even more interestingly, they find that for sufficiently simple datasets, the "ticket" initialization pattern learned on a more complex dataset actually does *better* than ones learned on the simple dataset itself. They also find that tickets by and large transfer between SGD and Adam, so whatever kind of inductive bias or value they provide is general across optimizers in addition to at least partially general across datasets. https://i.imgur.com/H0aPjRN.png I find this result fun to think about through a few frames. The first is to remember that figuring out heuristics for initializing networks (as a function of their topology) was an important step in getting them to train at all, so while this result may at first seem strange and arcane, in that context it feels less surprising that there are still-better initialization heuristics out there, possibly with some kind of interesting theoretical justification to them, that humans simply haven't been clever enough to formalize yet, and have only discovered empirically through methods like this. This result is also interesting in terms of transfer: we've known for awhile that the representations learned on more complex datasets can convey general information back to smaller ones, but it's less easy to think about what information is conveyed by the topology and connectivity of a network. This paper suggests that the information is there, and has prompted me to think more about the slightly mind-bending question of how training models could lead to information compressed in this form, and how this information could be better understood. |
[link]
VQ-VAE is a Variational AutoEncoder that uses as its information bottleneck a discrete set of codes, rather than a continuous vector. That is: the encoder creates a downsampled spatial representation of the image, where in each grid cell of the downsampled image, the cell is represented by a vector. But, before that vector is passed to the decoder, it's discretized, by (effectively) clustering the vectors the network has historically seen, and substituting each vector with the center of the vector it's closest to. This has the effect of reducing the capacity of your information bottleneck, but without just pushing your encoded representation closer to an uninformed prior. (If you're wondering how the gradient survives this very much not continuous operation, the answer is: we just pretend that operation didn't exist, and imagine that the encoder produced the cluster-center "codebook" vector that the decoder sees). The part of the model that got a (small) upgrade in this paper is the prior distribution model that's learned on top of these latent representations. The goal of this prior is to be able to just sample images, unprompted, from the distribution of latent codes. Once we have a trained decoder, if we give it a grid of such codes, it can produce an image. But these codes aren't one-per-image, but rather a grid of many codes representing features in different part of the image. In order to generate a set of codes corresponding to a reasonable image, we can either generate them all at once, or else (as this paper does) use an autoregressive approach, where some parts of the code grid are generated, and then subsequent ones conditioned on those. In the original version of the paper, the autoregressive model used was a PixelCNN (don't have the space to fully explain that here, but, at a high level: a model that uses convolutions over previously generated regions to generate a new region). In this paper, the authors took inspiration from the huge rise of self-attention in recent years, and swapped that operation in in lieu of the convolutions. Self-attention has the nice benefit that you can easily have a global receptive range (each region being generated can see all other regions) which you'd otherwise need multiple layers of convolutions to accomplish. In addition, the authors add an additional layer of granularity: generating both a 32x32 and 64x64 grid, and using both to generate the decoded reconstruction. They argue that this allows one representation to focus on more global details, and the other on more precise ones. https://i.imgur.com/zD78Pp4.png The final result is the ability to generate quite realistic looking images, that at least are being claimed to be more diverse than those generated by GANs (examples above). I'm always a bit cautious of claims of better performance in the image-generation area, because it's all squinting at pixels and making up somewhat-reasonable but still arbitrary metrics. That said, it seems interesting and useful to be aware of the current relative capabilities of two of the main forms of generative modeling, and so I'd recommend this paper on that front, even if it's hard for me personally to confidently assess the improvements on prior art. |
[link]
When talking about modern machine learning, particularly on images, it can feel like deep neural networks are a world unto themselves when it comes to complexity. On one hand, there are straightforward things like hand-designed features and linear classifiers, and then on the other, there are these deep, heavily-interacting networks that dazzle us with their performance but seem almost unavoidably difficult to hold in our heads or interpret. This paper, from ICLR 2019 earlier this year, investigates another point along this trade-off curve of complexity: a model that uses deep layers of convolutions, but limits the receptive field of those convolutions so that each feature is calculated using only a small spatial area of the image. https://i.imgur.com/NR0vFbN.png This approach, termed BagNet, essentially predicts class logits off of a small area of the image, without using information from anywhere else. Then, to aggregate the local predictions, a few simple and linear steps are performed: the predictions from each spatial area are averaged together into one vector containing the "aggregate information" for each class, and then that class information vector is passed into a linear (non-interacting!) model to predict final class probabilities. This is quite nice for interpretability, because you can directly identify the areas of the image that contributed evidence to the prediction, and you can know that the impact of those areas wasn't in fact amplified by feature values elsewhere, because there are no interaction effects outside of these small regions Now, it's fairly obvious that you're not going to get any state-of-the-art results off of this: the entire point is to handicap a network in ways believed to make it more interpretable. So the interesting question is instead what degree of performance loss comes from such a (fairly drastic) limitation of model capacity and receptive field? And the answer of the paper is: less than you might think. (Or, at least, less than *they* think you think). If you only use features calculated from 33x33 pixel chunks of image net, and aggregate their evidence together in a purely linear way, you can get to 87.6% top-5 image accuracy on ImageNet, which is about where we were with AlexNet in 2012. The authors also do some comparisons of their network to more common neural networks, to try to argue that even fully nonlinear neural nets don't use spatial information very much in their predictions. One way they did this was by masking different areas of the image, and comparing the effect of masking each individually to the effect of masking all areas together. In a purely linear model like BagNet, where the effects of different areas are just aggregated together, these would sum together perfectly, and the performance loss of all areas at once would be equal to the sum of each individually. To measure "effective spatial linearity" of each network, they measured the correlation between the sum of the individual effects and the joint effect. For VGG, they found a correlation of 0.75 here (compared to 1.0 for BagNet), which they use to argue that VGG doesn't use very much spatial information. I found this result hard to really get a grounding on, since I don't have a good intuitive grasp for what differences in this correlation value would mean. Is a difference of 0.25 a small difference, or a dramatic one? https://i.imgur.com/hA58AKM.png That aside, I found this paper interesting, and I'm quite pleased it was written. On one hand, you can say: well, obviously, we've done a lot of work in 7 years to build ResNet and DenseNet and whatnot, so of course if you apply those more advanced architectures, even on a small region of image space, you'll get good performance. That said, I still think this is an interesting finding, because it helps us understand how much of the added value in recent research requires a high (and uninterpretable) interaction complexity, and what proportion of the overall performance can be achieved with a simpler-to-understand model. Machine learning is used in a lot of settings, and it always practically exists on a trade-off curve, where performance is important, but it's often worth trading off performance to do better on other considerations, and this paper does a good job of illustrating that trade-off curve more fully. |
[link]
The successes of deep learning on complex strategic games like Chess and Go have been largely driven by the ability to do tree search: that is, simulating sequences of actions in the environment, and then training policy and value functions to more speedily approximate the results that more exhaustive search reveals. However, this relies on having a good simulator that can predict the next state of the world, given your action. In some games, with straightforward rules, this is easy to explicitly code, but in many RL tasks like Atari, and in many contexts in the real world, having a good model of how the world responds to your actions is in fact a major part of the difficulty of RL. A response to this within the literature has been systems that learn models of the world from trajectories, and then use those models to do this kind of simulated planning. Historically these have been done by designing models that predict the next observation, given past observations and a passed-in action. This lets you "roll out" observations from actions in a way similar to how a simulator could. However, in high-dimensional observation spaces it takes a lot of model capacity to accurately model the full observation, and many parts of a given observation space will often be irrelevant. https://i.imgur.com/wKK8cnj.png To address this difficulty, the MuZero architecture uses an approach from Value Prediction Networks, and learns an internal model that can predict transitions between abstract states (which don't need to match the actual observation state of the world) and then predict a policy, value, and next-step reward from the abstract state. So, we can plan in latent space, by simulating transitions from state to state through actions, and the training signal for that space representation and transition model comes from being able to accurately predict the reward, the empirical future value at a state (discovered through Monte Carlo rollouts) and the policy action that the rollout search would have taken at that point. If two observations are identical in terms of their implications for these quantities, the transition model doesn't need to differentiate them, making it more straightforward to learn. (Apologies for the long caption in above screenshot; I feel like it's quite useful to gain intuition, especially if you're less recently familiar with the MCTS deep learning architectures DeepMind typically uses) https://i.imgur.com/4nepG6o.png The most impressive empirical aspect of this paper is the fact that it claims (from what I can tell credibly) to be able to perform as well as planning algorithms with access to a real simulator in games like Chess and Go, and as well as model-free models in games like Atari where MFRL has typically been the state of the art (because world models have been difficult to learn). I feel like I've read a lot recently that suggests to me that the distinction between model-free and model-based RL is becoming increasingly blurred, and I'm really curious to see how that trajectory evolves in future. |
[link]
Recently, DeepMind released a new paper showing strong performance on board game tasks using a mechanism similar to the Value Prediction Network one in this paper, which inspired me to go back and get a grounding in this earlier work. A goal of this paper is to design a model-based RL approach that can scale to complex environment spaces, but can still be used to run simulations and do explicit planning. Traditional, model-based RL has worked by learning a dynamics model of the environment - predicting the next observation state given the current one and an action, and then using that model of the world to learn values and plan with. In addition to the advantages of explicit planning, a hope is that model-based systems generalize better to new environments, because they predict one-step changes in local dynamics in a way that can be more easily separated from long-term dynamics or reward patterns. However, a downside of MBRL is that it can be hard to train, especially when your observation space is high-dimensional, and learning a straight model of your environment will lead to you learning details that aren't actually unimportant for planning or creating policies. The synthesis proposed by this paper is the Value Prediction Network. Rather than predicting observed state at the next step, it learns a transition model in latent space, and then learns to predict next-step reward and future value from that latent space vector. Because it learns to encode latent-space state from observations, and also learns a transition model from one latent state to another, the model can be used for planning, by simulating multiple transitions between latent state. However, unlike a normal dynamics model, whose training signal comes from a loss against observational prediction, the signal for training both latent → reward/value/discount predictions, and latent → latent transitions comes from using this pipeline to predict reward values. This means that if an aspect of the environment isn't useful for predicting reward, it won't generally be encoded into latent state, meaning you don't waste model capacity predicting irrelevant detail. https://i.imgur.com/4bJylms.png Once this model exists, it can be used for generating a policy through a tree-search planning approach: simulating future trajectories and aggregating the predicted reward along those trajectories, and then taking the highest-value one. The authors find that their model is able to do better than both model-free and model-based methods on the tasks they tested on. In particular, they find that it has many of the benefits of a model that predicts full observations, but that the Value Prediction Network learns more quickly, and is more robust to stochastic environments where there's an inherent ceiling on how well a next-step observation prediction can work. My main question coming into this paper is: how is this different from simply a value estimator like those used in DQN or A2C, and my impression is that the difference comes from this model's ability to do explicit state simulation in latent space, and then predict a value off of the *latent* state, whereas a value network predicts value from observational state. |
[link]
Given the tasks that RL is typically used to perform, it can be easy to equate the problem of reinforcement learning with "learning dynamically, online, as you take actions in an environment". And while this does represent most RL problems in the literature, it is possible to learn a reinforcement learning system in an off-policy way (read: trained off of data that the policy itself didn't collect), and there can be compelling reasons to prefer this approach. In this paper, which seeks to train a chatbot to learn from implicit human feedback in text interactions, the authors note prior bad experiences with Microsoft's Tay bot, and highlight the value of being able to test and validate a learned model offline, rather than have it continue to learn in a deployment setting. This problem, of learning a RL model off of pre-collected data, is known as batch RL. In this setting, the batch is collected by simply using a pretrained language model to generate interactions with a human, and then extracting reward from these interactions to train a Q learning system once the data has been collected. If naively applied, Q learning (a good approach for off-policy problems, since it directly estimates the value of states and actions rather than of a policy) can lead to some undesirable results in a batch setting. An interesting one, that hadn't occurred to me, was the fact that Q learning translates its (state, action) reward model into a policy by taking the action associated with the highest reward. This is a generally sensible thing to do if you've been able to gather data on all or most of a state space, but it can also bias the model to taking actions that it has less data for, because high-variance estimates will tend make up a disproportionate amount of maximum values of any estimated distribution. One approach to this is to learn two separate Q functions, and take the minimum over them, and then take the max of that across actions (in this case: words in a sentence being generated). The idea here is that low-data, high-variance parts of state space might have one estimate be high, but might have the other be low, because high variance. However, it's costly to train and run two separate models. Instead, the authors here propose the simpler solution of training a single model with dropout, and using multiple "draws" from that model to simulate a distribution over Q value estimates. This will have a similar effect of penalizing actions whose estimate varies across different dropout masks (which can be hand-wavily thought of as different models). The authors also add a term to their RL training that penalizes divergence from the initial language model that they used to collect the data, and also that is the initialization point for the parameters of the model. This is done via KL-divergence control: the model is penalized for outputting a distribution over words that is different in distributional-metric terms from what the language model would have output. This makes it costlier for the model to diverge from the pretrained model, and should lead to it only happening in cases of convincing high reward. Out of these two approaches, it seems like the former is more convincing to me as a general-purpose method to use in batch RL settings. The latter is definitely something I would have expected to work well (and, indeed, KL-controlled models performed much better in empirical tests in the paper!), but more simply because language modeling is hard, and I would expect it to be good to constrain a model to be close to realistic outputs, since the sentiment-based reward signal won't reward realism directly. This seems more like something generally useful for avoiding catastrophic forgetting when switching from an old task to a new one (language modeling to sentiment modeling), rather than a particularly batch-RL-centric innovation. https://i.imgur.com/EmInxOJ.png An interesting empirical observation of this paper is that models without language-model control end up drifting away from realism, and repeatedly exploit part of the reward function that, in addition to sentiment, gave points for asking questions. By contrast, the KL-controlled models appear to have avoided falling into this local minimum, and instead generated realistic language that was polite and empathetic. (Obviously this is still a simplified approximation of what makes a good chat bot, but it's at least a higher degree of complexity in its response to reward). Overall, I quite enjoyed this paper, both for its thoughtfulness and its clever application of engineering to use RL for a problem well outside of its more typical domain. |
[link]
At a high level, this paper is a massive (34 pgs!) and highly-resourced study of many nuanced variations of language pretraining tasks, to see which of those variants produce models that transfer the best to new tasks. As a result, it doesn't lend itself *that* well to being summarized into a central kernel of understanding. So, I'm going to do my best to pull out some high-level insights, and recommend you read the paper in more depth if you're working particularly in language pretraining and want to get the details. The goals here are simple: create a standardized task structure and a big dataset, so that you can use the same architecture across a wide range of objectives and subsequent transfer tasks, and thus actually compare tasks on equal footing. To that end, the authors created a huge dataset by scraping internet text, and filtering it according to a few common sense criteria. This is an important and laudable task, but not one with a ton of conceptual nuance to it. https://i.imgur.com/5z6bN8d.png A more interesting structural choice was to adopt a unified text to text framework for all of the tasks they might want their pretrained model to transfer to. This means that the input to the model is always a sequence of tokens, and so is the output. If the task is translation, the input sequence might be "translate english to german: build a bed" and the the desired output would be that sentence in German. This gets particularly interesting as a change when it comes to tasks where you're predicting relationships of words within sentences, and would typically have a categorical classification loss, which is changed here to predicting the word of the correct class. This restructuring doesn't seem to hurt performance, and has the nice side effect that you can directly use the same model as a transfer starting point for all tasks, without having to add additional layers. Some of the transfer tasks include: translation, sentiment analysis, summarization, grammatical checking of a sentence, and checking the logical relationship between claims. All tested models followed a transformer (i.e. fully attentional) architecture. The authors tested performance along many different axes. A structural variation was the difference between an encoder-decoder architecture and a language model one. https://i.imgur.com/x4AOkLz.png In both cases, you take in text and predict text, but in an encoder-decoder, you have separate models that operate on the input and output, whereas in a language model, it's all seen as part of a single continuous sequence. They also tested variations in what pretraining objective is used. The most common is simple language modeling, where you predict words in a sentence given prior or surrounding ones, but, inspired by the success of BERT, they also tried a number of denoising objectives, where an original sentence was corrupted in some way (by dropping tokens and replacing them with either masks, nothing, or random tokens, dropping individual words vs contiguous spans of words) and then the model had to predict the actual original sentence. https://i.imgur.com/b5Eowl0.png Finally, they performed testing as to the effect of dataset size and number of training steps. Some interesting takeaways: - In almost all tests, the encoder-decoder architecture, where you separately build representations of your input and output text, performs better than a language model structure. This is still generally (though not as consistently) true if you halve the number of parameters in the encoder-decoder, suggesting that there's some structural advantage there beyond just additional parameter count. - A denoising, BERT-style objective works consistently better than a language modeling one. Within the set of different kinds of corruption, none work obviously and consistently better across tasks, though some have a particular advantage at a given task, and some are faster to train with due to different lengths of output text. - Unsurprisingly, more data and bigger models both lead to better performance. Somewhat interestingly, training with less data but the same number of training iterations (such that you see the same data multiple times) seems to be fine up to a point. This potentially gestures at an ability to train over a dataset a higher number of times without being as worried about overfitting. - Also somewhat unsurprisingly, training on a dataset that filters out HTML, random lorem-ipsum web text, and bad words performs meaningfully better than training on one that doesn't |
[link]
Coming from the perspective of the rest of machine learning, a somewhat odd thing about reinforcement learning that often goes unnoticed is the fact that, in basically all reinforcement learning, performance of an algorithm is judged by its performance on the same environment it was trained on. In the parlance of ML writ large: training on the test set. In RL, most of the focus has historically been on whether automatic systems would be able to learn a policy from the state distribution of a single environment, already a fairly hard task. But, now that RL has had more success in the single-environment case, there comes the question: how can we train reinforcement algorithms that don't just perform well on a single environment, but over a range of environments. One lens onto this question is that of meta-learning, but this paper takes a different approach, and looks at how straightforward regularization techniques pulled from the land of supervised learning can (or can't straightforwardly) be applied to reinforcement learning. In general, the regularization techniques discussed here are all ways of reducing the capacity of the model, and preventing it from overfitting. Some ways to reduce capacity are: - Apply L2 weight penalization - Apply dropout, which handicaps the model by randomly zeroing out neurons - Use Batch Norm, which uses noisy batch statistics, and increases randomness in a way that, similar to above, deteriorates performance - Use an information bottleneck: similar to a VAE, this approach works by learning some compressed representation of your input, p(z|x), and then predicting your output off of that z, in a way that incentivizes your z to be informative (because you want to be able to predict y well) but also penalizes too much information being put in it (because you penalize differences between your learned p(z|x) distribution and an unconditional prior p(z) ). This pushes your model to use its conditional-on-x capacity wisely, and only learn features if they're quite valuable in predicting y However, the paper points out that there are some complications in straightforwardly applying these techniques to RL. The central one is the fact that in (most) RL, the distribution of transitions you train on comes from prior iterations of your policy. This means that a noisier and less competent policy will also leave you with less data to train on. Additionally, using a noisy policy can increase variance, both by making your trained policy more different than your rollout policy (in an off-policy setting) and by making your estimate of the value function higher-variance, which is problematic because that's what you're using as a target training signal in a temporal difference framework. The paper is a bit disconnected in its connection between justification and theory, and makes two broad, mostly distinct proposals: 1. The most successful (though also the one least directly justified by the earlier-discussed theoretical difficulties of applying regularization in RL) is an information bottleneck ported into a RL setting. It works almost the same as the classification-model one, except that you're trying to increase the value of your actions given compressed-from-state representation z, rather than trying to increase your ability to correctly predict y. The justification given here is that it's good to incentivize RL algorithms in particular to learn simpler, more compressible features, because they often have such poor data and also training signal earlier in training 2. SNI (Selective Noise Injection) works by only applying stochastic aspects of regularization (sampling from z in an information bottleneck, applying different dropout masks, etc) to certain parts of the training procedure. In particular, the rollout used to collect data is non-stochastic, removing the issue of noisiness impacting the data that's collected. They then do an interesting thing where they calculate a weighted mixture of the policy update with a deterministic model, and the update with a stochastic one. The best performing of these that they tested seems to have been a 50/50 split. This is essentially just a knob you can turn on stochasticity, to trade off between the regularizing effect of noise and the variance-increasing-negative effect of it. https://i.imgur.com/fi0dHgf.png https://i.imgur.com/LLbDaRw.png Based on my read of the experiments in the paper, the most impressive thing here is how well their information bottleneck mechanism works as a way to improve generalization, compared to both the baseline and other regularization approaches. It does look like there's some additional benefit to SNI, particularly in the CoinRun setting, but very little in the MultiRoom setting, and in general the difference is less dramatic than the difference from using the information bottleneck. |
[link]
Domain translation - for example, mapping from a summer to a winter scene, or from a photorealistic image to an object segmentation map - is often performed by GANs through something called cycle consistency loss. This model works by having, for each domain, a generator to map domain A into domain B, and a discriminator to differentiate between real images from domain B, and those that were constructed through the cross-domain generator. With a given image in domain A, training happens by using the A→B generator to map it into domain B, and then then B→ A generator to map it back the original domain. These generators are then trained using two losses: one based on the B-domain discriminator, to push the generated image to look like it belongs from that domain, and another based on the L2 loss between the original domain A image, and the image you get on the other end when you translate it into B and back again. This paper addresses an effect (identified originally in an earlier paper) where in domains with a many to one mapping between domains (for example, mapping a realistic scene into a domain segmentation map, where information is inherently lost by translating pixels to object outlines), the cycle loss incentivizes the model to operate in a strange, steganographic way, where it saves information about the that would otherwise be lost in the form of low-amplitude random noise in the translated image. This low-amplitude information can't be isolated, but can be detected in a few ways. First, we can simply examine images and notice that information that could not have been captured in the lower-information domain is being perfectly reconstructed. Second, if you add noise to the translation in the lower-information domain, in such a way as to not perceptibly change the translation to human eyes, this can cause the predicted image off of that translation to deteriorate considerably, suggesting that the model was using information that could be modified by such small additions of noise to do its reconstruction. https://i.imgur.com/08i1j0J.png The authors of this paper ask whether it's possible to train models that don't perform this steganographic information-storing (which they call "self adversarial examples"). A typical approach to such a problem would be to train generators to perform translations with and without the steganographic information, but even though we can prove the existence of the information, we can't isolate it in a way that would allow us to remove it, and thus create these kinds of training pairs. The two tactics the paper uses are: 1) Simply training the generators to be able to translate a domain-mapped image with noise as well as one without noise, in the hope that this would train it not use information that can be interfered with by the application of such noise. 2) In addition to a L2 cycle loss, adding a discriminator to differentiate between the back-translated image and the original one. I believe the idea here is that if both of the encoders are adding in noise as a kind of secret signal, this would be a way for the discriminator to distinguish between the original and reconstructed image, and would thus be penalized. They find that both of these methods reduce the use of steganographic information, as determined both by sensitivity to noise (where less sensitivity of reconstruction to noise means less use of coded information) and reconstruction honesty (which constrains accuracy of reconstruction in many to one domains to be no greater than the prediction that a supervised predictor could make given the image from the compressed domain |
[link]
In Machine Learning, our models are lazy: they're only ever as good as the datasets we train them on. If a task doesn't require a given capability in order for a model to solve it, then the model won't gain that capability. This fact motivates a desire on the part of researchers to construct new datasets, to provide both a source of signal and a not-yet-met standard against which models can be measured. This paper focuses on the domain of reasoning about videos and the objects within them across frames. It observes that, on many tasks that ostensibly require a model to follow what's happening in a video, models that simply aggregate some set of features across frames can do as well as models that actually track and account for temporal evolution from one frame for another. They argue that this shows that, on these tasks, which often involve real-world scenes, the model can predict what's happening within a frame simply based on expectations of the world that can be gleaned from single frames - for example, if you see a swimming pool, you can guess that swimming is likely to take place there. As an example of the kind of task they'd like to get a model to solve, they showed a scene from the Godfather where a character leaves the room, puts a gun in his pocket, and returns to the room. Any human viewer could infer that the gun is in his pocket when it returns, but there doesn't exist any single individual frame that could give evidence of that, so it requires reasoning across frames. https://i.imgur.com/F2Ngsgw.png To get around this inherent layer of bias in real-world scenes, the authors decide to artificially construct their own dataset, where objects are moved, and some objects are moved to be contained and obscured within others, in an entirely neutral environment, where the model can't generally get useful information from single frames. This is done using the same animation environment as is used in CLEVR, which contains simple objects that have color, texture, and shape, and that can be moved around a scene. Within this environment, called CATER, the benchmark is made up of three tasks: - Simply predicting what action ("slide cone" or "pick up and place box") is happening in a given frame. For actions like sliding, where in a given frame a sliding cone is indistinguishable from a static one, this requires a model to actually track prior position in order to correctly predict an action taking place - Being able to correctly identify the order in which a given pair of actions occurs - Watching a single golden object that can be moved and contained within other objects (entertainingly enough, for Harry Potter fans, called the snitch), and guessing what frame it's in at the end of the scene. This is basically just the "put a ball in a cup and move it around" party trick, but as a learning task https://i.imgur.com/bBhPnFZ.png The authors do show that the "frame aggregation/pooling" methods that worked well on previous datasets don't work well on this dataset - which accords with both expectations and the authors goals. Obviously, it's still a fairly simplified environment, but they hope CATER can still be a useful shared benchmark for people working in the space to solve a task that is known to require more explicit spatiotemporal reasoning. |
[link]
A common critique of deep learning is its brittleness off-distribution, combined with its tendency to give confident predictions for off-distribution inputs, as is seen in the case of adversarial examples. In response to this critique, a number of different methods have cropped up in recent years, that try to capture a model's uncertainty as well as its overall prediction. This paper tries to do a broad evaluation of uncertainty methods, and, particularly, to test how they perform on out of distribution data, including both data that is perturbed from its original values, and fully OOD data from ground-truth categories never seen during training. Ideally, we would want an uncertainty method that is less confident in its predictions as data is made more dissimilar from the distribution that the model is trained on. Some metrics the paper uses for capturing this are: - Brier Score (The difference between predicted score and ground truth 0/1 label, averaged over all examples) - Negative Log Likelihood - Expected Calibration Error (Within a given bucket, this is calculated as the difference between accuracy to ground truth labels, and the average predicted score in that bucket, capturing that you'd ideally want to have a lower predicted score in cases where you have low accuracy, and vice versa) - Entropy - For labels that are fully out of distribution, and don't map to any of the model's categories, you can't directly calculate ground truth accuracy, but you can ideally ask for a model that has high entropy (close to uniform) probabilities over the classes it knows about when the image is drawn from an entirely different class The authors test over image datasets small (MNIST) and large (ImageNet and CIFAR10), as well as a categorical ad-click-prediction dataset. They came up with some interesting findings. https://i.imgur.com/EVnjS1R.png 1. More fully principled Bayesian estimation of posteriors over parameters, in the form of Stochastic Variational Inference, works well on MNIST, but quite poorly on either categorical data or higher dimensional image datasets https://i.imgur.com/3emTYNP.png 2. Temperature scaling, which basically performs a second supervised calibration using a hold-out set to push your probabilities towards true probabilities, performs well in-distribution but collapses fairly quickly off-distribution (which sort of makes sense given that it too is just another supervised method that can do poorly when off-distribution) 3. In general, ensemble methods, where you train different models on different subsets of the data and take their variance as uncertainty, perform the best across the bigger image models as well as the ad click model, likely because SVI (along with many other Bayesian methods) is too computationally intensive to get to work well on higher-dimensional data 4. Overall, none of the methods worked particularly well, and even the best-performing ones were often confidently wrong off-distribution I think it's fair to say that we're far from where we wish we were when it comes to models that "know when they don't know," and this paper does a good job of highlighting that in specific fashion.
1 Comments
|