![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
In the years before this paper came out in 2017, a number of different graph convolution architectures - which use weight-sharing and order-invariant operations to create representations at nodes in a graph that are contextualized by information in the rest of the graph - had been suggested for learning representations of molecules. The authors of this paper out of Google sought to pull all of these proposed models into a single conceptual framework, for the sake of better comparing and testing the design choices that went into them. All empirical tests were done using the QM9 dataset, where 134,000 molecules have predicted chemical properties attached to them, things like the amount of energy released if bombs are sundered and the energy of electrons at different electron shells. https://i.imgur.com/Mmp8KO6.png An interesting note is that these properties weren't measured empirically, but were simulated by a very expensive quantum simulation, because the former wouldn't be feasible for this large of a dataset. However, this is still a moderately interesting test because, even if we already have the capability to computationally predict these features, a neural network would do much more quickly. And, also, one might aspirationally hope that architectures which learn good representations of molecules for quantum predictions are also useful for tasks with a less available automated prediction mechanism. The framework assumes the existence of "hidden" feature vectors h at each node (atom) in the graph, as well as features that characterize the edges between nodes (whether that characterization comes through sorting into discrete bond categories or through a continuous representation). The features associated with each atom at the lowest input level of the molecule-summarizing networks trained here include: the element ID, the atomic number, whether it accepts electrons or donates them, whether it's in an aromatic system, and which shells its electrons are in. https://i.imgur.com/J7s0q2e.png Given these building blocks, the taxonomy lays out three broad categories of function, each of which different architectures implement in slightly different ways. 1. The Message function, M(). This function is defined with reference to a node w, that the message is coming from, and a node v, that it's being sent to, and is meant to summarize the information coming from w to inform the node representation that will be calculated at v. It takes into account the feature vectors of one or both nodes at the next level down, and sometimes also incorporates feature vectors attached to the edge connecting the two nodes. In a notable example of weight sharing, you'd use the same Message function for every combination of v and w, because you need to be able to process an arbitrary number of pairs, with each v having a different number of neighbors. The simplest example you might imagine here is a simple concatenation of incoming node and edge features; a more typical example from the architectures reviewed is a concatenation followed by a neural network layer. The aggregate message being sent to the receiver node is calculated by summing together the messages from each incoming vector (though it seems like other options are possible; I'm a bit confused why the paper presented summing as the only order-invariant option). 2. The Update function, U(). This function governs how to take the aggregated message vector sent to a particular node, and combine that with the prior-layer representation at that node, to come up with a next-layer representation at that node. Similarly, the same Update function weights are shared across all atoms. 3. The Readout function, R(), which takes the final-layer representation of each atom node and aggregates the representations into a final graph-level representation an order-invariant way Rather than following in the footsteps of the paper by describing each proposed model type and how it can be described in this framework, I'll instead try to highlight some of the more interesting ways in which design choices differed across previously proposed architectures. - Does the message function being sent from w to v depend on the feature value at both w and v, or just v? To put the question more colloquially, you might imagine w wanting to contextually send different information based on different values of the feature vector at node v, and this extra degree of expressivity (not present in the earliest 2015 paper), seems like a quite valuable addition (in that all subsequent papers include it) - Are the edge features static, categorical things, or are they feature vectors that get iteratively updated in the same way that the node vectors do? For most of the architectures reviewed, the former is true, but the authors found that the highest performance in their tests came from networks with continuous edge vectors, rather than just having different weights for different category types of edge - Is the Readout function something as simple as a summation of all top-level feature vectors, or is it more complex? Again, the authors found that they got the best performance by using a more complex approach, a Set2Set aggregator, which uses item-to-item attention within the set of final-layer atom representations to construct an aggregated grap-level embedding The empirical tests within the paper highlight a few more interestingly relevant design choices that are less directly captured by the framework. The first is the fact that it's quite beneficial to explicitly include Hydrogen atoms as part of the graph, rather than just "attaching" them to their nearest-by atoms as a count that goes on that atom's feature vector. The second is that it's valuable to start out your edge features with a continuous representation of the spatial distance between atoms, along with an embedding of the bond type. This is particularly worth considering because getting spatial distance data for a molecule requires solving the free-energy problem to determine its spatial conformation, a costly process. We might ideally prefer a network that can work on bond information alone. The authors do find a non-spatial-information network that can perform reasonably well - reaching full accuracy on 5 of 13 targets, compared to 11 with spatial information. However, the difference is notable, which, at least from my perspective, begs the question of whether it'd ever be possible to learn representations that can match the performance of spatially-informed ones without explicitly providing that information. ![]() |
[link]
This is an interesting - and refreshing - paper, in that, instead of trying to go all-in on a particular theoretical point, the authors instead run a battery of empirical investigations, all centered around the question of how to explain what happens to make transfer learning work. The experiments don't all line up to support a single point, but they do illustrate different interesting facets of the transfer process. - An initial experiment tries to understand how much of the performance of fine-tuned models can be explained by (higher-level, and thus larger-scale) features, and how much is driven by lower level (and thus smaller-scale) image statistics. To start with, the authors compare the transfer performance from ImageNet onto three different datasets - clip art, sketches, and real images. As expected, transfer performance is highest with real datasets, which are the most similar to training domain. However, there still *is* positive transfer in terms of final performance across all domains, as well as benefit in optimization speed. - To try to further tease out the difference between the transfer benefits of high and low-level features, the authors run an experiment where blocks of pixels are shuffled around within the image on downstream tasks . The larger the size of the blocks being shuffled, the more that large-scale features of the image are preserved. As predicted, accuracy drops dramatically when pixel block size is small, for both randomly initialized and pretrained models. In addition, the relative value added by pretraining drops, for all datasets except quickdraw (the dataset of sketches). This suggests that in most datasets, the value brought by fine-tuning was mostly concentrated in large-scale features. One interesting tangent of this experiment was the examination of optimization speed (in the form of mean training accuracy over initial epochs). Even at block sizes too small for pretraining to offer a benefit to final accuracy, it did still contribute to faster training. (See transparent bars in right-hand plot below) https://i.imgur.com/Y8sO1da.png - On a somewhat different front, the authors look into how similar pretrained + finetuned models are to one another, compared to models trained on the same dataset from random initializations. First, they look at a measure of feature similarity, and find that the features learned by two pretrained networks are more similar to each other than a pretrained network is to a randomly initalized network, and also more than two randomly initialized networks are to one another. Randomly initialized networks are closest to one another in their final-layer features, but this is still a multiple of 4 or 5 less than the similarity between the pretrained networks - Looking at things from the perspective of optimization, the paper measures how much performance drops when you linearly interpolate between different solutions found by both randomly initialized and pretrained networks. For randomly initialized networks, interpolation requires traversing a region where test accuracy drops to 0%. However, for pretrained networks, this isn't the case, with test accuracy staying high throughout. This suggests that pretraining gets networks into a basin of the loss landscape, and that future training stays within that basin. There were also some experiments on module criticality that I believe were in a similar vein to these, but which I didn't fully follow - Finally, the paper looks at the relationship between accuracy on the original pretraining task and both accuracy and optimization speed on the downstream task. They find that higher original-task accuracy moves in the same direction as higher downstream-task accuracy, though this is less true when the downstream task is less related (as with quickdraw). Perhaps more interestingly, they find that the benefits of transfer to optimization speed happen and plateau quite early in training. Clip Art and Real transfer tasks are much more similar in the optimization speed benefits they get form ImageNet training, where on the accuracy front, the real did dramatically better. https://i.imgur.com/jBCJcLc.png While there's a lot to dig into in these results overall, the things I think are most interesting are the reinforcing of the idea that even very random and noisy pretraining can be beneficial to optimization speed (this seems reminiscent of another paper I read from this year's NeurIPS, examining why pretraining on random labels can help downstream training), and the observation that pretraining deposits weights in a low-loss bucket, from which they can learn more efficiently (though, perhaps, if the task is too divergent from the pretraining task, this difficulty in leaving the basin becomes a disadvantage). This feels consistent with some work in the Lottery Ticket Hypothesis, which has recently suggested that, after a short duration of training, you can rewind a network to a checkpoint saved after that duration, and be successfully able to train to low loss again. ![]() |
[link]
This work attempts to use meta-learning to learn an update rule for a reinforcement learning agent. In this context, "learning an update rule" means learning the parameters of an LSTM module that takes in information about the agent's recent reward and current model and outputs two values - a scalar and a vector - that are used to update the agent's model. I'm not going to go too deep into meta-learning here, but, at a high level, meta learning methods optimize parameters governing an agent's learning, and, over the course of many training processes over many environments, optimize those parameters such that the reward over the full lifetime of training is higher. To be more concrete, the agent in a given environment learns two things: - A policy, that is, a distribution over predicted action given a state. - A "prediction vector". This fits in the conceptual slot where most RL algorithms would learn some kind of value or Q function, to predict how much future reward can be expected from a given state. However, in this context, this vector is *very explicitly* not a value function, but is just a vector that the agent-model generates and updates. The notion here is that maybe our human-designed construction of a value function isn't actually the best quantity for an agent to be predicting, and, if we meta-learn, we might find something more optimal. I'm a little bit confused about the structure of this vector, but I think it's *intended* to be a categorical 1-of-m prediction At each step, after acting in the environment, the agent passes to an LSTM: - The reward at the step - A binary of whether the trajectory is done - The discount factor - The probability of the action that was taken from state t - The prediction vector evaluated at state t - The prediction vector evaluated at state t+1 Given that as input (and given access to its past history from earlier in the training process), the LSTM predicts two things: - A scalar, pi-hat - A prediction vector, y-hat These two quantities are used to update the existing policy and prediction model according to the rule below. https://i.imgur.com/xx1W9SU.png Conceptually, the scalar governs whether to increase or decrease probability assigned to the taken action under the policy, and y-hat serves as a target for the prediction vector to be pulled towards. An important thing to note about the LSTM structure is that none of the quantities it takes as input are dependent on the action or observation space of the environment, so, once it is learned it can (hopefully) generalize to new environments. Given this, the basic meta learning objective falls out fairly easily - optimize the parameters of the LSTM to maximize lifetime reward, taken in expectation over training runs. However, things don't turn out to be quite that easy. The simplest version of this meta-learning objective is wildly unstable and difficult to optimize, and the authors had to add a number of training hacks in order to get something that would work. (It really is dramatic, by the way, how absolutely essential these are to training something that actually learns a prediction vector). These include: - A entropy bonus, pushing the meta learned parameters to learn policies and prediction vectors that have higher entropy (which is to say: are less deterministic) - An L2 penalty on both pi-hat and y-hat - A removal of the softmax that had originally been originally taken over the k-dimensional prediction vector categorical, and switching that target from a KL divergence to a straight mean squared error loss. As far as I can tell, this makes the prediction vector no longer actually a 1-of-k categorical, but instead just a continuous vector, with each value between 0 and 1, which makes it make more sense to think of k separate binaries? This I was definitely confused about in the paper overall https://i.imgur.com/EL8R1yd.png With the help of all of these regularizers, the authors were able to get something that trained, and that appeared to be able to perform comparably to or better than A2C - the human-designed baseline - across the simple grid-worlds it was being trained in. However, the two most interesting aspects of the evaluation were: 1. The authors showed that, given the values of the prediction vector, you could predict the true value of a state quite well, suggesting that the vector captured most of the information about what states were high value. However, beyond that, they found that the meta-learned vector was able to be used to predict the value calculated with discount rates different that than one used in the meta-learned training, which the hand-engineered alternative, TD-lambda, wasn't able to do (it could only well-predict values at the same discount rate used to calculate it). This suggests that the network really is learning some more robust notion of value that isn't tied to a specific discount rate. 2. They also found that they were able to deploy the LSTM update rule learned on grid worlds to Atari games, and have it perform reasonably well - beating A2C in a few cases, though certainly not all. This is fairly impressive, since it's an example of a rule learned on a different, much simpler set of environments generalizing to more complex ones, and suggests that there's something intrinsic to Reinforcement Learning that it's capturing ![]() |
[link]
I'm a little embarrassed that I'm only just now reading what seems like a fairly important paper from a year and a half ago, but, in my defense, March 2020 was not the best time for keeping up with the literature in a disciplined way. Anyhow, musings aside: this paper proposes an alternative training procedure for large language models, which the authors claim result in models that reach strong performance more efficiently than previous BERT, XLNet, or RoBERTa baselines. As some background context, the previously-canonical Masked Learning Model (MLM) task works by: - Replacing some percentage of tokens with a [MASK] indicator - Using the final-layer representation at the locations of those [MASK]s to predict the true input token - Using as a training signal the Maximum Likelihood of that prediction, or, how high the model's predicted probability on the true input. ELECTRA authors argue that there are a few notable disadvantages to this structure, if your goal is to train useful representations for downstream tasks. Firstly, your loss only consists of information (i.e. the true token) from the tokens you randomly masked, so a good amount of the data is going in some sense unused (except as context). Secondly, learning a full generative model of language requires a lot of data and training time, and it may not be all that beneficial for performance on your downstream tasks of interest. As an alternative, they propose: - Co-learning a (small) generator, trained in typical MLM fashion, alongside a discriminator. Randomly select tokens from the input to replace with fake tokens drawn from the distribution of the discriminator - The goal of the discriminator is to distinguish the true tokens from the fake ones. (minor note: if the generator happens to get lucky and generate the real token, that's counted as a "real" rather than "fake" token, even though it was generated by a generator). This uses more of the training data in the loss, since you can ask "real or fake" for every token in the input data, not (obviously) just the ones that are actually fake - An important note for those familiar with GANs is that the generator isn't trained to confuse the discriminator (as is GAN-standard), but is simply trained with it's own maximum likelihood loss, independent of the discriminator's performance. They argue, and show fairly convincingly, that ELECTRA is able to reach a higher efficiency-to-performance trade-off curve compared to BERT - matching the performance of previous models with notably less training, and outperforming them with comparable amounts of training. They go on to perform a few ablations, some of which felt more convincing than others. The most confusing ablation, which I'm not sure if I just misunderstood, was meant to ask how much of the value of ELECTRA came from calculating its loss over all the tokens in the training data, rather than just the masked ones. So, they tried just calculating the loss for the masked/replaced tokens. The resulting discriminator performs very poorly downstream. But, I find this a little odd as a design choice, since couldn't the discriminator learn to almost always predict that a replaced token was fake, since the only way it could be otherwise would be if the generator got lucky and produced the true word? They also did the (more sensible, to me) experiment of calculating the loss on a similarly-sized percentage of tokens, but not fully overlapping with the replacement mask, and that did more similarly to base ELECTRA. They also tested training a combined MLM/ELECTRA loss, where generated tokens were used in lieu of masking, and the full-sized MLM generator predicts the true token at every point in the sequence (which could be the token it gets as input, or could not be, in the case of a replacement). That model performed more closely to ELECTRA than BERT, which suggests that the efficiency gain of calculating a loss on every element in the training set was more important in practice than the gain from focusing a discriminator more directly on what was valuable for downstream tasks, rather than generating. ![]() |
[link]
In this tutorial paper, Carl E. Rasmussen gives an introduction to Gaussian Process Regression focusing on the definition, the hyperparameter learning and future research directions. A Gaussian Process is completely defined by its mean function $m(\pmb{x})$ and its covariance function (kernel) $k(\pmb{x},\pmb{x}')$. The mean function $m(\pmb{x})$ corresponds to the mean vector $\pmb{\mu}$ of a Gaussian distribution whereas the covariance function $k(\pmb{x}, \pmb{x}')$ corresponds to the covariance matrix $\pmb{\Sigma}$. Thus, a Gaussian Process $f \sim \mathcal{GP}\left(m(\pmb{x}), k(\pmb{x}, \pmb{x}')\right)$ is a generalization of a Gaussian distribution over vectors to a distribution over functions. A random function vector $\pmb{\mathrm{f}}$ can be generated by a Gaussian Process through the following procedure: 1. Compute the components $\mu_i$ of the mean vector $\pmb{\mu}$ for each input $\pmb{x}_i$ using the mean function $m(\pmb{x})$ 2. Compute the components $\Sigma_{ij}$ of the covariance matrix $\pmb{\Sigma}$ using the covariance function $k(\pmb{x}, \pmb{x}')$ 3. A function vector $\pmb{\mathrm{f}} = [f(\pmb{x}_1), \dots, f(\pmb{x}_n)]^T$ can be drawn from the Gaussian distribution $\pmb{\mathrm{f}} \sim \mathcal{N}\left(\pmb{\mu}, \pmb{\Sigma} \right)$ Applying this procedure to regression, means that the resulting function vector $\pmb{\mathrm{f}}$ shall be drawn in a way that a function vector $\pmb{\mathrm{f}}$ is rejected if it does not comply with the training data $\mathcal{D}$. This is achieved by conditioning the distribution on the training data $\mathcal{D}$ yielding the posterior Gaussian Process $f \rvert \mathcal{D} \sim \mathcal{GP}(m_D(\pmb{x}), k_D(\pmb{x},\pmb{x}'))$ for noise-free observations with the posterior mean function $m_D(\pmb{x}) = m(\pmb{x}) + \pmb{\Sigma}(\pmb{X},\pmb{x})^T \pmb{\Sigma}^{-1}(\pmb{\mathrm{f}} - \pmb{\mathrm{m}})$ and the posterior covariance function $k_D(\pmb{x},\pmb{x}')=k(\pmb{x},\pmb{x}') - \pmb{\Sigma}(\pmb{X}, \pmb{x}')$ with $\pmb{\Sigma}(\pmb{X},\pmb{x})$ being a vector of covariances between every training case of $\pmb{X}$ and $\pmb{x}$. Noisy observations $y(\pmb{x}) = f(\pmb{x}) + \epsilon$ with $\epsilon \sim \mathcal{N}(0,\sigma_n^2)$ can be taken into account with a second Gaussian Process with mean $m$ and covariance function $k$ resulting in $f \sim \mathcal{GP}(m,k)$ and $y \sim \mathcal{GP}(m, k + \sigma_n^2\delta_{ii'})$. The figure illustrates the cases of noisy observations (variance at training points) and of noise-free observationshttps://i.imgur.com/BWvsB7T.png (no variance at training points). In the Machine Learning perspective, the mean and the covariance function are parametrised by hyperparameters and provide thus a way to include prior knowledge e.g. knowing that the mean function is a second order polynomial. To find the optimal hyperparameters $\pmb{\theta}$, 1. determine the log marginal likelihood $L= \mathrm{log}(p(\pmb{y} \rvert \pmb{x}, \pmb{\theta}))$, 2. take the first partial derivatives of $L$ w.r.t. the hyperparameters, and 3. apply an optimization algorithm. It should be noted that a regularization term is not necessary for the log marginal likelihood $L$ because it already contains a complexity penalty term. Also, the tradeoff between data-fit and penalty is performed automatically. Gaussian Processes provide a very flexible way for finding a suitable regression model. However, they require the high computational complexity $\mathcal{O}(n^3)$ due to the inversion of the covariance matrix. In addition, the generalization of Gaussian Processes to non-Gaussian likelihoods remains complicated. ![]() |