![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
I'll admit that I found this paper a bit of a letdown to read, relative to expectations rooted in its high citation count, and my general excitement and interest to see how deep learning could be brought to bear on molecular design. But before a critique, let's first walk through the mechanics of how the authors' approach works. The method proposed is basically a very straightforward Variational Auto Encoder, or VAE. It takes in a textual SMILES string representation of a molecular structure, uses an encoder to map that into a continuous vector representation, a decoder to map the vector representation back into a a SMILES string, and an auxiliary predictor to predict properties of a molecule given the continuous representation. So, the training loss is a combination of the reconstruction loss (log probability of the true molecule under the distribution produced by the decoder) and the semi-supervised predictive loss. The hope with this model is that it would allow you to sample from a space of potential molecules by starting from an existing molecule, and then optimizing the the vector representation of that molecule to make it score higher on whatever property you want to optimize for. https://i.imgur.com/WzZsCOB.png The authors acknowledge that, in this setup, you're just producing a probability distribution over characters, and that the continuous vectors sampled from the latent space might not actually map to valid SMILES strings, and beyond that may well not correspond to chemically valid molecules. Empirically, they said that the proportion of valid generated molecules ranged between 1 and 70%. But they argue that it'd be too difficult to enforce those constraints, and instead just sample from the model and run the results through a hand-designed filter for molecular validity. In my view, this is the central weakness of the method proposed in this paper: that they seem to have not tackled the question of either chemical viability or even syntactic correctness of the produced molecules. I found it difficult to nail down from the paper what the ultimate percentage of valid molecules was from points in latent space that were off of the training . A table reports "percentage of 5000 randomly-selected latent points that decode to valid molecules after 1000 attempts," but I'm confused by what the 1000 attempts means here - does that mean we draw 1000 samples from the distribution given by the decoder, and see if *any* of those samples are valid? That would be a strange metric, if so, and perhaps it means something different, but it's hard to tell. https://i.imgur.com/9sy0MXB.png This paper made me really curious to see whether a GAN could do better in this space, since it would presumably be better at the task of incentivizing syntactic correctness of produced strings (given that any deviation from correctness could be signal for the discriminator), but it might also lead to issues around mode collapse, and when I last checked the literature, GANs on text data in particular were still not great. ![]() |
[link]
The main contribution of this paper is introducing a new transformation that the authors call Batch Normalization (BN). The need for BN comes from the fact that during the training of deep neural networks (DNNs) the distribution of each layer’s input change. This phenomenon is called internal covariate shift (ICS). #### What is BN? Normalize each (scalar) feature independently with respect to the mean and variance of the mini batch. Scale and shift the normalized values with two new parameters (per activation) that will be learned. The BN consists of making normalization part of the model architecture. #### What do we gain? According to the author, the use of BN provides a great speed up in the training of DNNs. In particular, the gains are greater when it is combined with higher learning rates. In addition, BN works as a regularizer for the model which allows to use less dropout or less L2 normalization. Furthermore, since the distribution of the inputs is normalized, it also allows to use sigmoids as activation functions without the saturation problem. #### What follows? This seems to be specially promising for training recurrent neural networks (RNNs). The vanishing and exploding gradient problems \cite{journals/tnn/BengioSF94} have their origin in the iteration of transformation that scale up or down the activations in certain directions (eigenvectors). It seems that this regularization would be specially useful in this context since this would allow the gradient to flow more easily. When we unroll the RNNs, we usually have ultra deep networks. #### Like * Simple idea that seems to improve training. * Makes training faster. * Simple to implement. Probably. * You can be less careful with initialization. #### Dislike * Does not work with stochastic gradient descent (minibatch size = 1). * This could reduce the parallelism of the algorithm since now all the examples in a mini batch are tied. * Results on ensemble of networks for ImageNet makes it harder to evaluate the relevance of BN by itself. (Although they do mention the performance of a single model). ![]() |
[link]
This is an interestingly pragmatic paper that makes a super simple observation. Often, we may want a usable network with fewer parameters, to make our network more easily usable on small devices. It's been observed (by these same authors, in fact), that pruned networks can achieve comparable weights to their fully trained counterparts if you rewind and retrain from early in the training process, to compensate for the loss of the (not ultimately important) pruned weights. This observation has been dubbed the "Lottery Ticket Hypothesis", after the idea that there's some small effective subnetwork you can find if you sample enough networks. Given these two facts - the usefulness of pruning, and the success of weight rewinding - the authors explore the effectiveness of various ways to train after pruning. Current standard practice is to prune low-magnitude weights, and then continue training remaining weights from values they had at pruning time, keeping the final learning rate of the network constant. The authors find that: 1. Weight rewinding, where you rewind weights to *near* their starting value, and then retrain using the learning rates of early in training, outperforms fine tuning from the place weights were when you pruned but, also 2. Learning rate rewinding, where you keep weights as they are, but rewind learning rates to what they were early in training, are actually the most effective for a given amount of training time/search cost To me, this feels a little bit like burying the lede: the takeaway seems to be that when you prune, it's beneficial to make your network more "elastic" (in the metaphor-to-neuroscience sense) so it can more effectively learn to compensate for the removed neurons. So, what was really valuable in weight rewinding was the ability to "heat up" learning on a smaller set of weights, so they could adapt more quickly. And the fact that learning rate rewinding works better than weight rewinding suggests that there is value in the learned weights after all, that value is just outstripped by the benefit of rolling back to old learning rates. All in all, not a super radical conclusion, but a useful and practical one to have so clearly laid out in a paper. ![]() |
[link]
This paper presents a variational approach to the maximisation of mutual information in the context of a reinforcement learning agent. Mutual information in this context can provide a learning signal to the agent that is "intrinsically motivated", because it relies solely on the agent's state/beliefs and does not require from the ("outside") user an explicit definition of rewards. Specifically, the learning objective, for a current state s, is the mutual information between the sequence of K actions a proposed by an exploration distribution $w(a|s)$ and the final state s' of the agent after performing these actions. To understand what the properties of this objective, it is useful to consider the form of this mutual information as a difference of conditional entropies: $$I(a,s'|s) = H(a|s) - H(a|s',s)$$ Where $I(.|.)$ is the (conditional) mutual information and $H(.|.)$ is the (conditional) entropy. This objective thus asks that the agent find an exploration distribution that explores as much as possible (i.e. has high $H(a|s)$ entropy) but is such that these actions have predictable consequences (i.e. lead to predictable state s' so that $H(a|s',s)$ is low). So one could think of the agent as trying to learn to have control of as much of the environment as possible, thus this objective has also been coined as "empowerment". The main contribution of this work is to show how to train, on a large scale (i.e. larger state space and action space) with this objective, using neural networks. They build on a variational lower bound on the mutual information and then derive from it a stochastic variational training algorithm for it. The procedure has 3 components: the exploration distribution $w(a|s)$, the environment $p(s'|s,a)$ (can be thought as an encoder, but which isn't modeled and is only interacted with/sampled from) and the planning model $p(a|s',s)$ (which is modeled and can be thought of as a decoder). The main technical contribution is in how to update the exploration distribution (see section 4.2.2 for the technical details). This approach exploits neural networks of various forms. Neural autoregressive generative models are also used as models for the exploration distribution as well as the decoder or planning distribution. Interestingly, the framework allows to also learn the state representation s as a function of some "raw" representation x of states. For raw states corresponding to images (e.g. the pixels of the screen image in a game), CNNs are used. ![]() |
[link]
The authors introduce a new, sampling-free method for training and evaluating energy-based models (aka EBMs, aka unnormalized density models). There are two broad approches for training EBMs. Sampling-based approaches like contrastive divergence try to estimate the likelihood with MCMC, but can be biased if the chain is not sufficiently long. The speed of training also greatly depends on the sampling parameters. Other approches, like score matching, avoid sampling by solving a surrogate objective that approximates the likelihood. However, using a surrogate objective also introduces bias in the solution. In any case, comparing goodness of fit of different models is challenging, regardless of how the models were trained. The authors introduce a measure of probability distance between distributions $p$ and $q$ called the Learned Stein Discrepancy ($LSD$): $$ LSD(f_{\phi}, p, q) = \mathbb{E}_{p(x)} [\nabla_x \log q(x)^T f_{\phi}(x) + Tr(\nabla_x f_{\phi} (x)) $$ This measure is derived from the Stein Discrepancy $SD(p,q)$. Note that like the $SD$, the $LSD$ is 0 iff $p = q$. Typically, $p$ is the data distribution and $q$ is the learned approximate distribution (an EBM), although this doesn't have to be the case. Note also that this objective only requires a differentiable unnormalized distribution $\tilde{q}$, and does not require MCMC sampling or computation of the normalizing constant $Z$, since $\nabla_x \log q(x) = \nabla_x \log \tilde{q}(x) - \nabla_x \log Z = \nabla_x \log \tilde{q}(x)$. $f_\phi$ is known as the critic function, and minimizing the $LSD$ with respect to $\phi$ (i.e. with gradient descent) over a bounded space of functions $\mathcal{F}$ can approximate the $SD$ over that space. The authors choose to define the function space $\mathcal{F} = \{ f: \mathbb{E}_{p(x)} [f(x)^Tf(x)] < \infty \}$, which is convenient because it can be optimized by introducing a simple L2 regularizer on the critic's output: $\mathcal{R}_\lambda (f_\phi) = \lambda \mathbb{E}_{p(x)} [f_\phi(x)^T f_\phi(x)]$. Since the trace of a matrix is expensive to backpropagate through, the authors use a single-sample Monte Carlo estimate $Tr(\nabla_x f_\phi(x)) \approx \mathbb{E}_{\mathbb{N}(\epsilon|0,1)} [\epsilon^T \nabla_x f_\phi(x) \epsilon] $, which is more efficient since $\epsilon^T \nabla_x f_\phi(x)$ is a vector-Jacobian product. The overall objective is thus the following: $$ \text{arg} \max_\phi \mathbb{E}_{p(x)} [\nabla_x \log q(x)^T f_{\phi}(x) + \mathbb{E}_{\epsilon} [\epsilon^T \nabla_x f_{\phi} (x) \epsilon)] - \lambda f_\phi(x)^T f_\phi(x)] $$ It is possible to compare two different EBMs $q_1$ and $q_2$ by optimizing the above objective for two different critic parameters $\phi_1$ and $\phi_2$, using the training and validation data for critic optimization (then evaluating on the held-out test set). Note that when computing the $LSD$ on the test set, the exact trace can be computed instead of the Monte Carlo approximation to reduce variance, since gradients are no longer required. The model that is closer to 0 has achieved a better fit. Similarly, a hypothesis test using the $LSD$ can be used to test if $p = q$ for the data distribution $p$ and model distribution $q$. The authors then show how EBM parameters $\theta$ can actually be optimized by gradient descent on the $LSD$ objective, in a minimax problem that is similar to the problem of optimizing a generative adversarial network (GAN). For given $\theta$, you first optimize the critic $f_\phi$ w.r.t. $\phi$ to try to get the $LSD(f_\phi, p, q_\theta)$ close to its theoretical optimum with the current $q_\theta$, then you take a single gradient step $\nabla_\theta LSD$ to minimize the $LSD$. They show some experiments that indicates that this works pretty well. One thing that was not clear to me when reading this paper is whether the $LSD(f_\phi,p,q)$ should be minimized or maximized with respect to $\phi$ to get it close to the true $SD(p,q)$. Although it it possible for $LSD$ to be above or below 0 for a given choice of $q$ and $f_\phi$, the problem can always be formulated as minimization by simply changing the sign of $f_\phi$ at the beginning such that the $LSD$ is positive (or as maximization by making it negative). ![]() |