![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
I have a lot of fondness for this paper as a result of its impulse towards clear explanations, simplicity, and pushing back against complexity for complexity’s sake. The goal of the paper is pretty straightforward. Long Short Term Memory networks (LSTM) work by having a memory vector, and pulling information into and out of that vector through a gating system. These gates take as input the context of the network at a given timestep (the prior hidden state, and the current input), apply weight matrices and a sigmoid activation, and produce “mask” vectors with values between 0 and 1. A typical LSTM learns three separate gates: a “forget” gate that controls how much of the old memory vector is remembered, an “input” gate that controls how much new contextual information is added to the memory, an “output” gate that controls how much of the output (a sum of the gated memory information, and the gated input information) is passed outward into a hidden state context that’s visible to the rest of the network. Note here that “hidden” is an unfortunate word here, since this is actually the state that is visible to the rest of the network, whereas the “memory” vector is only visible to the next-step memory updating calculations. Also note that “forget gate” is an awkward name insofar as the higher the value of the forget gate, the more that the model *remembers* of its past memory. This is confusing, but we appear to be stuck with this terminology The Gated Recurrent Unit, or GRU, did away with the output gate. In this system, the difference between “hidden” and “memory” vectors is removed, and so the network no longer has separate information channels for communicating with subsequent layers, and simple memory passed to future timesteps. On a wide range of problems, the GRU has performed comparably to the LSTM. This makes the authors ask: if a two-gate model can do as well, can a single gate model? In particular: how well does a LSTM-style model perform, if it only has a forget gate. The answer, to not bury the probably-obvious lede, is: quite well. Models that only have a forget gate perform comparably to or better than traditional LSTM models for the tasks at which they were tried. On a mechanical level, not having an input gate means that, instead of having individual scaling for “how much old memory do you remember” and “how much new context do you take in”, so that those values could be, for example, 0.2 and 0.15, these numbers are defined as a convex combination of a single value, which is the forget gate. That’s a fancy way of saying: we calculate some x between 0 and 1, and that’s the weight on the forget gate, and then (1-x) is the weight on the input gate. This model, for reasons that are entirely unjustified, and obviously the result of some In Joke, is called JANET, because with a single gate, it’s Just Another NETwork. Image is attached to prove I’m Not Making This Shit Up. The authors go down a few pathways of explaining why this forget-only model performs well, of which the most compelling is that it gives the model an easier and more efficient way to learn a skip connection, where information is passed down more or less intact to a future point in the model. It’s more straightforward to learn because the “skip-ness” of the connection, or, how strongly the information wants to propogate into the future, is just controlled by one set of parameters, and not a complex interaction of input, forget, and output. An interesting side investigation they perform is how the initialization of the bias term in the forget gate (which is calculated by applying weights to the input and former hidden state, and then adding a constant bias term) effects a model’s ability to learn long term dependencies. In particular, they discuss the situation where the model gets some signal, and then a long string of 0 values. If the bias term of the model is quite low, then all of those 0 values being used to calculate the forget gate will mean that only the bias is left, and the more times the bias is multiplied by itself, the smaller and closer to 0 it gets. The paper suggests initializing the bias of the forget gate according to the longest dependencies you expect the model to have, with the idea that you should more strongly bias your model towards remembering old information, regardless of what new information comes in, if you expect long term dependencies to be strongly relevant. ![]() |
[link]
The general goal of meta-learning systems is to learn useful shared structure across a broad distribution of tasks, in such a way that learning on a new task can be faster. Some of the historical ways this has been done have been through initializations (i.e. initializing the network at a point such that it is easy to further optimize on each individual task, drawn from some distribution of tasks), and recurrent network structures (where you treat the multiple timesteps of a recurrent network as the training iterations on a single task, and train the recurrent weights of the network based on generalization performance on a wide range of tasks). This paper proposes a different approach: a learned proxy loss function. The idea here is that, often, early in the learning process, handcoded rewards aren’t the best or most valuable signal to use to guide a network, both because they may be high variance, and because they might not natively incentivize things like exploration rather than just exploitation. A better situation would be if we had some more far-sighted loss function we could use, that had proved to be a good proxy over a variety of different rewards. This is exactly what this method proposes to give us. Training consists of an inner loop, and an outer loop. Each instantiation of the inner loop corresponds to a single RL task, drawn from a distribution over tasks (for example, all tasks involving the robot walking to a position, with a single instantiated task being the task of walking to one specific position). Within the inner loop, we apply a typical policy gradient loop of optimizing the parameters of our policy, except, instead of expected rewards, we optimize our policy parameters according to a loss function we specifically parametrize. Within the outer loop, we take as signal the final reward on the trained policy on this task, and use that to update our parametrized loss. This parametrized loss is itself a neural network, that takes in the agent’s most recent set of states, actions, and rewards at a rolling window of recent timesteps, and performs temporal convolutions on those, to get a final loss value out the other side. In short, this auxiliary network takes in information about the agent’s recent behavior, and outputs an assessment of how well the agent is doing according to this longer-view loss criteria. Because it’s not possible to directly formulate the test performance of a policy in terms of the loss function that was used to train the policy (which would be necessary for backprop), the weights of this loss-calculating network are instead learned via evolutionary strategies. At a zoomed-out level of complexity, this means: making small random perturbations to the current parameters of the network, and moving in the direction of the random change that works the best. So, ultimately, you end up with a loss network that takes in recent environmental states and the behavior of the agent, and returns an estimate of the proxy loss value, that has hopefully been trained such that it captures environmental factors that indicate progress on the task, over a wide variety of similar tasks. Then, during testing, the RL agent can use that loss function to adapt its behavior. An interesting note here is that for tasks where the parameters of the task being learned are inferable from the environment - for example, where the goal is “move towards the green dot”, you don’t actually need to give the agent the rewards from a new task; ideally, it will have learned how to infer the task from the environment. One of the examples they use to prove their method has done something useful is train their model entirely on tasks where an ant-agent’s goal is to move towards various different targets on the right, and then shift it to a scenario where its target is towards the left. In the EPG case, the ant was able to quickly learn to move left, because it’s loss function was able to adapt to the new environment where the target had moved. By contrast, RL^2 (a trained learning algorithm implemented as a recurrent network) kept on moving right as its initial strategy, and seemed unable to learn the specifics of a task outside its original task distribution of “always move right”. I think this paper could benefit from being a little bit more concrete about what it’s expected use cases are (like: what kinds of environments lend themselves to having proxy loss functions inferred from environmental data? Which don’t?), but overall, I find the kernel of idea this model introduces interesting, and will be interested to see if other researchers run with it. ![]() |
[link]
Meta learning is an area sparking a lot of research curiosity these days. It’s framed in different ways: models that can adapt, models that learn to learn, models that can learn a new task quickly. This paper uses a somewhat different lens: that of neural plasticity, and argues that applying the concept to modern neural networks will give us an effective, and biologically inspired way of building adaptable models. The basic premise of plasticity from a neurobiology perspective (at least how it was framed in the paper: I’m not a neuroscientist myself, and may be misunderstanding) is that plasticity performs a kind of gating function on the strength of a neural link being upregulated by experience. The most plastic a connection is, the more quickly it can get modified by new data; the less plastic, the more fixed it is. In concrete terms, this is implemented by subdividing the weight on each connection in the network into two parts: the “fixed” component, and the “plastic” component. (see picture). The fixed component acts like a typical weight: it gets modified during training, but stays fixed once training is done. The plastic component is composed of an alpha weight, multiplied by a term H. H is basically a decaying running average of the past input*output activations of this weight. Activations that are high in magnitude, and the same sign, for both the input and the output will lead to H being pushed higher. Note that that this H can continue to be updated even after the model is done training, because it builds up information whenever you pass a new input X through the network. The plastic component’s learned weight, alpha, controls how strong the influence of this is on the model. If alpha is near zero, then the connection behaves basically identically to a “typical” neural network, with weights that don’t change as a function of activation values. If alpha is positive, that means that strong co-activation within H will tend to make the connection weight higher. If alpha is negative, the opposite is true, and strong co-activation will make the connection weight more negative. (As an aside, I’d be really interested to see the distribution over alpha values in a trained model, relative to the weight values, and look at how often they go in the same direction as the weights, and increase magnitude, and how often they have the opposite direction and attenuate the weight towards zero). These models are trained by running them for fixed size “episodes” during which the H value gets iteratively changed, and then the alpha parameters of H get updated in the way that would have reduced error over the episode. One area in which they seem to show strong performance is that of memorization (where the network is shown an image once, and needs to reconstruct it later). The theory for why this is true is that the weights are able to store short-term information about which pixels are in the images it sees by temporarily boosting themselves higher for inputs and activations they’ve recently seen. There are definitely some intuitional gaps for me in this paper. The core one is: this framework just makes weights able to update themselves as a function of the values of their activations, not as a function of an actual loss function. That is to say: it seems like a potentially better analogy to neural plasticity is just a network that periodically gets more training data, and has some amount of connection plasticity to update as a result of that. ![]() |
[link]
DeepMind’s recently released paper (one of a boatload coming out in the wake of ICLR, which just finished in Vancouver) addresses the problem of building an algorithm that can perform well on tasks that don’t just stay fixed in their definition, but instead evolve and change, without giving the agent a chance to re-train in the middle. An example of this, is one used at various points in the paper: of an agent trying to run East, that finds two of its legs (a different two each time) slowly less functional. The theoretical framework they use to approach this problem is that of meta learning. Meta Learning is typically formulated as: how can I learn to do well on a new task, given only a small number of examples of that task? That’s why it’s called “meta”: it’s an extra, higher-level optimization loop applied around the process of learning. Typical learning learns parameters of some task, meta learning learns longer-scale parameters that make the short-scale, typical learning work better. Here, the task that evolves and changes over time (i.e. a nonstationary task) is seen as a close variant of the the multi-task problem. And, so, the hope is that a model that can quickly adapt to arbitrary new tasks can also be used to learn the ability to adapt to a gradually changing task environment. The meta learning algorithm that got most directly adapted for this paper is MAML: Model Agnostic Meta Learning. This algorithm works by, for a large number of tasks, initializing the model at some parameter set theta, evaluating the loss for a few examples on that task, and moving the gradients from the initialization theta, to a task-specific parameter set phi. Then, it calculating the “test set” performance of the one-step phi parameters, on the task. But then - the crucial thing here - the meta learning model updates its initialization parameters, theta. So, the meta learning model is learning a set of parameters that provides a good jumping off point for any given task within the distribution of tasks the model is trained on. In order to do this well, the theta parameters need to both 1) learn any general information, shared across all tasks, and 2) position the parameters such that an initial update step moves the model in the profitable direction. They adapted this idea, of training a model that could quickly update to multiple tasks, to the environment of a slowly/continuously changing environment, where certain parameters of the task the agent is facing. In this formulation, our set of tasks is no longer random draws from the distribution of possible tasks, but a smooth, Markov-walk gradient over tasks. The main change that the authors made to the original MAML algorithm was to say that each general task would start at theta, but then, as that task gradually evolved, it would perform multiple updates: theta to phi1, phi1 to phi2, and so on. The original theta parameters would then be updated according to a similar principle as the MAML parameters: so as to make the loss, summed over the full non-stationary task (notionally composed of many little sub-tasks) is as low as possible. ![]() |
[link]
The problem setting of the paper is the desire to perform translation in a monolingual setting, where datasets exist of each language independently, but little or no paired sentence data (paired here meaning that you know you have the same sentence or text in both languages). The paper outlines the prior methods in this area as being, first, training a single-language language model (i.e. train a model to take in a sentence, and return how coherent of a sentence it is in a given language) and using that to supplement a machine translation system. The authors honestly don’t go into this much, so I can’t tell exactly what they mean by it. The second baseline they talk about is bootstrapping themselves additional training data, by training a model using a small amount of training data, then using that mediocre model to translate additional sentences, which they use as additional training data to train the mediocre model to a higher performance. It doesn’t seem like this should work, but I’ve seen this or similar approaches used in a few cases, and it typically does add benefit. But, the authors claim, they can do better. The core intuition of this paper is pretty simple, and will be familiar to anyone who read my summary of CycleGAN, lo these many weeks ago. Their approach rests on the idea that, even if you can’t push translation models to be objectively correct in a paired sense, you can push translation models to be symmetric with one another, insofar as translating from language A to B (let’s say English to French), and then back from French to English, gets you something in English that looks like your original input. This forces the model to maintain an informative mapping, so that enough information about the English sentence is stored to allow it to be reconstructed.However, unconstrained, the model could just develop a 1:1 word mapping that gives you information about the English input, but doesn’t actually map to the translation in French. If you can additionally confirm that the translation into French looks like a coherent French sentence (which, recall, we can do with a language model trained on French independently), we can get closer to to generating a mapping that is hopefully more coherent. One interesting aspect of this paper is the fact that the model they describe is trained with reinforcement learning. Typically, reinforcement learning is used for scenarios where you don’t have direct visibility into how the actions you take impact your loss function. Compare this to a supervised network (where you can take the derivative of your loss with respect to the last layer, and backpropogate that back through to your inputs), or even a GAN, where you can take the derivative of the discriminator-created loss back through the input the discriminator, and through into the GAN that created it. This model treats the translation models that are learned as policies; that is, probability distributions over sets of words. It samples multiple A -> B translations using something called beam search, which, as it samples the sequence of words, samples several at each timestep, and then keeps that chain alive by continuing to sample along it. This helps the sequential translation not fall into a pit where it samples one highly probable word, but then, as you add more words, it doesn’t lead towards a good sentence. This ultimately results in taking multiple (k=12, in this case) samples from each translation distribution, and so the model uses as its objective the expected rewards over these samples, where the rewards is constructed as a combination of in-language coherence loss (scored by using the log likelihood of a trained single-language model) and reconstruction loss (scored by close the A -> B -> A’ is to the original A). My confusion about the use of reinforcement loss here mostly comes from the question of whether it just wasn’t possible to build an end to end model, where, like a GAN, you backpropogated through a constructed input, back to the model that constructed it (in this case, through both translator models). Is the issue just that a sequence of words is fundamentally discrete, in a way that images aren’t, and in a way that impedes backprop? That seems possible, but also, I think it’s the case that a typical encoder-decoder model, that outputs softmaxes over words, is able to be backpropogated through. Overall, it’s hard for me to tell if I’m missing something basic about why reinforcement learning is the obvious choice here, or if other more GAN-like approaches were an option, but that meme hadn’t spread into the literature yet, and RL was the more historically canonical choice. One other minor disappointing note: it looks like their results are based on a scenario that does use a small number of bilingual training pairs, as a way to pretrain the translation models to a reasonable, non-random starting point. It’s not clear whether this method would have worked with an actual cold start, i.e. a translation model that has no idea what it’s doing, and is using only this as signal. That said, they used a much smaller number of bilingual pairs than a true supervised method, and so even with a need for a warm start, a method like this could still give you leverage over language pairs where there exists some paired data, but not enough to build a full, sophisticated model on top of. ![]() |