Differentiable plasticity: training plastic neural networks with backpropagation on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Differentiable plasticity: training plastic neural networks with backpropagation
Thomas Miconi and Jeff Clune and Kenneth O. Stanley
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.NE, cs.LG, stat.ML
more

Summaries/Notes 1

[link] Summary by CodyWild 7 years ago

Meta learning is an area sparking a lot of research curiosity these days. It’s framed in different ways: models that can adapt, models that learn to learn, models that can learn a new task quickly. This paper uses a somewhat different lens: that of neural plasticity, and argues that applying the concept to modern neural networks will give us an effective, and biologically inspired way of building adaptable models. The basic premise of plasticity from a neurobiology perspective (at least how it was framed in the paper: I’m not a neuroscientist myself, and may be misunderstanding) is that plasticity performs a kind of gating function on the strength of a neural link being upregulated by experience. The most plastic a connection is, the more quickly it can get modified by new data; the less plastic, the more fixed it is. 

In concrete terms, this is implemented by subdividing the weight on each connection in the network into two parts: the “fixed” component, and the “plastic” component. (see picture). The fixed component acts like a typical weight: it gets modified during training, but stays fixed once training is done. The plastic component is composed of an alpha weight, multiplied by a term H. H is basically a decaying running average of the past input*output activations of this weight. Activations that are high in magnitude, and the same sign, for both the input and the output will lead to H being pushed higher. Note that that this H can continue to be updated even after the model is done training, because it builds up information whenever you pass a new input X through the network. The plastic component’s learned weight, alpha, controls how strong the influence of this is on the model. If alpha is near zero, then the connection behaves basically identically to a “typical” neural network, with weights that don’t change as a function of activation values. If alpha is positive, that means that strong co-activation within H will tend to make the connection weight higher. If alpha is negative, the opposite is true, and strong co-activation will make the connection weight more negative. (As an aside, I’d be really interested to see the distribution over alpha values in a trained model, relative to the weight values, and look at how often they go in the same direction as the weights, and increase magnitude, and how often they have the opposite direction and attenuate the weight towards zero).

These models are trained by running them for fixed size “episodes” during which the H value gets iteratively changed, and then the alpha parameters of H get updated in the way that would have reduced error over the episode.  One area in which they seem to show strong performance is that of memorization (where the network is shown an image once, and needs to reconstruct it later). The theory for why this is true is that the weights are able to store short-term information about which pixels are in the images it sees by temporarily boosting themselves higher for inputs and activations they’ve recently seen. 

There are definitely some intuitional gaps for me in this paper. The core one is: this framework just makes weights able to update themselves as a function of the values of their activations, not as a function of an actual loss function. That is to say: it seems like a potentially better analogy to neural plasticity is just a network that periodically gets more training data, and has some amount of connection plasticity to update as a result of that.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private