Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference
Matthew Riemer
and
Ignacio Cases
and
Robert Ajemian
and
Miao Liu
and
Irina Rish
and
Yuhai Tu
and
Gerald Tesauro
arXiv e-Print archive - 2018 via Local arXiv
Keywords:
cs.LG, cs.AI, stat.ML
First published: 2018/10/29 (6 years ago) Abstract: Lack of performance when it comes to continual learning over non-stationary
distributions of data remains a major challenge in scaling neural network
learning to more human realistic settings. In this work we propose a new
conceptualization of the continual learning problem in terms of a trade-off
between transfer and interference. We then propose a new algorithm,
Meta-Experience Replay (MER), that directly exploits this view by combining
experience replay with optimization based meta-learning. This method learns
parameters that make interference based on future gradients less likely and
transfer based on future gradients more likely. We conduct experiments across
continual lifelong supervised learning benchmarks and non-stationary
reinforcement learning environments demonstrating that our approach
consistently outperforms recently proposed baselines for continual learning.
Our experiments show that the gap between the performance of MER and baseline
algorithms grows both as the environment gets more non-stationary and as the
fraction of the total experiences stored gets smaller.
Catastrophic forgetting is the tendency of an neural network to forget previously learned information when learning new information. This paper combats that by keeping a buffer of experience and applying meta-learning to it. They call their new module Meta Experience Replay or MER.
How does this work? At each update they compute multiple possible updates to the model weights. One for the new batch of information and some more updates for batches of previous experience. Then they apply meta-learning using the REPTILE algorithm, here the meta-model sees each possible update and has to predict the output which combines them with the least interference. This is done by predicting an update vector that maximizes the dot product between the new and old update vectors, that way it transfers as much learning as possible from the new update without interfering with the old updates. https://i.imgur.com/TG4mZOn.png
Does it work? Yes, while it may take longer to train, the results show that it generalizes better and needs a much smaller buffer of experience than the popular approach of using replay buffers.