Online Fast Adaptation and Knowledge Accumulation: a New Approach to Continual Learning

Massimo Caccia and Pau Rodriguez and Oleksiy Ostapenko and Fabrice Normandin and Min Lin and Lucas Caccia and Issam Laradji and Irina Rish and Alexande Lacoste and David Vazquez and Laurent Charlin

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.AI, cs.LG

**First published:** 2020/03/12 (4 years ago)

**Abstract:** Learning from non-stationary data remains a great challenge for machine
learning. Continual learning addresses this problem in scenarios where the
learning agent faces a stream of changing tasks. In these scenarios, the agent
is expected to retain its highest performance on previous tasks without
revisiting them while adapting well to the new tasks. Two new recent
continual-learning scenarios have been proposed. In meta-continual learning,
the model is pre-trained to minimize catastrophic forgetting when trained on a
sequence of tasks. In continual-meta learning, the goal is faster remembering,
i.e., focusing on how quickly the agent recovers performance rather than
measuring the agent's performance without any adaptation. Both scenarios have
the potential to propel the field forward. Yet in their original formulations,
they each have limitations. As a remedy, we propose a more general scenario
where an agent must quickly solve (new) out-of-distribution tasks, while also
requiring fast remembering. We show that current continual learning, meta
learning, meta-continual learning, and continual-meta learning techniques fail
in this new scenario. Accordingly, we propose a strong baseline:
Continual-MAML, an online extension of the popular MAML algorithm. In our
empirical experiments, we show that our method is better suited to the new
scenario than the methodologies mentioned above, as well as standard continual
learning and meta learning approaches.
more
less

Massimo Caccia and Pau Rodriguez and Oleksiy Ostapenko and Fabrice Normandin and Min Lin and Lucas Caccia and Issam Laradji and Irina Rish and Alexande Lacoste and David Vazquez and Laurent Charlin

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.AI, cs.LG

[link]
disclaimer: I'm the first author of the paper ## TL;DR We have made a lot of progress on catastrophic forgetting within the standard evaluation protocol, i.e. sequentially learning a stream of tasks and testing our models' capacity to remember them all. We think it's time a new approach to Continual Learning (CL), coined OSAKA, which is more aligned with real-life applications of CL. It brings CL closer to Online Learning and Open-World Learning. main modifications we propose: - bring CL closer to Online learning i.e. at test time, the model is continually learning and evaluated on its online predictions - it's fine to forget, as long as you can quickly remember (just like we humans do) - we allow pretraining, (because you wouldn't deploy an untrained CL system, right?) but at test time, the model will have to quickly learn new out-of-distribution (OoD) tasks (because the world is full of surprises) - the tasks distribution is actually a hidden Markov chain. This implies: - new and old tasks can re-occur (just like in real life). Better remember them quickly if you want to get a good total performance! - tasks have different lengths - and the tasks boundaries are unknown (task agnostic setting) ### Bonus: We provide a unifying framework explaining the space of machine learning setting {supervised learning, meta learning, continual learning, meta-continual learning, continual-meta learning} in case it was starting to get confusing :p ## Motivation We imagine an agent, embedded or not, first pre-trained in a controlled environment and later deployed in the real world, where it faces new or unexpected situations. This scenario is relevant for many applications. For instance, in robotics, the agent is pre-trained in a factory and deployed in homes or in manufactures where it will need to adapt to new domains and maybe solve new tasks. Likewise, a virtual assistant can be pre-trained on static datasets and deployed in a user’s life to fit its personal needs. Further motivations can be found in time series forecasting, e.g., market prediction, game playing, autonomous customer service, recommendation systems, autonomous driving, to name a few. In this scenario, we are interested in the cumulative performance of the agent throughout its lifetime. Differently, standard CL reports the agent’s final performance on all tasks at the end of its life. In order to succeed in this scenario, agents need the ability to learn new tasks as well as quickly remembering old ones. ## Unifying Framework We propose a unifying framework explaining the space of machine learning setting {supervised learning, meta learning, continual learning, meta-continual learning, continual-meta learning} with meta learning terminology. https://i.imgur.com/U16kHXk.png (easier to digest with accompanying text) ## OSAKA The main features of the evaluation framework are - task agnosticism - pre-training is allowed, but OoD tasks at test time - task revisiting - controllable non-stationarity - online evaluation (see paper for the motivations of the features) ## Continual-MAML: an initial baseline A simple extension of MAML that is better suited than previous methods in the proposed setting. https://i.imgur.com/C86WUc8.png Features are: - Fast Adapatation - Dynamic Representation - Task boundary detection - Computational efficiency ## Experiments We provide a suite of 3 benchmarks to test algorithms in the new setting. The first includes the Omniglot, MNIST and FashionMNIST dataset. The second and third use the Synbols (Lacoste et al. 2018) and TieredImageNet datasets, respectively. The first set of experiments shows that the baseline outperforms previous approaches, i.e., supervised learning, meta learning, continual learning, meta-continual learning, continual-meta learning, in the new setting. https://i.imgur.com/IQ1WYTp.png The second and third experiments lead us to similar conclusions code: https://github.com/ElementAI/osaka |

Uncertainty-guided Continual Learning with Bayesian Neural Networks

Ebrahimi, Sayna and Elhoseiny, Mohamed and Darrell, Trevor and Rohrbach, Marcus

arXiv e-Print archive - 2019 via Local Bibsonomy

Keywords: dblp

Ebrahimi, Sayna and Elhoseiny, Mohamed and Darrell, Trevor and Rohrbach, Marcus

arXiv e-Print archive - 2019 via Local Bibsonomy

Keywords: dblp

[link]
## Introduction Bayesian Neural Networks (BNN): intrinsic importance model based on weight uncertainty; variational inference can approximate posterior distributions using Monte Carlo sampling for gradient estimation; acts like an ensemble method in that they reduce the prediction variance but only uses 2x the number of parameters. The idea is to use BNN's uncertainty to guide gradient descent to not update the important weight when learning new tasks. ## Bayes by Backprop (BBB): https://i.imgur.com/7o4gQMI.png Where $q(w|\theta)$ is our approximation of the posterior $p(w|x)$. $q$ is most probably gaussian with diagonal covariance. We can optimize this via the ELBO: https://i.imgur.com/OwGm20b.png ## Uncertainty-guided CL with BNN (UCB): UCB the regularizing is performed with the learning rate such that the learning rate of each parameter and hence its gradient update becomes a function of its importance. They set the importance to be inversely proportional to the standard deviation $\sigma$ of $q(w|\theta)$ Simply put, the more confident the posterior is about a certain weight, the less is this weight going to be updated. You can also use the importance for weight pruning (sort of a hard version of the first idea) ## Cartoon https://i.imgur.com/6Ld79BS.png |

Online Meta-Learning

Finn, Chelsea and Rajeswaran, Aravind and Kakade, Sham M. and Levine, Sergey

International Conference on Machine Learning - 2019 via Local Bibsonomy

Keywords: dblp

Finn, Chelsea and Rajeswaran, Aravind and Kakade, Sham M. and Levine, Sergey

International Conference on Machine Learning - 2019 via Local Bibsonomy

Keywords: dblp

[link]
## Introduction Two distinct research paradigms have studied how prior tasks or experiences can be used by an agent to inform future learning. * Meta Learning: past experience is used to acquire a prior over model parameters or a learning procedure, and typically studies a setting where a set of meta-training tasks are made available together upfront * Online learning : a sequential setting where tasks are revealed one after another, but aims to attain zero-shot generalization without any task-specific adaptation. We argue that neither setting is ideal for studying continual lifelong learning. Meta-learning deals with learning to learn, but neglects the sequential and non-stationary aspects of the problem. Online learning offers an appealing theoretical framework, but does not generally consider how past experience can accelerate adaptation to a new task. ## Online Learning Online learning focuses on regret minimization. Most standard notion of regret is to compare to the cumulative loss of the best fixed model in hindsight: https://i.imgur.com/pbZG4kK.png One way minimize regret is with Follow the Leader (FTL): https://i.imgur.com/NCs73vG.png ## Online Meta-learning Setting: let $U_t$ be the update procedure for task $t$ e.g. in MAML: https://i.imgur.com/Q4I4HkD.png The overall protocol for the setting is as follows: 1. At round t, the agent chooses a model defined by $w_t$ 2. The world simultaneously chooses task defined by $f_t$ 3. The agent obtains access to the update procedure $U_t$, and uses it to update parameters as $\tilde w_t = U_t(w_t)$ 4. The agent incurs loss $f_t(\tilde w_t )$. Advance to round t + 1. the goal for the agent is to minimize regrets over rounds. Achieving sublinear regrets means you're improving and converging to upper bound (joint training on all tasks) ## Algorithm and Analysis: Follow the meta-leader (FTML): https://i.imgur.com/qWb9g8Q.png FTML’s regret is sublinear (under some assumption) |

Online Continual Learning with Maximally Interfered Retrieval

Rahaf Aljundi and Lucas Caccia and Eugene Belilovsky and Massimo Caccia and Laurent Charlin and Tinne Tuytelaars

arXiv e-Print archive - 2019 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2019/08/11 (5 years ago)

**Abstract:** Continual learning, the setting where a learning agent is faced with a never
ending stream of data, continues to be a great challenge for modern machine
learning systems. In particular the online or "single-pass through the data"
setting has gained attention recently as a natural setting that is difficult to
tackle. Methods based on replay, either generative or from a stored memory,
have been shown to be effective approaches for continual learning, matching or
exceeding the state of the art in a number of standard benchmarks. These
approaches typically rely on randomly selecting samples from the replay memory
or from a generative model, which is suboptimal. In this work we consider a
controlled sampling of memories for replay. We retrieve the samples which are
most interfered, i.e. whose prediction will be most negatively impacted by the
foreseen parameters update. We show a formulation for this sampling criterion
in both the generative replay and the experience replay setting, producing
consistent gains in performance and greatly reduced forgetting.
more
less

Rahaf Aljundi and Lucas Caccia and Eugene Belilovsky and Massimo Caccia and Laurent Charlin and Tinne Tuytelaars

arXiv e-Print archive - 2019 via Local arXiv

Keywords: cs.LG, stat.ML

[link]
Disclaimer: I am an author # Intro Experience replay (ER) and generative replay (GEN) are two effective continual learning strategies. In the former, samples from a stored memory are replayed to the continual learner to reduce forgetting. In the latter, old data is compressed with a generative model and generated data is replayed to the continual learner. Both of these strategies assume a random sampling of the memories. But learning a new task doesn't cause **equal** interference (forgetting) on the previous tasks! In this work, we propose a controlled sampling of the replays. Specifically, we retrieve the samples which are most interfered, i.e. whose prediction will be most negatively impacted by the foreseen parameters update. The method is called Maximally Interfered Retrieval (MIR). ## Cartoon for explanation https://i.imgur.com/5F3jT36.png Learning about dogs and horses might cause more interference on lions and zebras than on cars and oranges. Thus, replaying lions and zebras would be a more efficient strategy. # Method 1) incoming data: $(X_t,Y_t)$ 2) foreseen parameter update: $\theta^v= \theta-\alpha\nabla\mathcal{L}(f_\theta(X_t),Y_t)$ ### applied to ER (ER-MIR) 3) Search for the top-$k$ values $x$ in the stored memories using the criterion $$s_{MI}(x) = \mathcal{L}(f_{\theta^v}(x),y) -\mathcal{L}(f_{\theta}(x),y)$$ ### or applied to GEN (GEN-MIR) 3) $$ \underset{Z}{\max} \, \mathcal{L}\big(f_{\theta^v}(g_\gamma(Z)),Y^*\big) -\mathcal{L}\big(f_{\theta}(g_\gamma(Z)),Y^*\big) $$ $$ \text{s.t.} \quad ||z_i-z_j||_2^2 > \epsilon \forall z_i,z_j \in Z \,\text{with} \, z_i\neq z_j $$ i.e. search in the latent space of a generative model $g_\gamma$ for samples that are the most forgotten given the foreseen update. 4) Then add theses memories to incoming data $X_t$ and train $f_\theta$ # Results ### qualitative https://i.imgur.com/ZRNTWXe.png Whilst learning 8s and 9s (first row), GEN-MIR mainly retrieves 3s and 4s (bottom two rows) which are similar to 8s and 9s respectively. ### quantitative GEN-MIR was tested on MNIST SPLIT and Permuted MNIST, outperforming the baselines in both cases. ER-MIR was tested on MNIST SPLIT, Permuted MNIST and Split CIFAR-10, outperforming the baselines in all cases. # Other stuff ### (for avid readers) We propose a hybrid method (AE-MIR) in which the generative model is replaced with an autoencoder to facilitate the compression of harder dataset like e.g. CIFAR-10. |

WAIC, but Why? Generative Ensembles for Robust Anomaly Detection

Hyunsun Choi and Eric Jang and Alexander A. Alemi

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2018/10/02 (6 years ago)

**Abstract:** Machine learning models encounter Out-of-Distribution (OoD) errors when the
data seen at test time are generated from a different stochastic generator than
the one used to generate the training data. One proposal to scale OoD detection
to high-dimensional data is to learn a tractable likelihood approximation of
the training distribution, and use it to reject unlikely inputs. However,
likelihood models on natural data are themselves susceptible to OoD errors, and
even assign large likelihoods to samples from other datasets. To mitigate this
problem, we propose Generative Ensembles, which robustify density-based OoD
detection by way of estimating epistemic uncertainty of the likelihood model.
We present a puzzling observation in need of an explanation -- although
likelihood measures cannot account for the typical set of a distribution, and
therefore should not be suitable on their own for OoD detection, WAIC performs
surprisingly well in practice.
more
less

Hyunsun Choi and Eric Jang and Alexander A. Alemi

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

[link]
### Summary Knowing when a model is qualified to make a prediction is critical to safe deployment of ML technology. Model-independent / Unsupervised Out-of-Distribution (OoD) detection is appealing mostly because it doesn't require task-specific labels to train. It is tempting to suggest a simple one-tailed test in which lower likelihoods are OoD (assigned by a Likelihood Model), but the intuition that In-Distribution (ID) inputs should have highest likelihoods _does not hold in higher dimension_. The authors propose to use the Watanabe-Akaike Information Criterion (WAIC) to circumvent this problem and empirically show the robustness of the approach. ### Counterintuitive Properties of Likelihood Models: https://i.imgur.com/4vo0Ff5.png So a GLOW model with Gaussian prior maps SVHN closer to the origin than Cifar (but never actually generates SVHN because Gaussian samples are on the shell). This is bad news for OoD detection. ### Proposed Methodology: Use the WAIC criterion for OoD detection which gives an asymptotically correct estimate of the gap between the training set and test set expectations: https://i.imgur.com/vasSxuk.png Basically, the correction term subtracts the variance in likelihoods across independent samples from the posterior. This acts to robustify the estimate, ensuring that points that are sensitive to the particular choice of posterior are penalized. They use an ensemble of generative models as a proxy for posterior samples i.e. the ensembles acts as approximate posterior samples. Now, OoD can be detected with a Likelihood Model: https://i.imgur.com/M3CDKOA.png ### Discussion Interestingly, GLOW maps Cifar and other datasets INSIDE the gaussian shell (which is an annulus of radius $\sqrt{dim} = \sqrt{3072} \approx 55.4$ https://i.imgur.com/ERdgOaz.png This is in itself quite disturbing, as it suggests that better flow-based generative models (for sampling) can be obtained by encouraging the training distribution to overlap better with the typical set in latent space. |

About