Massimo Caccia's profile - ShortScience.org

arxiv.org
scholar.google.com

Online Fast Adaptation and Knowledge Accumulation: a New Approach to Continual Learning
Massimo Caccia and Pau Rodriguez and Oleksiy Ostapenko and Fabrice Normandin and Min Lin and Lucas Caccia and Issam Laradji and Irina Rish and Alexande Lacoste and David Vazquez and Laurent Charlin
arXiv e-Print archive - 2020 via Local arXiv
Keywords: cs.AI, cs.LG
more

[link] Summary by Massimo Caccia 5 years ago

disclaimer: I'm the first author of the paper

## TL;DR

We have made a lot of progress on catastrophic forgetting within the standard evaluation protocol,
i.e. sequentially learning a stream of tasks and testing our models' capacity to remember them all.

We think it's time a new approach to Continual Learning (CL), coined OSAKA, which is more aligned with real-life applications of CL. It brings CL closer to Online Learning and Open-World Learning.

main modifications we propose:
- bring CL closer to Online learning i.e. at test time, the model is continually learning and evaluated on its online predictions
- it's fine to forget, as long as you can quickly remember (just like we humans do)
- we allow pretraining, (because you wouldn't deploy an untrained CL system, right?) but at test time, the model will have to quickly learn new out-of-distribution (OoD) tasks (because the world is full of surprises)
- the tasks distribution is actually a hidden Markov chain. This implies:
- new and old tasks can re-occur (just like in real life). Better remember them quickly if you want to get a good total performance!
- tasks have different lengths
- and the tasks boundaries are unknown (task agnostic setting)

### Bonus:
We provide a unifying framework explaining the space of machine learning setting {supervised learning, meta learning, continual learning, meta-continual learning, continual-meta learning} in case it was starting to get confusing :p

## Motivation

We imagine an agent, embedded or not, first pre-trained in a controlled environment and later deployed in the real world, where it faces new or unexpected situations. This scenario is relevant for many applications. For instance, in robotics, the agent is pre-trained in a factory and deployed in homes or
in manufactures where it will need to adapt to new domains and maybe solve new tasks. Likewise, a virtual assistant can be pre-trained on static datasets and deployed in a user’s life to fit its personal needs.

Further motivations can be found in time series forecasting, e.g., market prediction,
game playing, autonomous customer service, recommendation systems, autonomous driving, to name a few. In this scenario, we are interested in the cumulative performance of the agent throughout its lifetime. Differently, standard CL reports the agent’s final performance on all tasks at the end of its life. In order
to succeed in this scenario, agents need the ability to learn new tasks as well as quickly remembering old ones.

## Unifying Framework

We propose a unifying framework explaining the space of machine learning setting {supervised learning, meta learning, continual learning, meta-continual learning, continual-meta learning} with meta learning terminology.

https://i.imgur.com/U16kHXk.png

(easier to digest with accompanying text)

## OSAKA

The main features of the evaluation framework are
- task agnosticism
- pre-training is allowed, but OoD tasks at test time
- task revisiting
- controllable non-stationarity
- online evaluation

(see paper for the motivations of the features)

## Continual-MAML: an initial baseline

A simple extension of MAML that is better suited than previous methods in the proposed setting.

https://i.imgur.com/C86WUc8.png

Features are:
- Fast Adapatation
- Dynamic Representation
- Task boundary detection
- Computational efficiency

## Experiments

We provide a suite of 3 benchmarks to test algorithms in the new setting.

The first includes the Omniglot, MNIST and FashionMNIST dataset.

The second and third use the Synbols (Lacoste et al. 2018) and TieredImageNet datasets, respectively.

The first set of experiments shows that the baseline outperforms previous approaches, i.e., supervised learning, meta learning, continual learning, meta-continual learning, continual-meta learning, in the new setting.

https://i.imgur.com/IQ1WYTp.png

The second and third experiments lead us to similar conclusions

code: https://github.com/ElementAI/osaka

arxiv.org
scholar.google.com

Uncertainty-guided Continual Learning with Bayesian Neural Networks
Ebrahimi, Sayna and Elhoseiny, Mohamed and Darrell, Trevor and Rohrbach, Marcus
arXiv e-Print archive - 2019 via Local Bibsonomy
Keywords: dblp

[link] Summary by Massimo Caccia 5 years ago

## Introduction

Bayesian Neural Networks (BNN): intrinsic importance model based on weight uncertainty; variational inference can approximate posterior distributions using Monte Carlo sampling for gradient estimation; acts like an ensemble method in that they reduce the prediction variance but only uses 2x the number of parameters. 

The idea is to use BNN's uncertainty to guide gradient descent to not update the important weight when learning new tasks.

## Bayes by Backprop (BBB):

https://i.imgur.com/7o4gQMI.png

Where $q(w|\theta)$ is our approximation of the posterior $p(w|x)$. $q$ is most probably gaussian with diagonal covariance. We can optimize this via the ELBO:

https://i.imgur.com/OwGm20b.png

## Uncertainty-guided CL with BNN (UCB):

UCB the regularizing is performed with the learning rate such that the learning rate of each parameter and hence its gradient update becomes a function of its importance.  They set the importance to be inversely proportional to the standard deviation $\sigma$ of $q(w|\theta)$

Simply put, the more confident the posterior is about a certain weight, the less is this weight going to be updated. 

You can also use the importance for weight pruning (sort of a hard version of the first idea)

## Cartoon

https://i.imgur.com/6Ld79BS.png

proceedings.mlr.press
scholar.google.com

Online Meta-Learning
Finn, Chelsea and Rajeswaran, Aravind and Kakade, Sham M. and Levine, Sergey
International Conference on Machine Learning - 2019 via Local Bibsonomy
Keywords: dblp

[link] Summary by Massimo Caccia 5 years ago

## Introduction

Two distinct research paradigms have studied how prior tasks or experiences can be used by an agent to inform future learning.

* Meta Learning: past experience is used to acquire a prior over model parameters or a learning procedure, and typically studies a setting where a set of meta-training tasks are made available together upfront
* Online learning : a sequential setting where tasks are revealed one after another, but aims to attain zero-shot generalization without any task-specific adaptation.

We argue that neither setting is ideal for studying continual lifelong learning. Meta-learning deals with learning to learn, but neglects the sequential and non-stationary aspects of the problem. Online learning offers an appealing theoretical framework, but does not generally consider how past experience can accelerate adaptation to a new task.

## Online Learning

Online learning focuses on regret minimization. Most standard notion of regret is to compare to the cumulative loss of the best fixed model in hindsight:
https://i.imgur.com/pbZG4kK.png
One way minimize regret is with Follow the Leader (FTL):
https://i.imgur.com/NCs73vG.png

## Online Meta-learning Setting:

let $U_t$ be the update procedure for task $t$
e.g. in MAML:
https://i.imgur.com/Q4I4HkD.png

The overall protocol for the setting is as follows:
1. At round t, the agent chooses a model defined by $w_t$
2. The world simultaneously chooses task defined by $f_t$
3. The agent obtains access to the update procedure $U_t$, and uses it to update parameters as $\tilde w_t = U_t(w_t)$
4. The agent incurs loss $f_t(\tilde w_t )$. Advance to round t + 1.

the goal for the agent is to minimize regrets over rounds.
Achieving sublinear regrets means you're improving and converging to upper bound (joint training on all tasks)

## Algorithm and Analysis:

Follow the meta-leader (FTML):
https://i.imgur.com/qWb9g8Q.png

FTML’s regret is sublinear (under some assumption)

arxiv.org
arxiv-vanity.com
scholar.google.com

Online Continual Learning with Maximally Interfered Retrieval
Rahaf Aljundi and Lucas Caccia and Eugene Belilovsky and Massimo Caccia and Laurent Charlin and Tinne Tuytelaars
arXiv e-Print archive - 2019 via Local arXiv
Keywords: cs.LG, stat.ML
more

[link] Summary by Massimo Caccia 5 years ago

Disclaimer: I am an author

# Intro

Experience replay (ER) and generative replay (GEN) are two effective continual learning strategies. In the former, samples from a stored memory are replayed to the continual learner to reduce forgetting. In the latter, old data is compressed with a generative model and generated data is replayed to the continual learner. Both of these strategies assume a random sampling of the memories. But learning a new task doesn't cause **equal** interference (forgetting) on the previous tasks!  

In this work, we propose a controlled sampling of the replays. Specifically, we retrieve the samples which are most interfered, i.e. whose prediction will be most negatively impacted by the foreseen parameters update. The method is called Maximally Interfered Retrieval (MIR).

## Cartoon for explanation

https://i.imgur.com/5F3jT36.png

Learning about dogs and horses might cause more interference on lions and zebras than on cars and oranges. Thus, replaying lions and zebras would be a more efficient strategy.

# Method

1) incoming data: $(X_t,Y_t)$

2) foreseen parameter update: $\theta^v= \theta-\alpha\nabla\mathcal{L}(f_\theta(X_t),Y_t)$

### applied to ER (ER-MIR)
3) Search for the top-$k$ values $x$ in the stored memories using the criterion $$s_{MI}(x) = \mathcal{L}(f_{\theta^v}(x),y) -\mathcal{L}(f_{\theta}(x),y)$$

### or applied to GEN (GEN-MIR)
3)   
$$
     \underset{Z}{\max} \, \mathcal{L}\big(f_{\theta^v}(g_\gamma(Z)),Y^*\big) -\mathcal{L}\big(f_{\theta}(g_\gamma(Z)),Y^*\big)
$$
$$
         \text{s.t.}   \quad ||z_i-z_j||_2^2 > \epsilon \forall  z_i,z_j \in Z \,\text{with} \, z_i\neq z_j
$$
i.e. search in the latent space of a generative model $g_\gamma$ for samples that are the most forgotten given the foreseen update.

4) Then add theses memories to incoming data $X_t$ and train $f_\theta$

# Results

### qualitative
https://i.imgur.com/ZRNTWXe.png

Whilst learning 8s and 9s (first row), GEN-MIR mainly retrieves 3s and 4s (bottom two rows) which are similar to 8s and 9s respectively.

### quantitative 

GEN-MIR was tested on MNIST SPLIT and Permuted MNIST, outperforming the baselines in both cases.

ER-MIR was tested on MNIST SPLIT, Permuted MNIST and Split CIFAR-10, outperforming the baselines in all cases.


# Other stuff
### (for avid readers)

We propose a hybrid method (AE-MIR) in which the generative model is replaced with an autoencoder to facilitate the compression of harder dataset like e.g. CIFAR-10.

arxiv.org
arxiv-vanity.com
scholar.google.com

WAIC, but Why? Generative Ensembles for Robust Anomaly Detection
Hyunsun Choi and Eric Jang and Alexander A. Alemi
arXiv e-Print archive - 2018 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by Massimo Caccia 5 years ago

### Summary
Knowing when a model is qualified to make a prediction is critical to safe deployment of ML technology. Model-independent / Unsupervised Out-of-Distribution (OoD) detection is appealing mostly because it doesn't require task-specific labels to train. It is tempting to suggest a simple one-tailed test in which lower likelihoods are OoD (assigned by a Likelihood Model), but the intuition that In-Distribution (ID) inputs should have highest likelihoods _does not hold in higher dimension_. The authors propose to use the Watanabe-Akaike Information Criterion (WAIC) to circumvent this problem and empirically show the robustness of the approach.

### Counterintuitive Properties of Likelihood Models:
https://i.imgur.com/4vo0Ff5.png
So a GLOW model with Gaussian prior maps SVHN closer to the origin than Cifar (but never actually generates SVHN because Gaussian samples are on the shell). This is bad news for OoD detection.

### Proposed Methodology:
Use the WAIC criterion for OoD detection which gives an asymptotically correct estimate of the gap between the training set and test set expectations:
https://i.imgur.com/vasSxuk.png
Basically,  the correction term subtracts the variance in likelihoods across independent samples from the posterior. This acts to robustify the estimate, ensuring that points that are sensitive to the particular choice of posterior are penalized. They use an ensemble of generative models as a proxy for posterior samples i.e. the ensembles acts as approximate posterior samples.
Now, OoD can be detected with a Likelihood Model:
https://i.imgur.com/M3CDKOA.png
### Discussion
Interestingly, GLOW maps Cifar and other datasets INSIDE the gaussian shell (which is an annulus of radius $\sqrt{dim} = \sqrt{3072} \approx 55.4$
https://i.imgur.com/ERdgOaz.png
This is in itself quite disturbing, as it suggests that better flow-based generative models (for sampling) can be obtained by encouraging the training distribution to overlap better with the typical set in latent
space.

Massimo Caccia

sciscore: 4