[link]
disclaimer: I'm the first author of the paper ## TL;DR We have made a lot of progress on catastrophic forgetting within the standard evaluation protocol, i.e. sequentially learning a stream of tasks and testing our models' capacity to remember them all. We think it's time a new approach to Continual Learning (CL), coined OSAKA, which is more aligned with real-life applications of CL. It brings CL closer to Online Learning and Open-World Learning. main modifications we propose: - bring CL closer to Online learning i.e. at test time, the model is continually learning and evaluated on its online predictions - it's fine to forget, as long as you can quickly remember (just like we humans do) - we allow pretraining, (because you wouldn't deploy an untrained CL system, right?) but at test time, the model will have to quickly learn new out-of-distribution (OoD) tasks (because the world is full of surprises) - the tasks distribution is actually a hidden Markov chain. This implies: - new and old tasks can re-occur (just like in real life). Better remember them quickly if you want to get a good total performance! - tasks have different lengths - and the tasks boundaries are unknown (task agnostic setting) ### Bonus: We provide a unifying framework explaining the space of machine learning setting {supervised learning, meta learning, continual learning, meta-continual learning, continual-meta learning} in case it was starting to get confusing :p ## Motivation We imagine an agent, embedded or not, first pre-trained in a controlled environment and later deployed in the real world, where it faces new or unexpected situations. This scenario is relevant for many applications. For instance, in robotics, the agent is pre-trained in a factory and deployed in homes or in manufactures where it will need to adapt to new domains and maybe solve new tasks. Likewise, a virtual assistant can be pre-trained on static datasets and deployed in a user’s life to fit its personal needs. Further motivations can be found in time series forecasting, e.g., market prediction, game playing, autonomous customer service, recommendation systems, autonomous driving, to name a few. In this scenario, we are interested in the cumulative performance of the agent throughout its lifetime. Differently, standard CL reports the agent’s final performance on all tasks at the end of its life. In order to succeed in this scenario, agents need the ability to learn new tasks as well as quickly remembering old ones. ## Unifying Framework We propose a unifying framework explaining the space of machine learning setting {supervised learning, meta learning, continual learning, meta-continual learning, continual-meta learning} with meta learning terminology. https://i.imgur.com/U16kHXk.png (easier to digest with accompanying text) ## OSAKA The main features of the evaluation framework are - task agnosticism - pre-training is allowed, but OoD tasks at test time - task revisiting - controllable non-stationarity - online evaluation (see paper for the motivations of the features) ## Continual-MAML: an initial baseline A simple extension of MAML that is better suited than previous methods in the proposed setting. https://i.imgur.com/C86WUc8.png Features are: - Fast Adapatation - Dynamic Representation - Task boundary detection - Computational efficiency ## Experiments We provide a suite of 3 benchmarks to test algorithms in the new setting. The first includes the Omniglot, MNIST and FashionMNIST dataset. The second and third use the Synbols (Lacoste et al. 2018) and TieredImageNet datasets, respectively. The first set of experiments shows that the baseline outperforms previous approaches, i.e., supervised learning, meta learning, continual learning, meta-continual learning, continual-meta learning, in the new setting. https://i.imgur.com/IQ1WYTp.png The second and third experiments lead us to similar conclusions code: https://github.com/ElementAI/osaka |
[link]
## Introduction Bayesian Neural Networks (BNN): intrinsic importance model based on weight uncertainty; variational inference can approximate posterior distributions using Monte Carlo sampling for gradient estimation; acts like an ensemble method in that they reduce the prediction variance but only uses 2x the number of parameters. The idea is to use BNN's uncertainty to guide gradient descent to not update the important weight when learning new tasks. ## Bayes by Backprop (BBB): https://i.imgur.com/7o4gQMI.png Where $q(w|\theta)$ is our approximation of the posterior $p(w|x)$. $q$ is most probably gaussian with diagonal covariance. We can optimize this via the ELBO: https://i.imgur.com/OwGm20b.png ## Uncertainty-guided CL with BNN (UCB): UCB the regularizing is performed with the learning rate such that the learning rate of each parameter and hence its gradient update becomes a function of its importance. They set the importance to be inversely proportional to the standard deviation $\sigma$ of $q(w|\theta)$ Simply put, the more confident the posterior is about a certain weight, the less is this weight going to be updated. You can also use the importance for weight pruning (sort of a hard version of the first idea) ## Cartoon https://i.imgur.com/6Ld79BS.png |
[link]
## Introduction Two distinct research paradigms have studied how prior tasks or experiences can be used by an agent to inform future learning. * Meta Learning: past experience is used to acquire a prior over model parameters or a learning procedure, and typically studies a setting where a set of meta-training tasks are made available together upfront * Online learning : a sequential setting where tasks are revealed one after another, but aims to attain zero-shot generalization without any task-specific adaptation. We argue that neither setting is ideal for studying continual lifelong learning. Meta-learning deals with learning to learn, but neglects the sequential and non-stationary aspects of the problem. Online learning offers an appealing theoretical framework, but does not generally consider how past experience can accelerate adaptation to a new task. ## Online Learning Online learning focuses on regret minimization. Most standard notion of regret is to compare to the cumulative loss of the best fixed model in hindsight: https://i.imgur.com/pbZG4kK.png One way minimize regret is with Follow the Leader (FTL): https://i.imgur.com/NCs73vG.png ## Online Meta-learning Setting: let $U_t$ be the update procedure for task $t$ e.g. in MAML: https://i.imgur.com/Q4I4HkD.png The overall protocol for the setting is as follows: 1. At round t, the agent chooses a model defined by $w_t$ 2. The world simultaneously chooses task defined by $f_t$ 3. The agent obtains access to the update procedure $U_t$, and uses it to update parameters as $\tilde w_t = U_t(w_t)$ 4. The agent incurs loss $f_t(\tilde w_t )$. Advance to round t + 1. the goal for the agent is to minimize regrets over rounds. Achieving sublinear regrets means you're improving and converging to upper bound (joint training on all tasks) ## Algorithm and Analysis: Follow the meta-leader (FTML): https://i.imgur.com/qWb9g8Q.png FTML’s regret is sublinear (under some assumption) |
[link]
Disclaimer: I am an author # Intro Experience replay (ER) and generative replay (GEN) are two effective continual learning strategies. In the former, samples from a stored memory are replayed to the continual learner to reduce forgetting. In the latter, old data is compressed with a generative model and generated data is replayed to the continual learner. Both of these strategies assume a random sampling of the memories. But learning a new task doesn't cause **equal** interference (forgetting) on the previous tasks! In this work, we propose a controlled sampling of the replays. Specifically, we retrieve the samples which are most interfered, i.e. whose prediction will be most negatively impacted by the foreseen parameters update. The method is called Maximally Interfered Retrieval (MIR). ## Cartoon for explanation https://i.imgur.com/5F3jT36.png Learning about dogs and horses might cause more interference on lions and zebras than on cars and oranges. Thus, replaying lions and zebras would be a more efficient strategy. # Method 1) incoming data: $(X_t,Y_t)$ 2) foreseen parameter update: $\theta^v= \theta-\alpha\nabla\mathcal{L}(f_\theta(X_t),Y_t)$ ### applied to ER (ER-MIR) 3) Search for the top-$k$ values $x$ in the stored memories using the criterion $$s_{MI}(x) = \mathcal{L}(f_{\theta^v}(x),y) -\mathcal{L}(f_{\theta}(x),y)$$ ### or applied to GEN (GEN-MIR) 3) $$ \underset{Z}{\max} \, \mathcal{L}\big(f_{\theta^v}(g_\gamma(Z)),Y^*\big) -\mathcal{L}\big(f_{\theta}(g_\gamma(Z)),Y^*\big) $$ $$ \text{s.t.} \quad ||z_i-z_j||_2^2 > \epsilon \forall z_i,z_j \in Z \,\text{with} \, z_i\neq z_j $$ i.e. search in the latent space of a generative model $g_\gamma$ for samples that are the most forgotten given the foreseen update. 4) Then add theses memories to incoming data $X_t$ and train $f_\theta$ # Results ### qualitative https://i.imgur.com/ZRNTWXe.png Whilst learning 8s and 9s (first row), GEN-MIR mainly retrieves 3s and 4s (bottom two rows) which are similar to 8s and 9s respectively. ### quantitative GEN-MIR was tested on MNIST SPLIT and Permuted MNIST, outperforming the baselines in both cases. ER-MIR was tested on MNIST SPLIT, Permuted MNIST and Split CIFAR-10, outperforming the baselines in all cases. # Other stuff ### (for avid readers) We propose a hybrid method (AE-MIR) in which the generative model is replaced with an autoencoder to facilitate the compression of harder dataset like e.g. CIFAR-10. |
[link]
### Summary Knowing when a model is qualified to make a prediction is critical to safe deployment of ML technology. Model-independent / Unsupervised Out-of-Distribution (OoD) detection is appealing mostly because it doesn't require task-specific labels to train. It is tempting to suggest a simple one-tailed test in which lower likelihoods are OoD (assigned by a Likelihood Model), but the intuition that In-Distribution (ID) inputs should have highest likelihoods _does not hold in higher dimension_. The authors propose to use the Watanabe-Akaike Information Criterion (WAIC) to circumvent this problem and empirically show the robustness of the approach. ### Counterintuitive Properties of Likelihood Models: https://i.imgur.com/4vo0Ff5.png So a GLOW model with Gaussian prior maps SVHN closer to the origin than Cifar (but never actually generates SVHN because Gaussian samples are on the shell). This is bad news for OoD detection. ### Proposed Methodology: Use the WAIC criterion for OoD detection which gives an asymptotically correct estimate of the gap between the training set and test set expectations: https://i.imgur.com/vasSxuk.png Basically, the correction term subtracts the variance in likelihoods across independent samples from the posterior. This acts to robustify the estimate, ensuring that points that are sensitive to the particular choice of posterior are penalized. They use an ensemble of generative models as a proxy for posterior samples i.e. the ensembles acts as approximate posterior samples. Now, OoD can be detected with a Likelihood Model: https://i.imgur.com/M3CDKOA.png ### Discussion Interestingly, GLOW maps Cifar and other datasets INSIDE the gaussian shell (which is an annulus of radius $\sqrt{dim} = \sqrt{3072} \approx 55.4$ https://i.imgur.com/ERdgOaz.png This is in itself quite disturbing, as it suggests that better flow-based generative models (for sampling) can be obtained by encouraging the training distribution to overlap better with the typical set in latent space. |