First published: 2018/05/24 (4 years ago) Abstract: Continual learning experiments used in current deep learning papers do not
faithfully assess fundamental challenges of learning continually, masking
weak-points of the suggested approaches instead. We study gaps in such existing
evaluations, proposing essential experimental evaluations that are more
representative of continual learning's challenges, and suggest a
re-prioritization of research efforts in the field. We show that current
approaches fail with our new evaluations and, to analyse these failures, we
propose a variational loss which unifies many existing solutions to continual
learning under a Bayesian framing, as either 'prior-focused' or
'likelihood-focused'. We show that while prior-focused approaches such as EWC
and VCL perform well on existing evaluations, they perform dramatically worse
when compared to likelihood-focused approaches on other simple tasks.
Through a likelihood-focused derivation of a variational inference (VI) loss, Variational Generative Experience Replay (VGER) presents the closest appropriate likelihood- focused alternative to Variational Continual Learning (VCL), the state-of the art prior-focused approach to continual learning.
In non continual learning, the aim is to learn parameters $\omega$ using labelled training data $\mathcal{D}$ to infer $p(y|\omega, x)$. In the continual learning context, instead, the data is not independently and identically distributed (i.i.d.), but may be split into separate tasks $\mathcal{D}_t = (X_t, Y_t)$ whose examples $x_t^{n_t}$ and $y_t^{n_t}$ are assumed to be i.i.d.
In \cite{Farquhar18}, as the loss at time $t$ cannot be estimated for previously discarded datasets, to approximate the distribution of past datasets $p_t(x,y)$, VGER (Variational Generative Experience Replay) trains a GAN $q_t(x, y)$ to produce ($\hat{x}, \hat{y}$) pairs for each class in each dataset as it arrives (generator is kept while data is discarded after each dataset is used). The variational free energy $\mathcal{F}_T$ is used to train on dataset $\mathcal{D}_T$ augmented with samples generated by the GAN. In this way the prior is set as the posterior approximation from the previous task.