Overcoming catastrophic forgetting in neural networks
Kirkpatrick, James
and
Pascanu, Razvan
and
Rabinowitz, Neil
and
Veness, Joel
and
Desjardins, Guillaume
and
Rusu, Andrei A.
and
Milan, Kieran
and
Quan, John
and
Ramalho, Tiago
and
Grabska-Barwinska, Agnieszka
and
Hassabis, Demis
and
Clopath, Claudia
and
Kumaran, Dharshan
and
Hadsell, Raia
- 2016 via Local Bibsonomy
Keywords:
deep-learning
This paper proposes a simple method for sequentially training new tasks and avoid catastrophic forgetting. The paper starts with the Bayesian formulation of learning a model that is
$$
\log P(\theta | D) = \log P(D | \theta) + \log P(\theta) - \log P(D)
$$
By switching the prior into the posterior of previous task(s), we have
$$
\log P(\theta | D) = \log P(D | \theta) + \log P(\theta | D_{prev}) - \log P(D)
$$
The paper use the following form for posterior
$$
P(\theta | D_{prev}) = N(\theta_{prev}, diag(F))
$$
where $F$ is the Fisher Information matrix $E_x[ \nabla_\theta \log P(x|\theta) (\nabla_\theta \log P(x|\theta))^T]$. Then the resulting objective function is
$$
L(\theta) = L_{new}(\theta) + \frac{\lambda}{2}\sum F_{ii} (\theta_i - \theta^{prev*}_i)^2
$$
where $L_{new}$ is the loss on new task, and $\theta^{prev*}$ is previous best parameter. It can be viewed as a distance which uses Fisher Informatrix to properly scale each dimension, and it further proves that the Fisher Information matrix is important in the experienment by comparing with simple $L_2$ distance.