Overcoming catastrophic forgetting in neural networks on ShortScience.org

arxiv.org
scholar.google.com

Overcoming catastrophic forgetting in neural networks
Kirkpatrick, James and Pascanu, Razvan and Rabinowitz, Neil and Veness, Joel and Desjardins, Guillaume and Rusu, Andrei A. and Milan, Kieran and Quan, John and Ramalho, Tiago and Grabska-Barwinska, Agnieszka and Hassabis, Demis and Clopath, Claudia and Kumaran, Dharshan and Hadsell, Raia
- 2016 via Local Bibsonomy
Keywords: deep-learning

Summaries/Notes 1

[link] Summary by luyuchen 7 years ago

This paper proposes a simple method for sequentially training new tasks and avoid catastrophic forgetting. The paper starts with the Bayesian formulation of learning a model that is
$$
\log P(\theta | D) = \log P(D | \theta) + \log P(\theta) - \log P(D)
$$
By switching the prior into the posterior of previous task(s), we have
$$
\log P(\theta | D) = \log P(D | \theta) + \log P(\theta | D_{prev}) - \log P(D)
$$
The paper use the following form for posterior
$$
P(\theta | D_{prev}) = N(\theta_{prev}, diag(F))
$$
where $F$ is the Fisher Information matrix $E_x[ \nabla_\theta \log P(x|\theta) (\nabla_\theta \log P(x|\theta))^T]$. Then the resulting objective function is
$$
L(\theta) = L_{new}(\theta) + \frac{\lambda}{2}\sum F_{ii} (\theta_i - \theta^{prev*}_i)^2
$$
where $L_{new}$ is the loss on new task, and $\theta^{prev*}$ is previous best parameter. It can be viewed as a distance which uses Fisher Informatrix to properly scale each dimension, and it further proves that the Fisher Information matrix is important in the experienment by comparing with simple $L_2$ distance.

Your comment: