This paper proposes a simple method for sequentially training new tasks and avoid catastrophic forgetting. The paper starts with the Bayesian formulation of learning a model that is
$$
\log P(\theta | D) = \log P(D | \theta) + \log P(\theta) - \log P(D)
$$
By switching the prior into the posterior of previous task(s), we have
$$
\log P(\theta | D) = \log P(D | \theta) + \log P(\theta | D_{prev}) - \log P(D)
$$
The paper use the following form for posterior
$$
P(\theta | D_{prev}) = N(\theta_{prev}, diag(F))
$$
where $F$ is the Fisher Information matrix $E_x[ \nabla_\theta \log P(x|\theta) (\nabla_\theta \log P(x|\theta))^T]$. Then the resulting objective function is
$$
L(\theta) = L_{new}(\theta) + \frac{\lambda}{2}\sum F_{ii} (\theta_i - \theta^{prev*}_i)^2
$$
where $L_{new}$ is the loss on new task, and $\theta^{prev*}$ is previous best parameter. It can be viewed as a distance which uses Fisher Informatrix to properly scale each dimension, and it further proves that the Fisher Information matrix is important in the experienment by comparing with simple $L_2$ distance.