The paper: "Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask" by Zhou et al., 2019 found that by just learning binary masks one can find random subnetworks that do much better than chance on a task. This new paper builds on this method by proposing a strong algorithm than Zhou et al. for finding these high-performing subnetworks.https://i.imgur.com/vxDqCKP.png
The intuition follows: "If a neural network with random weights (center) is sufficiently overparameterized, it will contain a subnetwork (right) that performs as well as a trained neural network (left) with the same number of parameters."
While Zhou et al. learned a probability for each weight this paper learns a score for each weight and takes the top k percent at evaluation. The scores are learned through their primary contribution that they call the edge-popup algorithm:
"In the edge-popup Algorithm, we associate a score with each edge. On the forward pass we choose the top edges by score. On the backward pass we update the scores of all the edges with the straight-through estimator, allowing helpful edges that are “dead” to re-enter the subnetwork. *We never update the value of any weight in the network, only the score associated with each weight.*"
They're able to find higher-performing random subnetworks than Zhou et al.
First published: 2018/05/16 (3 years ago) Abstract: We introduce a conceptually simple and scalable framework for continual
learning domains where tasks are learned sequentially. Our method is constant
in the number of parameters and is designed to preserve performance on
previously encountered tasks while accelerating learning progress on subsequent
problems. This is achieved by training a network with two components: A
knowledge base, capable of solving previously encountered problems, which is
connected to an active column that is employed to efficiently learn the current
task. After learning a new task, the active column is distilled into the
knowledge base, taking care to protect any previously acquired skills. This
cycle of active learning (progression) followed by consolidation (compression)
requires no architecture growth, no access to or storing of previous data or
tasks, and no task-specific parameters. We demonstrate the progress & compress
approach on sequential classification of handwritten alphabets as well as two
reinforcement learning domains: Atari games and 3D maze navigation.
Proposes a two-stage approach for continual learning. An active learning phase and a consolidation phase. The active learning stage optimizes for a specific task that is then consolidated into the knowledge base network via Elastic Weight Consolidation (Kirkpatrick et al., 2016). The active learning phases uses a separate network than the knowledge base, but is not always trained from scratch - authors suggest a heuristic based on task-similarity. Improves EWC by deriving a new online method so parameters don’t increase linearly with the number of tasks.
Desiderata for a continual learning solution:
- A continual learning method should not suffer from catastrophic forgetting. That is, it should be able to perform reasonably well on previously learned tasks.
- It should be able to learn new tasks while taking advantage of knowledge extracted from previous tasks, thus exhibiting positive forward transfer to achieve faster learning and/or better final performance.
- It should be scalable, that is, the method should be trainable on a large number of tasks.
- It should enable positive backward transfer as well, which means gaining improved performance on previous tasks after learning a new task which is similar or relevant.
- Finally, it should be able to learn without requiring task labels, and ideally, it should even be applicable in the absence of clear task boundaries.
- Sequential learning of handwritten characters of 50 alphabets taken from the Omniglot dataset.
- Sequential learning of 6 games in the Atari suite (Bellemare et al., 2012) (“Space Invaders”, “Krull”, “Beamrider”, “Hero”, “Stargunner” and “Ms. Pac-man”).
- 8 navigation tasks in 3D environments inspired by experiments with Distral (Teh et al., 2017).