The authors propose an algorithm for meta-learning that is compatible with any model trained with gradient descent, and show that it works on various domain including supervised learning and reinforcement learning. This is done by explicitly train the network such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task.
### Key Points
- MAML is actually finding a good **initialization** of model parameters for several tasks.
- Good initialization of parameters means that it can achieve good performance on several tasks with small number of gradient steps.
### Method
- Simultaneously optimize the **initialization** of model parameters of different meta-training tasks, hoping that it can quickly adapt to new meta-testing tasks.
![](https://cloud.githubusercontent.com/assets/7057863/25161911/46f2721e-24f1-11e7-9fba-8bc2f0782204.png)
- Training procedure:
![](https://cloud.githubusercontent.com/assets/7057863/25161749/8d00902a-24f0-11e7-93a8-6a9b74386f55.png)
### Exp
- It acheived performance that is comparable to the state-of-the-art on classification/regression/reinforcement learning tasks.
### Thought
I think the experiments are thorough since they proved that this technique can be applied to both supervised and reinforcement learning. However, the method is not novel provided that [Optimization a A Midel For Few-shot Learning](https://openreview.net/pdf?id=rJY0-Kcll) already proposed to learn initialization of parameters.
## TL;DR
The paper presents a model-agnostic strategy to perform few-shot learning taking advantage of prior knowledge acquired during in multitask learning. Such prior knowledge derives from priors acquired about generalized model parameters (e.g. weights or hyperparameters) during the Model Agnostic Meta-Learning (MAML) algorithm. The strategy can be applied to any algorithm trained with gradient descent (not only neural networks) being more general and perhaps effective than transfer learning. It can loosely be referred to as "learning to learn".
## Why this is interesting
* Suitable in combination with any technique that uses gradient descent (supervised learning, reinforcement learning)
* Interesting idea: instead of further optimize existing models for performances, search a representation that can be subsequently tuned
* when only a few and diverse data is available, multiple tasks can be defined to harness the ability of the meta-model to learn while preserving generalization (see Experiments)
## Details
The key idea is performing the meta-learner update on a different data batch with respect to the parameters update(s) done for a single task. This leads (formally) the same update procedure both for the learning and meta-learning phases of the algorithm (see Figure below) and provides a general framework for MAML.
https://i.imgur.com/xq1wCai.png
image from [lilianweng post](https://lilianweng.github.io/lil-log/2018/11/30/meta-learning.html)
As [clearly worked out here](https://lilianweng.github.io/lil-log/2018/11/30/meta-learning.html), the method requires to compute the second derivatives for the outer-loop update. Surprisingly enough, omitting them and performing a first-order MAML does not sensibly affect the results according to the reported experiment. It is hypothesized that this is because ReLU networks are almost linear (and hence depends upon the actual network structure).
## Experiments
### Supervised learning
1. regression from input to output of a sine wave. Amplitude and phase are varied among tasks. It is shown that MAML leads to good results and can generalize better than fine-tuning in the experiment conditions ("due to the often contradictory outputs on pre-training tasks". See Figure 2 in the paper)
2. few-shot image classification on Omniglot and MiniImageNet datasets (N, unseen, multiclass trained with K different instances). SOTA performance on the first dataset, and better-than SOTA on the second one where first-order MAML is also tested.
### Reinforcement learning
1. 2D navigation: a single agent point must move to different positions (tasks). The model trained with MAML shows better performances, for the same number of gradient steps (Figure 4)
2. Locomotion: two simulated robots are provided with a set of tasks. MAML can learn a model that adapts much faster to new tasks (this is a case where pretraining is detrimental)
## Related work and resources
* official [gthub repo](https://github.com/cbfinn/maml)
* [videos](https://sites.google.com/view/maml) of the learned policies in MAML paper
* paper appendix: the part on the multi-task baseline is interesting
* [how to train your MAML](https://arxiv.org/abs/1810.09502): discusses various modifications to MAML to stabilize and improve performances
* [Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML](https://arxiv.org/abs/1909.09157)