TLDR; The authors show that we can distill the knowledge of a complex ensemble of models into a smaller model by letting the smaller model learn directly from the "soft targets" (softmax output with high temperature) of the ensemble. Intuitively, this works because the errors in probability assignment (e.g. assigning 0.1% to the wrong class) carry a lot of information about what the network learns. Learning directly from logits (unnormalized scores) as was done in a previous paper, is a special case of the distillation approach. The authors show how distillation works on the MNIST and an ASR data set.
#### Key Points
- Can use unlabeled data to transfer knowledge, but using the same training data seems to work well in practice.
- Use softmax with temperature, values from 1-10 seem to work well, depending on the problem.
- The MNIST networks learn to recognize digits without ever having seen base, solely based on the "errors" that the teacher network makes. (Bias needs to be adjusted)
- Training on soft targets with less data performs much better than training on hard targets with same amount of data.
- Breaking up the complex models into specialists didn't really fit into this paper without distilling those experts into one model. Also would've liked to see training of only specialists (without general network) and then distill their knowledge.
#### Problem addressed:
Traditional classifiers are trained using hard targets. This not only calls for learning a very complex function (due to spikes) but also ignores the relative similarity between classes, e.g., truck is more probable to be misclassified as a car instead of a cat. Instead the classifier is forced to assign both the car and cat to a single target value. This leads to poor generalization. This paper addresses this problem.
In order to address the aforementioned problems, the paper proposes a method to generate soft labels for each sample by first training a cubersome/large/complex classifier like dropout at a high ""temperature"" in so that it generates soft probabilities for every sample which represents its membership to each class. It then trains a vanilla NN initially at a high temperature and then at a low one using the generated soft labels on either the same training data or a transfer data. By doing so the simpler (student) model performs similar to the complex (teacher) model.
technique for generating soft labels for classes for training a much simpler classifier compared to currently used large and complex methods like dropout/conv-nets.
I believe a major drawback of this paper is that it entails learning a complex classifier for generating soft labels. Another drawback is that it is incapable of using unlabeled data.
MNIST, JFT (internal google image dataset)
#### Additional remarks: