Born Again Neural Networks
Tommaso Furlanello
and
Zachary C. Lipton
and
Michael Tschannen
and
Laurent Itti
and
Anima Anandkumar
arXiv e-Print archive - 2018 via Local arXiv
Keywords:
stat.ML, cs.AI, cs.LG
First published: 2018/05/12 (6 years ago) Abstract: Knowledge distillation (KD) consists of transferring knowledge from one
machine learning model (the teacher}) to another (the student). Commonly, the
teacher is a high-capacity model with formidable performance, while the student
is more compact. By transferring knowledge, one hopes to benefit from the
student's compactness. %we desire a compact model with performance close to the
teacher's. We study KD from a new perspective: rather than compressing models,
we train students parameterized identically to their teachers. Surprisingly,
these {Born-Again Networks (BANs), outperform their teachers significantly,
both on computer vision and language modeling tasks. Our experiments with BANs
based on DenseNets demonstrate state-of-the-art performance on the CIFAR-10
(3.5%) and CIFAR-100 (15.5%) datasets, by validation error. Additional
experiments explore two distillation objectives: (i) Confidence-Weighted by
Teacher Max (CWTM) and (ii) Dark Knowledge with Permuted Predictions (DKPP).
Both methods elucidate the essential components of KD, demonstrating a role of
the teacher outputs on both predicted and non-predicted classes. We present
experiments with students of various capacities, focusing on the under-explored
case where students overpower teachers. Our experiments show significant
advantages from transferring knowledge between DenseNets and ResNets in either
direction.