On the importance of single directions for generalization
Ari S. Morcos
and
David G. T. Barrett
and
Neil C. Rabinowitz
and
Matthew Botvinick
arXiv e-Print archive - 2018 via Local arXiv
Keywords:
stat.ML, cs.AI, cs.LG, cs.NE
First published: 2018/03/19 (6 years ago) Abstract: Despite their ability to memorize large datasets, deep neural networks often
achieve good generalization performance. However, the differences between the
learned solutions of networks which generalize and those which do not remain
unclear. Additionally, the tuning properties of single directions (defined as
the activation of a single unit or some linear combination of units in response
to some input) have been highlighted, but their importance has not been
evaluated. Here, we connect these lines of inquiry to demonstrate that a
network's reliance on single directions is a good predictor of its
generalization performance, across networks trained on datasets with different
fractions of corrupted labels, across ensembles of networks trained on datasets
with unmodified labels, across different hyperparameters, and over the course
of training. While dropout only regularizes this quantity up to a point, batch
normalization implicitly discourages single direction reliance, in part by
decreasing the class selectivity of individual units. Finally, we find that
class selectivity is a poor predictor of task importance, suggesting not only
that networks which generalize well minimize their dependence on individual
units by reducing their selectivity, but also that individually selective units
may not be necessary for strong network performance.
Morcos et al. study the influence of ablating single units as a proxy to generalization performance. On Cifar10, for example, a 11-layer convolutional network is trained on the clean dataset, as well as on versions of Cifar10 where a fraction of $p$ samples have corrupted labels. In the latter cases, the network is forced to memorize examples, as there is no inherent structure in the labels assignment. Then, it is experimentally shown that these memorizing networks are less robust to setting whole feature maps to zero, i.e., ablating them. This is shown in Figure 1. Based on this result, the authors argue that the area under this ablation curve (AUC) can be used as proxy for generalization performance. For example, early stopping or hyper-parameter selection can be done based on this AUC value. Furthermore, they show that batch normalization discourages networks to rely on these so-called single-directions, i.e., single units or feature maps. Specifically, batch normalization seems to favor units holding information about multiple classes/concepts.
https://i.imgur.com/h2JwLUF.png
Figure 1: Classification accuracy (y-axis) over the number of units that are ablated (x-axis) for networks trained on Cifar10 with various degrees of corrupted labels. The same experiments (left and right) for MNIST and ImageNet.
Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).