What Do Neural Networks Learn When Trained With Random Labels? on ShortScience.org

arxiv.org
scholar.google.com

What Do Neural Networks Learn When Trained With Random Labels?
Maennel, Hartmut and Alabdulmohsin, Ibrahim and Tolstikhin, Ilya O. and Baldock, Robert J. N. and Bousquet, Olivier and Gelly, Sylvain and Keysers, Daniel
arXiv e-Print archive - 2020 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 1

[link] Summary by CodyWild 5 years ago

This is another paper that was a bit of a personal-growth test for me to try to parse, since it's definitely heavier on analytical theory than I'm used to, but I think I've been able to get something from it, even though I'll be the first to say I didn't understand it entirely. 

The question of this paper is: why does it seem to be the case that training a neural network on a data distribution - but with your supervised labels randomly sampled - seems to afford some level of advantage when fine-tuning on those random-trained with correct labels.  What is it that these networks learn from random labels that gives them a head-start on future training? 

To try to answer this, the authors focus on analyzing the first-layer weights of a network, and frame both the input data and the learned weights (after random training) to both be random variables, each with some mean and covariance matrix. The central argument made by the paper is: 

After training with random labels, the weights come to have a distributional form with a covariance matrix that is "aligned" with the covariance matrix of the data. "Aligned" here means alignment on the level of eigenvectors. Formally, it is defined as a situation where every eigenspace in the data covariance matrix is contained in, or is a subset of, an eigenspace of the weight matrix. Intuitively, it means that the principal components — the axes that define the principle dimensions of variation - of the weight space are being aligned, in a linear algebra sense, with those of the data, down to a difference of a scaling factor (that is, the fact that the eigenvalues may be different between the two). 

They do show some empirical evidence of this being the case, by calculating the actual covariance matrices of both the data and the learned weight matrices, and showing that you see high degrees of similarity between the vector spaces of the two (though sometimes by having to add eigenspaces of the data together to be equivalent to an eigenspace of the weights). 

https://i.imgur.com/TB5JM6z.png

They also show some indication that this property drives the advantage in fine-tuning. They do this by just taking their analytical model of what they believe is happening during training - that weights are coming to be drawn from a distribution governed by a covariance matrix aligned with the data covariance matrix -  and sample weights from a normal distribution that has that property. They show, in the plot below, that this accounts for most of the advantage that has been observed in subsequent training from training on random labels (other than the previously-discovered effect of "all training increases the scale of the weights, which helps in future training," which they account for by normalizing). 

https://i.imgur.com/cnT27HI.png

Unfortunately, beyond this central intuition of covariance matrix alignment, I wasn't able to get much else from the paper. Some other things they mentioned, that I didn't follow were: 

- The actual proof for why you'd expect this property of alignment in the case of random training
- An analysis of the way that "the first layer [of a network] effectively learns a function which maps each eigenvalue of the data covariance matrix to the corresponding eigenvalue of the weight covariance matrix." I understand that their notion of alignment predicts that there should be some relationship between these two eigenvalues, but I don't fully follow which part of the first layer of a neural network will *learn* that function, or produce it as an output
- An analysis of how this framework explains both the cases where you get positive transfer from random-label training (i.e. fine-tuned networks training better subsequently) and the cases where you get negative transfer

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private