Rethinking Bias-Variance Trade-off for Generalization of Neural Networks on ShortScience.org

arxiv.org
scholar.google.com

Rethinking Bias-Variance Trade-off for Generalization of Neural Networks
Yang, Zitong and Yu, Yaodong and You, Chong and Steinhardt, Jacob and Ma, Yi
- 2020 via Local Bibsonomy
Keywords: readings, generalization, variance, bias, analysis

Summaries/Notes 1

[link] Summary by CodyWild 4 years ago

This is a really cool paper that posits a relatively simple explanation for the strange phenomena known as double descent - both the fact of seeing it in the first place, and the difficulty in robustly causing it to appear. In the classical wisdom of statistics, increasing model complexity too far will lead to increase in variance, and thus an increase in test error (or "test risk" or "empirical risk"), leading to a U-shaped test error curve as a function of model complexity. Double descent is the name given to the observation that, in modern neural networks, we tend to not see this phenomenon, and, in fact, sometimes see test error first increasing but then descend again below its initial minimum. Test error going up, and then back down again: double descent. However, this phenomenon proved to be a bit elusive: often in order to see it, you had to  add artificial noise to your labels. 

This paper provides a cohesive theory for both the existence of double descent, and the fact that it sometimes can only be elicited with increased label noise. They empirically estimate the bias and variance components of test error for a range of neural nets on a range of datasets, and show that when they directly estimate bias and variance this way, they see bias decreasing (or, at least, non-increasing) monotonically with model complexity, as expected. But, they also see variance, rather than strictly increasing with model complexity, exhibiting unimodal behavior, where it first increases, and then decreases, as a function of model complexity. 

Taking a step back, bias is here understood as the component of your test error that comes from the difference between your expected learned estimator and the true underlying function. Variance is the squared difference between the expected learned estimator (that is, the one you get if you average over different splits in the data), and the estimator learned on each split of the data. The actual estimator you get is a function of both your average estimator, and the particular estimator you draw in the distribution around that average, which is defined by the variance. The authors empirically measure bias and variance by conducting k different N-way splits of their datasets, and averaging these k*N estimates to get an average or expected estimator. Given that, they can (as shown below), take the variance to be the squared difference between the k*N individual estimators and the average. Since test error is composed of bias + variance, we can then simply calculate bias as whatever remains of test error when variance has been accounted for. 

https://i.imgur.com/VPzujaZ.png


This provides an elegant explanation for the different relationships we see between complexity and test error. In regimes where the decrease in bias from additional complexity is much larger than the increase in variance - which they argue is the case in modern deep networks - we don't see double descent, because the "bump" due to the variance peak is overshadowed by the continuing decrease in variance. However, in regimes where the overall scale of variance (at all levels of complexity) is higher, we see the increasing variance overwhelming the decreasing bias, and test error increases (before, ultimately, going down again, after the variance peaks). This explains why double descent has previously appeared preferentially in cases of injected label noise: more label noise means higher irreducible variability in the model learned from different sets of data, which makes the scale of the variance peak more pronounced compared to the bias drop. In addition to their empirical work, the authors also analytically analyze a two-layer linear neural network, and show that you would theoretically expect a peaked variance shape in that setting.

In a certain sense, this just pushes the problem down the road, since the paper doesn't explain why, in any kind of conceptual or statistical sense, we would expect variance to be unimodal in this way. (They do offer a conjecture, but it was not the main thrust of the paper, and I didn't fully follow it). However, it does offer conceptual clarity into a previously somewhat more murky empirical phenomenon, and hopefully will let us focus on understanding why variance behaves in this way.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private