Dinh et al. show that it is unclear whether flat minima necessarily generalize better than sharp ones. In particular, they study several notions of flatness, both based on the local curvature and based on the notion of “low change in error”. The authors show that the parameterization of the network has a significant impact on the flatness; this means that functions leading to the same prediction function (i.e., being indistinguishable based on their test performance) might have largely varying flatness around the obtained minima, as illustrated in Figure 1. In conclusion, while networks that generalize well usually correspond to flat minima, it is not necessarily true that flat minima generalize better than sharp ones.
https://i.imgur.com/gHfolEV.jpg
Figure 1: Illustration of the influence of parameterization on the flatness of the obtained minima.
Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).