Entropy-SGD: Biasing Gradient Descent Into Wide Valleys on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com
Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
Pratik Chaudhari and Anna Choromanska and Stefano Soatto and Yann LeCun
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more
Summaries/Notes 1
[link] Summary by Csaba Szepesvari 8 years ago
__Edit__: The paper has a newer version at https://openreview.net/pdf?id=B1YfAfcgl, where I also copied this review over with some changes.

__Note__: An earlier version of the review (almost identical to the present one) for an earlier version of the paper (available on arXiV) can be found here: http://www.shortscience.org/paper?bibtexKey=journals/corr/1611.01838#csaba
The only change concerns relation to previous work.

__Problem__: The problem considered is to derive an improved version of SGD for training neural networks (or minimize empirical loss) by modifying the loss optimized to that the solution found is more likely to end up in the vicinity of a minimum where the loss changes slowly ("flat minima" as in the paper of Hochreiter and Schmidhuber from 1997). 

__Motivation__: It is hypothetised that flat minima "generalize" better. 

__Algorithmic approach__: Let $f$ be the (empirical) loss to be minimized. Modify this to $$\overline{f}(x) = \rho f(x) - \log \int \exp(-\frac{\gamma}{2}\||z\||^2-f(x+z))\,dz$$ with some $\rho,\gamma>0$ tunable parameters. For $\||z\||^2\gg \frac{1}{\gamma}$, the term $\exp(-\frac{\gamma}{2}\||z\||^2)$ becomes very small, so effectively the second term is close to a constant times the integral of $f$ over a ball centered at $x$ and having a radius of $\propto \gamma^{-1/2}$. This is a smoothened version of $f$, hence one expects that by making this term more important then the first term, a procedure minimizing $\bar f$ will be more likely to end up at a flat minima of $f$. Since the gradient is somewhat complicated, an MCMC algorithm is proposed ("stochastic gradient Langevin dynamics" from Welling and Teh, 2011).

__Results__: There is a theoretical result that quantifies the increased smoothness of $\overline{f}$, which is connected to stability and ultimately to generalization through citing a result of Hardt et al. (2015). Empirical results show better validation error on two datasets: MNIST and CIFAR-10 (the respective networks are LeNet and All-CNN-C). The improvement is in terms of reaching the same validation error as with an "original SGD" but with fewer "epochs".

__Soundness, significance__: The proof of the __theoretical result__ relies on an arbitrary assumption that there exists some $c>0$ such that no eigenvalue of the hessian of $f$ lies in the set $[-2\gamma-c,c]$ (the reason for the assumption is because otherwise a uniform improvement cannot be shown). For $\rho=0$ the improvement of the smoothness (first and second order) is a factor of $1/(1+c/\gamma)$. The proof uses Laplace's method and is more a sketch than a rigorous proof (error terms are dropped; it would be good to make this clear in the statement of the result).

In the experiments the modified procedure did not consistently reach a smaller validation error. The authors did not present running times, hence it is unclear whether the procedure's increased computation cost is offset by the faster convergence. 

__Evaluation__: It is puzzling why a simpler smoothing, e.g., $$\overline{f}(x) = \int f(x+z) g(z) dz$$ (with $g$ being the density of a centered probability distribution) is not considered. The authors note that something "like this" may be infeasible in "deep neural networks" (the note is somewhat vague). However, under mild conditions, $\frac{\partial}{\partial x} \overline{f}(x) = \int \frac{\partial}{\partial x} f(x+z) g(z) dz$, hence, for $Z\sim g(\cdot)$, $\frac{\partial}{\partial x} f(x+Z)$ is an unbiased estimate of $\frac{\partial}{\partial x} \overline{f}(x)$, whose calculation is as cheap as that of vanilla SGD. Also, how much the smoothness of $f$ changes when using this approach is quite well understood.

__Related work__: Hochreiter and Schmidhuber argued a long time ago for the benefits of "flat minima" in an  identically titled Neural computation paper that appeared in 1997. This reviewer may not agree with the arguments in this paper, but the paper is highly relevant and should be cited. It is also strange that the specific modification that appears in this paper was proposed by others (Baldassi et al.), whom this paper also cites, but without giving these authors credit for introducing local entropy as a smoothing technique.
Your comment:
Write your summary here (You can use $\LaTeX$ and markdown syntax):
Anon Private