Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Pratik Chaudhari and Anna Choromanska and Stefano Soatto and Yann LeCun

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

**First published:** 2016/11/06 (7 years ago)

**Abstract:** This paper proposes a new optimization algorithm called Entropy-SGD for
training deep neural networks that is motivated by the local geometry of the
energy landscape at solutions found by gradient descent. Local extrema with low
generalization error have a large proportion of almost-zero eigenvalues in the
Hessian with very few positive or negative eigenvalues. We leverage upon this
observation to construct a local entropy based objective that favors
well-generalizable solutions lying in the flat regions of the energy landscape,
while avoiding poorly-generalizable solutions located in the sharp valleys. Our
algorithm resembles two nested loops of SGD, where we use Langevin dynamics to
compute the gradient of local entropy at each update of the weights. We prove
that incorporating local entropy into the objective function results in a
smoother energy landscape and use uniform stability to show improved
generalization bounds over SGD. Our experiments on competitive baselines
demonstrate that Entropy-SGD leads to improved generalization and has the
potential to accelerate training.
more
less

Pratik Chaudhari and Anna Choromanska and Stefano Soatto and Yann LeCun

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

[link]
__Edit__: The paper has a newer version at https://openreview.net/pdf?id=B1YfAfcgl, where I also copied this review over with some changes. __Note__: An earlier version of the review (almost identical to the present one) for an earlier version of the paper (available on arXiV) can be found here: http://www.shortscience.org/paper?bibtexKey=journals/corr/1611.01838#csaba The only change concerns relation to previous work. __Problem__: The problem considered is to derive an improved version of SGD for training neural networks (or minimize empirical loss) by modifying the loss optimized to that the solution found is more likely to end up in the vicinity of a minimum where the loss changes slowly ("flat minima" as in the paper of Hochreiter and Schmidhuber from 1997). __Motivation__: It is hypothetised that flat minima "generalize" better. __Algorithmic approach__: Let $f$ be the (empirical) loss to be minimized. Modify this to $$\overline{f}(x) = \rho f(x) - \log \int \exp(-\frac{\gamma}{2}\||z\||^2-f(x+z))\,dz$$ with some $\rho,\gamma>0$ tunable parameters. For $\||z\||^2\gg \frac{1}{\gamma}$, the term $\exp(-\frac{\gamma}{2}\||z\||^2)$ becomes very small, so effectively the second term is close to a constant times the integral of $f$ over a ball centered at $x$ and having a radius of $\propto \gamma^{-1/2}$. This is a smoothened version of $f$, hence one expects that by making this term more important then the first term, a procedure minimizing $\bar f$ will be more likely to end up at a flat minima of $f$. Since the gradient is somewhat complicated, an MCMC algorithm is proposed ("stochastic gradient Langevin dynamics" from Welling and Teh, 2011). __Results__: There is a theoretical result that quantifies the increased smoothness of $\overline{f}$, which is connected to stability and ultimately to generalization through citing a result of Hardt et al. (2015). Empirical results show better validation error on two datasets: MNIST and CIFAR-10 (the respective networks are LeNet and All-CNN-C). The improvement is in terms of reaching the same validation error as with an "original SGD" but with fewer "epochs". __Soundness, significance__: The proof of the __theoretical result__ relies on an arbitrary assumption that there exists some $c>0$ such that no eigenvalue of the hessian of $f$ lies in the set $[-2\gamma-c,c]$ (the reason for the assumption is because otherwise a uniform improvement cannot be shown). For $\rho=0$ the improvement of the smoothness (first and second order) is a factor of $1/(1+c/\gamma)$. The proof uses Laplace's method and is more a sketch than a rigorous proof (error terms are dropped; it would be good to make this clear in the statement of the result). In the experiments the modified procedure did not consistently reach a smaller validation error. The authors did not present running times, hence it is unclear whether the procedure's increased computation cost is offset by the faster convergence. __Evaluation__: It is puzzling why a simpler smoothing, e.g., $$\overline{f}(x) = \int f(x+z) g(z) dz$$ (with $g$ being the density of a centered probability distribution) is not considered. The authors note that something "like this" may be infeasible in "deep neural networks" (the note is somewhat vague). However, under mild conditions, $\frac{\partial}{\partial x} \overline{f}(x) = \int \frac{\partial}{\partial x} f(x+z) g(z) dz$, hence, for $Z\sim g(\cdot)$, $\frac{\partial}{\partial x} f(x+Z)$ is an unbiased estimate of $\frac{\partial}{\partial x} \overline{f}(x)$, whose calculation is as cheap as that of vanilla SGD. Also, how much the smoothness of $f$ changes when using this approach is quite well understood. __Related work__: Hochreiter and Schmidhuber argued a long time ago for the benefits of "flat minima" in an identically titled Neural computation paper that appeared in 1997. This reviewer may not agree with the arguments in this paper, but the paper is highly relevant and should be cited. It is also strange that the specific modification that appears in this paper was proposed by others (Baldassi et al.), whom this paper also cites, but without giving these authors credit for introducing local entropy as a smoothing technique. |

About