Deep Learning without Poor Local Minima on ShortScience.org

papers.nips.cc
scholar.google.com

Deep Learning without Poor Local Minima
Kawaguchi, Kenji
Neural Information Processing Systems Conference - 2016 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 1

[link] Summary by Brady Neal 6 years ago

# Main Results (tl;dr)
## Deep *Linear* Networks
1. Loss function is **non-convex** and non-concave
2. **Every local minimum is a global minimum**
3. Shallow neural networks *don't* have bad saddle points
4. Deep neural networks *do* have bad saddle points

## Deep *ReLU* Networks
* Same results as above by reduction to deep linear networks under strong simplifying assumptions
* Strong assumptions:
* The probability that a path through the ReLU network is active is the same, agnostic to which path it is.
* The activations of the network are independent of the input data and the weights.

## Highlighted Takeaways
* Depth *doesn't* create non-global minima, but depth *does* create bad saddle points.
* This paper moves deep linear networks closer to a good model for deep ReLU networks by discarding 5 of the 7 of the previously used assumptions. This gives more "support" for the conjecture that deep ReLU networks don't have bad local minima.
* Deep linear networks don't have bad local minima, so if deep ReLU networks do have bad local minima, it's purely because of the introduction of nonlinear activations. This highlights the importance of the activation function used.
* Shallow linear networks don't have bad saddles point while deep linear networks do, indicating that the saddle point problem is introduced with depth beyond the first hidden layer.

Bad saddle point
: saddle point whose Hessian has no negative eigenvalues (no direction to descend)

Shallow neural network
: single hidden layer

Deep neural network
: more than one hidden layer

Bad local minima
: local minima that aren't global minima

# Position in Research Landscape
* Conjecture from 1989: For deep linear networks, every local minimum is a global minimum: [Neural networks and principal component analysis: Learning from examples without local minima (Neural networks 1989)](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.408.1839&rep=rep1&type=pdf)
* This paper proves that conjecture.
* Given 7 strong assumptions, the losses of local minima are concentrated in an exponentially (with dimension) tight band: [The Loss Surfaces of Multilayer Networks (AISTATS 2015)](https://arxiv.org/abs/1412.0233)
* Discarding some of the above assumptions is an open problem: [Open Problem: The landscape of the loss surfaces of multilayer networks (COLT 2015)](http://proceedings.mlr.press/v40/Choromanska15.pdf)
* This paper discards 5 of those assumptions and proves the result for a strictly more general deep nonlinear model class.

# More Details

## Deep *Linear* Networks
* Main result is Result 2, which proves the conjecture from 1989: every local minimum is a global minimum.
* Not where the strong assumptions come in
* Assumptions (realistic and practically easy to satisfy):
* $XX^T$ and $XY^T$ are full rank
* $d_y \leq d_x$ (output is lower dimension than input)
* $\Sigma = YX^T(XX^T )^{−1}XY^T$ has $d_y$ distinct eigenvalues
* specific to the squared error loss function
* Essentially gives a comprehensive understanding of the loss surface of deep linear networks

## Deep ReLU Networks
* Specific to ReLU activation. Makes strong use of its properties
* Choromanska et al. (2015) relate the loss function to the Hamiltonian of the spherical spin-glass model, using 3 reshaping assumptions. This allows them to apply existing random matrix theory results. This paper drops those reshaping assumptions by performing completely different analysis.
* Because Choromanska et al. (2015) used random matrix theory, they analyzed a random Hessian, which means they need to make 2 distributional assumptions. This paper also drops those 2 assumptions and analyzes a deterministic Hessian.
* Remaining Unrealistic Assumptions:
* The probability that a path through the ReLU network is active is the same, agnostic to which path it is.
* The activations of the network are independent of the input data and the weights.

# Related Resources
* [NIPS Oral Presentation](https://channel9.msdn.com/Events/Neural-Information-Processing-Systems-Conference/Neural-Information-Processing-Systems-Conference-NIPS-2016/Deep-Learning-without-Poor-Local-Minima)

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private