An overview of gradient descent optimization algorithms
Sebastian Ruder
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.LG
First published: 2016/09/15 (8 years ago) Abstract: Gradient descent optimization algorithms, while increasingly popular, are
often used as black-box optimizers, as practical explanations of their
strengths and weaknesses are hard to come by. This article aims to provide the
reader with intuitions with regard to the behaviour of different algorithms
that will allow her to put them to use. In the course of this overview, we look
at different variants of gradient descent, summarize challenges, introduce the
most common optimization algorithms, review architectures in a parallel and
distributed setting, and investigate additional strategies for optimizing
gradient descent.
This is originally from a web post with the adding content about Noisy SGD methods.
Demo experiment is interesting but I would like to do it by myself to see the result.