Summary by Hugo Larochelle 7 years ago
SGD is a widely used optimization method for training the parameters of some model f on some given task. Since the convergence of SGD is related to the variance of the stochastic gradient estimate, there's been a lot of work on trying to come up with such stochastic estimates with smaller variance. This paper does it using an importance sampling (IS) Monte Carlo estimate of the gradient, and learning the proposal distribution $q$ of the IS estimate.
The proposal distribution $q$ is parametrized in some way, and is trained to minimize the variance of the gradient estimate. It is trained simultaneously while the model $f$ that SGD (i.e. the SGD that uses IS to get its gradient) is training. To make this whole story more recursive, the proposal distribution $q$ is also trained with SGD :-) This makes sense, since one expects the best proposal to depend on the value of the parameters of model $f$, so the best proposal $q$ should vary as $f$ is trained.
One application of this idea is in optimizing a classification model over a distribution that is imbalanced class-wise (e.g. there are classes with much fewer examples). In this case, the proposal distribution determines how frequently we sample examples from each class (conditioned on the class, training examples are chosen uniformly).
#### My two cents
This is a really cool idea. I particularly like the application to training on an imbalanced classification problem. People have mostly been using heuristics to tackle this problem, such as initially sampling each class equally as often, and then fine-tuning/calibrating the model using the real class proportions. This approach instead proposes a really elegant, coherent, solution to this problem.
I would have liked to see a comparison with that aforementioned heuristic (for mainly selfish reasons :-) ). They instead compare with an importance sampling approach with proposal that assigns the same probability to each class, which is a reasonable alternative (though I don't know if it's used as often as the more heuristic approach).
There are other applications, to matrix factorization and reinforcement learning, that are presented in the paper and seem neat, though I haven't gone through those as much.
Overall, one of my favorite paper this year: it's original, tackles a problem for which I've always hated the heuristic solution I'm using now, proposes an elegant solution to it, and is applicable even more widely than that setting.