[link]
Summary by Hugo Larochelle 8 years ago
This paper starts by introducing a trick to reduce the variance of stochastic gradient variational Bayes (SGVB) estimators. In neural networks, SGVB consists in learning a variational (e.g. diagonal Gaussian) posterior over the weights and biases of neural networks, through a procedure that (for the most part) alternates between adding (Gaussian) noise to the model's parameters and then performing a model update with backprop.
The authors present a local reparameterization trick, which exploits the fact that the Gaussian noise added into the weights could instead be added directly into the pre-activation (i.e. before the activation fonction) vectors during forward propagation. This is due to the fact that computing the pre-activation is a linear operation, thus noise at that level is also Gaussian. The advantage of doing so is that, in the context of minibatch training, one can efficiently then add independent noise to the pre-activation vectors for each example of the minibatch. The nature of the local reparameterization trick implies that this is equivalent to using one corrupted version of the weights for each example in the minibatch, something that wouldn't be practical computationally otherwise. This is in fact why, in normal SGVB, previous work would normally use a single corrupted version of the weights for all the minibatch.
The authors demonstrate that using the local reparameterization trick yields stochastic gradients with lower variance, which should improve the speed of convergence.
Then, the authors demonstrate that the Gaussian version of dropout (one that uses multiplicative Gaussian noise, instead of 0-1 masking noise) can be seen as the local reparameterization trick version of a SGVB objective, with some specific prior and variational posterior. In this SGVB view of Gaussian dropout, the dropout rate is an hyper-parameter of this prior, which can now be tuned by optimizing the variational lower bound of SGVB. In other words, we now have a method to also train the dropout rate! Moreover, it becomes possible to tune an individual dropout rate parameter for each layer, or even each parameter of the model.
Experiments on MNIST confirm that tuning that parameter works and allows to reach good performance of various network sizes, compared to using a default dropout rate.
##### My two cents
This is another thought provoking connection between Bayesian learning and dropout. Indeed, while Deep GPs have allowed to make a Bayesian connection with regular (binary) dropout learning \cite{journals/corr/GalG15}, this paper sheds light on a neat Bayesian connection for the Gaussian version of dropout. This is great, because it suggests that Gaussian dropout training is another legit way of modeling uncertainty in the parameters of neural networks. It's also nice that that connection also yielded a method for tuning the dropout rate automatically.
I hope future work (by the authors or by others) can evaluate the quality of the corresponding variational posterior in terms of estimating uncertainty in the network and, in particular, in obtaining calibrated output probabilities.
Little detail: I couldn't figure out whether the authors tuned a single dropout rate for the whole network, or used many rates, for instance one per parameter, as they suggest can be done.
more
less