First published: 2017/05/24 (4 years ago) Abstract: Compression and computational efficiency in deep learning have become a
problem of great significance. In this work, we argue that the most principled
and effective way to attack this problem is by taking a Bayesian point of view,
where through sparsity inducing priors we prune large parts of the network. We
introduce two novelties in this paper: 1) we use hierarchical priors to prune
nodes instead of individual weights, and 2) we use the posterior uncertainties
to determine the optimal fixed point precision to encode the weights. Both
factors significantly contribute to achieving the state of the art in terms of
compression rates, while still staying competitive with methods designed to
optimize for speed or energy efficiency.
This paper described an algorithm of parametrically adding noise and applying a variational regulariser similar to that in ["Variational Dropout Sparsifies Deep Neural Networks"][vardrop]. Both have the same goal: make neural networks more efficient by removing parameters (and therefore the computation applied with those parameters). Although, this paper also has the goal of giving a prescription of how many bits to store each parameter with as well.
There is a very nice derivation of the hierarchical variational approximation being used here, which I won't try to replicate here. In practice, the difference to prior work is that the stochastic gradient variational method uses hierarchical samples; ie it samples from a prior, then incorporates these samples when sampling over the weights (both applied through local reparameterization tricks). It's a powerful method, which allows them to test two different priors (although they are clearly not limited to just these), and compare both against competing methods. They are comparable, and the choice of prior offers some tradeoffs in terms of sparsity versus quantization.