[link]
Summary by Gavin Gray 7 years ago
Normally, a Deep Convolutional Network (DCN) has a conditional probabilistic model where we have some parameterised function $f_{\theta}$, and some distribution over the targets to define the loss function (categorical for classification or Gaussian for regression). But, that conditional function is treated as a black box (probabilistically speaking, it's just fit by maximum likelihood). This paper breaks the entire network up into a number of latent factors.
The latent factors are designed to represent familiar parts of DCNs; for example, max-pooling selects from a set of activations and we can model the uncertainty in this selection with a categorical random variable. To recreate every pixel you can imagine selecting a set of paths backwards through the network to reproduce that pixel activation. That's not how their model is parameterised, all I can do is point you to the paper for the real DRMM model definition.
# Inference
The big trick of this paper is that they have designed the probabilistic model so that max-sum-product message passing (also introduced in this paper) inference equates to the forward prop in a DCN. What does that mean? Well, since the network structure is now a hierarchical probabilistic model we can hope to throw better learning algorithms at it, which is what they do.
# Learning
Using this probabilistic formulation, you can define a loss function that includes a reconstruction loss generating the image from responsibilities you can estimate during the forward pass. Then, you can have reconstruction gradients _and_ gradients wrt your target loss, so you can train semi-supervised or unsupervised but still work with what is practically a DCN.
In addition, it's possible to derive a full EM algorithm in this case, and the M-step corresponds to just solving a generalised Least Squares problem. Which means gradient-free and better principled training of neural networks.
# Experimental Results
Not all of the theory presented in this paper is covered in the experiments. They do demonstrate their method working well on MNIST and CIFAR-10 semi-supervised (SOTA with additional variational tweaks). But, there are not yet any experiments using the full EM algorithm they describe (only reporting that results _appear promising_); all experiments use gradient optimisation. They report that their network will train 2-3x faster, but only demonstrate this on MNIST (and what about variation in the chosen optimizer?).
Also, the model can be sampled from, but we don't have any visualisations of the kind of images we would get.
more
less