This paper presents an unsupervised generative model, based on the variational autoencoder framework, but where the encoder is a recurrent neural network that sequentially infers the identity, pose and number of objects in some input scene (2D image or 3D scene).
In short, this is done by extending the DRAW model to incorporate discrete latent variables that determine whether an additional object is present or not. Since the reparametrization trick cannot be used for discrete variables, the authors estimate the gradient through the sampling operation using a likelihood ratio estimator.
Another innovation over DRAW is the application to 3D scenes, in which the decoder is a graphics renderer. Since it is not possible to backpropagate through the renderer, gradients are estimated using finite-difference estimates (which require going through the renderer several times).
Experiments are presented where the evaluation is focused on the ability of the model to detect and count the number of objects in the image or scene.
**My two cents**
This is a nice, natural extension of DRAW. I'm particularly impressed by the results for the 3D scene setting. Despite the fact that setup is obviously synthetic and simplistic, I really surprised that estimating the decoder gradients using finite-differences worked at all. It's also interesting to see that the proposed model does surprisingly well compared to a CNN supervised approach that directly predicts the objects identity and pose. Quite cool!
To see the model in action, see [this cute video][1].
[1]: https://www.youtube.com/watch?v=4tc84kKdpY4