Adversarially Learned Inference on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Adversarially Learned Inference
Vincent Dumoulin and Ishmael Belghazi and Ben Poole and Alex Lamb and Martin Arjovsky and Olivier Mastropietro and Aaron Courville
arXiv e-Print archive - 2016 via Local arXiv
Keywords: stat.ML, cs.LG
more

Summaries/Notes 2

[link] Summary by Alexander Jung 7 years ago

  * They suggest a new architecture for GANs.
  * Their architecture adds another Generator for a reverse branch (from images to noise vector `z`).
  * Their architecture takes some ideas from VAEs/variational neural nets.
  * Overall they can improve on the previous state of the art (DCGAN).

### How
  * Architecture
    * Usually, in GANs one feeds a noise vector `z` into a Generator (G), which then generates an image (`x`) from that noise.
    * They add a reverse branch (G2), in which another Generator takes a real image (`x`) and generates a noise vector `z` from that.
      * The noise vector can now be viewed as a latent space vector.
    * Instead of letting G2 generate *discrete* values for `z` (as it is usually done), they instead take the approach commonly used VAEs and use *continuous* variables instead.
      * That is, if `z` represents `N` latent variables, they let G2 generate `N` means and `N` variances of gaussian distributions, with each distribution representing one value of `z`.
      * So the model could e.g. represent something along the lines of "this face looks a lot like a female, but with very low probability could also be male".
  * Training
    * The Discriminator (D) is now trained on pairs of either `(real image, generated latent space vector)` or `(generated image, randomly sampled latent space vector)` and has to tell them apart from each other.
    * Both Generators are trained to maximally confuse D.
      * G1 (from `z` to `x`) confuses D maximally, if it generates new images that (a) look real and (b) fit well to the latent variables in `z` (e.g. if `z` says "image contains a cat", then the image should contain a cat).
      * G2 (from `x` to `z`) confuses D maximally, if it generates good latent variables `z` that fit to the image `x`.
    * Continuous variables
      * The variables in `z` follow gaussian distributions, which makes the training more complicated, as you can't trivially backpropagate through gaussians.
      * When training G1 (from `z` to `x`) the situation is easy: You draw a random `z`-vector following a gaussian distribution (`N(0, I)`). (This is basically the same as in "normal" GANs. They just often use uniform distributions instead.)
      * When training G2 (from `x` to `z`) the situation is a bit harder.
        * Here we need to use the reparameterization trick here.
        * That roughly means, that G2 predicts the means and variances of the gaussian variables in `z` and then we draw a sample of `z` according to exactly these means and variances.
        * That sample gives us discrete values for our backpropagation.
        * If we do that sampling often enough, we get a good approximation of the true gradient (of the continuous variables). (Monte Carlo approximation.)

* Results
  * Images generated based on Celeb-A dataset:
    * ![Celeb-A samples](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Adversarially_Learned_Inference__celeba-samples.png?raw=true "Celeb-A samples")
  * Left column per pair: Real image, right column per pair: reconstruction (`x -> z` via G2, then `z -> x` via G1)
    * ![Celeb-A reconstructions](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Adversarially_Learned_Inference__celeba-reconstructions.png?raw=true "Celeb-A reconstructions")
  * Reconstructions of SVHN, notice how the digits often stay the same, while the font changes:
    * ![SVHN reconstructions](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Adversarially_Learned_Inference__svhn-reconstructions.png?raw=true "SVHN reconstructions")
  * CIFAR-10 samples, still lots of errors, but some quite correct:
    * ![CIFAR10 samples](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Adversarially_Learned_Inference__cifar10-samples.png?raw=true "CIFAR10 samples")

Your comment:

[link] Summary by CodyWild 7 years ago

Despite their difficulties in training, Generative Adversarial Networks are still one of the most exciting recent ideas in machine learning; a way to generate data without the fuzziness and averaging of earlier methods. However, up until recently, there had been major way in which the GAN’s primary competitor in the field, the Variational Autoencoder, was superior: it could do inference.

Intuitively, inference is the inverse of generation. Whereas generation works by taking some source of randomness - a random vector, the setting of some latent code - and transforming that recipe into an observation, an inference process tries to work in reverse, taking in the observation as input and trying to guess what “recipe” was used to generate it. (As a note: in real world data, it’s generally not the case that there were explicit numerical factors used to generate data; this framing is a simplified model meant to represent the way a small set of latent settings of an object jointly cause a lot of that object’s feature values). The authors of this paper proposed the BiGAN to fix that deficiency in GAN literature.

https://i.imgur.com/vZZzWH5.png

The BiGAN - short for Bidirectional GAN - works by having two generators, not one. One generator works in the typical fashion of a GAN: taking in a random vector z, and transforming that into G(z) = x. The second generator works in reverse, taking in as input data from the underlying dataset, and transforming it into a code z, E(x) = z. Once these generators are in place, the discriminators work, not by trying to differentiate the x and z values separately, but all together. That works by giving the discriminator a pair, (x, z), and asking the discriminator to decide whether that pair came from the z -> x decoder, or the x -> z encoder. If this model fully converges, it becomes the case that G(z) and E(x) are inverse transformations, giving us a way to take in a new input x, and infer its underlying factors z. This is valuable because it’s been shown that, in typical GANs, changes in z often correspond to latent values we care about, and it would be useful to be able to generate z from x for purposes of representation learning.

The authors offer quite a nice intuitive proof for why the model learns this inverse mapping. For each pair of (x, z), it’s either the case that E(x) = z (if the pair came from the encoder), or that G(z) = x (if the pair came from the decoder). But if only one of those is the case, then it’s easy for the discriminator to tell which generation process produced the pair. So, in order to fool the discriminator, G(z) and E(x) need to synchronize their decoding and encoding processes.

The authors also tried a method where, instead of having this bidirectional GAN structure, they instead simply built a network on top of the generated samples, that tries to predict the original z used, taking the generated x as input. They show that this performs less well on subjective quality measures of the learned representation, which they attribute to the fact that GANs notoriously only learn some modes of the data, and thus a x -> z encoder that only takes the generated z as input will not have good coverage over the full distribution of x.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private