[link]
Summary by Alexander Jung 7 years ago
* GANs are based on adversarial training.
* Adversarial training is a basic technique to train generative models (so here primarily models that create new images).
* In an adversarial training one model (G, Generator) generates things (e.g. images). Another model (D, discriminator) sees real things (e.g. real images) as well as fake things (e.g. images from G) and has to learn how to differentiate the two.
* Neural Networks are models that can be trained in an adversarial way (and are the only models discussed here).
### How
* G is a simple neural net (e.g. just one fully connected hidden layer). It takes a vector as input (e.g. 100 dimensions) and produces an image as output.
* D is a simple neural net (e.g. just one fully connected hidden layer). It takes an image as input and produces a quality rating as output (0-1, so sigmoid).
* You need a training set of things to be generated, e.g. images of human faces.
* Let the batch size be B.
* G is trained the following way:
* Create B vectors of 100 random values each, e.g. sampled uniformly from [-1, +1]. (Number of values per components depends on the chosen input size of G.)
* Feed forward the vectors through G to create new images.
* Feed forward the images through D to create ratings.
* Use a cross entropy loss on these ratings. All of these (fake) images should be viewed as label=0 by D. If D gives them label=1, the error will be low (G did a good job).
* Perform a backward pass of the errors through D (without training D). That generates gradients/errors per image and pixel.
* Perform a backward pass of these errors through G to train G.
* D is trained the following way:
* Create B/2 images using G (again, B/2 random vectors, feed forward through G).
* Chose B/2 images from the training set. Real images get label=1.
* Merge the fake and real images to one batch. Fake images get label=0.
* Feed forward the batch through D.
* Measure the error using cross entropy.
* Perform a backward pass with the error through D.
* Train G for one batch, then D for one (or more) batches. Sometimes D can be too slow to catch up with D, then you need more iterations of D per batch of G.
### Results
* Good looking images MNIST-numbers and human faces. (Grayscale, rather homogeneous datasets.)
* Not so good looking images of CIFAR-10. (Color, rather heterogeneous datasets.)
![Generated Faces](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Generative_Adversarial_Networks__faces.jpg?raw=true "Generated Faces")
*Faces generated by MLP GANs. (Rightmost column shows examples from the training set.)*
-------------------------
### Rough chapter-wise notes
* Introduction
* Discriminative models performed well so far, generative models not so much.
* Their suggested new architecture involves a generator and a discriminator.
* The generator learns to create content (e.g. images), the discriminator learns to differentiate between real content and generated content.
* Analogy: Generator produces counterfeit art, discriminator's job is to judge whether a piece of art is a counterfeit.
* This principle could be used with many techniques, but they use neural nets (MLPs) for both the generator as well as the discriminator.
* Adversarial Nets
* They have a Generator G (simple neural net)
* G takes a random vector as input (e.g. vector of 100 random values between -1 and +1).
* G creates an image as output.
* They have a Discriminator D (simple neural net)
* D takes an image as input (can be real or generated by G).
* D creates a rating as output (quality, i.e. a value between 0 and 1, where 0 means "probably fake").
* Outputs from G are fed into D. The result can then be backpropagated through D and then G. G is trained to maximize log(D(image)), so to create a high value of D(image).
* D is trained to produce only 1s for images from G.
* Both are trained simultaneously, i.e. one batch for G, then one batch for D, then one batch for G...
* D can also be trained multiple times in a row. That allows it to catch up with G.
* Theoretical Results
* Let
* pd(x): Probability that image `x` appears in the training set.
* pg(x): Probability that image `x` appears in the images generated by G.
* If G is now fixed then the best possible D classifies according to: `D(x) = pd(x) / (pd(x) + pg(x))`
* It is proofable that there is only one global optimum for GANs, which is reached when G perfectly replicates the training set probability distribution. (Assuming unlimited capacity of the models and unlimited training time.)
* It is proofable that G and D will converge to the global optimum, so long as D gets enough steps per training iteration to model the distribution generated by G. (Again, assuming unlimited capacity/time.)
* Note that these things are proofed for the general principle for GANs. Implementing GANs with neural nets can then introduce problems typical for neural nets (e.g. getting stuck in saddle points).
* Experiments
* They tested on MNIST, Toronto Face Database (TFD) and CIFAR-10.
* They used MLPs for G and D.
* G contained ReLUs and Sigmoids.
* D contained Maxouts.
* D had Dropout, G didn't.
* They use a Parzen Window Estimate aka KDE (sigma obtained via cross validation) to estimate the quality of their images.
* They note that KDE is not really a great technique for such high dimensional spaces, but its the only one known.
* Results on MNIST and TDF are great. (Note: both grayscale)
* CIFAR-10 seems to match more the texture but not really the structure.
* Noise is noticeable in CIFAR-10 (a bit in TFD too). Comes from MLPs (no convolutions).
* Their KDE score for MNIST and TFD is competitive or better than other approaches.
* Advantages and Disadvantages
* Advantages
* No Markov Chains, only backprob
* Inference-free training
* Wide variety of functions can be incorporated into the model (?)
* Generator never sees any real example. It only gets gradients. (Prevents overfitting?)
* Can represent a wide variety of distributions, including sharp ones (Markov chains only work with blurry images).
* Disadvantages
* No explicit representation of the distribution modeled by G (?)
* D and G must be well synchronized during training
* If G is trained to much (i.e. D can't catch up), it can collapse many components of the random input vectors to the same output ("Helvetica")
more
less