[link]
* GANs are based on adversarial training. * Adversarial training is a basic technique to train generative models (so here primarily models that create new images). * In an adversarial training one model (G, Generator) generates things (e.g. images). Another model (D, discriminator) sees real things (e.g. real images) as well as fake things (e.g. images from G) and has to learn how to differentiate the two. * Neural Networks are models that can be trained in an adversarial way (and are the only models discussed here). ### How * G is a simple neural net (e.g. just one fully connected hidden layer). It takes a vector as input (e.g. 100 dimensions) and produces an image as output. * D is a simple neural net (e.g. just one fully connected hidden layer). It takes an image as input and produces a quality rating as output (0-1, so sigmoid). * You need a training set of things to be generated, e.g. images of human faces. * Let the batch size be B. * G is trained the following way: * Create B vectors of 100 random values each, e.g. sampled uniformly from [-1, +1]. (Number of values per components depends on the chosen input size of G.) * Feed forward the vectors through G to create new images. * Feed forward the images through D to create ratings. * Use a cross entropy loss on these ratings. All of these (fake) images should be viewed as label=0 by D. If D gives them label=1, the error will be low (G did a good job). * Perform a backward pass of the errors through D (without training D). That generates gradients/errors per image and pixel. * Perform a backward pass of these errors through G to train G. * D is trained the following way: * Create B/2 images using G (again, B/2 random vectors, feed forward through G). * Chose B/2 images from the training set. Real images get label=1. * Merge the fake and real images to one batch. Fake images get label=0. * Feed forward the batch through D. * Measure the error using cross entropy. * Perform a backward pass with the error through D. * Train G for one batch, then D for one (or more) batches. Sometimes D can be too slow to catch up with D, then you need more iterations of D per batch of G. ### Results * Good looking images MNIST-numbers and human faces. (Grayscale, rather homogeneous datasets.) * Not so good looking images of CIFAR-10. (Color, rather heterogeneous datasets.) ![Generated Faces](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Generative_Adversarial_Networks__faces.jpg?raw=true "Generated Faces") *Faces generated by MLP GANs. (Rightmost column shows examples from the training set.)* ------------------------- ### Rough chapter-wise notes * Introduction * Discriminative models performed well so far, generative models not so much. * Their suggested new architecture involves a generator and a discriminator. * The generator learns to create content (e.g. images), the discriminator learns to differentiate between real content and generated content. * Analogy: Generator produces counterfeit art, discriminator's job is to judge whether a piece of art is a counterfeit. * This principle could be used with many techniques, but they use neural nets (MLPs) for both the generator as well as the discriminator. * Adversarial Nets * They have a Generator G (simple neural net) * G takes a random vector as input (e.g. vector of 100 random values between -1 and +1). * G creates an image as output. * They have a Discriminator D (simple neural net) * D takes an image as input (can be real or generated by G). * D creates a rating as output (quality, i.e. a value between 0 and 1, where 0 means "probably fake"). * Outputs from G are fed into D. The result can then be backpropagated through D and then G. G is trained to maximize log(D(image)), so to create a high value of D(image). * D is trained to produce only 1s for images from G. * Both are trained simultaneously, i.e. one batch for G, then one batch for D, then one batch for G... * D can also be trained multiple times in a row. That allows it to catch up with G. * Theoretical Results * Let * pd(x): Probability that image `x` appears in the training set. * pg(x): Probability that image `x` appears in the images generated by G. * If G is now fixed then the best possible D classifies according to: `D(x) = pd(x) / (pd(x) + pg(x))` * It is proofable that there is only one global optimum for GANs, which is reached when G perfectly replicates the training set probability distribution. (Assuming unlimited capacity of the models and unlimited training time.) * It is proofable that G and D will converge to the global optimum, so long as D gets enough steps per training iteration to model the distribution generated by G. (Again, assuming unlimited capacity/time.) * Note that these things are proofed for the general principle for GANs. Implementing GANs with neural nets can then introduce problems typical for neural nets (e.g. getting stuck in saddle points). * Experiments * They tested on MNIST, Toronto Face Database (TFD) and CIFAR-10. * They used MLPs for G and D. * G contained ReLUs and Sigmoids. * D contained Maxouts. * D had Dropout, G didn't. * They use a Parzen Window Estimate aka KDE (sigma obtained via cross validation) to estimate the quality of their images. * They note that KDE is not really a great technique for such high dimensional spaces, but its the only one known. * Results on MNIST and TDF are great. (Note: both grayscale) * CIFAR-10 seems to match more the texture but not really the structure. * Noise is noticeable in CIFAR-10 (a bit in TFD too). Comes from MLPs (no convolutions). * Their KDE score for MNIST and TFD is competitive or better than other approaches. * Advantages and Disadvantages * Advantages * No Markov Chains, only backprob * Inference-free training * Wide variety of functions can be incorporated into the model (?) * Generator never sees any real example. It only gets gradients. (Prevents overfitting?) * Can represent a wide variety of distributions, including sharp ones (Markov chains only work with blurry images). * Disadvantages * No explicit representation of the distribution modeled by G (?) * D and G must be well synchronized during training * If G is trained to much (i.e. D can't catch up), it can collapse many components of the random input vectors to the same output ("Helvetica")
Your comment:
|
[link]
# Generative Adversarial Nets ## Introduction * The paper proposes an adversarial approach for estimating generative models where one model (generative model) tries to learn a data distribution and another model (discriminative model) tries to distinguish between samples from the generative model and original data distribution. * [Link to the paper](https://arxiv.org/abs/1406.2661) ## Adversarial Net * Two models - Generative Model(*G*) and Discriminative Model(*D*) * Both are multi-layer perceptrons. * *G* takes as input a noise variable *z* and outputs data sample *x(=G(z))*. * *D* takes as input a data sample *x* and predicts whether it came from true data or from *G*. * *G* tries to minimise *log(1-D(G(z)))* while *D* tries to maximise the probability of correct classification. * Think of it as a minimax game between 2 players and the global optimum would be when *G* generates perfect samples and *D* can not distinguish between the samples (thereby always returning 0.5 as the probability of sample coming from true data). * Alternate between *k* steps of training *D* and 1 step of training *G* so that *D* is maintained near its optimal solution. * When starting training, the loss *log(1-D(G(z)))* would saturate as *G* would be weak. Instead maximise *log(D(G(z)))* * The paper contains the theoretical proof for global optimum of the minimax game. ## Experiments * Datasets * MNIST, Toronto Face Database, CIFAR-10 * Generator model uses RELU and sigmoid activations. * Discriminator model uses maxout and dropout. * Evaluation Metric * Fit Gaussian Parzen window to samples obtained from *G* and compare log-likelihood. ## Strengths * Computational advantages * Backprop is sufficient for training with no need for Markov chains or performing inference. * A variety of functions can be used in the model. * Since *G* is trained only using the gradients from *D*, fewer chances of directly copying features from the true data. * Can represent sharp (even degenerate) distributions. ## Weakness * *D* must be well synchronised with *G*. * While *G* may learn to sample data points that are indistinguishable from true data, no explicit representation can be obtained. ## Possible Extensions * Conditional generative models. * Inference network to predict *z* given *x*. * Implement a stochastic extension of the deterministic [Multi-Prediction Deep Boltzmann Machines](https://papers.nips.cc/paper/5024-multi-prediction-deep-boltzmann-machines.pdf) * Using discriminator net or inference net for feature selection. * Accelerating training by ensuring better coordination between *G* and *D* or by determining better distributions to sample *z* from during training. |
[link]
#### Problem addressed: learning data distribution using non-parametric models #### Summary: The authors propose learning generative model as playing an adversarial game, where one player is the generative model and the other is a discriminative model. The discriminative model learns to differentiate samples obtained from the generative model with true samples from the dataset, and the generative model tries to improve its data likelihood. They provide theoretical result that when one use such adversarial objective the global optimum of generative model will be the data distribution. Good generative performance is presented both quantitatively and qualitatively. #### Novelty: Treating learning generative model as playing a two player adversarial game, although such idea is kind of there in the training of energy based models, this paper is the first to successfully demonstrate this power. #### Drawbacks: As with any of the non-parametric generative models, the evaluation of data probability is not very easy. #### Datasets: MNIST, TFD, CIFAR-10 (only qualitative) #### Additional remarks: Presentation video available on cedar server #### Presenter: Yingbo Zhou |