[link]
This post is a comment on the Laplacian pyramid-based generative model proposed by researchers from NYU/Facebook AI Research. Let me start by saying that I really like this model, and I think - looking at the samples drawn - it represents a nice big step towards convincing generative models of natural images. To summarise the model, the authors use the Laplacian pyramid representation of images, where you recursively decompose the image to a lower resolution subsampled component and the high-frequency residual. The reason this decomposition is favoured in image processing is the fact that the high-frequency residuals tend to be very sparse, so they are relatively easy to compress and encode. In this paper the authors propose using convolutional neural networks at each layer of the Laplacian pyramid representation to generate an image sequentially, increasing the resolution at each step. The convnet at each layer is conditioned on the lower resolution image, and some noise component $z_k$, and generates a random higher resolution image. The process continues recursively until the desired resilution is reached. For training they use the adversarial objective function. Below is the main figure that explains how the generative model works, I encourage everyone to have a look at the paper for more details: ![](http://www.inference.vc/content/images/2015/07/Screen-Shot-2015-07-23-at-11-15-17.png) #### An argument about Conditional Entropies What I think is weird about the model is the precise amount of noise that is injected at each layer/resolution. In the schematic above, these are the $z_k$ variables. Adding the noise is crucial to defining a probabilistic generative process; this is how it defines a probability distribution. I think it's useful to think about entropies of natural images at different resolutions. When doing generative modelling or unsuperised learning, we want to capture the distribution of data. One important aspect of a probability distribution is its entropy, which measures the variability of the random quantity. In this case, we want to describe the statistics of the full resolution observed natural image $I_0$. (I borrow the authors' notation where $I_0$ represents the highest resolution image, and $I_k$ represent the $k$-times subsampled version. Using the Laplacian pyramid representation, we can decompose the entropy of an image in the following way: $$\mathbb{H}[I_0] = \mathbb{H}[I_{K}] + \sum_{k=0}^{K-1} \mathbb{H}[I_k\vert I_{k+1}].$$ The reason why the above decomposition holds is very simple. Because $I_{k+1}$ is a deterministic function of $I_{k}$ (subsampling), the conditional entropy $\mathbb{H}[I_{k+1}\vert I_{k}] = 0$. Therefore the joint entropy of the two variables is simply the entropy of the higher resolution image $I_{k}$, that is $\mathbb{H}[I_{k},I_{k+1}] = \mathbb{H}[I_{k}] + \mathbb{H}[I_{k+1}\vert I_{k}] = \mathbb{H}[I_{k}]$. So by induction, the join entropy of all images $I_{k}$ is just the marginal entropy of the highest resolution image $I_0$. Applying the chain rule for joint entropies we get the expression above. Now, the interesting bit is how the conditional entropies $\mathbb{H}[I_k\vert I_{k+1}]$ are 'achieved' in the Laplacian pyramid generative model paper. These entropies are provided by the injected random noise variables $z_k$. By the information processing lemma $\mathbb{H}[I_k\vert I_{k+1}] \leq \mathbb{H}[z_k]$. The authors choose $z_k$ to be uniform random variables whose dimensionality grows with the resolution of $I_k$. To quote them "The noise input $z_k$ to $G_k$ is presented as a 4th color plane to low-pass $l_k$, hence its dimensionality varies with the pyramid level." Therefore $\mathbb{H}[z_k] \propto 4^{-k}$, assuming that the pixel count quadruples at each layer. So the conditional entropy $\mathbb{H}[I_k\vert I_{k+1}]$ is allowed to grow exponentially with resolution, at the same rate it would grow if the images contained pure white noise. In their model, they allow the per-pixel conditional entropy $c\cdot 4^{-k}\cdot \mathbb{H}[I_k\vert I_{k+1}]$ to be constant across resolutions. To me, this seems undesirable. My intuition is, for natural images, $\mathbb{H}[I_k\vert I_{k+1}]$ may grow as $k$ decreases (because the dimensionality gorws), but the per-pixel value $c\cdot 4^{k}\cdot \mathbb{H}[I_k\vert I_{k+1}]$ should decrease or converge to $0$ as the resolution increases. Very low low-resolution subsampled natural images behave a little bit like white noise, there is a lot of variability in them. But as you increase the resolution, the probability distribution of the high-res image given the low-res image will become a lot sharper. In terms of model capacity, this is not a problem, inasmuch as the convolutional models $G_{k}$ can choose to ignore some variance in $z_k$ and learn a more deterministic superresolution process. However, adding unnecessarily high entropy will almost certainly make the fitting of such model harder. For example, the adversarial training process relies on sampling from $z_k$, and the procedure is pretty sensitive to sampling noise. If you make the distribution of $z_k$ unneccessarily high entropy, you will end up doing a lot of extra work during training until the network figures out to ignore the extra variance. To solve this problem, I propose to keep the entropy of the noise vectors constant, or make them grow sub-linearly with the number of pixels in the image. This mperhaps akes the generative convnets harder to implement. Another quick solution would be to introduce dependence between components of $z_k$ via a low-rank covariance matrix, or some sort of a hashing trick. #### Adversarial training vs superresolution autoencoders Another weird thing is that the adversarial objective function forgets the identity of the image. For example, you would want your model so that `"if at the previous layer you have a low-resolution parrot, the next layer should be a higher-resolution parrot"` Instead, what you get with the adversarial objective is `"if at the previous layer you have a low-resolution parrot, the next layer should output a higher-dimensional image that looks like a plausible natural image"` So, there is nothing in the objective function that enforces dependency between subsequent layers of the pyramid. I think if you made $G_k$ very complex, it could just learn to model natural images by itself, so that $I_{k}$ is in essence independent of $I_{k+1}$ and is purely driven by the noise $z_{k}$. You could sidestep this problem by restricting the complexity of the generative nets, or, again, to restrict the entropy of the noise. Overall, I think the approach would benefit from a combination of the adversarial and a supervised (superresolution autoencoder) objective function.
Your comment:
|
[link]
#### Problem addressed: Learn to generate sharp high resolution images #### Summary: The authors applied generative adversarial networks (GANs) for image modeling. Instead of learning one model for a high resolution image directly, they learn it in a hierachical way using pyramid decomposition of the image. The image is decomposed to smaller versions that contains all low frequency information and its corresponding high frequency ones. It can also be regarded as a similar idea from lossless compression, where one have both compressed version of the image and the corresponding error so that once the compression is given the image can be reconstructed perfectly. The quantitative and qualitaive performance are impressive, and is so that the best image generative model for high resolution images. #### Novelty: Using laplacian pyramid decomposition that enables the generation of sharp high resolution images possible #### Drawbacks: Training is stagewise and slow, and levels of decomposition is fixed for all kinds of images. #### Datasets: CIFAR-10, STL, SUN #### Additional remarks: Presentation video available on cedar server #### Resources: Other relevant work to this one: Conditional Generative Adversarial Nets, Conditional generative adversarial nets for convolutional face generation #### Presenter: Yingbo Zhou |
[link]
* The original GAN approach used one Generator (G) to generate images and one Discriminator (D) to rate these images. * The laplacian pyramid GAN uses multiple pairs of G and D. * It starts with an ordinary GAN that generates small images (say, 4x4). * Each following pair learns to generate plausible upscalings of the image, usually by a factor of 2. (So e.g. from 4x4 to 8x8.) * This scaling from coarse to fine resembles a laplacian pyramid, hence the name. ### How * The first pair of G and D is just like an ordinary GAN. * For each pair afterwards, G recieves the output of the previous step, upscaled to the desired size. Due to the upscaling, the image will be blurry. * G has to learn to generate a plausible sharpening of that blurry image. * G outputs a difference image, not the full sharpened image. * D recieves the upscaled/blurry image. D also recieves either the optimal difference image (for images from the training set) or G's generated difference image. * D adds the difference image to the blurry image as its first step. Afterwards it applies convolutions to the image and ends in one sigmoid unit. * The training procedure is just like in the ordinary GAN setting. Each upscaling pair of G and D can be trained on its own. * The first G recieves a "normal" noise vector, just like in the ordinary GAN setting. Later Gs recieve noise as one plane, so each image has four channels: R, G, B, noise. ### Results * Images are rated as looking more realistic than the ones from ordinary GANs. * The approximated log likelihood is significantly lower (improved) compared to ordinary GANs. * The generated images do however still look distorted compared to real images. * They also tried to add class conditional information to G and D (just a one hot vector for the desired class of the image). G and D learned successfully to adapt to that information (e.g. to only generate images that seem to show birds). ![Sampling Process](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Deep_Generative_Image_Models_using_a_Laplacian_Pyramid_of_Adversarial_Networks__pyramid.png?raw=true "Sampling process") *Basic training and sampling process. The first image is generated directly from noise. Everything afterwards is de-blurring of upscaled images.* ------------------------- ### Rough chapter-wise notes * Introduction * Instead of just one big generative model, they build multiple ones. * They start with one model at a small image scale (e.g. 4x4) and then add multiple generative models that increase the image size (e.g. from 4x4 to 8x8). * This scaling from coarse to fine (low frequency to high frequency components) resembles a laplacian pyramid, hence the name of the paper. * Related Works * Types of generative image models: * Non-Parametric: Models copy patches from training set (e.g. texture synthesis, super-resolution) * Parametric: E.g. Deep Boltzmann machines or denoising auto-encoders * Novel approaches: e.g. DRAW, diffusion-based processes, LSTMs * This work is based on (conditional) GANs * Approach * They start with a Gaussian and a Laplacian pyramid. * They build the Gaussian pyramid by repeatedly decreasing the image height/width by 2: [full size image, half size image, quarter size image, ...] * They build a Laplacian pyramid by taking pairs of images in the gaussian pyramid, upscaling the smaller one and then taking the difference. * In the laplacian GAN approach, an image at scale k is created by first upscaling the image at scale k-1 and then adding a refinement to it (de-blurring). The refinement is created with a GAN that recieves the upscaled image as input. * Note that the refinement is a difference image (between the upscaled image and the optimal upscaled image). * The very first (small scale) image is generated by an ordinary GAN. * D recieves an upscaled image and a difference image. It then adds them together to create an upscaled and de-blurred image. Then D applies ordinary convolutions to the result and ends in a quality rating (sigmoid). * Model Architecture and Training * Datasets: CIFAR-10 (32x32, 100k images), STL (96x96, 100k), LSUN (64x64, 10M) * They use a uniform distribution of [-1, 1] for their noise vectors. * For the upscaling Generators they add the noise as a fourth plane (to the RGB image). * CIFAR-10: 8->14->28 (height/width), STL: 8->16->32->64->96, LSUN: 4->8->16->32->64 * CIFAR-10: G=3 layers, D=2 layers, STL: G=3 layers, D=2 layers, LSUN: G=5 layers, D=3 layers. * Experiments * Evaluation methods: * Computation of log-likelihood on a held out image set * They use a Gaussian window based Parzen estimation to approximate the probability of an image (note: not very accurate). * They adapt their estimation method to the special case of the laplacian pyramid. * Their laplacian pyramid model seems to perform significantly better than ordinary GANs. * Subjective evaluation of generated images * Their model seems to learn the rough structure and color correlations of images to generate. * They add class conditional information to G and D. G indeed learns to generate different classes of images. * All images still have noticeable distortions. * Subjective evaluation of generated images by other people * 15 volunteers. * They show generated or real images in an interface for 50-2000ms. Volunteer then has to decide whether the image is fake or real. * 10k ratings were collected. * At 2000ms, around 50% of the generated images were considered real, ~90 of the true real ones and <10% of the images generated by an ordinary GAN. |