Conditional Image Generation with PixelCNN Decoders
Aaron van den Oord
and
Nal Kalchbrenner
and
Oriol Vinyals
and
Lasse Espeholt
and
Alex Graves
and
Koray Kavukcuoglu
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.CV, cs.LG
First published: 2016/06/16 (8 years ago) Abstract: This work explores conditional image generation with a new image density
model based on the PixelCNN architecture. The model can be conditioned on any
vector, including descriptive labels or tags, or latent embeddings created by
other networks. When conditioned on class labels from the ImageNet database,
the model is able to generate diverse, realistic scenes representing distinct
animals, objects, landscapes and structures. When conditioned on an embedding
produced by a convolutional network given a single image of an unseen face, it
generates a variety of new portraits of the same person with different facial
expressions, poses and lighting conditions. We also show that conditional
PixelCNN can serve as a powerful decoder in an image autoencoder. Additionally,
the gated convolutional layers in the proposed model improve the log-likelihood
of PixelCNN to match the state-of-the-art performance of PixelRNN on ImageNet,
with greatly reduced computational cost.
#### Introduction
* The paper explores the domain of conditional image generation by adopting and improving PixelCNN architecture.
* [Link to the paper](https://arxiv.org/abs/1606.05328)
#### Based on PixelRNN and PixelCNN
* Models image pixel by pixel by decomposing the joint image distribution as a product of conditionals.
* PixelRNN uses two-dimensional LSTM while PixelCNN uses convolutional networks.
* PixelRNN gives better results but PixelCNN is faster to train.
#### Gated PixelCNN
* PixelRNN outperforms PixelCNN due to the larger receptive field and because they contain multiplicative units, LSTM gates, which allow modelling more complex interactions.
* To account for these, deeper models and gated activation units (equation 2 in the [paper](https://arxiv.org/abs/1606.05328)) can be used respectively.
* Masked convolutions can lead to blind spots in the receptive fields.
* These can be removed by combining 2 convolutional network stacks:
* Horizontal stack - conditions on the current row.
* Vertical stack - conditions on all rows above the current row.
* Every layer in the horizontal stack takes as input the output of the previous layer as well as that of the vertical stack.
* Residual connections are used in the horizontal stack and not in the vertical stack (as they did not seem to improve results in the initial settings).
#### Conditional PixelCNN
* Model conditional distribution of image, given the high-level description of the image, represented using the latent vector h (equation 4 in the [paper](https://arxiv.org/abs/1606.05328))
* This conditioning does not depend on the location of the pixel in the image.
* To consider the location as well, map h to spatial representation $s = m(h)$ (equation 5 in the the [paper](https://arxiv.org/abs/1606.05328))
#### PixelCNN Auto-Encoders
* Start with a traditional auto-encoder architecture and replace the deconvolutional decoder with PixelCNN and train the network end-to-end.
#### Experiments
* For unconditional modelling, Gated PixelCNN either outperforms PixelRNN or performs almost as good and takes much less time to train.
* In the case of conditioning on ImageNet classes, the log likelihood measure did not improve a lot but the visual quality of the generated sampled was significantly improved.
* Paper also included sample images generated by conditioning on human portraits and by training a PixelCNN auto-encoder on ImageNet patches.