Pixel Recurrent Neural Networks on ShortScience.org

5

[link] Summary by Hugo Larochelle 9 years ago

This paper explores the use of convolutional (PixelCNN) and recurrent units (PixelRNN) for modeling the distribution of images, in the framework of autoregression distribution estimation. In this framework, the input distribution $p(x)$ is factorized into a product of conditionals $\Pi p(x_i | x_i-1)$. Previous work has shown that very good models can be obtained by using a neural network parametrization of the conditionals (e.g. see our work on NADE \cite{journals/jmlr/LarochelleM11}). Moreover, unlike other approaches based on latent stochastic units that are directed or undirected, the autoregressive approach is able to compute log-probabilities tractably.

So in this paper, by considering the specific case of x being an image, they exploit the topology of pixels and investigate appropriate architectures for this.

Among the paper's contributions are:

1. They propose Diagonal BiLSTM units for the PixelRNN, which are efficient (thanks to the use of convolutions) while making it possible to, in effect, condition a pixel's distribution on all the pixels above it (see Figure 2 for an illustration).

2. They demonstrate that the use of residual connections (a form of skip connections, from hidden layer i-1 to layer $i+1$) are very effective at learning very deep distribution estimators (they go as deep as 12 layers).

3. They show that it is possible to successfully model the distribution over the pixel intensities (effectively an integer between 0 and 255) using a softmax of 256 units.

4. They propose a multi-scale extension of their model, that they apply to larger 64x64 images.

The experiments show that the PixelRNN model based on Diagonal BiLSTM units achieves state-of-the-art performance on the binarized MNIST benchmark, in terms of log-likelihood. They also report excellent log-likelihood on the CIFAR-10 dataset, comparing to previous work based on real-valued density models. Finally, they show that their model is able to generate high quality image samples.

Your comment:

4

[link] Summary by Shagun Sodhani 9 years ago

#### Introduction

* Problem: Building an expressive, tractable and scalable image model which can be used in downstream tasks like image generation, reconstruction, compression etc.
* [Link to the paper](https://arxiv.org/abs/1601.06759)

#### Model

* Scan the image, one row at a time and one pixel at a time (within each row).
* Given the scanned content, predict the distribution over the possible values for the next pixel.
* Joint distribution over the pixel values is factorised into a product of conditional distributions thus causing the problem as a sequence problem.
* Parameters used in prediction are shared across all the pixel positions.
* Since each pixel is jointly determined by 3 values (3 colour channels), each channel may be conditioned on other channels as well.

##### Pixel as discrete value

* The conditional distributions are multinomial (with channel variable taking 1 of 256 discrete values).
* This discrete representation is simpler and easier to learn.

#### Pixel RNN

##### Row LSTM

* Undirectional layer that processed image row by row.
* Uses one-dimensional convolution (kernel of size kx1, k>=3).
* Refer image 2 in the [paper](https://arxiv.org/abs/1601.06759).
* Weight sharing in convolution ensures translation invariance of computed feature along each row.
* For LSTM, the input-to-state component is computed for the entire 2-d input map and then is masked to include only the valid context.
* For equations related to state-to-state component, refer to equation 3 in the [paper](https://arxiv.org/abs/1601.06759)

##### Diagonal BiLSTM

* Bidirectional layer that processes the image in the diagonal fashion.
* Input map skewed by offsetting each row of the image by one position with respect to the previous row.
* Refer image 3 in the [paper](https://arxiv.org/abs/1601.06759)
* For both directions, the input-to-state component is a 1 x 1 convolution while the state-to-state recurrent component is computed with column wise convolution using kernel size 2x1.
* Kernel size of 2x1 processes minimal information yielding a highly non-linear computation.
* Output map is skewed back by removing the offset positions.
* To prevent layers from seeing further pixels, the right output map is shifted down by one row and added to left output map.

##### Residual Connections

* Residual connections (or skip connections) are used to increase convergence speed and to propagate signals more explicitly.
* Refer image 4 in the [paper](https://arxiv.org/abs/1601.06759)

##### Masked Convolutions

* Masks are used to enforce certain restrictions on the connections in the network (eg when predicting values for R channel, values of B channel can not be used).
* Mask A is applied to first convolution layer and restricts connections to only those neighbouring pixels and colour channels that have already been seen.
* Mask B is applied to all subsequent input-to-state convolution transactions and allows connections from a colour channel to itself.
* Refer image 4 in the [paper](https://arxiv.org/abs/1601.06759)

##### PixelCNN

* Uses multiple convolution layers that preserve spatial resolution.
* Makes receptive field large but not unbounded.
* Mask used to avoid seeing the future context.
* Faster that PixelRNN at training or evaluation time (as convolutions can be parallelized easily).

##### Multi-Scale PixelRNN

* Composed of one unconditional PixelRNN and multiple conditional PixelRNNs.
* Unconditional network generates a smaller s x s image which is fed as input to the conditional PixelRNN. (n is a multiple of s)
* Conditional PixelRNN is a standard PixelRNN with layers biased with an upsampled version of the s x s image.
* For upsampling, a convolution network with deconvolution layers constructs an enlarged feature map of size c x n x n.
* For biasing, the c x n x n map is mapped to 4hxnxn map (using 1x1 unmasked convolution) and added to input-to-state map.

#### Training and Evaluation

* Pixel values are dequantized using real-valued noise and log likelihood of continuous and discrete models are compared.
* Update rule - RMSProp
* Batch size - 16 for MNIST and CIFAR 10 and 32(or 64) for IMAGENET.
* Residual connections are as effective as Skip connections, in fact, the 2 can be used together as well.
* PixelRNN outperforms other models for Binary MNIST and CIFAR10.
* For CIFAR10, Diagonal BiLSTM > Row LSTM > PixelCNN. This is also the order of receptive field for the 3 architectures and the observation underlines the importance of having a large receptive field.
* The paper also provides new benchmarks for generative image modelling on IMAGENET dataset.

Your comment:

4

[link] Summary by Alexander Jung 8 years ago

*Note*: This paper felt rather hard to read. The summary might not have hit exactly what the authors tried to explain.

* The authors describe multiple architectures that can model the distributions of images.
* These networks can be used to generate new images or to complete existing ones.
* The networks are mostly based on RNNs.

### How
* They define three architectures:
* Row LSTM:
* Predicts a pixel value based on all previous pixels in the image.
* It applies 1D convolutions (with kernel size 3) to the current and previous rows of the image.
* It uses the convolution results as features to predict a pixel value.
* Diagonal BiLSTM:
* Predicts a pixel value based on all previous pixels in the image.
* Instead of applying convolutions in a row-wise fashion, they apply them to the diagonals towards the top left and top right of the pixel.
* Diagonal convolutions can be applied by padding the n-th row with `n-1` pixels from the left (diagonal towards top left) or from the right (diagonal towards the top right), then apply a 3x1 column convolution.
* PixelCNN:
* Applies convolutions to the region around a pixel to predict its values.
* Uses masks to zero out pixels that follow after the target pixel.
* They use no pooling layers.
* While for the LSTMs each pixel is conditioned on all previous pixels, the dependency range of the CNN is bounded.
* They use up to 12 LSTM layers.
* They use residual connections between their LSTM layers.
* All architectures predict pixel values as a softmax over 255 distinct values (per channel). According to the authors that leads to better results than just using one continuous output (i.e. sigmoid) per channel.
* They also try a multi-scale approach: First, one network generates a small image. Then a second networks generates the full scale image while being conditioned on the small image.

### Results
* The softmax layers learn reasonable distributions. E.g. neighboring colors end up with similar probabilities. Values 0 and 255 tend to have higher probabilities than others, especially for the very first pixel.
* In the 12-layer LSTM row model, residual and skip connections seem to have roughly the same effect on the network's results. Using both yields a tiny improvement over just using one of the techniques alone.
* They achieve a slightly better result on MNIST than DRAW did.
* Their negative log likelihood results for CIFAR-10 improve upon previous models. The diagonal BiLSTM model performs best, followed by the row LSTM model, followed by PixelCNN.
* Their generated images for CIFAR-10 and Imagenet capture real local spatial dependencies. The multi-scale model produces better looking results. The images do not appear blurry. Overall they still look very unreal.

![Generated ImageNet images](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Pixel_Recurrent_Neural_Networks__imagenet_multiscale.png?raw=true "Generated ImageNet images")

*Generated ImageNet 64x64 images.*

![Image completion](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Pixel_Recurrent_Neural_Networks__occlusion.png?raw=true "Image completion")

*Completing partially occluded images.*

Your comment: