#### Introduction * Problem: Building an expressive, tractable and scalable image model which can be used in downstream tasks like image generation, reconstruction, compression etc. * [Link to the paper](https://arxiv.org/abs/1601.06759) #### Model * Scan the image, one row at a time and one pixel at a time (within each row). * Given the scanned content, predict the distribution over the possible values for the next pixel. * Joint distribution over the pixel values is factorised into a product of conditional distributions thus causing the problem as a sequence problem. * Parameters used in prediction are shared across all the pixel positions. * Since each pixel is jointly determined by 3 values (3 colour channels), each channel may be conditioned on other channels as well. ##### Pixel as discrete value * The conditional distributions are multinomial (with channel variable taking 1 of 256 discrete values). * This discrete representation is simpler and easier to learn. #### Pixel RNN ##### Row LSTM * Undirectional layer that processed image row by row. * Uses one-dimensional convolution (kernel of size kx1, k>=3). * Refer image 2 in the [paper](https://arxiv.org/abs/1601.06759). * Weight sharing in convolution ensures translation invariance of computed feature along each row. * For LSTM, the input-to-state component is computed for the entire 2-d input map and then is masked to include only the valid context. * For equations related to state-to-state component, refer to equation 3 in the [paper](https://arxiv.org/abs/1601.06759) ##### Diagonal BiLSTM * Bidirectional layer that processes the image in the diagonal fashion. * Input map skewed by offsetting each row of the image by one position with respect to the previous row. * Refer image 3 in the [paper](https://arxiv.org/abs/1601.06759) * For both directions, the input-to-state component is a 1 x 1 convolution while the state-to-state recurrent component is computed with column wise convolution using kernel size 2x1. * Kernel size of 2x1 processes minimal information yielding a highly non-linear computation. * Output map is skewed back by removing the offset positions. * To prevent layers from seeing further pixels, the right output map is shifted down by one row and added to left output map. ##### Residual Connections * Residual connections (or skip connections) are used to increase convergence speed and to propagate signals more explicitly. * Refer image 4 in the [paper](https://arxiv.org/abs/1601.06759) ##### Masked Convolutions * Masks are used to enforce certain restrictions on the connections in the network (eg when predicting values for R channel, values of B channel can not be used). * Mask A is applied to first convolution layer and restricts connections to only those neighbouring pixels and colour channels that have already been seen. * Mask B is applied to all subsequent input-to-state convolution transactions and allows connections from a colour channel to itself. * Refer image 4 in the [paper](https://arxiv.org/abs/1601.06759) ##### PixelCNN * Uses multiple convolution layers that preserve spatial resolution. * Makes receptive field large but not unbounded. * Mask used to avoid seeing the future context. * Faster that PixelRNN at training or evaluation time (as convolutions can be parallelized easily). ##### Multi-Scale PixelRNN * Composed of one unconditional PixelRNN and multiple conditional PixelRNNs. * Unconditional network generates a smaller s x s image which is fed as input to the conditional PixelRNN. (n is a multiple of s) * Conditional PixelRNN is a standard PixelRNN with layers biased with an upsampled version of the s x s image. * For upsampling, a convolution network with deconvolution layers constructs an enlarged feature map of size c x n x n. * For biasing, the c x n x n map is mapped to 4hxnxn map (using 1x1 unmasked convolution) and added to input-to-state map. #### Training and Evaluation * Pixel values are dequantized using real-valued noise and log likelihood of continuous and discrete models are compared. * Update rule - RMSProp * Batch size - 16 for MNIST and CIFAR 10 and 32(or 64) for IMAGENET. * Residual connections are as effective as Skip connections, in fact, the 2 can be used together as well. * PixelRNN outperforms other models for Binary MNIST and CIFAR10. * For CIFAR10, Diagonal BiLSTM > Row LSTM > PixelCNN. This is also the order of receptive field for the 3 architectures and the observation underlines the importance of having a large receptive field. * The paper also provides new benchmarks for generative image modelling on IMAGENET dataset.