ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

A Neural Algorithm of Artistic Style
Leon A. Gatys and Alexander S. Ecker and Matthias Bethge
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CV, cs.NE, q-bio.NC
more

[link] Summary by Alexander Jung 8 years ago

* The paper describes a method to separate content and style from each other in an image.
* The style can then be transfered to a new image.
* Examples:
* Let a photograph look like a painting of van Gogh.
* Improve a dark beach photo by taking the style from a sunny beach photo.

### How
* They use the pretrained 19-layer VGG net as their base network.
* They assume that two images are provided: One with the *content*, one with the desired *style*.
* They feed the content image through the VGG net and extract the activations of the last convolutional layer. These activations are called the *content representation*.
* They feed the style image through the VGG net and extract the activations of all convolutional layers. They transform each layer to a *Gram Matrix* representation. These Gram Matrices are called the *style representation*.
* How to calculate a *Gram Matrix*:
* Take the activations of a layer. That layer will contain some convolution filters (e.g. 128), each one having its own activations.
* Convert each filter's activations to a (1-dimensional) vector.
* Pick all pairs of filters. Calculate the scalar product of both filter's vectors.
* Add the scalar product result as an entry to a matrix of size `#filters x #filters` (e.g. 128x128).
* Repeat that for every pair to get the Gram Matrix.
* The Gram Matrix roughly represents the *texture* of the image.
* Now you have the content representation (activations of a layer) and the style representation (Gram Matrices).
* Create a new image of the size of the content image. Fill it with random white noise.
* Feed that image through VGG to get its content representation and style representation. (This step will be repeated many times during the image creation.)
* Make changes to the new image using gradient descent to optimize a loss function.
* The loss function has two components:
* The mean squared error between the new image's content representation and the previously extracted content representation.
* The mean squared error between the new image's style representation and the previously extracted style representation.
* Add up both components to get the total loss.
* Give both components a weight to alter for more/less style matching (at the expense of content matching).

![Examples](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/A_Neural_Algorithm_for_Artistic_Style__examples.jpg?raw=true "Examples")

*One example input image with different styles added to it.*

-------------------------

### Rough chapter-wise notes

* Page 1
* A painted image can be decomposed in its content and its artistic style.
* Here they use a neural network to separate content and style from each other (and to apply that style to an existing image).

* Page 2
* Representations get more abstract as you go deeper in networks, hence they should more resemble the actual content (as opposed to the artistic style).
* They call the feature responses in higher layers *content representation*.
* To capture style information, they use a method that was originally designed to capture texture information.
* They somehow build a feature space on top of the existing one, that is somehow dependent on correlations of features. That leads to a "stationary" (?) and multi-scale representation of the style.

* Page 3
* They use VGG as their base CNN.

* Page 4
* Based on the extracted style features, they can generate a new image, which has equal activations in these style features.
* The new image should match the style (texture, color, localized structures) of the artistic image.
* The style features become more and more abtstract with higher layers. They call that multi-scale the *style representation*.
* The key contribution of the paper is a method to separate style and content representation from each other.
* These representations can then be used to change the style of an existing image (by changing it so that its content representation stays the same, but its style representation matches the artwork).

* Page 6
* The generated images look most appealing if all features from the style representation are used. (The lower layers tend to reflect small features, the higher layers tend to reflect larger features.)
* Content and style can't be separated perfectly.
* Their loss function has two terms, one for content matching and one for style matching.
* The terms can be increased/decreased to match content or style more.

* Page 8
* Previous techniques work only on limited or simple domains or used non-parametric approaches (see non-photorealistic rendering).
* Previously neural networks have been used to classify the time period of paintings (based on their style).
* They argue that separating content from style might be useful and many other domains (other than transfering style of paintings to images).

* Page 9
* The style representation is gathered by measuring correlations between activations of neurons.
* They argue that this is somehow similar to what "complex cells" in the primary visual system (V1) do.
* They note that deep convnets seem to automatically learn to separate content from style, probably because it is helpful for style-invariant classification.

* Page 9, Methods
* They use the 19 layer VGG net as their basis.
* They use only its convolutional layers, not the linear ones.
* They use average pooling instead of max pooling, as that produced slightly better results.

* Page 10, Methods
* The information about the image that is contained in layers can be visualized. To do that, extract the features of a layer as the labels, then start with a white noise image and change it via gradient descent until the generated features have minimal distance (MSE) to the extracted features.
* The build a style representation by calculating Gram Matrices for each layer.

* Page 11, Methods
* The Gram Matrix is generated in the following way:
* Convert each filter of a convolutional layer to a 1-dimensional vector.
* For a pair of filters i, j calculate the value in the Gram Matrix by calculating the scalar product of the two vectors of the filters.
* Do that for every pair of filters, generating a matrix of size #filters x #filters. That is the Gram Matrix.
* Again, a white noise image can be changed with gradient descent to match the style of a given image (i.e. minimize MSE between two Gram Matrices).
* That can be extended to match the style of several layers by measuring the MSE of the Gram Matrices of each layer and giving each layer a weighting.

* Page 12, Methods
* To transfer the style of a painting to an existing image, proceed as follows:
* Start with a white noise image.
* Optimize that image with gradient descent so that it minimizes both the content loss (relative to the image) and the style loss (relative to the painting).
* Each distance (content, style) can be weighted to have more or less influence on the loss function.

arxiv.org
arxiv-vanity.com
scholar.google.com

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
Alec Radford and Luke Metz and Soumith Chintala
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG, cs.CV
more

[link] Summary by Shagun Sodhani 8 years ago

# Deep Convolutional Generative Adversarial Nets

## Introduction

* The paper presents Deep Convolutional Generative Adversarial Nets (DCGAN) - a topologically constrained variant of conditional GAN.
* [Link to the paper](https://arxiv.org/abs/1511.06434)

## Benefits

* Stable to train
* Very useful to learn unsupervised image representations.

## Model

* GANs difficult to scale using CNNs.
* Paper proposes following changes to GANs:
* Replace any pooling layers with strided convolutions (for discriminator) and fractional strided convolutions (for generators).
* Remove fully connected hidden layers.
* Use batch normalisation in both generator (all layers except output layer) and discriminator (all layers except input layer).
* Use LeakyReLU in all layers of the discriminator.
* Use ReLU activation in all layers of the generator (except output layer which uses Tanh).

## Datasets

* Large-Scale Scene Understanding.
* Imagenet-1K.
* Faces dataset.

## Hyperparameters

* Minibatch SGD with minibatch size of 128.
* Weights initialized with 0 centered Normal distribution with standard deviation = 0.02
* Adam Optimizer
* Slope of leak = 0.2 for LeakyReLU.
* Learning rate = 0.0002, β1 = 0.5

## Observations

* Large-Scale Scene Understanding data
* Demonstrates that model scales with more data and higher resolution generation.
* Even though it is unlikely that model would have memorized images (due to low learning rate of minibatch SGD).
* Classifying CIFAR-10 dataset
* Features
* Train in Imagenet-1K and test on CIFAR-10.
* Max pool discriminator's convolutional features (from all layers) to get 4x4 spatial grids.
* Flatten and concatenate to get a 28672-dimensional vector.
* Linear L2-SVM classifier trained over the feature vector.
* 82.8% accuracy, outperforms K-means (80.6%)
* Street View House Number Classifier
* Similar pipeline as CIFAR-10
* 22.48% test error.
* The paper contains many examples of images generated by final and intermediate layers of the network.
* Images in the latent space do not show sharp transitions indicating that network did not memorize images.
* DCGAN can learn an interesting hierarchy of features.
* Networks seems to have some success in disentangling image representation from object representation.
* Vector arithmetic can be performed on the Z vectors corresponding to the face samples to get results like `smiling woman - normal woman + normal man = smiling man` visually.

arxiv.org
arxiv-vanity.com
scholar.google.com

Systematic evaluation of CNN advances on the ImageNet
Dmytro Mishkin and Nikolay Sergievskiy and Jiri Matas
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.NE, cs.CV, cs.LG
more

[link] Summary by Dmytro Mishkin 9 years ago

Authors test different variant of CNN architectures, non-linearities, poolings, etc. on ImageNet.

Summary:
-  use ELU non-linearity without batchnorm or ReLU with it.
-  apply a learned colorspace transformation of RGB (2 layers of 1x1 convolution ).
-  use the linear learning rate decay policy.
-  use a sum of the average and max pooling layers.
-  use mini-batch size around 128 or 256. If this is too big for your GPU,
decrease the learning rate proportionally to the batch size.
- use fully-connected layers as convolutional and average the predictions for
the final decision.
- when investing in increasing training set size, check if a plateau has not
been reach.
- cleanliness of the data is more important then the size.
- if you cannot increase the input image size, reduce the stride in the consequent
layers, it has roughly the same effect.
- if your network has a complex and highly optimized architecture, like e.g.
GoogLeNet, be careful with modifications.

arxiv.org
arxiv-vanity.com
scholar.google.com

Learning to learn by gradient descent by gradient descent
Marcin Andrychowicz and Misha Denil and Sergio Gomez and Matthew W. Hoffman and David Pfau and Tom Schaul and Nando de Freitas
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.NE, cs.LG
more

[link] Summary by Peter O'Connor 7 years ago

# Very Short

The authors propose **learning** an optimizer **to** optimally **learn** a function (the *optimizee*) which is being trained **by gradient descent**.  This optimizer, a recurrent neural network, is trained to make optimal parameter updates to the optimizee **by gradient descent**.

# Short

Let's suppose we have a stochastic function $f: \mathbb R^{\text{dim}(\theta)} \rightarrow \mathbb R^+$, (the *optimizee*) which we wish to minimize with respect to $\theta$.  Note that this is the typical situation we encounter when training a neural network with Stochastic Gradient Descent - where the stochasticity comes from sampling random minibatches of the data (the data is omitted as an argument here).  

The "vanilla" gradient descent update is: $\theta_{t+1} = \theta_t - \alpha_t \nabla_{\theta_t} f(\theta_t)$, where $\alpha_t$ is some learning rate.  Other optimizers (Adam, RMSProp, etc) replace the multiplication of the gradient by $-\alpha_t$ with some sort of weighted sum of the history of gradients.

This paper proposes to apply an optimization step $\theta_{t+1} = \theta_t + g_t$, where the update $g_t \in \mathbb R^{\text{dim}(\theta)}$ is defined by a recurrent network $m_\phi$: 

$$(g_t, h_{t+1}) := m_\phi (\nabla_{\theta_t} f(\theta_t), h_t)$$

Where in their implementation, $h_t \in \mathbb R^{\text{dim}(\theta)}$ is the hidden state of the recurrent network.  To make the number of parameters in the optimizer manageable, they implement their recurrent network $m$ as a *coordinatewise* LSTM (i.e. A set of $\text{dim}(\theta)$ small LSTMs that share parameters $\phi$).   They train the optimizer networks's parameters $\phi$ by "unrolling" T subsequent steps of optimization, and minimizing:

$$\mathcal L(\phi) := \mathbb E_f[f(\theta^*(f, \phi))]  \approx \frac1T \sum_{t=1}^T f(\theta_t)$$

Where $\theta^*(f, \phi)$ are the final optimizee parameters.  In order to avoid computing second derivatives while calculating $\frac{\partial \mathcal L(\phi)}{\partial \phi}$, they make the approximation $\frac{\partial}{\partial \phi}  \nabla_{\theta_t}f(\theta_t) \approx 0$ (corresponding to the dotted lines in the figure, along which gradients are not backpropagated).  

https://i.imgur.com/HMaCeip.png
**The computational graph of the optimization of the optimizer, unrolled across 3 time-steps.  Note that $\nabla_t := \nabla_{\theta_t}f(\theta_t)$.  The dotted line indicates that we do not backpropagate across this path.**

The authors demonstrate that their method usually outperforms traditional optimizers (ADAM, RMSProp, SGD, NAG), on a synthetic dataset, MNIST, CIFAR-10, and Neural Style Transfer.  They argue that their algorithm constitutes a form of transfer learning, since a pre-trained optimizer can be applied to accelerate training of a newly initialized network.

arxiv.org
arxiv-vanity.com
scholar.google.com

Conditional Image Generation with PixelCNN Decoders
Aaron van den Oord and Nal Kalchbrenner and Oriol Vinyals and Lasse Espeholt and Alex Graves and Koray Kavukcuoglu
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CV, cs.LG
more

[link] Summary by Shagun Sodhani 9 years ago

#### Introduction

* The paper explores the domain of conditional image generation by adopting and improving PixelCNN architecture.
* [Link to the paper](https://arxiv.org/abs/1606.05328)

#### Based on PixelRNN and PixelCNN

* Models image pixel by pixel by decomposing the joint image distribution as a product of conditionals.
* PixelRNN uses two-dimensional LSTM while PixelCNN uses convolutional networks.
* PixelRNN gives better results but PixelCNN is faster to train.

#### Gated PixelCNN

* PixelRNN outperforms PixelCNN due to the larger receptive field and because they contain multiplicative units, LSTM gates, which allow modelling more complex interactions.
* To account for these, deeper models and gated activation units (equation 2 in the [paper](https://arxiv.org/abs/1606.05328)) can be used respectively.
* Masked convolutions can lead to blind spots in the receptive fields.
* These can be removed by combining 2 convolutional network stacks:
* Horizontal stack - conditions on the current row.
* Vertical stack - conditions on all rows above the current row.
* Every layer in the horizontal stack takes as input the output of the previous layer as well as that of the vertical stack.
* Residual connections are used in the horizontal stack and not in the vertical stack (as they did not seem to improve results in the initial settings).

#### Conditional PixelCNN

* Model conditional distribution of image, given the high-level description of the image, represented using the latent vector h (equation 4 in the [paper](https://arxiv.org/abs/1606.05328))
* This conditioning does not depend on the location of the pixel in the image.
* To consider the location as well, map h to spatial representation $s = m(h)$ (equation 5 in the the [paper](https://arxiv.org/abs/1606.05328))

#### PixelCNN Auto-Encoders

* Start with a traditional auto-encoder architecture and replace the deconvolutional decoder with PixelCNN and train the network end-to-end.

#### Experiments

* For unconditional modelling, Gated PixelCNN either outperforms PixelRNN or performs almost as good and takes much less time to train.
* In the case of conditioning on ImageNet classes, the log likelihood measure did not improve a lot but the visual quality of the generated sampled was significantly improved.
* Paper also included sample images generated by conditioning on human portraits and by training a PixelCNN auto-encoder on ImageNet patches.