ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Wasserstein GAN
Martin Arjovsky and Soumith Chintala and Léon Bottou
arXiv e-Print archive - 2017 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by MarvMind 8 years ago

This very new paper, is currently receiving quite a bit of attention by the [community](https://www.reddit.com/r/MachineLearning/comments/5qxoaz/r_170107875_wasserstein_gan/).

The paper describes a new training approach, which solves the two major practical problems with current GAN training:

1) The training process comes with a meaningful loss. This can be used as a (soft) performance metric and will help debugging, tune parameters and so on.

2) The training process does not suffer from all the instability problems. In particular the paper reduces mode collapse significantly.

On top of that, the paper comes with quite a bit mathematical theory, explaining why there approach works and other approachs have failed. This paper is a must read for anyone interested in GANs.

arxiv.org
scholar.google.com

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Simonyan, Karen and Vedaldi, Andrea and Zisserman, Andrew
arXiv e-Print archive - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

* The paper presents gradient computation based techniques to visualise image classification models.
* [Link to the paper](https://arxiv.org/abs/1312.6034)

#### Experimental Setup

* Single deep convNet trained on ILSVRC-2013 dataset (1.2M training images and 1000 classes).
* Weight layer configuration is: conv64-conv256-conv256-conv256-conv256-full4096-full4096-full1000.

#### Class Model Visualisation

* Given a learnt ConvNet and a class (of interest), start with the zero image and perform optimisation by back propagating with respect to the input image (keeping the ConvNet weights constant).
* Add the mean image (for training set) to the resulting image.
* The paper used unnormalised class scores so that optimisation focuses on increasing the score of target class and not decreasing the score of other classes.

#### Image-Specific Class Saliency Visualisation

* Given an image, class of interest, and trained ConvNet, rank the pixels of the input image based on their influence on class scores.
* Derivative of the class score with respect to image gives an estimate of the importance of different pixels for the class.
* The magnitude of derivative also indicated how much each pixel needs to be changed to improve the class score.

##### Class Saliency Extraction

* Find the derivative of the class score with respect with respect to the input image.
* This would result in one single saliency map per colour channel.
* To obtain a single saliency map, take the maximum magnitude of derivative across all colour channels.

##### Weakly Supervised Object Localisation

* The saliency map for an image provides a rough encoding of the location of the object of the class of interest. 
* Given an image and its saliency map, an object segmentation map can be computed using GraphCut colour segmentation.
* Color continuity cues are needed as saliency maps might capture only the most dominant part of the object in the image.
* This weakly supervised approach achieves 46.4% top-5 error on the test set of ILSVRC-2013.

#### Relation to Deconvolutional Networks

* DeconvNet-based reconstruction of the $n^{th}$ layer input is similar to computing the gradient of the visualised neuron activity $f$ with respect to the input layer.
* One difference is in the way RELU neurons are treated: 
    * In DeconvNet, the sign indicator (for the derivative of RELU) is computed on output reconstruction while in this paper, the sign indicator is computed on the layer input.

arxiv.org
arxiv-vanity.com
scholar.google.com

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe and Christian Szegedy
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Alexander Jung 7 years ago

### What is BN:
  * Batch Normalization (BN) is a normalization method/layer for neural networks.
  * Usually inputs to neural networks are normalized to either the range of [0, 1] or [-1, 1] or to mean=0 and variance=1. The latter is called *Whitening*.
  * BN essentially performs Whitening to the intermediate layers of the networks.

### How its calculated:
  * The basic formula is $x^* = (x - E[x]) / \sqrt{\text{var}(x)}$, where $x^*$ is the new value of a single component, $E[x]$ is its mean within a batch and `var(x)` is its variance within a batch.
  * BN extends that formula further to $x^{**} = gamma * x^* +$ beta, where $x^{**}$ is the final normalized value. `gamma` and `beta` are learned per layer. They make sure that BN can learn the identity function, which is needed in a few cases.
  * For convolutions, every layer/filter/kernel is normalized on its own (linear layer: each neuron/node/component). That means that every generated value ("pixel") is treated as an example. If we have a batch size of N and the image generated by the convolution has width=P and height=Q, we would calculate the mean (E) over `N*P*Q` examples (same for the variance).

### Theoretical effects:
  * BN reduces *Covariate Shift*. That is the change in distribution of activation of a component. By using BN, each neuron's activation becomes (more or less) a gaussian distribution, i.e. its usually not active, sometimes a bit active, rare very active.
  * Covariate Shift is undesirable, because the later layers have to keep adapting to the change of the type of distribution (instead of just to new distribution parameters, e.g. new mean and variance values for gaussian distributions).
  * BN reduces effects of exploding and vanishing gradients, because every becomes roughly normal distributed. Without BN, low activations of one layer can lead to lower activations in the next layer, and then even lower ones in the next layer and so on.

### Practical effects:
  * BN reduces training times. (Because of less Covariate Shift, less exploding/vanishing gradients.)
  * BN reduces demand for regularization, e.g. dropout or L2 norm. (Because the means and variances are calculated over batches and therefore every normalized value depends on the current batch. I.e. the network can no longer just memorize values and their correct answers.)
  * BN allows higher learning rates. (Because of less danger of exploding/vanishing gradients.)
  * BN enables training with saturating nonlinearities in deep networks, e.g. sigmoid. (Because the normalization prevents them from getting stuck in saturating ranges, e.g. very high/low values for sigmoid.)


![MNIST and neuron activations](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Batch_Normalization__performance_and_activations.png?raw=true "MNIST and neuron activations")

*BN applied to MNIST (a), and activations of a randomly selected neuron over time (b, c), where the middle line is the median activation, the top line is the 15th percentile and the bottom line is the 85th percentile.*

-------------------------

### Rough chapter-wise notes

* (2) Towards Reducing Covariate Shift
  * Batch Normalization (*BN*) is a special normalization method for neural networks.
  * In neural networks, the inputs to each layer depend on the outputs of all previous layers.
  * The distributions of these outputs can change during the training. Such a change is called a *covariate shift*.
  * If the distributions stayed the same, it would simplify the training. Then only the parameters would have to be readjusted continuously (e.g. mean and variance for normal distributions).
  * If using sigmoid activations, it can happen that one unit saturates (very high/low values). That is undesired as it leads to vanishing gradients for all units below in the network.
  * BN fixes the means and variances of layer inputs to specific values (zero mean, unit variance).
  * That accomplishes:
    * No more covariate shift.
    * Fixes problems with vanishing gradients due to saturation.
  * Effects:
    * Networks learn faster. (As they don't have to adjust to covariate shift any more.)
    * Optimizes gradient flow in the network. (As the gradient becomes less dependent on the scale of the parameters and their initial values.)
    * Higher learning rates are possible. (Optimized gradient flow reduces risk of divergence.)
    * Saturating nonlinearities can be safely used. (Optimized gradient flow prevents the network from getting stuck in saturated modes.)
    * BN reduces the need for dropout. (As it has a regularizing effect.)
  * How BN works:
    * BN normalizes layer inputs to zero mean and unit variance. That is called *whitening*.
    * Naive method: Train on a batch. Update model parameters. Then normalize. Doesn't work: Leads to exploding biases while distribution parameters (mean, variance) don't change.
    * A proper method has to include the current example *and* all previous examples in the normalization step.
    * This leads to calculating in covariance matrix and its inverse square root. That's expensive. The authors found a faster way.

* (3) Normalization via Mini-Batch Statistics
  * Each feature (component) is normalized individually. (Due to cost, differentiability.)
  * Normalization according to: `componentNormalizedValue = (componentOldValue - E[component]) / sqrt(Var(component))`
  * Normalizing each component can reduce the expressitivity of nonlinearities. Hence the formula is changed so that it can also learn the identiy function.
  * Full formula: `newValue = gamma * componentNormalizedValue + beta` (gamma and beta learned per component)
  * E and Var are estimated for each mini batch.
  * BN is fully differentiable. Formulas for gradients/backpropagation are at the end of chapter 3 (page 4, left).

* (3.1) Training and Inference with Batch-Normalized Networks
  * During test time, E and Var of each component can be estimated using all examples or alternatively with moving averages estimated during training.
  * During test time, the BN formulas can be simplified to a single linear transformation.

* (3.2) Batch-Normalized Convolutional Networks
  * Authors recommend to place BN layers after linear/fully-connected layers and before the ninlinearities.
  * They argue that the linear layers have a better distribution that is more likely to be similar to a gaussian.
  * Placing BN after the nonlinearity would also not eliminate covariate shift (for some reason).
  * Learning a separate bias isn't necessary as BN's formula already contains a bias-like term (beta).
  * For convolutions they apply BN equally to all features on a feature map. That creates effective batch sizes of m\*pq, where m is the number of examples in the batch and p q are the feature map dimensions (height, width). BN for linear layers has a batch size of m.
  * gamma and beta are then learned per feature map, not per single pixel. (Linear layers: Per neuron.)

* (3.3) Batch Normalization enables higher learning rates
  * BN normalizes activations.
  * Result: Changes to early layers don't amplify towards the end.
  * BN makes it less likely to get stuck in the saturating parts of nonlinearities.
  * BN makes training more resilient to parameter scales.
  * Usually, large learning rates cannot be used as they tend to scale up parameters. Then any change to a parameter amplifies through the network and can lead to gradient explosions.
  * With BN gradients actually go down as parameters increase. Therefore, higher learning rates can be used.
  * (something about singular values and the Jacobian)

* (3.4) Batch Normalization regularizes the model
  * Usually: Examples are seen on their own by the network.
  * With BN: Examples are seen in conjunction with other examples (mean, variance).
  * Result: Network can't easily memorize the examples any more.
  * Effect: BN has a regularizing effect. Dropout can be removed or decreased in strength.

* (4) Experiments
* (4.1) Activations over time
** They tested BN on MNIST with a 100x100x10 network. (One network with BN before each nonlinearity, another network without BN for comparison.)
** Batch Size was 60.
** The network with BN learned faster. Activations of neurons (their means and variances over several examples) seemed to be more consistent during training.
** Generalization of the BN network seemed to be better.

* (4.2) ImageNet classification
** They applied BN to the Inception network.
** Batch Size was 32.
** During training they used (compared to original Inception training) a higher learning rate with more decay, no dropout, less L2, no local response normalization and less distortion/augmentation.
** They shuffle the data during training (i.e. each batch contains different examples).
** Depending on the learning rate, they either achieve the same accuracy (as in the non-BN network) in 14 times fewer steps (5x learning rate) or a higher accuracy in 5 times fewer steps (30x learning rate).
** BN enables training of Inception networks with sigmoid units (still a bit lower accuracy than ReLU).
** An ensemble of 6 Inception networks with BN achieved better accuracy than the previously best network for ImageNet.

* (5) Conclusion
** BN is similar to a normalization layer suggested by Gülcehre and Bengio. However, they applied it to the outputs of nonlinearities.
** They also didn't have the beta and gamma parameters (i.e. their normalization could not learn the identity function).

arxiv.org
arxiv-vanity.com
scholar.google.com

SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient
Lantao Yu and Weinan Zhang and Jun Wang and Yong Yu
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, cs.AI
more

[link] Summary by Jon Gauthier 8 years ago

Everyone has been thinking about how to apply GANs to discrete sequence data for the past year or so. This paper presents the model that I would guess most people thought of as the first-thing-to-try:

1. Build a recurrent generator model which samples from its softmax outputs at each timestep.
2. Pass sampled sequences to a recurrent discriminator model which distinguishes between sampled sequences and real-data sequences.
3. Train the discriminator under the standard GAN loss.
4. Train the generator with a REINFORCE (policy gradient) objective, where each trajectory is assigned a single episodic reward: the score assigned to the generated sequence by the discriminator.

Sounds hacky, right? We're learning a generator with a high-variance model-free reinforcement learning algorithm, in a very seriously non-stationary environment. (Here the "environment" is a discriminator being jointly learned with the generator.)

There's just one trick in this paper on top of that setup: for non-terminal states, the reward is defined as the *expectation* of the discriminator score after stochastically generating from that state forward. To restate using standard (somewhat sloppy) RL syntax, in different terms than the paper: (under stochastic sequential policy $\pi$, with current state $s_t$, trajectory $\tau_{1:T}$ and discriminator $D(\tau)$)

$$r_t = \mathbb E_{\tau_{t+1:T} \sim \pi(s_t)} \left[ D(\tau_{1:T}) \right]$$

The rewards are estimated via Monte Carlo — i.e., just take the mean of $N$ rollouts from each intermediate state. They claim this helps to reduce variance. That makes intuitive sense, but I don't see any results in the paper demonstrating the effect of varying $N$.

---

Yep, so it turns out that this sort of works.. with a big caveat:

## The big caveat

Graph from appendix:

![](https://www.dropbox.com/s/5fqh6my63sgv5y4/Bildschirmfoto%202016-09-27%20um%2021.34.44.png?raw=1)

SeqGANs don't work without supervised pretraining. Makes sense — with a cold start, the generator just samples a bunch of nonsense and the discriminator overfits. Both the generator and discriminator are pretrained on supervised data in this paper (see Algorithm 1).

I think it must be possible to overcome this with the proper training tricks and enough sweat. But it's probably more worth our time to address the fundamental problem here of developing better RL for structured prediction tasks.

4 Comments

arxiv.org
arxiv-vanity.com
scholar.google.com

Universal representations:The missing link between faces, text, planktons, and cat breeds
Hakan Bilen and Andrea Vedaldi
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CV, stat.ML
more

[link] Summary by Martin Thoma 8 years ago

This paper is about transfer learning for computer vision tasks.

## Contributions
* Before this paper, people focused on similar datasets (e.g. ImageNet-like images) or even the same dataset but a different task (classification -> segmentation). This paper, they look at extremely different dataset (ImageNet-like vs text) but only one task (classification). They show that all layers can be shared (including the last classification layer) between datasets such as MNIST and CIFAR-10
* Normalizing information is necessary for sharing models between datasets in order to compensate for dataset-specific differences. Domain-specific scaling parameters work well.

## Evaluation

* Used datasets:
  1. MNIST (10 classes: handwritten digits 0-9),
  2. SVHN (10 classes: house number digits, 0-9),
  3. [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) (10 classes: airplane, automobile, bird, ...)
  4. Daimler Mono Pedestrian Classification Benchmark (18 × 36 pixels)
  5. Human Sketch dataset (20000 human sketches of every day objects such as “book”, “car”, “house”, “sun”)
  6. German Traffic Sign Recognition (GTSR) Benchmark (43 traffic signs)
  7. Plankton imagery data (classification benchmark that contains 30336 images of various organisms ranging from the smallest single-celled protists to copepods, larval fish, and larger jellies)
  8. Animals with Attributes (AwA): 30475 images of 50 animal species (for zero-shot learning)
  9. Caltech-256: object classification benchmark (256 object categories and an additional background class)
  10. Omniglot: 1623 different handwritten characters from 50 different alphabets (one shot learning)
* images are resized to 64 × 64 pixels, greyscale ones are converted into RGB by setting the three channels to the same value
* Each dataset is also whitened, by subtracting its mean and dividing it by its standard deviation per channel
* **Architecture**: ResNet + Global Average Pooling + FC with Softmax
* "As the majority of the datasets have a different number of classes, we use a dataset-specific fully connected layer in our experiments unless otherwise stated."
* **Data augmentation**: We follow the same data augmentation strategy in [[18]](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeZRS15), the 64 × 64 size whitened image is padded with 8 pixels on all sides and a 64×64 patch randomly sampled from the padded image or its horizontal flip (except for MNIST / Omniglot / SVHN, as those contain text)
* **Training**: stochastic gradient descent with momentum

Sharing strategies:

1. Baseline: Train networks for each dataset independantly
2. Full sharing: For MNIST / SVHN / CIFAR-10, group classes randomly together so that Node 2 might be digit "7" for MNIST, digit "3" for SVHN and "aeroplane" for CIFAR-10. They are trained together in one network.
3. Deep sharing: Share all layers except the last one. Use all 10 datasets for this.
4. Partial sharing: Have a dataset-specific first part to compensate for different image statistics, but share the middle of the network.

The results seem to be inconclusive to me.


## Follow-up / related work