Summaries from International Conference on Learning Representations on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Unrolled Generative Adversarial Networks
Luke Metz and Ben Poole and David Pfau and Jascha Sohl-Dickstein
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, stat.ML
more

[link] Summary by CodyWild 7 years ago

If you’ve ever read a paper on Generative Adversarial Networks (from now on: GANs), you’ve almost certainly heard the author refer to the scourge upon the land of GANs that is mode collapse. When a generator succumbs to mode collapse, that means that, instead of modeling the full distribution, of input data, it will choose one region where there is a high density of data, and put all of its generated probability weight there. Then, on the next round, the discriminator pushes strongly away from that region (since it now is majority-occupied by fake data), and the generator finds a new mode. 

In the view of the authors of the Unrolled GANs paper,  one reason why this happens is that, in the typical GAN, at each round the generator implicitly assumes that it’s optimizing itself against the final and optimal discriminator. And, so, it makes its best move given that assumption, which is to put all its mass on a region the discriminator assigns high probability. Unfortunately for our short-sighted robot friend, this isn’t a one-round game, and this mass-concentrating strategy gives the discriminator a really good way to find fake data during the next round: just dramatically downweight how likely you think data is in the generator’s prior-round sweet spot, which it’s heavy concentration allows you to do without impacting your assessment of other data. Unrolled GANs operate on this key question: what if we could give the generator an ability to be less short-sighted, and make moves that aren’t just optimizing for the present, but are also defensive against the future, in ways that will hopefully tamp down on this running-around-in-circles dynamic illustrated above. If the generator was incentivized not only to make moves that fool the current discriminator, but also make moves that make the next-step discriminator less likely to tell it apart, the hope is that it will spread out its mass more, and be less likely to fall into the hole of a mode collapse. 

This intuition was realized in UnrolledGANs, through a mathematical approach that is admittedly a little complex for this discussion format. Essentially, in addition to the typical GAN loss (which is based on the current values of the generator and discriminator), this model also takes one “step forward” of the discriminator (calculates what the new parameters of the discriminator would be, if it took one update step), and backpropogates backward through that step. The loss under the next-step discriminator parameters is a function of both the current generator, and the next-step parameters, which come from the way the discriminator reacts to the current generator. When you take the gradient with respect to the generator of both of these things, you get something very like the ideal we described earlier: a generator that is trying to put its mass into areas the current discriminator sees as high-probability, but also change its parameters such that it gives the discriminator a less effective response strategy. 

https://i.imgur.com/0eEjm0g.png

Empirically: UnrolledGANs do a quite good job at their stated aim of reducing mode collapse, and the unrolled training procedure is now a common building-block technique used in other papers.

arxiv.org
scholar.google.com

Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning
Lotter, William and Kreiman, Gabriel and Cox, David D.
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Peter O'Connor 7 years ago

# Very Short

The authors propose a deep, recurrent, convolutional architecture called PredNet, inspired by the idea of predictive coding from neuroscience.  In PredNet, first layer attempts to predict the input frame, based on past frames and input from higher layers.  The next layer then attempts to predict the *prediction error* of the first layer, and so on.  The authors show that such an architecture can predict future frames of video, and predict the parameters synthetically-generated video, better than a conventional recurrent autoencoder.

# Short

## The Model
PredNet has the following architecture:

https://i.imgur.com/7vOcGwI.png

Where the R blocks are Recurrent Neural Networks, and the A blocks are Convolutional Layers.   $E_l$ indictes the prediction error at layer $l$.  The network is trained from snippets of video, and the loss is given as:

$L_{train} = \sum_{t=1}^T \sum_l \frac{\lambda_l}{n_l} \sum_i^{2n_l} [E_l^t]_i$

Where $t$ indexes the time step, $l$ indexes the layer, $n_l$ is the number of units in the layer, $E_l^t = [ReLU(A_l^t-\hat A_l^t) ; ReLU(\hat A_l^t - A_l^t) ]$ is the concatenation of the negative and positive components of the error, $\lambda_l$ is a hyperparameter determining the effect that layer $l$ error should have on the loss.  

In the experiments, they use two settings for the $\lambda_l$ hyperparameters.  In the "$L_0$" setting, they set $\lambda_0=1, \lambda_{>0}=0$, which ends up being optimal when trying to optimize next-frame L1 error.  In the "$L_{all}$" setting, they use $\lambda_0=1, \lambda_{>0}=0.1$, which, in the synthetic-images experiment, seems to be better at predicting the parameters of the synthetic-image generator. 

## Results

They apply the model on two tasks: 

1) Predicting the future frames of a synthetic video generated by a graphics engine.  

Here they predict both the next frame (in which their $L_0$ model does best), and the parameters (face characteristics, rotation, angle) of the program that generates the synthetic faces, (on which their $L_{all}$ model does best).  They predict face generating parameters by first training the model, and then freezing weights and regressing from the learned representations at a given layer to the parameters.  They show that both the $L_0$ and $L_{all}$ models outperform a more conventional recurrent autoencoder.

https://i.imgur.com/S8PpJnf.png
**Next-frame predictions on a sequence of faces (note: here, predictions are *not* fed back into the model to generate the next frame)**

2) Predicting future frames of video from dashboard cameras.  

https://i.imgur.com/Zus34Vm.png
**Next-frame predictions of dashboard-camera images**

The authors conclude that allowing higher layers to model *prediction errors*, instead of *abstract representations* can lead to better modeling of video.

arxiv.org
scholar.google.com

Deep Information Propagation
Schoenholz, Samuel S. and Gilmer, Justin and Ganguli, Surya and Sohl-Dickstein, Jascha
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Léo Paillier 7 years ago

_Objective:_ Fondamental analysis of random networks using mean-field theory. Introduces two scales controlling network behavior.

## Results:

Guide to choose hyper-parameters for random networks to be nearly critical (in between order and chaos). This in turn implies that information can propagate forward and backward and thus the network is trainable (not vanishing or exploding gradient).

Basically for any given number of layers and initialization covariances for weights and biases, tells you if the network will be trainable or not, kind of an architecture validation tool.

**To be noted:** any amount of dropout removes the critical point and therefore imply an upper bound on trainable network depth.

## Caveats:

*   Consider only bounded activation units: no relu, etc.
*   Applies directly only to fully connected feed-forward networks: no convnet, etc.

arxiv.org
scholar.google.com

FractalNet: Ultra-Deep Neural Networks without Residuals
Larsson, Gustav and Maire, Michael and Shakhnarovich, Gregory
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Alexander Jung 7 years ago

* They describe an architecture for deep CNNs that contains short and long paths. (Short = few convolutions between input and output, long = many convolutions between input and output)
* They achieve comparable accuracy to residual networks, without using residuals.

### How
* Basic principle:
* They start with two branches. The left branch contains one convolutional layer, the right branch contains a subnetwork.
* That subnetwork again contains a left branch (one convolutional layer) and a right branch (a subnetwork).
* This creates a recursion.
* At the last step of the recursion they simply insert two convolutional layers as the subnetwork.
* Each pair of branches (left and right) is merged using a pair-wise mean. (Result: One of the branches can be skipped or removed and the result after the merge will still be sound.)
* Their recursive expansion rule (left) and architecture (middle and right) visualized:
![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/FractalNet_Ultra-Deep_Networks_without_Residuals__architecture.png?raw=true "Architecture")
* Blocks:
* Each of the recursively generated networks is one block.
* They chain five blocks in total to create the network that they use for their experiments.
* After each block they add a max pooling layer.
* Their first block uses 64 filters per convolutional layer, the second one 128, followed by 256, 512 and again 512.
* Drop-path:
* They randomly dropout whole convolutional layers between merge-layers.
* They define two methods for that:
* Local drop-path: Drops each input to each merge layer with a fixed probability, but at least one always survives. (See image, first three examples.)
* Global drop-path: Drops convolutional layers so that only a single columns (and thereby path) in the whole network survives. (See image, right.)
* Visualization:
![Drop-path](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/FractalNet_Ultra-Deep_Networks_without_Residuals__drop_path.png?raw=true "Drop-path")

### Results
* They test on CIFAR-10, CIFAR-100 and SVHN with no or mild (crops, flips) augmentation.
* They add dropout at the start of each block (probabilities: 0%, 10%, 20%, 30%, 40%).
* They use for 50% of the batches local drop-path at 15% and for the other 50% global drop-path.
* They achieve comparable accuracy to ResNets (a bit behind them actually).
* Note: The best ResNet that they compare to is "ResNet with Identity Mappings". They don't compare to Wide ResNets, even though they perform best.
* If they use image augmentations, dropout and drop-path don't seem to provide much benefit (only small improvement).
* If they extract the deepest column and test on that one alone, they achieve nearly the same performance as with the whole network.
* They derive from that, that their fractal architecture is actually only really used to help that deepest column to learn anything. (Without shorter paths it would just learn nothing due to vanishing gradients.)

arxiv.org
scholar.google.com

Sample Efficient Actor-Critic with Experience Replay
Wang, Ziyu and Bapst, Victor and Heess, Nicolas and Mnih, Volodymyr and Munos, Rémi and Kavukcuoglu, Koray and de Freitas, Nando
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by abhishm 8 years ago

In many policy gradient algorithms, we update the parameters in online fashion. We collect trajectories  from a policy, use the trajectories to compute the gradient of policy parameters with respect to the long-term cumulative reward, and update the policy parameters using this gradient. It is to be noted here that we do not use these samples again after updating the policies. The main reason that we do not use these samples again because we need to use **importance sampling** and **importance sampling** suffers from high variance and can make the learning potentially unstable. 

This paper proposes an update on **Asynchronous Advantage Actor Critic (A3C)** to incorporate off-line data (the trajectories collected using previous policies). 

** Incorporating offline data in Policy Gradient **
The offline data is incorporated using importance sampling. Mainly; lets $J(\theta)$ denote the total reward using policy $\pi(\theta)$, then using Policy Gradient Theorem
$$
\Delta J(\theta) \propto \mathbb{E}_{x_t \sim \beta_\mu, a_t \sim \mu}[\rho_t \nabla_{\theta} \log \pi(a_t | x_t) Q^{\pi}(x_t, a_t)] 
$$
where $\rho_t = \frac{\pi(a_t | x_t)}{\mu({a_t|x_t})}$. $\rho_t$ is called the importance sampling term. $\beta_\mu$ is the stationary probability distribution of states under the policy $\mu$.

**Estimating $Q^{\pi}(x_t, a_t)$ in above equation:** The authors used a *retrace-$\lambda$* approach to estimate $Q^{\pi}$. Mainly; the action-values were computed using the following recursive equation:
$$
Q^{\text{ret}}(x_t, a_t) = r_t + \gamma \bar{\rho}_{t+1}\left(Q^{\text{ret}}(x_{t+1}, a_{t+1}) - Q(x_{t+1}, a_{t+1})\right) + \gamma V(x_{t+1})
$$
where $\bar{\rho}_t = \min\{c, \rho_t\}$ and $\rho_t$ is the importance sampling term. $Q$ and $V$ in the above equation are the estimate of action-value and state-value respectively. 

To estimate $Q$, the authors used a similar architecture as A3C except that the final layer outputs $Q-$values instead of state-values $V$.

To train $Q$, the authors used the $Q^{\text{ret}}$.

** Reducing the variance because of importance sampling in the above equation:** The authors used a technique called *importance weight truncation with bias correction* to keep the variance bounded in the policy gradient equation. Mainly; they use the following identity:
$$
\begin{array}{ccc}
&&\mathbb{E}_{x_t \sim \beta_\mu, a_t \sim \mu}[\rho_t \nabla_{\theta} \log \pi(a_t | x_t) Q^{\pi}(x_t, a_t)] \\
&=& \mathbb{E}_{x_t \sim \beta_\mu}\left[ \mathbb{E}_{a_t \sim \mu}[\bar{\rho}_t \nabla_{\theta} \log \pi(a_t | x_t) Q^{\pi}(x_t, a_t)] \right] \\
&+& \mathbb{E}_{a\sim \pi}\left[\left[\frac{\rho_t(a) - c}{\rho_t(a)}\right] \nabla_{\theta} \log\pi_{\theta}(a | x_t) Q^{\pi}(x_t, a)\right]
\end{array}
$$
Note that in the above identity, the variance in the both the terms on the right hand side is bounded. 

** Results: ** The authors showed that by using the off-line data, they were able to match the performance of best DQN agent with the less data and the same amount of computation. 

**Continuous task: ** The authors used a stochastic duelling architecture for tasks having continuous action spaces while utilizing the innovation of discrete cases.

arxiv.org
arxiv-vanity.com
scholar.google.com

Pointer Sentinel Mixture Models
Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.AI
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors combine a standard LSTM softmax with [Pointer Networks](https://arxiv.org/abs/1506.03134) in a mixture model called Pointer-Sentinel LSTM (PS-LSTM). The pointer networks helps with rare words and long-term dependencies but is unable to refer to words that are not in the input. The oppoosite is the case for the standard softmax. By combining the two approaches we get the best of both worlds. The probability of an output words is defined as a mixture of the pointer and softmax model and the mixture coefficient is calculated as part of the pointer attention. The authors evaluate their architecture on the PTB Language Modeling dataset where they achieve state of the art. They also present a novel WikiText dataset that is larger and more realistic then PTB.

### Key Points:

- Standard RNNs with softmax struggle with rare and unseen words, even when adding attention.
- Use a window of the most recent`L` words to match against.
- Probability of output with gating: `p(y|x) = g * p_vocab(y|x) + (1 - g) * p_ptr(y|x)`.
- The gate `g` is calcualted as an extra element in the attention module. Probabilities for the pointer network are then normalized accordingly.
- Integrating the gating funciton computation into the pointer network is crucial: It needs to have access to the pointer network state, not just the RNN state (which can't hold long-term info)
- WikiText-2 dataset: 2M train tokens, 217k validation tokens, 245k test tokens. 33k vocab, 2.6% OOV. 2x larger than PTB.
- WikiText-1-3 dataset: 103M train tokens, 217k validation tokens, 245k test tokens. 267k vocab, 2.4% OOV. 100x larger than PTB.
- Pointer Sentiment Model leads to stronger improvements for rare words - that makes intuitive sense.

arxiv.org
arxiv-vanity.com
scholar.google.com

An Actor-Critic Algorithm for Sequence Prediction
Dzmitry Bahdanau and Philemon Brakel and Kelvin Xu and Anirudh Goyal and Ryan Lowe and Joelle Pineau and Aaron Courville and Yoshua Bengio
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors propose to use the Actor Critic framework from Reinforcement Learning for Sequence prediction. They train an actor (policy) network to generate a sequence together with a critic (value) network that estimates the q-value function. Crucially, the actor network does not see the ground-truth output, but the critic does. This is different from LL (log likelihood) where errors are likely to cascade. The authors evaluate their framework on an artificial spelling correction and a real-world German-English Machine Translation tasks, beating baselines and competing approaches in both cases.

#### Key Points

- In LL training, the model is conditioned on its own guesses during search, leading to error compounding.
- The critic is allowed to see the ground truth, but the actor isn't
- The reward is a task-specific score, e.g. BLEU
- Use bidirectional RNN for both actor and critic. Actor uses a soft attention mechanism.
- The reward is partially receives at each intermediate step, not just at the end
- Framework is analogous to TD-Learning in RL
- Trick: Use additional target network to compute q_t (see Deep-Q paper) for stability
- Trick: Use delayed actor (as in Deep Q paper) for stability
- Trick: Put constraint on critic to deal with large action spaces (is this analogous to advantage functions?)
- Pre-train actor and critic to encourage exploration of the right space
- Task 1: Correct corrupt character sequence. AC outperforms LL training. Longer sequences lead to stronger lift.
- Task 2: GER-ENG Machine Translation: Beats LL and Reinforce models
- Qualitatively, critic assigns high values to words that make sense
- BLUE scores during training are lower than those of LL model - Why? Strong regularization? Can't overfit the training data.

#### Notes

- Why does the sequence length for spelling prediction only go up to 30? This seems very short to me and something that an LSTM should be able to handle quite easily. Would've like to see much longer sequences.

arxiv.org
arxiv-vanity.com
scholar.google.com

Reinforcement Learning with Unsupervised Auxiliary Tasks
Max Jaderberg and Volodymyr Mnih and Wojciech Marian Czarnecki and Tom Schaul and Joel Z Leibo and David Silver and Koray Kavukcuoglu
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, cs.NE
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors augment the A3C (Asynchronous Actor Critic) algorithm with auxiliary tasks. These tasks share some of the network parameters but value functions for them are learned off-policy using n-step Q-Learning. The auxiliary tasks only used to learn a better representation and don't directly influence the main policy control. The technique, called UNREAL (Unsupervised Reinforcement and Auxiliary Learning), outperforms A3C on both the Atari and Labyrinth domains in terms of performance and training efficiency.

#### Key Points

- Environments contain a wide variety of possible training signals, not just cumulative reward
- Base A3C agent uses CNN + RNN
- Auxiliary Control and Prediction tasks share the convolutional and LSTM network for the "base agent". This forces the agent to balance improvement and base and aux. tasks.
- Auxiliary Tasks
- Use off-policy RL algorithms (e.g. n-step Q-Learning) so that the same stream of experience from the base agent can be used for maximizing all tasks. Experience is sampled from a replay buffer.
- Pixel Changes (Auxiliary Control): Learn a policy for maximally changing the pixels in a grid of cells overlaid over the images
- Network Features (Auxiliary Control): Learn a policy for maximally activating units in a specific hidden layer
- Reward Prediction (Auxiliary Reward): Predict the next reward given some historical context. Crucially, because rewards tend to be sparse, histories are sampled in a skewed manner from the replay buffer so that P(r!=0) = 0.5. Convolutional features are shared with the base agent.
- Value Function Replay: Value function regression for the base agent with varying window for n-step returns.
- UNREAL
- Base agent is optimized on-policy (A3C) and aux. tasks are optimized off-policy.
- Experiments
- Agent is trained with 20-step returns and aux. tasks are performed every 20 steps.
- Replay buffer stores the most recent 2k observations, actions and rewards
- UNREAL tends to be more robust to hyperparameter settings than A3C
- Labyrinth
- 38% -> 83% human-normalized score. Each aux. tasks independently adds to the performance.
- Significantly faster learning, 11x across all levels
- Compared to input reconstruction technique: Input reconstruction hurts final performance b/c it puts too much focus on reconstructing relevant parts.
- Atari
- Not all experiments are completed yet, but UNREAL already surpasses state of the art agents and is more robust.

#### Thoughts

- I want an algorithm box please :)

arxiv.org
arxiv-vanity.com
scholar.google.com

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
Chris J. Maddison and Andriy Mnih and Yee Whye Teh
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, stat.ML
more

[link] Summary by Gavin Gray 8 years ago

This paper presents a way to differentiate through discrete random variables by replacing them with continuous random variables. Say you have a discrete [categorical variable][cat] and you're sampling it with the [Gumbel trick][gumbel] like this ($G_k$ is a Gumbel distributed variable and $\boldsymbol{\alpha}/\sum_k \alpha_k$ are our categorical probabilities):

$$
z = \text{one_hot} \left( \underset{k}{\text{arg max}} [ G_k + \log \alpha_k ] \right)
$$

This paper replaces the one hot and argmax with a softmax, and they add a $\lambda$ variable to control the "temperature". As $\lambda$ tends to zero it will equal the above equation.

$$
z = \text{softmax} \left( \frac{  G_k + \log \alpha_k }{\lambda} \right)
$$

I made [some notes][nb] on how this process works, if you'd like more intuition.

Comparison to [Gumbel-softmax][gs]
--------------------------------------------

These papers are proposed precisely the same distribution with notation changes ([noted there][gs]). Both papers also reference each other and the differences. Although, they mention differences in the variatonal objectives to the Gumbel-softmax. This paper also compares to [VIMCO][], which is probably a harder benchmark to compare against (multi-sample versus single sample).

The results in both papers compare to SOTA score function based estimators and both report high scoring results (often the best). There are some details about implementations to consider though, such as scheduling and exactly how to define the variational objective.

[cat]: https://en.wikipedia.org/wiki/Categorical_distribution
[gumbel]: https://hips.seas.harvard.edu/blog/2013/04/06/the-gumbel-max-trick-for-discrete-distributions/
[gs]: http://www.shortscience.org/paper?bibtexKey=journals/corr/JangGP16
[nb]: https://gist.github.com/gngdb/ef1999ce3a8e0c5cc2ed35f488e19748
[vimco]: https://arxiv.org/abs/1602.06725

arxiv.org
scholar.google.com

Categorical Reparameterization with Gumbel-Softmax
Jang, Eric and Gu, Shixiang and Poole, Ben
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Gavin Gray 8 years ago

In [stochastic computation graphs][scg], like [variational autoencoders][vae], using discrete variables is hard because we can't just differentiate through Monte Carlo estimates. This paper introduces a distribution that is a smoothed version of the [categorical distribution][cat] and has a parameter that, as it goes to zero, will make it equal the categorical distribution. This distribution is continuous and can be reparameterised.

In other words, the Gumbel trick way to sample a categorical $z$ looks like this ($g_i$ is gumbel distributed and $\boldsymbol{\pi}/\sum_j \pi_j$ are the categorical probabilties):

$$
z = \text{one_hot} \left( \underset{i}{\text{arg max}} [ g_i + \log \pi_i ] \right)
$$

This paper replaces the one hot and argmax with a [softmax][], and they introduce $\tau$ to control the "discreteness":

$$
z = \text{softmax} \left(  \frac{ g_i + \log \pi_i}{\tau} \right)
$$

I made a [notebook that illustrates this][nb] while looking at another paper that came out at the same time, which I should probably compare against here.

Comparison with [Concrete Distribution][concrete]
---------------------------------------------------------------

The concrete and Gumbel-softmax distributions are exactly the same (notation switch: $\tau \to \lambda$, $\pi_i \to \alpha_k$, $G_k \to g_i$). Both papers have structured output prediction experiments (predict one half of MNIST digits from the other half). This paper shows Gumbel-softmax always being better, but doesn't compare to VIMCO, which is sometimes better at test time in the concrete distribution paper.

Sidenote - blog post
----------------------------

The authors posted a [nice blog post][blog] that is also a good short summary and explanation.

[blog]: http://blog.evjang.com/2016/11/tutorial-categorical-variational.html
[scg]: https://arxiv.org/abs/1506.05254
[vae]: https://arxiv.org/abs/1312.6114
[cat]: https://en.wikipedia.org/wiki/Categorical_distribution
[softmax]: https://en.wikipedia.org/wiki/Softmax_function
[concrete]: http://www.shortscience.org/paper?bibtexKey=journals/corr/1611.00712
[nb]: https://gist.github.com/gngdb/ef1999ce3a8e0c5cc2ed35f488e19748

arxiv.org
arxiv-vanity.com
scholar.google.com

Density estimation using Real NVP
Laurent Dinh and Jascha Sohl-Dickstein and Samy Bengio
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Hugo Larochelle 9 years ago

This paper presents a novel neural network approach (though see [here](https://www.facebook.com/hugo.larochelle.35/posts/172841743130126?pnref=story) for a discussion on prior work) to density estimation, with a focus on image modeling. At its core, it exploits the following property on the densities of random variables. Let $x$ and $z$ be two random variables of equal dimensionality such that $x = g(z)$, where $g$ is some bijective and deterministic function (we'll note its inverse as $f = g^{-1}$). Then the change of variable formula gives us this relationship between the densities of $x$ and $z$:

$p_X(x) = p_Z(z) \left|{\rm det}\left(\frac{\partial g(z)}{\partial z}\right)\right|^{-1}$

Moreover, since the determinant of the Jacobian matrix of the inverse $f$ of a function $g$ is simply the inverse of the Jacobian of the function $g$, we can also write:

$p_X(x) = p_Z(f(x)) \left|{\rm det}\left(\frac{\partial f(x)}{\partial x}\right)\right|$

where we've replaced $z$ by its deterministically inferred value $f(x)$ from $x$.

So, the core of the proposed model is in proposing a design for bijective functions $g$ (actually, they design its inverse $f$, from which $g$ can be derived by inversion), that have the properties of being easily invertible and having an easy-to-compute determinant of Jacobian. Specifically, the authors propose to construct $f$ from various modules that all preserve these properties and allows to construct highly non-linear $f$ functions. Then, assuming a simple choice for the density $p_Z$ (they use a multidimensional Gaussian), it becomes possible to both compute $p_X(x)$ tractably and to sample from that density, by first samples $z\sim p_Z$ and then computing $x=g(z)$.

The building blocks for constructing $f$ are the following:

**Coupling layers**: This is perhaps the most important piece. It simply computes as its output $b\odot x + (1-b) \odot (x \odot \exp(l(b\odot x)) + m(b\odot x))$, where $b$ is a binary mask (with half of its values set to 0 and the others to 1) over the input of the layer $x$, while $l$ and $m$ are arbitrarily complex neural networks with input and output layers of equal dimensionality. 

In brief, for dimensions for which $b_i = 1$ it simply copies the input value into the output. As for the other dimensions (for which $b_i = 0$) it linearly transforms them as $x_i * \exp(l(b\odot x)_i) + m(b\odot x)_i$. Crucially, the bias ($m(b\odot x)_i$) and coefficient ($\exp(l(b\odot x)_i)$) of the linear transformation are non-linear transformations (i.e. the output of neural networks) that only have access to the masked input (i.e. the non-transformed dimensions). While this layer might seem odd, it has the important property that it is invertible and the determinant of its Jacobian is simply $\exp(\sum_i (1-b_i) l(b\odot x)_i)$. See Section 3.3 for more details on that.

**Alternating masks**: One important property of coupling layers is that they can be stacked (i.e. composed), and the resulting composition is still a bijection and is invertible (since each layer is individually a bijection) and has a tractable determinant for its Jacobian (since the Jacobian of the composition of functions is simply the multiplication of each function's Jacobian matrix, and the determinant of the product of square matrices is the product of the determinant of each matrix). This is also true, even if the mask $b$ of each layer is different. Thus, the authors propose using masks that alternate across layer, by masking a different subset of (half of) the dimensions. For images, they propose using masks with a checkerboard pattern (see Figure 3). Intuitively, alternating masks are better because then after at least 2 layers, all dimensions have been transformed at least once.

**Squeezing operations**: Squeezing operations corresponds to a reorganization of a 2D spatial layout of dimensions into 4 sets of features maps with spatial resolutions reduced by half (see Figure 3). This allows to expose multiple scales of resolutions to the model. Moreover, after a squeezing operation, instead of using a checkerboard pattern for masking, the authors propose to use a per channel masking pattern, so that "the resulting partitioning is not redundant with the previous checkerboard masking". See Figure 3 for an illustration.

Overall, the models used in the experiments usually stack a few of the following "chunks" of layers: 1) a few coupling layers with alternating checkboard masks, 2) followed by squeezing, 3) followed by a few coupling layers with alternating channel-wise masks. Since the output of each layers-chunk must technically be of the same size as the input image, this could become expensive in terms of computations and space when using a lot of layers. Thus, the authors propose to explicitly pass on (copy) to the very last layer ($z$) half of the dimensions after each layers-chunk, adding another chunk of layers only on the other half. This is illustrated in Figure 4b.

Experiments on CIFAR-10, and 32x32 and 64x64 versions of ImageNet show that the proposed model (coined the real-valued non-volume preserving or Real NVP) has competitive performance (in bits per dimension), though slightly worse than the Pixel RNN.

**My Two Cents**

The proposed approach is quite unique and thought provoking. Most interestingly, it is the only powerful generative model I know that combines A) a tractable likelihood, B) an efficient / one-pass sampling procedure and C) the explicit learning of a latent representation. While achieving this required a model definition that is somewhat unintuitive, it is nonetheless mathematically really beautiful!

I wonder to what extent Real NVP is penalized in its results by the fact that it models pixels as real-valued observations. First, it implies that its estimate of bits/dimensions is an upper bound on what it could be if the uniform sub-pixel noise was integrated out (see Equations 3-4-5 of [this paper](http://arxiv.org/pdf/1511.01844v3.pdf)). Moreover, the authors had to apply a non-linear transformation (${\rm logit}(\alpha + (1-\alpha)\odot x)$) to the pixels, to spread the $[0,255]$ interval further over the reals. Since the Pixel RNN models pixels as discrete observations directly, the Real NVP might be at a disadvantage.

I'm also curious to know how easy it would be to do conditional inference with the Real NVP. One could imagine doing approximate MAP conditional inference, by clamping the observed dimensions and doing gradient descent on the log-likelihood with respect to the value of remaining dimensions. This could be interesting for image completion, or for structured output prediction with real-valued outputs in general. I also wonder how expensive that would be.

In all cases, I'm looking forward to saying interesting applications and variations of this model in the future!

arxiv.org
scholar.google.com

Recurrent Batch Normalization
Cooijmans, Tim and Ballas, Nicolas and Laurent, César and Courville, Aaron
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper describes how to apply the idea of batch normalization (BN) successfully to recurrent neural networks, specifically to LSTM networks. The technique involves the 3 following ideas:

**1) Careful initialization of the BN scaling parameter.** While standard practice is to initialize it to 1 (to have unit variance), they show that this situation creates problems with the gradient flow through time, which vanishes quickly. A value around 0.1 (used in the experiments) preserves gradient flow much better.

**2) Separate BN for the "hiddens to hiddens pre-activation and for the "inputs to hiddens" pre-activation.** In other words, 2 separate BN operators are applied on each contributions to the pre-activation, before summing and passing through the tanh and sigmoid non-linearities.

**3) Use of largest time-step BN statistics for longer test-time sequences.** Indeed, one issue with applying BN to RNNs is that if the input sequences have varying length, and if one uses per-time-step mean/variance statistics in the BN transformation (which is the natural thing to do), it hasn't been clear how do deal with the last time steps of longer sequences seen at test time, for which BN has no statistics from the training set. The paper shows evidence that the pre-activation statistics tend to gradually converge to stationary values over time steps, which supports the idea of simply using the training set's last time step statistics.

Among these ideas, I believe the most impactful idea is 1). The papers mentions towards the end that improper initialization of the BN scaling parameter probably explains previous failed attempts to apply BN to recurrent networks.

Experiments on 4 datasets confirms the method's success.

**My two cents**

This is an excellent development for LSTMs. BN has had an important impact on our success in training deep neural networks, and this approach might very well have a similar impact on the success of LSTMs in practice.