Summaries from International Conference on Machine Learning on ShortScience.org

jmlr.org
scholar.google.com

Noisy Activation Functions
Gülçehre, Çaglar and Moczulski, Marcin and Denil, Misha and Bengio, Yoshua
International Conference on Machine Learning - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Alexander Jung 6 years ago

  * Certain activation functions, mainly sigmoid, tanh, hard-sigmoid and hard-tanh can saturate.
  * That means that their gradient is either flat 0 after threshold values (e.g. -1 and +1) or that it approaches zero for high/low values.
  * If there's no gradient, training becomes slow or stops completely.
  * That's a problem, because sigmoid, tanh, hard-sigmoid and hard-tanh are still often used in some models, like LSTMs, GRUs or Neural Turing Machines.
  * To fix the saturation problem, they add noise to the output of the activation functions.
  * The noise increases as the unit saturates.
  * Intuitively, once the unit is saturating, it will occasionally "test" an activation in the non-saturating regime to see if that output performs better.

### How
  * The basic formula is: `phi(x,z) = alpha*h(x) + (1-alpha)u(x) + d(x)std(x)epsilon`
  * Variables in that formula:
    * Non-linear part `alpha*h(x)`:
      * `alpha`: A constant hyperparameter that determines the "direction" of the noise and the slope. Values below 1.0 let the noise point away from the unsaturated regime. Values <=1.0 let it point towards the unsaturated regime (higher alpha = stronger noise).
      * `h(x)`: The original activation function.
    * Linear part `(1-alpha)u(x)`:
      * `u(x)`: First-order Taylor expansion of h(x).
        * For sigmoid: `u(x) = 0.25x + 0.5`
        * For tanh: `u(x) = x`
        * For hard-sigmoid: `u(x) = max(min(0.25x+0.5, 1), 0)`
        * For hard-tanh: `u(x) = max(min(x, 1), -1)`
    * Noise/Stochastic part `d(x)std(x)epsilon`:
      * `d(x) = -sgn(x)sgn(1-alpha)`: Changes the "direction" of the noise.
      * `std(x) = c(sigmoid(p*v(x))-0.5)^2 = c(sigmoid(p*(h(x)-u(x)))-0.5)^2`
        * `c` is a hyperparameter that controls the scale of the standard deviation of the noise.
        * `p` controls the magnitude of the noise. Due to the `sigmoid(y)-0.5` this can influence the sign. `p` is learned.
      * `epsilon`: A noise creating random variable. Usually either a Gaussian or the positive half of a Gaussian (i.e. `z` or `|z|`).
  * The hyperparameter `c` can be initialized at a high value and then gradually decreased over time. That would be comparable to simulated annealing.
  * Noise could also be applied to the input, i.e. `h(x)` becomes `h(x + noise)`.

### Results
  * They replaced sigmoid/tanh/hard-sigmoid/hard-tanh units in various experiments (without further optimizations).
  * The experiments were:
    * Learn to execute source code (LSTM?)
    * Language model from Penntreebank (2-layer LSTM)
    * Neural Machine Translation engine trained on Europarl (LSTM?)
    * Image caption generation with soft attention trained on Flickr8k (LSTM)
    * Counting unique integers in a sequence of integers (LSTM)
    * Associative recall (Neural Turing Machine)
  * Noisy activations practically always led to a small or moderate improvement in resulting accuracy/NLL/BLEU.
  * In one experiment annealed noise significantly outperformed unannealed noise, even beating careful curriculum learning. (Somehow there are not more experiments about that.)
  * The Neural Turing Machine learned far faster with noisy activations and also converged to a much better solution. 


![Influence of alphas](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Noisy_Activation_Functions__alphas.png?raw=true "Influence of alphas.")

*Hard-tanh with noise for various alphas. Noise increases in different ways in the saturing regimes.*


![Neural Turing Machine results](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Noisy_Activation_Functions__ntm.png?raw=true "Neural Turing Machine results.")

*Performance during training of a Neural Turing Machine with and without noisy activation units.*

--------------------

# Rough chapter-wise notes

* (1) Introduction
  * ReLU and Maxout activation functions have improved the capabilities of training deep networks.
  * Previously, tanh and sigmoid were used, which were only suited for shallow networks, because they saturate, which kills the gradient.
  * They suggest a different avenue: Use saturating nonlinearities, but inject noise when they start to saturate (and let the network learn how much noise is "good").
  * The noise allows to train deep networks with saturating activation functions.
  * Many current architectures (LSTMs, GRUs, Neural Turing Machines, ...) require "hard" decisions (yes/no). But they use "soft" activation functions to implement those, because hard functions lack gradient.
  * The soft activation functions can still saturate (no more gradient) and don't match the nature of the binary decision problem. So it would be good to replace them with something better.
  * They instead use hard activation functions and compensate for the lack of gradient by using noise (during training).
  * Networks with hard activation functions outperform those with soft ones.

* (2) Saturating Activation Functions
  * Activation Function = A function that maps a real value to a new real value and is differentiable almost everywhere.
  * Right saturation = The gradient of an activation function becomes 0 if the input value goes towards infinity.
  * Left saturation = The gradient of an activation function becomes 0 if the input value goes towards -infinity.
  * Saturation = A activation function saturates if it right-saturates and left-saturates.
  * Hard saturation = If there is a constant c for which for which the gradient becomes 0.
  * Soft saturation = If there is no constant, i.e. the input value must become +/- infinity.
  * Soft saturating activation functions can be converted to hard saturating ones by using a first-order Taylor expansion and then clipping the values to the required range (e.g. 0 to 1).
  * A hard activating tanh is just `f(x) = x`. With clipping to [-1, 1]: `max(min(f(x), 1), -1)`.
  * The gradient for hard activation functions is 0 above/below certain constants, which will make training significantly more challenging.
  * hard-sigmoid, sigmoid and tanh are contractive mappings, hard-tanh for some reason only when it's greater than the threshold.
  * The fixed-point for tanh is 0, for the others !=0. That can have influences on the training performance.

* (3) Annealing with Noisy Activation Functions
  * Suppose that there is an activation function like hard-sigmoid or hard-tanh with additional noise (iid, mean=0, variance=std^2).
  * If the noise's `std` is 0 then the activation function is the original, deterministic one.
  * If the noise's `std` is very high then the derivatives and gradient become high too. The noise then "drowns" signal and the optimizer just moves randomly through the parameter space.
  * Let the signal to noise ratio be `SNR = std_signal / std_noise`. So if SNR is low then noise drowns the signal and exploration is random.
  * By letting SNR grow (i.e. decreaseing `std_noise`) we switch the model to fine tuning mode (less coarse exploration).
  * That is similar to simulated annealing, where noise is also gradually decreased to focus on better and better regions of the parameter space.

* (4) Adding Noise when the Unit Saturate
  * This approach does not always add the same noise. Instead, noise is added proportinally to the saturation magnitude. More saturation, more noise.
  * That results in a clean signal in "good" regimes (non-saturation, strong gradients) and a noisy signal in "bad" regimes (saturation).
  * Basic activation function with noise: `phi(x, z) = h(x) + (mu + std(x)*z)`, where `h(x)` is the saturating activation function, `mu` is the mean of the noise, `std` is the standard deviation of the noise and `z` is a random variable.
  * Ideally the noise is unbiased so that the expectation values of `phi(x,z)` and `h(x)` are the same.
  * `std(x)` should take higher values as h(x) enters the saturating regime.
  * To calculate how "saturating" a activation function is, one can `v(x) = h(x) - u(x)`, where `u(x)` is the first-order Taylor expansion of `h(x)`.
  * Empirically they found that a good choice is `std(x) = c(sigmoid(p*v(x)) - 0.5)^2` where `c` is a hyperparameter and `p` is learned.
  * (4.1) Derivatives in the Saturated Regime
    * For values below the threshold, the gradient of the noisy activation function is identical to the normal activation function.
    * For values above the threshold, the gradient of the noisy activation function is `phi'(x,z) = std'(x)*z`. (Assuming that z is unbiased so that mu=0.)
  * (4.2) Pushing Activations towards the Linear Regime
    * In saturated regimes, one would like to have more of the noise point towards the unsaturated regimes than away from them (i.e. let the model try often whether the unsaturated regimes might be better).
    * To achieve this they use the formula `phi(x,z) = alpha*h(x) + (1-alpha)u(x) + d(x)std(x)epsilon`
      * `alpha`: A constant hyperparameter that determines the "direction" of the noise and the slope. Values below 1.0 let the noise point away from the unsaturated regime. Values <=1.0 let it point towards the unsaturated regime (higher alpha = stronger noise).
      * `h(x)`: The original activation function.
      * `u(x)`: First-order Taylor expansion of h(x).
      * `d(x) = -sgn(x)sgn(1-alpha)`: Changes the "direction" of the noise.
      * `std(x) = c(sigmoid(p*v(x))-0.5)^2 = c(sigmoid(p*(h(x)-u(x)))-0.5)^2` with `c` being a hyperparameter and `p` learned.
      * `epsilon`: Either `z` or `|z|`. If `z` is a Gaussian, then `|z|` is called "half-normal" while just `z` is called "normal". Half-normal lets the noise only point towards one "direction" (towards the unsaturated regime or away from it), while normal noise lets it point in both directions (with the slope being influenced by `alpha`).
    * The formula can be split into three parts:
      * `alpha*h(x)`: Nonlinear part.
      * `(1-alpha)u(x)`: Linear part.
      * `d(x)std(x)epsilon`: Stochastic part.
    * Each of these parts resembles a path along which gradient can flow through the network.
    * During test time the activation function is made deterministic by using its expectation value: `E[phi(x,z)] = alpha*h(x) + (1-alpha)u(x) + d(x)std(x)E[epsilon]`.
    * If `z` is half-normal then `E[epsilon] = sqrt(2/pi)`. If `z` is normal then `E[epsilon] = 0`.

* (5) Adding Noise to Input of the Function
  * Noise can also be added to the input of an activation function, i.e. `h(x)` becomes `h(x + noise)`.
  * The noise can either always be applied or only once the input passes a threshold.

* (6) Experimental Results
  * They applied noise only during training.
  * They used existing setups and just changed the activation functions to noisy ones. No further optimizations.
  * `p` was initialized uniformly to [-1,1].
  * Basic experiment settings:
    * NAN: Normal noise applied to the outputs.
    * NAH: Half-normal noise, i.e. `|z|`, i.e. noise is "directed" towards the unsaturated or satured regime.
    * NANI: Normal noise applied to the *input*, i.e. `h(x+noise)`.
    * NANIL: Normal noise applied to the input with learned variance.
    * NANIS: Normal noise applied to the input, but only if the unit saturates (i.e. above/below thresholds).
  * (6.1) Exploratory analysis
    * A very simple MNIST network performed slightly better with noisy activations than without. But comparison was only to tanh and hard-tanh, not ReLU or similar.
    * In an experiment with a simple GRU, NANI (noisy input) and NAN (noisy output) performed practically identical. NANIS (noisy input, only when saturated) performed significantly worse.
  * (6.2) Learning to Execute
    * Problem setting: Predict the output of some lines of code.
    * They replaced sigmoids and tanhs with their noisy counterparts (NAH, i.e. half-normal noise on output). The model learned faster.
  * (6.3) Penntreebank Experiments
    * They trained a standard 2-layer LSTM language model on Penntreebank.
    * Their model used noisy activations, as opposed to the usually non-noisy ones.
    * They could improve upon the previously best value. Normal noise and half-normal noise performed roughly the same.
  * (6.4) Neural Machine Translation Experiments
    * They replaced all sigmoids and tanh units in the Neural Attention Model with noisy ones. Then they trained on the Europarl corpus.
    * They improved upon the previously best score.
  * (6.5) Image Caption Generation Experiments
    * They train a network with soft attention to generate captions for the Flickr8k dataset.
    * Using noisy activation units improved the result over normal sigmoids and tanhs.
  * (6.6) Experiments with Continuation
    * They build an LSTM and train it to predict how many unique integers there are in a sequence of random integers.
    * Instead of using a constant value for hyperparameter `c` of the noisy activations (scale of the standard deviation of the noise), they start at `c=30` and anneal down to `c=0.5`.
    * Annealed noise performed significantly better then unannealed noise.
    * Noise applied to the output (NAN) significantly beat noise applied to the input (NANIL).
    * In a second experiment they trained a Neural Turing Machine on the associative recall task.
    * Again they used annealed noise.
    * The NTM with annealed noise learned by far faster than the one without annealed noise and converged to a perfect solution.

arxiv.org
scholar.google.com

Associative Long Short-Term Memory
Danihelka, Ivo and Wayne, Greg and Uria, Benigno and Kalchbrenner, Nal and Graves, Alex
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 7 years ago

TLDR; The authors propose Associate LSTMs, a combination of external memory based on Holographic Reduced Representations and LSTMs. The memory provides noisy key-value lookup based on matrix multiplications without introducing additional parameters. The authors evaluate their model on various sequence copying and memorization tasks, where it outperforms vanilla LSTMs and competing models with a similar number of parameters.

#### Key Points

- Two limitations of LSTMs: 1. N cells require NxN weight matrices. 2. Lacks mechanism to index memory
- Idea of memory comes from "Holographic Reduced Representations" (Plate, 2003), but authors add multiple redundant memory copies to reduce noise.
- More copies of memory => Less noise during retrieval
- In the LSTM update equations input and output keys to the memory are computed
- Compared to: LSTM, Permutation LSTM, Unitary LSTM, Multiplicative Unitary LSTM
- Tasks: Episodic Copy, XML modeling, variable assignment, arithmetic, sequence prediction

#### Notes

- Only brief comparison with Neural Turing Machines in appendix. Probably NTMs outperform this and are simpler. No comparison with attention mechanisms, memory networks, etc. Why?
- It's surprising to me that deep LSTM without any bells and whistles actually perform pretty well on many of the tasks. Is the additional complexity really worth it?

arxiv.org
arxiv-vanity.com
scholar.google.com

Language Modeling with Gated Convolutional Networks
Yann N. Dauphin and Angela Fan and Michael Auli and David Grangier
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Martin Thoma 7 years ago

This paper is about a new model for language which uses a convolutional approach instead of LSTMs.


## General Language modeling

Statistical language models estimate the probability distribution of a sequence of words. They are important for ASR (automatic speech recognition) and translation. The usual approach is to embedd words into $\mathbb{R}^n$ and then apply RNNs to the vector sequences.


## Evaluation

* [WikiText-103](http://metamind.io/research/the-wikitext-long-term-dependency-language-modeling-dataset/): [Perplexity](https://en.wikipedia.org/wiki/Perplexity) of 44.9 (lower is better)
* new best single-GPU result on the Google Billion Word benchmark: Perplexity of 43.9


## Idea

* uses Gated Linear Units (GLU)
* uses pre-activation residual blocks
* adaptive softmax
* no tanh in the gating mechanism
* use gradient clipping

## See also

* [Reddit](https://www.reddit.com/r/MachineLearning/comments/5kbsjb/r_161208083_language_modeling_with_gated/)
* [Improving Neural Language Models with a Continuous Cache](https://arxiv.org/abs/1612.04426): Test perplexity of **40.8 on WikiText-103**

jmlr.org
scholar.google.com

Learning End-to-end Video Classification with Rank-Pooling
Fernando, Basura and Gould, Stephen
International Conference on Machine Learning - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 7 years ago

This paper describes how rank pooling, a very recent approach for pooling representations organized in a sequence $\\{{\bf v}_t\\}_{t=1}^T$, can be used in an end-to-end trained neural network architecture.

Rank pooling is an alternative to average and max pooling for sequences, but with the distinctive advantage of maintaining some order information from the sequence. Rank pooling first solves a regularized (linear) support vector regression (SVR) problem where the inputs are the vector representations ${\bf v}_t$ in the sequence and the target is the corresponding index $t$ of that representation in the sequence (see Equation 5). The output of rank pooling is then simply the linear regression parameters $\bf{u}$ learned for that sequence. Because of the way ${\bf u}$ is trained, we can see that ${\bf u}$ will capture order information, as successful training would imply that ${\bf u}^\top {\bf v}_t < {\bf u}^\top {\bf v}_{t'} $ if $t < t'$. See [this paper](https://www.robots.ox.ac.uk/~vgg/rg/papers/videoDarwin.pdf) for more on rank pooling.

While previous work has focused on using rank pooling on hand-designed and fixed representations, this paper proposes to use ConvNet features (pre-trained on ImageNet) for the representation and backpropagate through rank pooling to fine-tune the ConvNet features. Since the output of rank pooling corresponds to an argmin operation, passing gradients through this operation is not as straightforward as for average or max pooling. However, it turns out that if the objective being minimized (in our case regularized SVR) is twice differentiable, gradients with respect to its argmin can be computed (see Lemmas 1 and 2). The authors derive the gradient for rank pooling (Equation 21). Finally, since its gradient requires inverting a matrix (corresponding to a hessian), the authors propose to either use an efficient procedure for computing it by exploiting properties of sums of rank-one matrices (see Lemma 3) or to simply use an approximation based on using a diagonal hessian.

In experiments on two small scale video activity recognition datasets (UCF-Sports and Hollywood2), the authors show that fine-tuning the ConvNet features significantly improves the performance of rank pooling and makes it superior to max and average pooling.

**My two cents**

This paper was eye opening for me, first because I did not realize that one could backpropagate through an operation corresponding to an argmin that doesn't have a closed form solution (though apparently this paper isn't the first to make that observation). Moreover, I did not know about rank pooling, which itself is a really thought provoking approach to pooling representations in a way that preserves some organizational information about the original representations.

I wonder how sensitive the results are to the value of the regularization constant of the SVR problem. The authors mention some theoretical guaranties on the stability of the solution found by SVR in general, but intuitively I would expect that the regularization constant would play a large role in the stability.

I'll be looking forward to any future attempts to increase the speed of rank pooling (or any similar method). Indeed, as the authors mention, it is currently too slow to be used on the larger video datasets that are currently available. 

Code for computing rank pooling (though not for computing its gradients) seems to be available [here](https://bitbucket.org/bfernando/videodarwin).

2 Comments

arxiv.org
arxiv-vanity.com
scholar.google.com

Recurrent Highway Networks
Julian Georg Zilly and Rupesh Kumar Srivastava and Jan Koutník and Jürgen Schmidhuber
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.LG, cs.CL, cs.NE
more

[link] Summary by Udibr 7 years ago

multi layer RNN in which first layer is LSTM, following layers $l$ have $t$,$c$ gates that control whether the state of the layer is carried from previous state or transferred previous layer:
$s_l^{[t]} = h_l^{[t]} \cdot t_l^{[t]} + s_{l-1}^{[t]} \cdot c
_l^{[t]}$

arxiv.org
scholar.google.com

Asynchronous Methods for Deep Reinforcement Learning
Mnih, Volodymyr and Badia, Adrià Puigdomènech and Mirza, Mehdi and Graves, Alex and Lillicrap, Timothy P. and Harley, Tim and Silver, David and Kavukcuoglu, Koray
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by fabianboth 7 years ago

The main contribution of [Asynchronous Methods for Deep Reinforcement Learning](https://arxiv.org/pdf/1602.01783v1.pdf) by Mnih et al. is a ligthweight framework for reinforcement learning agents. 

They propose a training procedure which utilizes asynchronous gradient decent updates from multiple agents at once. Instead of training one single agent who interacts with its environment, multiple agents are interacting with their own version of the environment simultaneously.

After a certain amount of timesteps, accumulated gradient updates from an agent are applied to a global model, e.g. a Deep Q-Network. These updates are asynchronous and lock free. 

Effects of training speed and quality are analyzed for various reinforcement learning methods. No replay memory is need to decorrelate successive game states, since all agents are already exploring different game states in real time. Also, on-policy algorithms like actor-critic can be applied.

They show that asynchronous updates have a stabilizing effect on policy and value updates. Also, their best method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU.