Summaries from International Conference on Machine Learning on ShortScience.org

arxiv.org
scholar.google.com

Neural Message Passing for Quantum Chemistry
Gilmer, Justin and Schoenholz, Samuel S. and Riley, Patrick F. and Vinyals, Oriol and Dahl, George E.
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 4 years ago

In the years before this paper came out in 2017, a number of different graph convolution architectures - which use weight-sharing and order-invariant operations to create representations at nodes in a graph that are contextualized by information in the rest of the graph - had been suggested for learning representations of molecules. The authors of this paper out of Google sought to pull all of these proposed models into a single conceptual framework, for the sake of better comparing and testing the design choices that went into them. All empirical tests were done using the QM9 dataset, where 134,000 molecules have predicted chemical properties attached to them, things like the amount of energy released if bombs are sundered and the energy of electrons at different electron shells. 

https://i.imgur.com/Mmp8KO6.png

An interesting note is that these properties weren't measured empirically, but were simulated by a very expensive quantum simulation, because the former wouldn't be feasible for this large of a dataset. However, this is still a moderately interesting test because, even if we already have the capability to computationally predict these features, a neural network would do much more quickly. And, also, one might aspirationally hope that architectures which learn good representations of molecules for quantum predictions are also useful for tasks with a less available automated prediction mechanism.

The framework assumes the existence of "hidden" feature vectors h at each node (atom) in the graph, as well as features that characterize the edges between nodes (whether that characterization comes through sorting into discrete bond categories or through a continuous representation). The features associated with each atom at the lowest input level of the molecule-summarizing networks trained here include: the element ID, the atomic number, whether it accepts electrons or donates them, whether it's in an aromatic system, and which shells its electrons are in.  

https://i.imgur.com/J7s0q2e.png

Given these building blocks, the taxonomy lays out three broad categories of function, each of which different architectures implement in slightly different ways. 

1. The Message function, M(). This function is defined with reference to a node w, that the message is coming from, and a node v, that it's being sent to, and is meant to summarize the information coming from w to inform the node representation that will be calculated at v. It takes into account the feature vectors of one or both nodes at the next level down, and sometimes also incorporates feature vectors attached to the  edge connecting the two nodes. In a notable example of weight sharing, you'd use the same Message function for every combination of v and w, because you need to be able to process an arbitrary number of pairs, with each v having a different number of neighbors.  The simplest example you might imagine here is a simple concatenation of incoming node and edge features; a more typical example from the architectures reviewed is a concatenation followed by a neural network layer. The aggregate message being sent to the receiver node is calculated by summing together the messages from each incoming vector (though it seems like other options are possible; I'm a bit confused why the paper presented summing as the only order-invariant option). 
2. The Update function, U(). This function governs how to take the aggregated message vector sent to a particular node, and combine that with the prior-layer representation at that node, to come up with a next-layer representation at that node. Similarly, the same Update function weights are shared across all atoms. 
3. The Readout function, R(), which takes the final-layer representation of each atom node and aggregates the representations into a final graph-level representation an order-invariant way 

Rather than following in the footsteps of the paper by describing each proposed model type and how it can be described in this framework, I'll instead try to highlight some of the more interesting ways in which design choices differed across previously proposed architectures. 

- Does the message function being sent from w to v depend on the feature value at both w and v, or just v? To put the question more colloquially, you might imagine w wanting to contextually send different information based on different values of the feature vector at node v, and this extra degree of expressivity (not present in the earliest 2015 paper), seems like a quite valuable addition (in that all subsequent papers include it)
- Are the edge features static, categorical things, or are they feature vectors that get iteratively updated in the same way that the node vectors do? For most of the architectures reviewed, the former is true, but the authors found that the highest performance in their tests came from networks with continuous edge vectors, rather than just having different weights for different category types of edge
- Is the Readout function something as simple as a summation of all top-level feature vectors, or is it more complex? Again, the authors found that they got the best performance by using a more complex approach, a Set2Set aggregator, which uses item-to-item attention within the set of final-layer atom representations to construct an aggregated grap-level embedding

The empirical tests within the paper highlight a few more interestingly relevant design choices that are less directly captured by the framework. The first is the fact that it's quite beneficial to explicitly include Hydrogen atoms as part of the graph, rather than just "attaching" them to their nearest-by atoms as a count that goes on that atom's feature vector. The second is that it's valuable to start out your edge features with a continuous representation of the spatial distance between atoms, along with an embedding of the bond type. 

This is particularly worth considering because getting spatial distance data for a molecule requires solving the free-energy problem to determine its spatial conformation, a costly process. We might ideally prefer a network that can work on bond information alone. The authors do find a non-spatial-information network that can perform reasonably well - reaching full accuracy on 5 of 13 targets, compared to 11 with spatial information. However, the difference is notable, which, at least from my perspective, begs the question of whether it'd ever be possible to learn representations that can match the performance of spatially-informed ones without explicitly providing that information.

proceedings.mlr.press
scholar.google.com

Sharp Minima Can Generalize For Deep Nets
Dinh, Laurent and Pascanu, Razvan and Bengio, Samy and Bengio, Yoshua
International Conference on Machine Learning - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by David Stutz 4 years ago

Dinh et al. show that it is unclear whether flat minima necessarily generalize better than sharp ones. In particular, they study several notions of flatness, both based on the local curvature and based on the notion of “low change in error”. The authors show that the parameterization of the network has a significant impact on the flatness; this means that functions leading to the same prediction function (i.e., being indistinguishable based on their test performance) might have largely varying flatness around the obtained minima, as illustrated in Figure 1. In conclusion, while networks that generalize well usually correspond to flat minima, it is not necessarily true that flat minima generalize better than sharp ones.

https://i.imgur.com/gHfolEV.jpg
Figure 1: Illustration of the influence of parameterization on the flatness of the obtained minima.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
arxiv-vanity.com
scholar.google.com

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)
Been Kim and Martin Wattenberg and Justin Gilmer and Carrie Cai and James Wexler and Fernanda Viegas and Rory Sayres
arXiv e-Print archive - 2017 via Local arXiv
Keywords: stat.ML
more

[link] Summary by David Stutz 5 years ago

Kim et al. propose Concept Activation Vectors (CAV) that represent the direction of features corresponding to specific human-interpretable concepts. In particular, given a network for a classification task, a concept is defined as a set of images with that concept. A linear classifier is then trained to distinguish images with concept from random images without the concept based on a chosen feature layer. The normal of the obtained linear classification boundary corresponds to the learned Concept Activation Vector (CAV). By considering the directional derivative along this direction for a given input allows to quantify how well the input aligns with the chosen concept. This way, images can be ranked and the model’ sensitivity to particular concepts can be quantified. The idea is also illustrated in Figure 1.

https://i.imgur.com/KOqPeag.png
Figure 1: Process of constructing Concept Activation Vectors (CAVs).

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

proceedings.mlr.press
scholar.google.com

Understanding Black-box Predictions via Influence Functions
Koh, Pang Wei and Liang, Percy
International Conference on Machine Learning - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by kangcheng 5 years ago

**Goal**: identifying training points most responsible for a given prediction.

Given training points $z_1, \dots, z_n$, let loss function be $\frac{1}{n}\sum_{i=1}^nL(z_i, \theta)$ 

A function called influence function let us compute the parameter change if $z$ were upweighted by some small $\epsilon$. 
$$\hat{\theta}_{\epsilon, z} := \arg \min_{\theta \in \Theta} \frac{1}{n}\sum_{i=1}^n L(z_i, \theta) + \epsilon L(z, \theta)$$

$$\mathcal{I}_{\text{up, params}}(z) := \frac{d\hat{\theta}_{\epsilon, z}}{d\epsilon} = -H_{\hat{\theta}}^{-1} \nabla_\theta L(z, \hat{\theta})$$

$\mathcal{I}_{\text{up, params}}(z)$ shows how uplifting one point $z$ affect the estimate of the parameters $\theta$. 

Furthermore, we could determine how uplifting $z$ affect the loss estimate of a test point through chain rule. 
$$\mathcal{I}_{\text{up, loss}}(z, z_{\text{test}}) = \nabla_\theta L(z_{\text{test}}, \hat{\theta})^\top \mathcal{I}_{\text{up, params}}(z)$$ 

Apart from lifting one training point, change of the parameters with the change of a training point could also be estimated. 
$$\frac{d\hat{\theta}_{\epsilon, z_\delta, -z}}{d\epsilon} = \mathcal{I}_{\text{up, params}}(z_\delta) - \mathcal{I}_{\text{up, params}}(z)$$
This measures how purturbation $\delta$ to training point $z$ affect the parameter estimation $\theta$.

Section 3 describes some practicals about efficient implementing.

This set of tool could be used for some interpretable machine learning tasks.

arxiv.org
arxiv-vanity.com
scholar.google.com

On Calibration of Modern Neural Networks
Chuan Guo and Geoff Pleiss and Yu Sun and Kilian Q. Weinberger
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG
more

[link] Summary by elbaro 5 years ago

## Task

A neural network for classification typically has a **softmax** layer and outputs the class with the max probability. However, this probability does not represent the **confidence**. If the average confidence (average of max probs) for a dataset matches the accuracy, it is called **well-calibrated**. Old models like LeNet (1998) was well-calibrated, but modern networks like ResNet (2016) are no longer well-calibrated. This paper explains what caused this and compares various calibration methods.

## Figure - Confidence Histogram

https://i.imgur.com/dMtdWsL.png

The bottom row: group the samples by confidence (max probailities) into bins, and calculates the accuracy (# correct / # bin size) within each bin.

- ECE (Expected Calibration Error): average of |accuracy-confidence| of bins
- MCE (Maximum Calibration Error): max of |accuracy-confidence| of bins

## Analysis - What
The paper experiments how models are mis-calibrated with different factors: (1) model capacity, (2) batch norm, (3) weight decay, (4) NLL.

## Solution - Calibration Methods

Many calibration methods for binary classification and multi-class classification are evaluated. The method that performed the best is **temperature scailing**, which simply multiplies logits before the softmax by some constant. The paper used the validation set to choose the best constant.

proceedings.mlr.press
scholar.google.com

Parseval Networks: Improving Robustness to Adversarial Examples
Cissé, Moustapha and Bojanowski, Piotr and Grave, Edouard and Dauphin, Yann and Usunier, Nicolas
International Conference on Machine Learning - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by David Stutz 6 years ago

Cisse et al. propose parseval networks, deep neural networks regularized to learn orthonormal weight matrices. Similar to the work by Hein et al. [1], the mean idea is to constrain the Lipschitz constant of the network – which essentially means constraining the Lipschitz constants of each layer independently. For weight matrices, this can be achieved by constraining the matrix-norm. However, this (depending on the norm used) is often intractable during gradient descent training. Therefore, Cisse et al. propose to use a per-layer regularizer of the form:

$R(W) = \|W^TW – I\|$

where $I$ is the identity matrix. During training, this regularizer is supposed to ensure that the learned weigth matrices are orthonormal – an efficient alternative to regular matrix manifold optimization techniques (see the paper).

[1] Matthias Hein, Maksym Andriushchenko: Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation. CoRR abs/1705.08475 (2017)

Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

proceedings.mlr.press
scholar.google.com

Efficient softmax approximation for GPUs
Grave, Edouard and Joulin, Armand and Cissé, Moustapha and Grangier, David and Jégou, Hervé
International Conference on Machine Learning - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 6 years ago

Modification of the 2-level hierarchical softmax for better efficiency. An equation of computational complexity is used to find the optimal number of words in each class. In addition, the most common words are considered on the same level as other classes.

https://i.imgur.com/dbKS3gh.png

arxiv.org
scholar.google.com

Optimizing the Latent Space of Generative Networks
Bojanowski, Piotr and Joulin, Armand and Lopez-Paz, David and Szlam, Arthur
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Min Lin 6 years ago

An algorithm named GLO is proposed in this paper. The objective function of GLO:

$$\min_{\theta}\frac{1}{N}\sum_{i=1}^N\[\min_{z_i}l(g^\theta(z_i),x_i)\]$$

This idea dates back to [Dictionary Learning](https://en.wikipedia.org/wiki/Sparse_dictionary_learning). ![](https://wikimedia.org/api/rest_v1/media/math/render/svg/81449a31e07ad388801379c804b73e6d1f044ce2)
It can be viewed as a nonlinear version of the dictionary learning by 
1. replace the dictionary $D$ with the function $g^{\theta}$. 
2. replace $r$ with $z$.
3. use $l_2$ loss function.

Although in this way, the generator could be learned without the hassles caused by GAN objective, there could be problems. 

With this method, the space of the latent vector $z$ could be structured. Although the author project $z$ to a unit ball if it falls outside, there is no guarantee that the trained $z$ would remain a Gaussian distribution that it was originally initialized from. 

This could cause the problem that not every sampled noise could reconstruct a valid image, and the linear interpolation could be problematic if the support of the marginalized $p(z)$ is not a convex set.

proceedings.mlr.press
scholar.google.com

Neural Episodic Control
Pritzel, Alexander and Uria, Benigno and Srinivasan, Sriram and Badia, Adrià Puigdomènech and Vinyals, Oriol and Hassabis, Demis and Wierstra, Daan and Blundell, Charles
International Conference on Machine Learning - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Léo Paillier 6 years ago

 _Objective:_ Reduce learning time for [DQN](https://deepmind.com/research/dqn/)-type architectures.

They introduce a new network element, called DND (Differentiable Neural Dictionary) which is basically a dictionary that uses any key (especially embeddings) and computes the value by using kernel between keys. Plus it's differentiable.

## Architecture:

They use basically a network in two steps:

1.  A classical CNN network that computes and embedding for every image.
2.  A DND for all possible actions (controller input) that stores the embedding as key and estimated reward as value.

Also they use a buffer to store all tuples (previous image, action, reward, next image) and for training basic technique is used.

[![screen shot 2017-04-12 at 11 23 32 am](https://cloud.githubusercontent.com/assets/17261080/24951103/92930022-1f73-11e7-97d2-628e2f4b5a33.png)](https://cloud.githubusercontent.com/assets/17261080/24951103/92930022-1f73-11e7-97d2-628e2f4b5a33.png)

## Results:

Clearly improves learning speed but in the end other techniques catchup and it gets outperformed.

proceedings.mlr.press
scholar.google.com

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Finn, Chelsea and Abbeel, Pieter and Levine, Sergey
International Conference on Machine Learning - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by yenchenlin 6 years ago

The authors propose an algorithm for meta-learning that is compatible with any model trained with gradient descent, and show that it works on various domain including supervised learning and reinforcement learning. This is done by explicitly train the network such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task.

### Key Points

- MAML is actually finding a good **initialization** of model parameters for several tasks.
- Good initialization of parameters means that it can achieve good performance on several tasks with small number of gradient steps.

### Method
- Simultaneously optimize the **initialization** of model parameters of different meta-training tasks, hoping that it can quickly adapt to new meta-testing tasks.

![](https://cloud.githubusercontent.com/assets/7057863/25161911/46f2721e-24f1-11e7-9fba-8bc2f0782204.png)

- Training procedure:

![](https://cloud.githubusercontent.com/assets/7057863/25161749/8d00902a-24f0-11e7-93a8-6a9b74386f55.png)



### Exp

- It acheived performance that is comparable to the state-of-the-art on classification/regression/reinforcement learning tasks.

### Thought
I think the experiments are thorough since they proved that this technique can be applied to both supervised and reinforcement learning. However, the method is not novel provided that [Optimization a A Midel For Few-shot Learning](https://openreview.net/pdf?id=rJY0-Kcll) already proposed to learn initialization of parameters.

proceedings.mlr.press
scholar.google.com

Learning Important Features Through Propagating Activation Differences
Shrikumar, Avanti and Greenside, Peyton and Kundaje, Anshul
International Conference on Machine Learning - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by sam 6 years ago

### Summary

The motivation of this paper is to make neural network interpretable so that they can be adopted in fields where interpretability is essential (ie: medical field). Thus, this paper present _DeepLIFT_, a method to interpret neural networks by **decomposing** the output prediction given a specific input by backpropagating the _contribution_ of all the neurons to every input features. The _contribution_ of a neuron is determined by comparing the activation of this neuron to a _reference activation_. This _reference activation_ is determined arbitrarily by a domain expert. Moreover, the authors argue that in some case, giving separate consideration to positive and negative contributions can reveal dependencies that are missed by other approaches. The authors show that their approaches can capture some dependencies that a gradient-based method cannot.

### Computing the contribution of a neuron

Given the following notation:
* $t$: Target output neuron
* $t^0$: Reference activation of $t$
* $x_1, x_2, ..., x_n$: Set of neurons
* $\Delta t$: The difference-from-reference of a target
* $\Delta x$: The difference-from-reference of an input
* $C_{\Delta x_i,\Delta t}$: Contributions scores of a neuron

$$\Delta t = t - t^0$$
$$\Delta t = \sum_{i=1}^n C_{\Delta x_i \Delta t}$$

The advantage of the _difference from reference_ against purely gradient method is that the _diference from reference_ avoid all discontinuities as seen in the following figure

https://i.imgur.com/vLZytJT.png

### "Backpropagating" the contribution to the input

To compute the contribution to the input, the authors use a concept similar to the chain rule. Given a _multiplier_ $m_\Delta x _\Delta t$ computed as following:

$$m_{\Delta x \Delta t} = \frac{C_{\Delta x \Delta t}} {\Delta x}$$

Given $z$ the output of a neuron, $y_j$ one neuron in the hidden layer before $z$ and $x_i$ one neuron at the input, before $y_j$. We can compute $m_{\Delta x_i \Delta z}$ as following:

$$m_{\Delta x_i \Delta z}=\sum_j m_{\Delta x_i \Delta y_j} m_{\Delta y_j \Delta z}$$

### Computing the contribution score

The authors argues that it can be beneficial in some case to separate the positive and negative contributions. ie:
$$\Delta _{x_i} = \Delta _{x_i}^+ + \Delta _{x_i}^-$$
$$C_{\Delta _{x_i} \Delta _t} = C_{\Delta _{x_i}^+ \Delta _t} + C_{\Delta _{x_i}^- \Delta _t}$$

The authors propose three similars techniques to compute the contribution score

1. A linear rule where one does not take into consideration the nonlinearity function such that $C_{\Delta _{x_i} \Delta _t} = w_i \Delta _{x_i}$
2. The _rescale rule_ applied to nonlinear function (ie: $y=f(x)$). If $\Delta _y = 0$ or is very close (less than $10^{-7}$), then the authors use the gradient instead of the multiplier.
3. The _Reveal Cancel rule_ is similar than the _rescale rule_, but threat the positive and negative example differently. This allows to capture dependencies (ie: min/AND) that cannot be captured by _rescale rule_ or other method. The difference from reference can be computed as follow:
$$\Delta y^+ = \frac{1}{2}(f(x^0 + \Delta x^+) - f(x^0)) + \frac{1}{2}(f(x^0 + \Delta x^+ + \Delta x^-) - f(x^0+ \Delta x^-)$$
$$\Delta y^- = \frac{1}{2}(f(x^0 + \Delta x^-) - f(x^0)) + \frac{1}{2}(f(x^0 + \Delta x^+ + \Delta x^-) - f(x^0+ \Delta x^+)$$

1 Comments

proceedings.mlr.press
scholar.google.com

The Shattered Gradients Problem: If resnets are the answer, then what is the question?
Balduzzi, David and Frean, Marcus and Leary, Lennox and Lewis, J. P. and Ma, Kurt Wan-Duo and McWilliams, Brian
International Conference on Machine Learning - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Gavin Gray 6 years ago

Imagine you make a neural network mapping a scalar to a scalar. After you initialise this network in the traditional way, randomly with some given variance, you could take the gradient of the input with respect to the output for all reasonable values (between about - and 3 because networks typically assume standardised inputs). As the value increases, different rectified linear units in the network will randomly switch on, drawing a random walk in the gradients; another name for which is brown noise.

![](http://i.imgur.com/KMzfzMZ.png)

However, do the same thing for deep networks, and any traditional initialisation you choose, and you'll see the random walk start to look like white noise. One intuition given in the paper is that as different rectifiers in the network switch off and on the input is taking a number of different paths though the network. The number of possible paths grows exponentially with the depth of the network, so as the input varies, the gradients become increasingly chaotic. **The explanations and derivations given in the paper are much better reasoned and thorough, please read those if you are interested**.

Why should we care about this? Because the authors take the recent nonlinearity [CReLU][] (output is concatenation of `relu(x)` and `relu(-x)`) and develop an initialisation that will avoid problems with gradient shattering. The initialisation is just to take your standard initialised weight matrix $\mathbf{W}$ and set the right half to be the negative of the left half ($\mathbf{W}_{\text{left}}$). As long as the input to the layer is also concatenated, the left half will be multiplied by `relu(x)` and the right by `relu(-x)`. Then:

$$
\mathbf{W}.\text{CReLU}(\mathbf{x}) = \begin{cases} \mathbf{W}_{\text{left}}\mathbf{x} & \text{ if } x > 0 \\ \mathbf{W}_{\text{left}}\mathbf{x} & \text{ if } x \leq 0\end{cases}
$$

Doing this allows them to train deep networks without skip connections, and they show results on CIFAR-10 with depths of up to 200 exceeding (slightly) a similar resnet.

[crelu]: https://arxiv.org/abs/1603.05201

arxiv.org
arxiv-vanity.com
scholar.google.com

On orthogonality and learning recurrent networks with long term dependencies
Eugene Vorontsov and Chiheb Trabelsi and Samuel Kadoury and Chris Pal
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG, cs.NE
more

3	[link] Summary by Joseph Paul Cohen 6 years ago Here is a video overview: https://www.youtube.com/watch?v=t-fow6GJepQ Here is an image of the poster: https://i.imgur.com/Ti9btj9.png more less 1 Comments

proceedings.mlr.press
scholar.google.com

Unimodal Probability Distributions for Deep Ordinal Classification
Beckham, Christopher and Pal, Christopher J.
International Conference on Machine Learning - 2017 via Local Bibsonomy
Keywords: dblp

1	[link] Summary by Joseph Paul Cohen 6 years ago Short overview from ICML: https://youtube.com/watch?v=GMG5bFciuIA Long overview from ICML: https://youtu.be/o6dtDuldsEo more less

arxiv.org
arxiv-vanity.com
scholar.google.com

Variational Dropout Sparsifies Deep Neural Networks
Dmitry Molchanov and Arsenii Ashukha and Dmitry Vetrov
arXiv e-Print archive - 2017 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by Gavin Gray 7 years ago

The authors introduce their contribution as an alternative way to approximate the KL divergence between prior and variational posterior used in [Variational Dropout and the Local Reparameterization Trick][kingma] which allows unbounded variance on the multiplicative noise. When the noise variance parameter associated with a weight tends to infinity you can say that the weight is effectively being removed, and in their implementation this is what they do.

There are some important details differing from the [original algorithm][kingma] on per-weight variational dropout. For both methods we have the following initialization for each dense layer:

```
theta = initialize weight matrix with shape (number of input units, number of hidden units)
log_alpha = initialize zero matrix with shape (number of input units, number of hidden units)
b = biases initialized to zero with length the number of hidden units
```

Where `log_alpha` is going to parameterise the variational posterior variance.

In the original paper the algorithm was the following:

```
mean = dot(input, theta) + b # standard dense layer
# marginal variance over activations (eq. 10 in [original paper][kingma])
variance = dot(input^2, theta^2 * exp(log_alpha)) 
# sample from marginal distribution by scaling Normal 
activations = mean + sqrt(variance)*unit_normal(number of output units) 
```

The final step is a standard [reparameterization trick][shakir], but since it is a marginal distribution this is referred to as a local reparameterization trick (directly inspired by the [fast dropout paper][fast]).

The sparsifying algorithm starts with an alternative parameterisation for `log_alpha`

```
log_sigma2 = matrix filled with negative constant (default -8) with size (number of input units, number of hidden units)
log_alpha = log_sigma2 - log(theta^2)
log_alpha = log_alpha clipped between 8 and -8
```

The authors discuss this in section 4.1, the $\sigma_{ij}^2$ term corresponds to an additive noise variance on each weight with $\sigma_{ij}^2 = \alpha_{ij}\theta_{ij}^2$. Since this can then be reversed to define `log_alpha` the forward pass remains unchanged, but the variance of the gradient is reduced. It is quite a counter-intuitive trick, so much so I can't quite believe it works.

They then define a mask removing contributions to units where the noise variance has gone too high:

```
clip_mask = matrix shape of log_alpha, equals 1 if log_alpha is greater than thresh (default 3)
```

The clip mask is used to set elements of `theta` to zero, and then the forward pass is exactly the same as in the original paper.

The difference in the approximation to the KL divergence is illustrated in figure 1 of the paper; the sparsifying version tends to zero as the variance increases, which matches the true KL divergence. In the [original paper][kingma] the KL divergence would explode, forcing them to clip the variances at a certain point.

[kingma]: https://arxiv.org/abs/1506.02557
[shakir]: http://blog.shakirm.com/2015/10/machine-learning-trick-of-the-day-4-reparameterisation-tricks/
[fast]: http://proceedings.mlr.press/v28/wang13a.html