Summaries from arXiv e-Print archive on ShortScience.org

arxiv.org
scholar.google.com

Convolutional Networks on Graphs for Learning Molecular Fingerprints
Duvenaud, David and Maclaurin, Dougal and Aguilera-Iparraguirre, Jorge and Gómez-Bombarelli, Rafael and Hirzel, Timothy and Aspuru-Guzik, Alán and Adams, Ryan P.
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 5 years ago

If you read modern (that is, 2018-2020) papers using deep learning on molecular inputs, almost all of them use some variant of graph convolution. So, I decided to go back through the citation chain and read the earliest papers that thought to apply this technique to molecules, to get an idea of lineage of the technique within this domain. 

This 2015 paper, by Duvenaud et al, is the earliest one I can find. It focuses the entire paper on comparing differentiable, message-passing networks to the state of the art standard at the time, circular fingerprints (more on that in a bit). I really appreciated this approach, which, rather than trying to claim an unrealistic level of novelty, goes into detail on the prior approach, and carves out specific areas of difference. At a high level, the authors' claim is: our model is, in its simplest case, a more flexible and super version of existing work. The unspoken corollary, which ended up being proven true, is that the flexibility of the neural network structure makes it easy to go beyond this initial level of simplicity. 

Circular Fingerprinting (or, more properly, Extended-Connectivity Circular Fingerprints), is a fascinating algorithm that captures many of the elements of convolution: shared weights, a hierarchy of kernels that match patterns at different scales, and a clever way of aggregating information across an arbitrary number of input nodes. Mechanistically, Circular Fingerprints work by: 
1) Taking each atom, and creating a concatenated vector of its basic features, along with the basic features of each atom it's bonded to (with bonded neighbors quasi-randomly)

2) Calculating next-level features by applying some number of hash functions (roughly equivalent to convolutional kernels) to the neighborhood feature vector at the lower level to produce an integer 

3) For each feature, setting the value of the fingerprint vector to 1 at the index implied by the integer in step (2) 

4) Iterating this process at progressively higher layers, using the hash 

The effect of this is to assign each index of the vector to an binary feature (modulo hash collisions), where that feature is activated if an exact match is found to a structure within a given atom.  Its main downside is that (a) its "kernel" equivalents are fixed and not trainable, since they're just random hashes, and (b) its features represent *exact* matches to lower-level feature patterns, which means you can't have one feature activated to different degrees by variations on a pattern it's identifying.

https://i.imgur.com/V8FpfVE.png

Duvenaud et al present their alternative in terms of keeping a similar structure, but swapping out fixed and binary components for trainable (because differentiable) and continuous ones. Instead of concatenating a random sorting of atom neighbors to enforce invariance to sorting, they simply sum feature vectors across neighbors, which is also an order-invariantoperation. Instead of applying hash functions, they apply parametrized kernel functions, with the same parameters used across all aggregated neighborhood vectors . This will no longer look for exact matches, but will activate to the extent a structure within an atom matches against a kernel pattern. Then, these features are put into a softmax, which instead setting an index of a vector to a sharp one value, activates different feature indices in the final vector to differing degrees. The final fingerprint is simply the sum of these softmax feature activations for each atom. 

The authors do a few tests to confirm their substitution is working well, including starting out with a random network (to better approximate the random hash functions), comparing distances between atoms according to either the circular or neural footprint (which had a high correlation), and confirming that that performs similarly to circular fingerprints on a set of supervised learning tasks on modules. When they trained weights to be better than random on three such supervised tasks, they found that their model was comparable or better than circular fingerprints on all three (to break that down: it was basically equivalent on one, and notably better on the other two, according to mean squared error) 

This really is the simplest possible version of a message-passing or graph convolutional network (it doesn't use edge features, it doesn't calculate features of a neighbor-connection according to the features of each node, etc), but it's really satisfying to see it laid out as a next-step alternative that offered value just by stepping away from exact-match feature dynamics and non-random functions, even without all the sophisticated additions that would later be added to such models.

2 Comments

arxiv.org
scholar.google.com

Metadata Embeddings for User and Item Cold-start Recommendations
Kula, Maciej
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 5 years ago

The idea is to combine collaborative filtering with content-based recommenders to mitigate the user and item coldstart problems.

The author distinguishes between positive and negative interactions.

The representation of a user and of items is the sum of all their latent representations. This sounds similar to "**Asymmetric factor models**" as described in [the BellKor Netflix price solution](https://www.netflixprize.com/assets/ProgressPrize2007_KorBell.pdf). **The key idea is to encode the latent user (or item) vector as a sum of latent attribute vectors.**

Adagrad / asynchronous stochastic gradient descent was used for optimization.


## See also

* [Code on GitHub](https://lyst.github.io/lightfm/docs/index.html#)
* [Paper on ArXiv](https://arxiv.org/pdf/1507.08439.pdf)

arxiv.org
arxiv-vanity.com
scholar.google.com

Adding Gradient Noise Improves Learning for Very Deep Networks
Arvind Neelakantan and Luke Vilnis and Quoc V. Le and Ilya Sutskever and Lukasz Kaiser and Karol Kurach and James Martens
arXiv e-Print archive - 2015 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by David Stutz 5 years ago

Neelakantan et al. study gradient noise for improving neural network training. In particular, they add Gaussian noise to the gradients in each iteration:

$\tilde{\nabla}f = \nabla f + \mathcal{N}(0, \sigma^2)$

where the variance $\sigma^2$ is adapted throughout training as follows:

$\sigma^2 = \frac{\eta}{(1 + t)^\gamma}$

where $\eta$ and $\gamma$ are hyper-parameters and $t$ the current iteration. In experiments, the authors show that gradient noise has the potential to improve accuracy, especially given optimization.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
arxiv-vanity.com
scholar.google.com

End-to-end Learning of Action Detection from Frame Glimpses in Videos
Serena Yeung and Olga Russakovsky and Greg Mori and Li Fei-Fei
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CV, cs.LG
more

[link] Summary by shiyu 6 years ago

### **Keyword**: RNN, serialized model; non-differentiable backpropogarion; action detection in video 


**Abstract**: This paper uses an end-to-end model which is a recurrent neural network trained by REINFORCE to directly predict the temporal bounds of actions. The intuition is that people will observe moments in video and decide where to look to predict when an action is occurring. After training, Serena et al manage to achieve the state-of-art result by only observing 2% of the video frames.

**Model**: In order to take a long video and output all the instances of given action, they use two parts including an observation network and recurrent network. 
* observation network:  encode the visual representation of video frames.
     * input:  $ln$ -- the normalized location of the frame + frame $v_{ln}$
     * network: fc7 feature of finetuned VGG16 network
     * output: $on$ of 1024 dimension indicate time and frame feature 
     
* recurrent network: sequentially process the visual representations and decide where to watch next and whether to emit detection.
##### for each timestep:
     * input:  $on$ -- the representation of the frame + previous state $h_{n-1}$
     * network: $d_n = f_d(h_n; \theta_d)$. $pn = fp(h_n,\theta_p)$,$fd$ is fc. $fp$ is fc+sigmoid  
     * output: $d_n = (s_n,e_n,c_n )$as the candidate detection, where $s_n$,$e_n$ is the start and end of the detection, $c_n$ is confidence level;  $p_n$ whether $d_n$ is a valid detection. $l_{n+1}$ where to observe next. all the parameter falls in [0,1]
https://i.imgur.com/SeFianV.png


**Training**: in order to learn the supervision annotation in long videos and handle the non-differentiable components, authors use BP to train $d_n$ while use REINFORCE to train $p_n$ and $l_{n+1}$
* for $d_n$: $L(D) = \sum_n L_{cls}(d_n) + \sum_n \sum_m 1[y_{mn} = 1] L_{loc}(d_n,g_m)$
* for $p_n,l_{n+1}$: reward $J(\theta) = \sum_{a\in A} p(a) r(a)$ p(a) is the distribution of action and r(a) is the reward for the action. so training needs to maximize this.


**Summary**:
This paper uses a serialized model which first extract the feature from each frame, then use the frame feature and previous state info to generate the next observation time, detection and detection indicator. Specifically, in order to use previous information, they use RNN to store information and use REINFORCE to train $p_n$ and $l_n$, where the goal is to maximize reward for an action sequence and use Monte-Carlo sampling to numerically calculate the gradient for high dimension function.

**questions**:
1. why  $p_n$ and $l_n$ are non-differentiable components?
2. if $p_n$ and $l_n$ are non-differentiable components indeed, how do we come up with REINFORCE to compute the gradient?
3. why don't we get $p_n$ from $p_n = f_p(h_n, \theta_p)$ directly but rather use fp as the parameter in bernoulli distribution, similar question can be applied to calculation for $l_{n+1}$ in trainning time.

arxiv.org
arxiv-vanity.com
scholar.google.com

Variational Inference with Normalizing Flows
Danilo Jimenez Rezende and Shakir Mohamed
arXiv e-Print archive - 2015 via Local arXiv
Keywords: stat.ML, cs.AI, cs.LG, stat.CO, stat.ME
more

[link] Summary by CodyWild 6 years ago

This paper argues for the use of normalizing flows - a way of building up new probability distributions by applying multiple sets of invertible transformations to existing distributions - as a way of building more flexible variational inference models. 

The central premise of a variational autoencoder is that of learning an approximation to the posterior distribution of latent variables - p(z|x) - and parameterizing that distribution according to values produced by a neural network. In typical practice, this has meant that VAEs are limited in terms of the complexity of latent variable distributions they can encode, since using an analytically specified distribution tends to limit you to simpler distributional shapes - Gaussians, uniform, and the like. Normalizing flows are here proposed as a way to allow for the model to learn more complex forms of posterior distribution. 

Normalizing flows work off of a fairly simple intuition: if you take samples from a distribution p(x), and then apply a function f(x) to each x in that sample, you can calculate the expected value of your new distribution f(x) by calculating the expectation of f(x) under the old distribution p(x). That is to say: 
https://i.imgur.com/NStm7zN.png
This mathematical transformation has a pretty delightful name - The Law of the Unconscious Statistician - that came from the fact that so many statisticians just treated this identity as a definitional fact, rather than something actually in need of proving (I very much fall into this bucket as well). The implication of this is that if you apply many transformations in sequence to the draws from some simple distribution, you can work with that distribution without explicitly knowing its analytical formulation, just by being able to evaluate - and, importantly - invert the function. The ability to invert the function is key, because of the way you calculate the derivative: by taking the inverse of the determinant of the derivative of your function f(z) with respect to z. (Note here that q(z) is the original distribution you sampled under, and q’(z) is the implicit density you’re trying to estimate, after your function has been applied). 

https://i.imgur.com/8LmA0rc.png

Combining these ideas together: a variational flow autoencoder works by having an encoder network define the parameters of a simple distribution (Gaussian or Uniform), and then running the samples from that distribution through a series of k transformation layers. This final transformed density over z is then given to the decoder to work with. Some important limitations are in place here, the most salient of which is that in order to calculate derivatives, you have to be able to calculate the determinant of the derivative of a given transformation. Due to this constraint, the paper only tests a few transformations where this is easy to calculate analytically - the planar transformation and radial transformation. If you think about transformations of density functions as fundamentally stretching or compressing regions of density, the planar transformation works by stretching along an axis perpendicular to some parametrically defined plane, and the radial transformation works by stretching outward in a radial way around some parametrically defined point. Even though these transformations are individually fairly simple, when combined, they can give you a lot more flexibility in distributional space than a simple Gaussian or Uniform could. 

https://i.imgur.com/Xf8HgHl.png

arxiv.org
arxiv-vanity.com
scholar.google.com

Understanding Adversarial Training: Increasing Local Stability of Neural Nets through Robust Optimization
Uri Shaham and Yutaro Yamada and Sahand Negahban
arXiv e-Print archive - 2015 via Local arXiv
Keywords: stat.ML, cs.LG, cs.NE
more

[link] Summary by David Stutz 7 years ago

Shaham et al. provide an interpretation of adversarial training in the context of robust optimization. In particular, adversarial training is posed as min-max problem (similar to other related work, as I found):

$\min_\theta \sum_i \max_{r \in U_i} J(\theta, x_i + r, y_i)$

where $U_i$ is called the uncertainty set corresponding to sample $x_i$ – in the context of adversarial examples, this might be an $\epsilon$-ball around the sample quantifying the maximum perturbation allowed; $(x_i, y_i)$ are training samples, $\theta$ the parameters and $J$ the trianing objective. In practice, when the overall minimization problem is tackled using gradient descent, the inner maximization problem cannot be solved exactly (as this would be inefficient). Instead Shaham et al. Propose to alternatingly make single steps both for the minimization and the maximization problems – in the spirit of generative adversarial network training.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
arxiv-vanity.com
scholar.google.com

Learning with a Strong Adversary
Ruitong Huang and Bing Xu and Dale Schuurmans and Csaba Szepesvari
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG
more

[link] Summary by David Stutz 7 years ago

Huang et al. propose a variant of adversarial training called “learning with a strong adversary”. In spirit the idea is also similar to related work [1]. In particular, the authors consider the min-max objective

$\min_g \sum_i \max_{\|r^{(i)}\|\leq c} l(g(x_i + r^{(i)}), y_i)$

where $g$ ranges over expressible functions and $(x_i, y_i)$ is a training sample. In the remainder of the paper, Huang et al. Address the problem of efficiently computing $r^{(i)}$ – i.e. a strong adversarial example based on the current state of the network – and subsequently updating the weights of the network by computing the gradient of the augmented loss. Details can be found in the paper.

[1] T. Miyato, S. Maeda, M. Koyama, K. Nakae, S. Ishii. Distributional Smoothing by Virtual Adversarial Training. ArXiv:1507.00677, 2015.

Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
arxiv-vanity.com
scholar.google.com

Distributional Smoothing with Virtual Adversarial Training
Takeru Miyato and Shin-ichi Maeda and Masanori Koyama and Ken Nakae and Shin Ishii
arXiv e-Print archive - 2015 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by David Stutz 7 years ago

Miyato et al. propose distributional smoothing (or virtual adversarial training) as defense against adversarial examples. However, I think that both terms do not give a good intuition of what is actually done. Essentially, a regularization term is introduced. Letting $p(y|x,\theta)$ be the learned model, the regularizer is expressed as

$\text{KL}(p(y|x,\theta)|p(y|x+r,\theta)$

where $r$ is the perturbation that maximizes the Kullback-Leibler divergence above, i.e.

$r = \arg\max_r \{\text{KL}(p(y|x,\theta)|p(y|x+r,\theta) | \|r\|_2 \leq \epsilon\}$

with hyper-parameter $\epsilon$. Essentially, the regularizer is supposed to “simulate” adversarial training – thus, the method is also called virtual adversarial training.

The discussed implementation, however, is somewhat cumbersome. In particular, $r$ cannot be computed using first-order methods as the gradient of $\text{KL}$ is $0$ for $r = 0$. So a second-order method is used – for which the Hessian needs to be approximated and the corresponding eigenvectors need to be computed. For me it is unclear why $r$ cannot be initialized randomly to solve this issue … Then, the derivative of the regularizer needs to be computed during training. Here, the authors make several simplifications (such as fixing $\theta$ in the first part of the Kullback-Leibler divergence and ignoring the derivative of $r$ w.r.t. $\theta$).

Overall, however, I like the idea of “virtual” adversarial training as it avoids the need of explicitly using attacks during training to craft adversarial examples. Then, the trained model is often robust against the chosen attacks, but new adversarial examples can be found easily through novel attacks.

Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
scholar.google.com

The Limitations of Deep Learning in Adversarial Settings
Nicolas Papernot and Patrick McDaniel and Somesh Jha and Matt Fredrikson and Z. Berkay Celik and Ananthram Swami
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CR, cs.LG, cs.NE, stat.ML
more

[link] Summary by David Stutz 7 years ago

Papernot et al. Introduce a novel attack on deep networks based on so-called adversarial saliency maps that are computed independently of a loss. Specifically, they consider – for a given network $F(X)$ – the forward derivative

$\nabla F = \frac{\partial F}{\partial X} = \left[\frac{\partial F_j(X)}{\partial x_i}\right]_{i,j}$.

Essentially, this is the regular derivative of $F$ with respect to its input; Papernot et al. seem to refer to is as “forward” derivative as it stands in contrast with regular backpropagation where the derivative of the loss with respect to the parameters is considered. They define an adversarial saliency map by considering

$S(X, t)_i = \begin{cases}0 & \text{ if } \frac{\partial F_t(X)}{\partial X_i} < 0 \text{ or } \sum_{j\neq t} \frac{\partial F_j(X)}{\partial X_i} > 0\\ \left(\frac{\partial F_t(X)}{\partial X_i}\right) \left| \sum_{j \neq t} \frac{\partial F_j(X)}{\partial X_i}\right| & \text{ otherwise}\end{cases}$

where $t$ is the target class of the attack. The intuition of this definition is the following: The partial derivative of $F_t$ with respect to $X$ at location $i$ indicates how $X_i$ can be changed in order to increase $F_t$ (which is the goal). At the same time, $F_j$ for all $t \neq j$ is supposed to decrease for the targeted attack, this is implemented using the second (absolute) term. If, at a specific feature $X_i$, not increase of $X_i$ will lead to an increase of $F_t$, or an increase will also lead to an increase in the other $F_j$, the saliency map is zero – indicating that feature $i$ is useless. Note that here, only increases in $X_i$ are considered; Papernot et al. have a analogous formulation for considering decreases of $X_i$.
Based on the concept of adversarial saliency maps, a simple attack is implemented as illustrated in Algorithm 1. In particular, the feature $X_i$ for which the saliency map $S(X, t)$ is maximized is chosen and increased by a fixed amount until the network $F$ changes the label to $t$ or a maximum perturbation is reached (in which case the attack fails).

https://i.imgur.com/PvJv9yS.png
Algorithm 1: The proposed algorithm for generating adversarial examples, see text for details.

In experiments on MNIST they show the effectiveness of the proposed attack. Additionally, they attempt to quantify the robustness (called “hardness”) of specific classes. In particular, they show that some classes are harder to attack than others. To this end they derive the so-called adversarial distance

$A(X, t) = 1 - \frac{1}{M}\sum_i 1_{[S(X, t)_i > 0]}$

which counts the number of features in the adversarial saliency map that are greater than zero (i.e. can be perturbed during the attack in Algorithm 1). Personally, I find this “hardness” measure quite interesting because it is independent of a specific loss, but directly takes statistics of the learned model into account.

Also see this summary on [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
scholar.google.com

Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks
Nicolas Papernot and Patrick McDaniel and Xi Wu and Somesh Jha and Ananthram Swami
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CR, cs.LG, cs.NE, stat.ML
more

[link] Summary by David Stutz 7 years ago

Papernot et al. build upon the idea of network distillation [1] and propose a simple mechanism to defend networks against adversarial attacks. The main idea of distillation – originally introduced to “distill” the knowledge of very deep networks into smaller ones – is to train a second, possibly smaller network, with the probability distributions of the original, possibly larger network as supervision. Papernot et al. as well as the authors of [1] argue that the probability distributions, i.e. the activations of the final softmax layer (also referred to as “soft” labels), contain rich information about the task in contrast to the true “hard” labels. This allows the network to achieve similar performance while using less parameters or a different architecture.

However, Papernot et al. do not distill a network's knowledge into a smaller one; instead they use distillation to make networks robust against adversarial attacks. They argue that most algorithms to generate adversarial examples make use of the “adversarial gradient”; i.e. the gradient of the network's cost w.r.t. its input. The adversarial gradient then guides perturbation of the input image in the direction of wrong classes (the authors consider a simple classification task for simplicity). Therefore, Papernot et al. Argure, the gradient around training samples needs to be reduced – in other words, the model needs to be smoothed.

https://i.imgur.com/jXIhIGz.png

The proposed approach is very simple, they just distill the knowledge of the network into another network with same architectures and hyper parameters. By using the probability distributions as “soft” labels instead of the hard labels for training, the network is essentially smoothed. The full procedure is illustrated in Figure 1.

Despite the simplicity of the approach, I want to highlight some additional key observations:
- Distillation is also supposed to help generalization by avoiding overly confident networks.
- The success rate of adversarial attacks can be reduced significantly as shown in quantitative experiments.
- The amplitude of adversarial gradients can be reduced, which means that the network has been smoothed and is less sensitive to variations in the input samples.

Also see this summary on [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
arxiv-vanity.com
scholar.google.com

Autoencoding beyond pixels using a learned similarity metric
Anders Boesen Lindbo Larsen and Søren Kaae Sønderby and Hugo Larochelle and Ole Winther
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG, cs.CV, stat.ML
more

[link] Summary by CodyWild 7 years ago

Variational Autoencoders are a type of generative model that seek to learn how to generate new data by incentivizing the model to be able to reconstruct input data, after compressing it to a low-dimensional space. Typically, the way that the reconstruction is scored against the original is by comparing the pixel by pixel values: a reconstruction gets a high score if it is able to place pixels of color in the same places that the original did. However, there are compelling reasons why this is a sub-par way of scoring images.

The central one is: it focuses on and penalizes superficial differences, so if the model accurately reproduces the focal object of the image, but does so, say, 10 pixels to the right of where it was previously, that will incur a penalty we might not actually want to apply. The flip side of this is that a direct pixel-comparison loss doesn’t differentiate between pixel differences that do or don’t change the fundamental substance of the image. For instance, having 100 pixels wrong around the border of a dog, making it seem very slightly larger, would be the same amount of error as having 100 pixels concentrated in a weird bulb that appears to be growing out of a dog’s ear, even though the former does a better job of being recognizable as a dog.

The authors of the VAE/GAN paper have a clever approach to solving this problem, that involves taking the typical pixel loss, and breaking it up into two conceptual parts.
The first focuses on aligning the conceptual features of the reconstructed image with the conceptual features of the input image. It does so by running both the input and the reconstruction through a discriminative convolutional model which - in the typical way of deep learning - learns ever more abstract features at each layer of the network. These “conceptual features” abstract out the precise pixel values, and instead capture the higher level features of the image. So, instead of calculating the pixelwise squared loss between the specific input x, and its after-bottleneck reconstruction x~, you take the squared loss between the feature maps at some layer for both x and x~, and push them to be closer together, so that the reconstruction shares the same features as the original.
The second focuses on detail-level specifics of images, but, cleverly, does so in a general, rather than a observation-specific way. This is done by training a GAN-style discriminator to tell the difference between generated images* and original image, and then using that loss to train the decoder part of the VAE. The cleverness of this comes from the fact that they are still enforcing that the details and structural features of the reconstructed image are not distinguishable from real images, but doing so in a general sense, rather than requiring the details to be an exact match to the details found in a given input x.

https://i.imgur.com/Bmtmac2.png

The authors freely admit that existing metrics of scoring images (which themselves *use* pixelwise similarity) rate their method as being worse than existing VAEs. However, they argue, that’s inherently a flawed metric, that doesn’t capture the aspects of clean visual quality we want in generated image. A metric they propose instead involves using an dataset where a list of attributes are attached to each image (old, black, blond, etc). They add these as additional input while training the network, so that whatever signals the decoder part of the model needs to turn someone blonde, it gets those from the externally-given attribute vector, rather than a learned representation. This means that, once the model is trained, we can set some value of the attribute vector, and have the decoder generate samples conditional on that. The metric is constructed by taking the decoded samples conditioned on some attribute set, and then taking a classifier model that is trained on the real images to detect attribute values from the images. The generated images are then scored by how closely the predictions from the classifier model match the true values of the attributes. If the generator model were working perfectly, this error rate would as low as for real data. By this metric (which: grain of salt, since they invented), the VAE/GAN model is superior to both GANs and vanilla VAEs.

arxiv.org
arxiv-vanity.com
scholar.google.com

Deep Linear Discriminant Analysis
Matthias Dorfer and Rainer Kelz and Gerhard Widmer
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Anonymous 7 years ago

There are 2 implementations for the paper:

 1. [Reference Implementation of Deep Linear Discriminant Analysis (DeepLDA)](https://github.com/CPJKU/deep_lda).
 2. [VahidooX/DeepLDA](https://github.com/VahidooX/DeepLDA).

[It seems something is wrong with the cost function implemented](https://github.com/VahidooX/DeepLDA/issues/1#issuecomment-392261355).
Also, while they derive the Gradient they didn't verify it and in the implementation use Theano's Auto Grad (While other Auto Grad can't work it out).

arxiv.org
arxiv-vanity.com
scholar.google.com

Prioritized Experience Replay
Tom Schaul and John Quan and Ioannis Antonoglou and David Silver
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Tianxiao Zhao 7 years ago

this paper: develop a framework to replay important transitions more frequently -> learn efficienty

prior work: uniformly sample a replay memory to get experience transitions
 
evaluate: DQN + PER outperform DQN on 41 out of 49 Atari games

## Introduction

**issues with online RL:** (solution: experience replay) 

1. strongly correlated updates that break the i.i.d. assumption
2. rapid forgetting of rare experiences that could be useful later

**key idea:** 

more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error

**issues with prioritization:**

1. loss of diversity -> alleviate with stochastic prioritization
2. introduce bias -> correct with importance sampling

## Prioritized Replay

**criterion:**

- the amount the RL agent can learn from a transition in its current state (expected learning progress) -> not directly accessible
- proxy: the magnitude of a transition’s TD error ~= how far the value is from its next-step bootstrap estimate

**stochastic sampling:**

$$P(i)=\frac{p_i^\alpha}{\sum_k p_k^\alpha}$$

*p_i* > 0: priority of transition *i*; 0 <= *alpha* <= 1 determines how much prioritization is used.

*two variants:*

1. proportional prioritization: *p_i* = abs(TD\_error\_i) + epsilon (small positive constant to avoid zero prob)
2. rank-based prioritization: *p_i* = 1/rank(i); **more robust as it is insensitive to outliers**

https://i.imgur.com/T8je5wj.png

**importance sampling:**

IS weights: 

$$w_i = \left(\frac{1}{N} \cdot \frac{1}{P(i)}\right)^\beta $$

- weights can be folded into the Q-learning update by using $w_i*\delta_i$ instead of $\delta_i$
- weights normalized by $\frac{1}{\max w_i}$

arxiv.org
arxiv-vanity.com
scholar.google.com

Early Inference in Energy-Based Models Approximates Back-Propagation
Yoshua Bengio and Asja Fischer
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Peter O'Connor 7 years ago

# Very Short

The authors define a neural network as a nonlinear dynamical system whose fixed points correspond to the minima of some **energy function**.  They then show that if one were to start at a fixed-point and *perturb* the output units in the direction that minimizes a loss, the initial perturbation that would flow back through the network would be proportional to the gradient of the neural activations with respect to this loss.  Thus, the initial propagation of those propagations (i.e. **early inference**) **approximates** the **backpropagated** gradients of the loss.

arxiv.org
scholar.google.com

Trust Region Policy Optimization
Schulman, John and Levine, Sergey and Moritz, Philipp and Jordan, Michael I. and Abbeel, Pieter
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by biedenka 7 years ago

The authors present an iterative approach for optimizing policies with guaranteed monotonic improvement.
TRPO is similar to natural policy gradient methods and can be applied effectively in optimization of large nonlinear policies.

\cite{KakadeL02} gave monotonic improvement guarantees for mixture of policies $\pi_{new}(a|s)=(1-\alpha)\pi_{old}(a|s) + \alpha\pi'(a|s)$ where $\pi'=\mathrm{arg}\max_{\pi'}L_{\pi_{old}}(\pi')$ is the approximated expected return of a policy $\pi'$ in terms of the advantage over $\pi_{old}$, as $\eta(\pi_{new})\geq L_{\pi_{old}}(\pi_{new}) - \frac{2\epsilon\gamma}{(1-\gamma)^2}\alpha^2$ with $\eta$ the true expected return and $\epsilon$ the maximally expected advantage.

The authors extend this approach to be applicable for all stochastic policy classes by replacing $\alpha$ with a distance measure between two policies $\pi_{new}$ and $\pi_{old}$.
As distance measure they use the maximal Kullback–Leibler divergence $D_{KL}^{\max}(\pi_{new},\pi_{old})$ and show that $\eta(\pi_{new})\geq L_{\pi_{old}}(\pi_{new}) -CD_{KL}^{\max}(\pi_{new},\pi_{old})$, with $C= \frac{4\epsilon\gamma}{(1-\gamma)^2}$.

From this follows, that one is guaranteed to improve the true objective $\eta$ when performing the following maximaization $\mathrm{maximize}_\pi\left[L_{\pi_{old}}(\pi)-CD_{KL}^{\max}(\pi,\pi_{old})\right]$. In practice however $C$ would only allow for small steps. Thus constraining $\mathrm{maximize}_\pi L_{\pi_{old}}(\pi)$ subject to $D_{KL}^{\max}(\pi,\pi_{old}) \leq \delta$ allows for larger steps in a **Trust Region**

Due to the large number of constraints this problem is impractical to solve, which is why the authors replace the maximum KL divergence with approximated average KL.

TRPO then works as follows:
 1. Use a rollout procedure to collet a set of state-action-pairs wit Monte Carlo estimates of their $Q$-Values
 2. Average over the samples to construct the estimate objective $L_{\pi}$ as well as the constraint
 3. Approximately solve the constrained optimization problem to update the policy parameters.
 They use the conjugate gradient algorithm followed by a linesearch.

Their experiments support the claim that TRPO is able to effectively optimize large nonlinear policies.

arxiv.org
arxiv-vanity.com
scholar.google.com

Learning Continuous Control Policies by Stochastic Value Gradients
Nicolas Heess and Greg Wayne and David Silver and Timothy Lillicrap and Yuval Tassa and Tom Erez
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG, cs.NE
more

[link] Summary by tom89 7 years ago

This paper shows how a family of reinforcement learning algorithms known as value gradient methods can be generalised to learn stochastic policies and deal with stochastic environment models.

Value gradients are a type of policy gradient algorithm which represent a value function either by:
* A learned Q-function (a critic)
* Linking together a policy, an environment model and reward function to define a recursive function to simulate the trajectory and the total return from a given state.

By backpropagating though these functions, value gradient methods can calculate a policy gradient. This backpropagation sets them apart from other policy gradient methods (like REINFORCE for example) which are model-free and sample returns from the real environment.

Applying value gradients to stochastic problems requires differentiating the stochastic bellman equation:
\begin{equation} 
V ^t (s) = \int \left[ r^t + γ \int  V^{t+1} (s) p(s' | s, a) ds'  \right] p(a|s; θ) da
\end{equation}

To do that, the authors use a trick called re-parameterisation to express the stochastic bellman equation as a deterministic function which takes a noise variable as an input. To differentiate a re-parameterised function, one simply samples the noise variable then computes the derivative as if the function were deterministic. This can then be repeated $ M $ times and averaged to arrive at a Monte Carlo estimate for the derivative of the stochastic function.

The re-parameterised bellman equation is: 

$ V (s) = \mathbb{E}_{ \rho(\eta) } \left[ r(s, \pi(s, \eta; \theta)) + \gamma \mathbb{E}_{\rho(\xi) }  \left[ V' (f(s, \pi(s, \eta; \theta), \xi)) \right]  \right] $

It's derivative with respect to the current state and the policy parameters is:

$ V_s = \mathbb{E}_{\rho(\eta)} \[ r_\textbf{s} + r_\textbf{a} \pi_\textbf{s} + \gamma \mathbb{E}_{\rho(\xi)} V'_{s'} (\textbf{f}_\textbf{s} + \textbf{f}_\textbf{a} \pi_\textbf{s}) \] $

$ V_\theta = \mathbb{E}_{\rho(\eta)} \[ r_\textbf{a} \pi_\theta + \gamma \mathbb{E}_{\rho(\xi)} \[ V'_{\textbf{s'}} \textbf{f}_\textbf{a} \pi_\textbf{s} + V'_\theta\] \] $

Based on these relationships the authors define two algorithms; SVG(∞), SVG(1) 

* SVG(∞) takes the trajectory from an entire episode and starting at the terminal state accumulates a gradients $V_{\textbf{s}} $ and $ V_{\theta} $ using the expressions above to arrive at a policy gradient. SVG(∞) is on-policy and only works with finite-horizon environments

* SVG(1) trains a value function then uses its gradient as an estimate for $ V_{\textbf{s}} $ above. SVG(1) also uses importance weighting so as to be off-policy and can work with infinite-horizon environments.

Both algorithms use an environment model which is trained using an experience replay database. The paper also introduces SVG(0) which is a similar to SVG(1), but is model-free.

SVG was analysed using several MuJoCo environments and it was found that:
* SVG(∞) outperformed a BBPT planner on a control problem with a stochastic model, indicating that gradient evaluation using real trajectories is more effective than planning for stochastic environments
* SVG(1) is more robust to inaccurate environment models and value functions than SVG(∞)
* SVG(1) was able to solve several complex environments

arxiv.org
scholar.google.com

On Using Monolingual Corpora in Neural Machine Translation
Gülçehre, Çaglar and Firat, Orhan and Xu, Kelvin and Cho, Kyunghyun and Barrault, Loïc and Lin, Huei-Chi and Bougares, Fethi and Schwenk, Holger and Bengio, Yoshua
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

The authors extend a seq2seq model for MT with a language model. They first pre-train a seq2seq model and a neural language model, then train a separate feedforward component that takes the hidden states from both and combines them together to make a prediction. They compare to simply combining the output probabilities from both models (shallow fusion) and show improvement on different MT datasets.

https://i.imgur.com/zD9jb4K.png

dx.doi.org
sci-hub
scholar.google.com

Simultaneous Deep Transfer Across Domains and Tasks
Tzeng, Eric and Hoffman, Judy and Darrell, Trevor and Saenko, Kate
International Conference on Computer Vision - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by robromijnders 7 years ago

# Simultaneous Deep transfer across domains and tasks
## Tzeng, Hoffman, Saenko, 2015

* The paper aims to exploit unlabeled and sparsely labeled data from the target domain.
* As a baseline, they mention that one could match feature distributions between source and target domain. This work will also explore correlation between categories, such as _bottle_ and _mug._
* The authors derive inspiration from the _Name the dataset_ game by Torralbe and Efros. In this game, you train a classifier to predict which dataset an image originates from. This idea transpires into the domain confusion loss. The domain classifier measures the confusion between learned features from source and target domain. The image classifier learns a feature representation that makes the domain inditinguishable, as measured by the domain confusion.
* The second idea also learns the similarity structure between objects in the target domain. This works as follows. _We first compute the average output probability distribution, or “softlabel,” over the source training examples in each category. Then, for each target labeled example, we directly optimize our model to match the distribution over classes to the soft label. In this way we are able to perform task adaptation by transferring information to categories with no explicit labels in the target domain._
* The experiments take place in two situations. The _supervised_ case, where only few labels are present in the target domain. The _semi supervised_ case, where only few labels of a subset of the classes are present.
* In the final section, the authors perform analysis on theis own result. They show how the image classifier correctly labeled monitor, while no labels for monitor were present in the target domain.

arxiv.org
scholar.google.com

Learning to Diagnose with LSTM Recurrent Neural Networks
Lipton, Zachary Chase and Kale, David C. and Elkan, Charles and Wetzel, Randall C.
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Tiago Vinhoza 7 years ago

#### Goal
+ Predict 128 diagnoses for intensive pediatric care patients.

#### Dataset:

+ Children's Hospital LA.
+ Episode is a multivariate time series that describes the stay of one patient in the intensive care unit.

Dataset properties | Value
---------|----------
Number of episodes | 10,401
Duration of episodes | From 12h to several months
Time series variables | Systolic blood pressure, Diastolic blood pressure, Peripheral capillary refill rate, End tidal CO2, Fraction of inspired O2, Glasgow coma scale, Blood glucose, Heart rate, pH, Respiratory rate, Blood O2 Saturation, Body temperature, Urine output.

+ Resampling and missing values:
+ Irregularly sampled time-series that is resampled to an hourly rate.
+ Mean measurement within each hour window is taken.
+ Forward- and back-filling are used to fill gaps created by the resampling.
+ When variable time series is missing entirely: imputation with a clinically *normal* value defined by domain experts.
+ This paper is followed by [Modeling Missing Data in Clinical Time Series with RNNs](http://www.shortscience.org/paper?bibtexKey=journals/corr/LiptonKW16) from the same research group.

+ Labels:
+ Each episode is associated with 0 or more diagnoses. (in-house taxonomy, ICD-9 based).
+ Dataset contains 429 diagnoses. The paper focuses on the 128 most frequent diagnoses that appear 50 or more times in the dataset.

#### Architecture:

+ LSTM with Target Replication:

![Architecture](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016a_target.png?raw=true "Target Replication")

+ Loss function:
+ For the model with target replication, output y is generated at every sequence step. The loss function is then a convex combination of the final loss (log-loss in the case of this paper) and the average of the losses over all steps where T is the number of sequence steps and alpha is a hyperparameter.

![Loss function](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016a_loss.png?raw=true "Loss function")

#### Experiments and Results:

**Methodology**:
+ Split dataset: 80% training, 10% validation, 10% test
+ LTSM trained for 100 epochs via gradient stochastic gradient (with momentum).
+ Regularization L2: 1e-6, obtained via validation dataset.

+ LSTM: 2 hidden layers with 64 cells or 128 cells (and 50% dropout)
+ Multiple combinations: target replication / auxiliary target variables (trained using the other 301 diagnoses and other clinical information as a target. Inferences are made only for the 128 major diagnoses.

+ Baselines for comparison:
+ Logistic Regression - L2 regularized
+ MLP with 3 hidden layers - ReLU - dropout 50%.
+ Baselines tested in the raw time-series and in a feature engineering version made by domain experts.

*Metrics*:
+ Micro AUC, Micro F1: calculated by adding the TPs, FPs, TNs and FNs for the entire dataset and for all classes.
+ Macro AUC, Macro F1: Arithmetic mean of AUCs and F1 scores for each of the classes.
+ Precision at 10: Fraction of correct diagnoses among the top 10 predictions of the model.
+ The upper bound for precision at 10 is 0.2281 since in the test set there are on average 2.281 diagnoses per patient.

*Results*:

![All Results](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016a_allresults.png?raw=true "Performance metrics across all labels")

*Results for selected diagnoses*:

![Results for Selected Diseases](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/clinical-data/images/Lipton2016a_selected.png?raw=true "Performance for selected diagnoses")

#### Discussion:

+ Auxiliary outputs improve performance at the expense of increased training time. Very unbalanced dataset for some of the remaining 301 labels makes it spend an entire epoch only to learn that one of the target variables can take values other than 0.

+ Real-Time Predictions: In the future, the authors expect that the proposed solution could be used to make continuously updated real-time alerts and diagnoses.

dx.doi.org
sci-hub
scholar.google.com

U-Net: Convolutional Networks for Biomedical Image Segmentation
Ronneberger, Olaf and Fischer, Philipp and Brox, Thomas
Medical Image Computing and Computer Assisted Interventions Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by nandinics 7 years ago

1. U-NET learns segmentation in an end to end images.
2. They solved Challenges are
* Very few annotated images (approx. 30 per application).
* Touching objects of the same class.
# How:
* Input image is fed in to the network, then the data is propagated through the network along all possible path at the end segmentation maps comes out.
* In U-net architecture, each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations.
https://i.imgur.com/Usxmv6r.png
* In two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2for down sampling. At each down sampling step they double the number of feature channels.
* Contracting path (left side from up to down) is increases the feature channel and reduces the steps and an expansive path (right side from down to up) consists of sequence of up convolution and concatenation with the corresponds high resolution features from contracting path.
* The network does not have any fully connected layers and only uses the valid part of each convolution, i.e., the segmentation map only contains the pixels, for which the full context is available in the input image.

## Challenges:
1. Overlap-tile strategy for seamless segmentation of arbitrary large images:
* To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image.
* In fig, segmentation of the yellow area uses input data of the blue area and the raw data extrapolation by mirroring.
https://i.imgur.com/NUbBRUG.png
2. Augment training data using deformation:
* They use excessive data augmentation by applying elastic deformations to the available training images.
* Then the network to learn invariance to such deformations, without the need to see these transformations in the annotated image corpus.
* Deformation used to be the most common variation in tissue and realistic deformations can be simulated efficiently.
https://i.imgur.com/CyC8Hmd.png
3. Segmentation of touching object of the same class:
* They propose the use of a weighted loss, where the separating background labels between touching cells obtain a large weight in the loss function.
* Ensure separation of touching objects, in that segmentation mask for training (inserted background between touching objects) get the loss weights for each pixel.
https://i.imgur.com/ds7psDB.png
4. Segmentation of neural structure in electro-microscopy(EM):
* Ongoing challenge since ISBI 2012 in this dataset structures with low contrast, fuzzy membranes and other cell components.
* The training data is a set of 30 images (512x512 pixels) from serial section transmission electron microscopy of the Drosophila first instar larva ventral nerve cord (VNC). Each image comes with corresponding fully annotated ground truth segmentation map for cells(white) and membranes (black).
* An evaluation can be obtained by sending the predicted membrane probability map to the organizers. The evaluation is done by thresholding the map at 10 different levels and computation of the warping error, the Rand error and the pixel error.

### Results:
* The u-net (averaged over 7 rotated versions of the input data) achieves with-out any further pre or post-processing a warping error of 0.0003529, a rand-error of 0.0382 and a pixel error of 0.0611.
https://i.imgur.com/6BDrByI.png
* ISBI cell tracking challenge 2015, one of the dataset contains cell phase contrast microscopy has strong shape variations,weak outer borders, strong irrelevant inner borders and cytoplasm has same structure like background.
https://i.imgur.com/vDflYEH.png
* The first data set PHC-U373 contains Glioblastoma-astrocytoma U373 cells on a polyacrylimide substrate recorded by phase contrast microscopy- It contains 35 partially annotated training images. Here we achieve an average IOU ("intersection over union") of 92%,which is significantly better than the second best algorithm with 83%.
https://i.imgur.com/of4rAYP.png
* The second data set DIC-HeLa are HeLa cells on a flat glass recorded by differential interference contrast (DIC) microscopy - It contains 20 partially annotated training images. Here we achieve an average IOU of 77.5% which is significantly better than the second best algorithm with 46%.
https://i.imgur.com/Y9wY6Lc.png

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

FlowNet: Learning Optical Flow with Convolutional Networks
Dosovitskiy, Alexey and Fischer, Philipp and Ilg, Eddy and Häusser, Philip and Hazirbas, Caner and Golkov, Vladimir and van der Smagt, Patrick and Cremers, Daniel and Brox, Thomas
International Conference on Computer Vision - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Anonymous 7 years ago

**Summary**:
A CNN is employed to estimate optical flow. The task is defined as a supervised learning problem. 2 architectures are proposed: a generic one (FlowNetSimple); and another including a correlation layer for feature vectors at different image locations (FlowNetCorr). Networks consist of contracting and expanding parts and are trained as a whole using back-propagation. The correlation layer in FlowNetCorr finds correspondences between feature representations of 2 images instead of following the standard matching approach of extracting features from patches of both images and then comparing them.

https://i.imgur.com/iUe8ir3.png

**Approach**:
1. *Contracting part*: Their first choice is to stack both input images together and feed them through a rather generic network, allowing the network to decide itself how to process the image pair to extract the motion information. This is called 'FlowNetSimple' and consists only of convolutional layers.
The second approach 'FlowNetCorr' is to create two separate, identical processing streams for the two images and to combine them at a later stage. The two architectures are illustrated above. The 'correlation layer' performs multiplicative patch comparisons between two feature maps. The correlation of two patches centered at $x_1$ in the first map and $x_2$ in the second map is then defined as:
$$c(x_1,x_2) =\sum_{o\in[-k,k]*[-k,k]} \langle f_{1}(x_{1}+o),f_{2}(x_{2},o) \rangle $$
for a square patch of size $K = 2k+1$.
2. *Expanding part*: It consists of the upconvolutional layers - combination of unpooling and convolution. 'Upconvolution' is applied to feature maps and concatenated with corresponding feature maps from the 'contractive' part of the network.

**Experiments**:
A new dataset called 'Flying Chairs' is created with $22,872$ image pairs and flow fields. It is created by applying affine transformations to images collected from Flickr and a publicly available set of renderings of 3D chair models [1]. Results are reported on Sintel, KITTI, Middlebury datasets, as well as on their synthetic Flying Chairs dataset. The proposed method is compared with different methods: EpicFlow [2], DeepFlow [3], EDPM [4], and LDOF [5].
The authors inferred that even though the number of parameters of the two networks (FlowNetC, FlowNetCorr) is virtually the same, the FlowNetC slightly more overfits to the training data.

The architecture has nine convolutional layers with stride of $2$ in six of them and a $ReLU$ nonlinearity after each layer. As training loss, endpoint error (EPE) is used which is the standard error measure for optical flow estimation. Below figure shows examples of optical flow prediction on the Sintel dataset. Endpoint error is also shown.

https://i.imgur.com/xIRpUZQ.png

**Scope for Improvement**: It would be interesting to see the performance of the network on more realistic data.

**References**:

[1] Aubry, Mathieu, et al. "Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.

[2] Revaud, Jerome, et al. "Epicflow: Edge-preserving interpolation of correspondences for optical flow." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

[3] Weinzaepfel, Philippe, et al. "DeepFlow: Large displacement optical flow with deep matching." Proceedings of the IEEE International Conference on Computer Vision. 2013.

[4] Bao, Linchao, Qingxiong Yang, and Hailin Jin. "Fast edge-preserving patchmatch for large displacement optical flow." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.

[5] Brox, Thomas, and Jitendra Malik. "Large displacement optical flow: descriptor matching in variational motion estimation." IEEE transactions on pattern analysis and machine intelligence 33.3 (2011): 500-513.

doi.acm.org
sci-hub
scholar.google.com

Multi-view Face Detection Using Deep Convolutional Neural Networks
Farfade, Sachin Sudhakar and Saberian, Mohammad J. and Li, Li-Jia
ACM ICMR - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Alexander Jung 7 years ago

* They propose a CNN-based approach to detect faces in a wide range of orientations using a single model. However, since the training set is skewed, the network is more confident about up-right faces.
* The model does not require additional components such as segmentation, bounding-box regression, segmentation, or SVM classifiers

### How
* __Data augmentation__: to increase the number of positive samples (24K face annotations), the authors used randomly sampled sub-windows of the images with IOU > 50% and also randomly flipped these images. In total, there were 20K positive and 20M negative training samples.
* __CNN Architecture__: 5 convolutional layers followed by 3 fully-connected. The fully-connected layers were converted to convolutional layers. Non-Maximal Suppression is applied to merge predicted bounding boxes.
* __Training__: the CNN was trained using Caffe Library in the AFLW dataset with the following parameters:
* Fine-tuning with AlexNet model
* Input image size = 227x227
* Batch size = 128 (32+, 96-)
* Stride = 32
* __Test__: the model was evaluated on PASCAL FACE, AFW, and FDDB dataset.
* __Running time__: since the fully-connected layers were converted to convolutional layers, the input image in running time may be of any size, obtaining a heat map as output. To detect faces of different sizes though, the image is scaled up/down and new heatmaps are obtained. The authors found that rescaling image 3 times per octave gives reasonable good performance.
![DDFD heatmap](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/DDFD__heatmap.png?raw=true "DDFD heatmap")
* The authors realized that the model is more confident about up-right faces than rotated/occluded ones. This trend is because the lack of good training examples to represent such faces in the training process. Better results can be achieved by using better sampling strategies and more sophisticated data augmentation techniques.
![DDFD example](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/DDFD__example.png?raw=true "DDFD example")
* The authors tested different strategies for NMS and the effect of bounding-box regression for improving face detection. They NMS-avg had better performance compared to NMS-max in terms of average precision. On the other hand, adding a bounding-box regressor degraded the performance for both NMS strategies due to the mismatch between annotations of the training set and the test set. This mismatch is mostly for side-view faces.

### Results:
* In comparison to R-CNN, the proposed face detector had significantly better performance independent of the NMS strategy. The authors believe the inferior performance of R-CNN due to the loss of recall since selective search may miss some of the face regions; and loss in localization since bounding-box regression is not perfect and may not be able to fully align the segmentation bounding-boxes, provided by selective search, with the ground truth.
* In comparison to other state-of-art methods like structural model, TSM and cascade-based methods the DDFD achieve similar or better results. However, this comparison is not completely fair since the most of methods use extra information of pose annotation or information about facial landmarks during the training.

arxiv.org
arxiv-vanity.com
scholar.google.com

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
Djork-Arné Clevert and Thomas Unterthiner and Sepp Hochreiter
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG
more

[link] Summary by Alexander Jung 7 years ago

  * ELUs are an activation function
  * The are most similar to LeakyReLUs and PReLUs

### How (formula)
  * f(x):
    * `if x >= 0: x`
    * `else: alpha(exp(x)-1)`
  * f'(x) / Derivative:
    * `if x >= 0: 1`
    * `else: f(x) + alpha`
  * `alpha` defines at which negative value the ELU saturates.
  * E. g. `alpha=1.0` means that the minimum value that the ELU can reach is `-1.0`
  * LeakyReLUs however can go to `-Infinity`, ReLUs can't go below 0.

![ELUs vs LeakyReLUs vs ReLUs](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/ELUs__slopes.png?raw=true "ELUs vs LeakyReLUs vs ReLUs")

*Form of ELUs(alpha=1.0) vs LeakyReLUs vs ReLUs.*


### Why
  * They derive from the unit natural gradient that a network learns faster, if the mean activation of each neuron is close to zero.
  * ReLUs can go above 0, but never below. So their mean activation will usually be quite a bit above 0, which should slow down learning.
  * ELUs, LeakyReLUs and PReLUs all have negative slopes, so their mean activations should be closer to 0.
  * In contrast to LeakyReLUs and PReLUs, ELUs saturate at a negative value (usually -1.0).
  * The authors think that is good, because it lets ELUs encode the degree of presence of input concepts, while they do not quantify the degree of absence.
  * So ELUs can measure the presence of concepts quantitatively, but the absence only qualitatively.
  * They think that this makes ELUs more robust to noise.

### Results
  * In their tests on MNIST, CIFAR-10, CIFAR-100 and ImageNet, ELUs perform (nearly always) better than ReLUs and LeakyReLUs.
  * However, they don't test PReLUs at all and use an alpha of 0.1 for LeakyReLUs (even though 0.33 is afaik standard) and don't test LeakyReLUs on ImageNet (only ReLUs).

![CIFAR-100](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/ELUs__cifar100.png?raw=true "CIFAR-100")

*Comparison of ELUs, LeakyReLUs, ReLUs on CIFAR-100. ELUs ends up with best values, beaten during the early epochs by LeakyReLUs. (Learning rates were optimized for ReLUs.)*

-------------------------

### Rough chapter-wise notes

* Introduction
  * Currently popular choice: ReLUs
  * ReLU: max(0, x)
  * ReLUs are sparse and avoid the vanishing gradient problem, because their derivate is 1 when they are active.
  * ReLUs have a mean activation larger than zero.
  * Non-zero mean activation causes a bias shift in the next layer, especially if multiple of them are correlated.
  * The natural gradient (?) corrects for the bias shift by adjusting the weight update.
  * Having less bias shift would bring the standard gradient closer to the natural gradient, which would lead to faster learning.
  * Suggested solutions:
    * Centering activation functions at zero, which would keep the off-diagonal entries of the Fisher information matrix small.
    * Batch Normalization
    * Projected Natural Gradient Descent (implicitly whitens the activations)
  * These solutions have the problem, that they might end up taking away previous learning steps, which would slow down learning unnecessarily.
  * Chosing a good activation function would be a better solution.
  * Previously, tanh was prefered over sigmoid for that reason (pushed mean towards zero).
  * Recent new activation functions:
    * LeakyReLUs: x if x > 0, else alpha*x
    * PReLUs: Like LeakyReLUs, but alpha is learned
    * RReLUs: Slope of part < 0 is sampled randomly
  * Such activation functions with non-zero slopes for negative values seemed to improve results.
  * The deactivation state of such units is not very robust to noise, can get very negative.
  * They suggest an activation function that can return negative values, but quickly saturates (for negative values, not for positive ones).
  * So the model can make a quantitative assessment for positive statements (there is an amount X of A in the image), but only a qualitative negative one (something indicates that B is not in the image).
  * They argue that this makes their activation function more robust to noise.
  * Their activation function still has activations with a mean close to zero.

* Zero Mean Activations Speed Up Learning
  * Natural Gradient = Update direction which corrects the gradient direction with the Fisher Information Matrix
  * Hessian-Free Optimization techniques use an extended Gauss-Newton approximation of Hessians and therefore can be interpreted as versions of natural gradient descent.
  * Computing the Fisher matrix is too expensive for neural networks.
  * Methods to approximate the Fisher matrix or to perform natural gradient descent have been developed.
  * Natural gradient = inverse(FisherMatrix) * gradientOfWeights
  * Lots of formulas. Apparently first explaining how the natural gradient descent works, then proofing that natural gradient descent can deal well with non-zero-mean activations.
  * Natural gradient descent auto-corrects bias shift (i.e. non-zero-mean activations).
  * If that auto-correction does not exist, oscillations (?) can occur, which slow down learning.
  * Two ways to push means towards zero:
    * Unit zero mean normalization (e.g. Batch Normalization)
    * Activation functions with negative parts

* Exponential Linear Units (ELUs)
  * *Formula*
    * f(x):
      * if x >= 0: x
      * else: alpha(exp(x)-1)
    * f'(x) / Derivative:
      * if x >= 0: 1
      * else: f(x) + alpha
    * `alpha` defines at which negative value the ELU saturates.
    * `alpha=0.5` => minimum value is -0.5 (?)
  * ELUs avoid the vanishing gradient problem, because their positive part is the identity function (like e.g. ReLUs)
  * The negative values of ELUs push the mean activation towards zero.
  * Mean activations closer to zero resemble more the natural gradient, therefore they should speed up learning.
  * ELUs are more noise robust than PReLUs and LeakyReLUs, because their negative values saturate and thus should create a small gradient.
  * "ELUs encode the degree of presence of input concepts, while they do not quantify the degree of absence"

* Experiments Using ELUs
  * They compare ELUs to ReLUs and LeakyReLUs, but not to PReLUs (no explanation why).
  * They seem to use a negative slope of 0.1 for LeakyReLUs, even though 0.33 is standard afaik.
  * They use an alpha of 1.0 for their ELUs (i.e. minimum value is -1.0).
  * MNIST classification:
    * ELUs achieved lower mean activations than ReLU/LeakyReLU
    * ELUs achieved lower cross entropy loss than ReLU/LeakyReLU (and also seemed to learn faster)
    * They used 5 hidden layers of 256 units each (no explanation why so many)
    * (No convolutions)
  * MNIST Autoencoder:
    * ELUs performed consistently best (at different learning rates)
    * Usually ELU > LeakyReLU > ReLU
    * LeakyReLUs not far off, so if they had used a 0.33 value maybe these would have won
  * CIFAR-100 classification:
    * Convolutional network, 11 conv layers
    * LeakyReLUs performed better during the first ~50 epochs, ReLUs mostly on par with ELUs
    * LeakyReLUs about on par for epochs 50-100
    * ELUs win in the end (the learning rates used might not be optimal for ELUs, were designed for ReLUs)
  * CIFER-100, CIFAR-10 (big convnet):
    * 6.55% error on CIFAR-10, 24.28% on CIFAR-100
    * No comparison with ReLUs and LeakyReLUs for same architecture
  * ImageNet
    * Big convnet with spatial pyramid pooling (?) before the fully connected layers
    * Network with ELUs performed better than ReLU network (better score at end, faster learning)
    * Networks were still learning at the end, they didn't run till convergence
    * No comparison to LeakyReLUs

arxiv.org
arxiv-vanity.com
scholar.google.com

A Neural Algorithm of Artistic Style
Leon A. Gatys and Alexander S. Ecker and Matthias Bethge
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CV, cs.NE, q-bio.NC
more

[link] Summary by Alexander Jung 7 years ago

* The paper describes a method to separate content and style from each other in an image.
* The style can then be transfered to a new image.
* Examples:
* Let a photograph look like a painting of van Gogh.
* Improve a dark beach photo by taking the style from a sunny beach photo.

### How
* They use the pretrained 19-layer VGG net as their base network.
* They assume that two images are provided: One with the *content*, one with the desired *style*.
* They feed the content image through the VGG net and extract the activations of the last convolutional layer. These activations are called the *content representation*.
* They feed the style image through the VGG net and extract the activations of all convolutional layers. They transform each layer to a *Gram Matrix* representation. These Gram Matrices are called the *style representation*.
* How to calculate a *Gram Matrix*:
* Take the activations of a layer. That layer will contain some convolution filters (e.g. 128), each one having its own activations.
* Convert each filter's activations to a (1-dimensional) vector.
* Pick all pairs of filters. Calculate the scalar product of both filter's vectors.
* Add the scalar product result as an entry to a matrix of size `#filters x #filters` (e.g. 128x128).
* Repeat that for every pair to get the Gram Matrix.
* The Gram Matrix roughly represents the *texture* of the image.
* Now you have the content representation (activations of a layer) and the style representation (Gram Matrices).
* Create a new image of the size of the content image. Fill it with random white noise.
* Feed that image through VGG to get its content representation and style representation. (This step will be repeated many times during the image creation.)
* Make changes to the new image using gradient descent to optimize a loss function.
* The loss function has two components:
* The mean squared error between the new image's content representation and the previously extracted content representation.
* The mean squared error between the new image's style representation and the previously extracted style representation.
* Add up both components to get the total loss.
* Give both components a weight to alter for more/less style matching (at the expense of content matching).

![Examples](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/A_Neural_Algorithm_for_Artistic_Style__examples.jpg?raw=true "Examples")

*One example input image with different styles added to it.*

-------------------------

### Rough chapter-wise notes

* Page 1
* A painted image can be decomposed in its content and its artistic style.
* Here they use a neural network to separate content and style from each other (and to apply that style to an existing image).

* Page 2
* Representations get more abstract as you go deeper in networks, hence they should more resemble the actual content (as opposed to the artistic style).
* They call the feature responses in higher layers *content representation*.
* To capture style information, they use a method that was originally designed to capture texture information.
* They somehow build a feature space on top of the existing one, that is somehow dependent on correlations of features. That leads to a "stationary" (?) and multi-scale representation of the style.

* Page 3
* They use VGG as their base CNN.

* Page 4
* Based on the extracted style features, they can generate a new image, which has equal activations in these style features.
* The new image should match the style (texture, color, localized structures) of the artistic image.
* The style features become more and more abtstract with higher layers. They call that multi-scale the *style representation*.
* The key contribution of the paper is a method to separate style and content representation from each other.
* These representations can then be used to change the style of an existing image (by changing it so that its content representation stays the same, but its style representation matches the artwork).

* Page 6
* The generated images look most appealing if all features from the style representation are used. (The lower layers tend to reflect small features, the higher layers tend to reflect larger features.)
* Content and style can't be separated perfectly.
* Their loss function has two terms, one for content matching and one for style matching.
* The terms can be increased/decreased to match content or style more.

* Page 8
* Previous techniques work only on limited or simple domains or used non-parametric approaches (see non-photorealistic rendering).
* Previously neural networks have been used to classify the time period of paintings (based on their style).
* They argue that separating content from style might be useful and many other domains (other than transfering style of paintings to images).

* Page 9
* The style representation is gathered by measuring correlations between activations of neurons.
* They argue that this is somehow similar to what "complex cells" in the primary visual system (V1) do.
* They note that deep convnets seem to automatically learn to separate content from style, probably because it is helpful for style-invariant classification.

* Page 9, Methods
* They use the 19 layer VGG net as their basis.
* They use only its convolutional layers, not the linear ones.
* They use average pooling instead of max pooling, as that produced slightly better results.

* Page 10, Methods
* The information about the image that is contained in layers can be visualized. To do that, extract the features of a layer as the labels, then start with a white noise image and change it via gradient descent until the generated features have minimal distance (MSE) to the extracted features.
* The build a style representation by calculating Gram Matrices for each layer.

* Page 11, Methods
* The Gram Matrix is generated in the following way:
* Convert each filter of a convolutional layer to a 1-dimensional vector.
* For a pair of filters i, j calculate the value in the Gram Matrix by calculating the scalar product of the two vectors of the filters.
* Do that for every pair of filters, generating a matrix of size #filters x #filters. That is the Gram Matrix.
* Again, a white noise image can be changed with gradient descent to match the style of a given image (i.e. minimize MSE between two Gram Matrices).
* That can be extended to match the style of several layers by measuring the MSE of the Gram Matrices of each layer and giving each layer a weighting.

* Page 12, Methods
* To transfer the style of a painting to an existing image, proceed as follows:
* Start with a white noise image.
* Optimize that image with gradient descent so that it minimizes both the content loss (relative to the image) and the style loss (relative to the painting).
* Each distance (content, style) can be weighted to have more or less influence on the loss function.

jmlr.org
scholar.google.com

Generative Moment Matching Networks
Li, Yujia and Swersky, Kevin and Zemel, Richard S.
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Alexander Jung 7 years ago

* Generative Moment Matching Networks (GMMN) are generative models that use maximum mean discrepancy (MMD) for their objective function.
* MMD is a measure of how similar two datasets are (here: generated dataset and training set).
* GMMNs are similar to GANs, but they replace the Discriminator with the MMD measure, making their optimization more stable.

### How
* MMD calculates a similarity measure by comparing statistics of two datasets with each other.
* MMD is calculated based on samples from the training set and the generated dataset.
* A kernel function is applied to pairs of these samples (thus the statistics are acutally calculated in high-dimensional spaces). The authors use Gaussian kernels.
* MMD can be approximated using a small number of samples.
* MMD is differentiable and therefor can be used as a standard loss function.
* They train two models:
* GMMN: Noise vector input (as in GANs), several ReLU layers into one sigmoid layer. MMD as the loss function.
* GMMN+AE: Same as GMMN, but the sigmoid output is not an image, but instead the code that gets fed into an autoencoder's (AE) decoder. The AE is trained separately on the dataset. MMD is backpropagated through the decoder and then the GMMN. I.e. the GMMN learns to produce codes that let the decoder generate good looking images.

![Formula](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Generative_Moment_Matching_Networks__formula.png?raw=true "Formula")

*MMD formula, where $x_i$ is a training set example and $y_i$ a generated example.*

![Architectures](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Generative_Moment_Matching_Networks__architectures.png?raw=true "Architectures")

*Architectures of GMMN (left) and GMMN+AE (right).*

### Results
* They tested only on MNIST and TFD (i.e. datasets that are well suited for AEs...).
* Their GMMN achieves similar log likelihoods compared to other models.
* Their GMMN+AE achieves better log likelihoods than other models.
* GMMN+AE produces good looking images.
* GMMN+AE produces smooth interpolations between images.

![Interpolations](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Generative_Moment_Matching_Networks__interpolations.png?raw=true "Interpolations")

*Generated TFD images and interpolations between them.*

--------------------

### Rough chapter-wise notes

* (1) Introduction
* Sampling in GMMNs is fast.
* GMMNs are similar to GANs.
* While the training objective in GANs is a minimax problem, in GMMNs it is a simple loss function.
* GMMNs are based on maximum mean discrepancy. They use that (implemented via the kernel trick) as the loss function.
* GMMNs try to generate data so that the moments in the generated data are as similar as possible to the moments in the training data.
* They combine GMMNs with autoencoders. That is, they first train an autoencoder to generate images. Then they train a GMMN to produce sound code inputs to the decoder of the autoencoder.

* (2) Maximum Mean Discrepancy
* Maximum mean discrepancy (MMD) is a frequentist estimator to tell whether two datasets X and Y come from the same probability distribution.
* MMD estimates basic statistics values (i.e. mean and higher order statistics) of both datasets and compares them with each other.
* MMD can be formulated so that examples from the datasets are only used for scalar products. Then the kernel trick can be applied.
* It can be shown that minimizing MMD with gaussian kernels is equivalent to matching all moments between the probability distributions of the datasets.

* (4) Generative Moment Matching Networks
* Data Space Networks
* Just like GANs, GMMNs start with a noise vector that has N values sampled uniformly from [-1, 1].
* The noise vector is then fed forward through several fully connected ReLU layers.
* The MMD is differentiable and therefor can be used for backpropagation.
* Auto-Encoder Code Sparse Networks
* AEs can be used to reconstruct high-dimensional data, which is a simpler task than to learn to generate new data from scratch.
* Advantages of using the AE code space:
* Dimensionality can be explicitly chosen.
* Disentangling factors of variation.
* They suggest a combination of GMMN and AE. They first train an AE, then they train a GMMN to generate good codes for the AE's decoder (based on MMD loss).
* For some reason they use greedy layer-wise pretraining with later fine-tuning for the AE, but don't explain why. (That training method is outdated?)
* They add dropout to their AE's encoder to get a smoother code manifold.
* Practical Considerations
* MMD has a bandwidth parameter (as its based on RBFs). Instead of chosing a single fixed bandwidth, they instead use multiple kernels with different bandwidths (1, 5, 10, ...), apply them all and then sum the results.
* Instead of $MMD^2$ loss they use $\sqrt{MMD^2}$, which does not go as fast to zero as raw MMD, thereby creating stronger gradients.
* Per minibatch they generate a small number of samples und they pick a small number of samples from the training set. They then compute MMD for these samples. I.e. they don't run MMD over the whole training set as that would be computationally prohibitive.

* (5) Experiments
* They trained on MNIST and TFD.
* They used an GMMN with 4 ReLU layers and autoencoders with either 2/2 (encoder, decoder) hidden sigmoid layers (MNIST) or 3/3 (TFD).
* They used dropout on the encoder layers.
* They used layer-wise pretraining and finetuning for the AEs.
* They tuned most of the hyperparameters using bayesian optimization.
* They use minibatch sizes of 1000 and compute MMD based on those (i.e. based on 2000 points total).
* Their GMMN+AE model achieves better log likelihood values than all competitors. The raw GMMN model performs roughly on par with the competitors.
* Nearest neighbor evaluation indicates that it did not just memorize the training set.
* The model learns smooth interpolations between digits (MNIST) and faces (TFD).

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Conditional Random Fields as Recurrent Neural Networks
Zheng, Shuai and Jayasumana, Sadeep and Romera-Paredes, Bernardino and Vineet, Vibhav and Su, Zhizhong and Du, Dalong and Huang, Chang and Torr, Philip H. S.
International Conference on Computer Vision - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Cubs Reading Group 7 years ago

#### Problem addressed: 
Image Segmentation, Pixel labelling, Object recognition

#### Summary: 
The authors approximate the CRF inference procedure using the mean field approximation. They use a specific set of unary and binary potentials. Each step in the mean field inference is modelled as a convolutional layer with appropriate filter sizes and channels. The mean field inference procedure requires multiple iterations (over time) to achieve convergence. This is exploited to model the whole procedure as CNN-RNN. The unary potentials and initial pixel labels are learnt using a FCN. The authors train the FCN and CNN-RNN separately and jointly and find that joint training gives the better performance of the two on the VOC2007 dataset.

#### Novelty:
Formulating the mean field CRF inference procedure as a combination of CNN and RNN. Joint training procedure of a fully convolutional network (FCN) + CRF as RNN to perform pixel labelling tasks

#### Drawbacks:
Does not scale with number of classes. No theoretical justification for success of joint training, only empirical justification

#### Datasets:
VOC2012, COCO

#### Additional remarks:
Presentation video available on cedar server

#### Resources:
http://www.robots.ox.ac.uk/~szheng/papers/CRFasRNN.pdf

#### Presenter:
Bhargava U. Kota

arxiv.org
scholar.google.com

Deep Compositional Question Answering with Neural Module Networks
Andreas, Jacob and Rohrbach, Marcus and Darrell, Trevor and Klein, Dan
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper presents an approach to visual question answering by dynamically composing networks of independent neural modules based on the semantic parsing of the question. Main contributions:

- Independent neural modules that can be combined together and jointly trained.
- Attention: Convolutional layer, with different filters for different instances. For example, attend[dog], attend[cat], etc.
- Re-attention: FC-ReLU-FC-ReLU, weights are different for different instances. For example, re-attend[above], re-attend[not], etc.
- Combination: Stacks two attention maps, followed by conv-ReLU to map to a single attention map. For example, combine[and], combine[except], etc.
- Classification: Combines attention map and image, followed by FC-Softmax to map to answer. For example, classify[colors].
- Measurement: FC-ReLU-FC-Softmax, takes attention map as input. For example, measure[exists].

- Structured representations are extracted from questions and these are then mapped to network layouts, including the connections between them.
- All leaves become attend modules, all internal nodes become re-attend or combine modules dependent on their arity, and root nodes become measure modules for yes/no questions and classify modules for all other question types.
- Networks with the same structure but different instantiations can be processed in the same batch. For example, classify[color]$attend[cat]$, classify[where]$attend[truck]$.

- Predictions from the module network are combined with LSTM representations to get the final answer.
- Syntactic regularities: 'what is flying?' and 'what are flying?' get mapped to the same module network.
- Semantic regularities: 'green' is an implausible answer for 'what color is the bear?'.

- Experiments are performed on the synthetic SHAPES dataset and VQA dataset.
- Performance on the SHAPES dataset is better as it is designed to benefit from compositionality.

## Strengths

- This model takes advantage of the inherently compositional property of language, which makes a lot of sense. VQA is an extremely complex task and breaking it up into separate functions/modules is an excellent approach.

## Weaknesses / Notes

- Mapping from syntactic structure to module network is hand-designed. Ideally, the model should learn this too to generalize.

- Due to its compositional nature, this kind of model can possibly be used in the zero-shot learning setting, i.e. generalize to novel question types that the network hasn't seen before.

arxiv.org
arxiv-vanity.com
scholar.google.com

Delving Deeper into Convolutional Networks for Learning Video Representations
Nicolas Ballas and Li Yao and Chris Pal and Aaron Courville
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CV, cs.LG, cs.NE
more

[link] Summary by Abhishek Das 7 years ago

This paper presents a neat method for learning spatio-temporal representations from videos. Convolutional features from intermediate layers of a CNN are extracted, to preserve spatial resolution, and fed into a modified GRU that can (in theory) learn infinite temporal dependencies. Main contributions:

- Their variant of GRU (called GRU-RCN) uses convolution operations instead of fully-connected units.
- This exploits the local correlation in image frames across spatial locations.
- Features from pool2, pool3, pool4, pool5 are extracted and fed into independent GRU-RCNs. Hidden states at last time step are now feature volumes, which are average pooled to reduce to 1x1 spatially, and fed into a linear + softmax classifier. Outputs from each of these classifiers is averaged to get the final prediction.

- Other variants that they experiment with are bidirectional GRU-RCNs and stacked GRU-RCNs i.e. GRU-RCNs with connections between them (with maxpool operations for dimensionality reduction).
- Bidirectional GRU-RCNs perform the best.
- Stacked GRU-RCNs perform worse than the other variants, probably because of limited data.

- They evaluate their method on action recognition and video captioning, and show significant improvements on a CNN+RNN baseline, comparing favorably with other state-of-the-art methods (like C3D).

## Strengths

- The idea is simple and elegant. Earlier methods for learning video representations typically used 3D convolutions (k x k x T filters), which suffered from finite temporal capacity, or RNNs sitting on top of last-layer CNN features, which is unable to capture finer spatial resolution. In theory, this formulation solves both.

- Changing fully-connected operations to convolutions has the additional advantage of requiring lesser parameters (n\_input x n\_output x input\_width x input\_height v/s n\_input x n\_output x k\_width x k\_height).

arxiv.org
arxiv-vanity.com
scholar.google.com

Dynamic Capacity Networks
Amjad Almahairi and Nicolas Ballas and Tim Cooijmans and Yin Zheng and Hugo Larochelle and Aaron Courville
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG, cs.NE
more

[link] Summary by Abhishek Das 7 years ago

This paper presents a model that can dynamically split computation across coarse, low-capacity sub-networks and fine, high-capacity sub-networks. The coarse model processes the entire input data and is typically shallow while the fine model focuses on a few important regions of the input and is deeper. For images as input, this is a hard attention mechanism that can be trained with stochastic gradient descent and doesn't require a task-specific attention policy trained by reinforcement learning. Key ideas:

- A deep network h can be decomposed into bottom layers f and top layers g such that $h(x) = g(f(x))$. Further, f consists of two alternate sub-networks $f\_c$ and $f\_f$. $f\_c$ is a low-capacity sub-network while $f\_f$ is a high-capacity sub-network.

- g should be able to use representations from $f\_c$ and $f\_f$ dynamically. $f\_c$ processes the entire input while $f\_f$ only a few important regions of the input.

- The coarse model processes the entire input and the norm of the gradient of the entropy with respect to the coarse vector at each spatial region is computed which is a measure of saliency. The use of the entropy gradient as a saliency measure encourages selecting input regions that could affect the uncertainty in the model’s predictions the most.

- The top-k input regions with highest saliency values are processed by the fine model. The refined representation for input to the top layers consists of both coarse and fine vectors. During backpropagation, gradients are computed for the refined model, i.e. propagating gradients at each position into either the coarse or fine features, depending on which was used.

- To make sure $f\_c$ and $f\_f$ representations are interchangeable and input to the top layers has smooth transitions, an additional objective term minimizes the squared distance between coarse and fine representations and this additional term is used only to optimize the coarse layers, not the fine layers.

- Experiments on cluttered MNIST, SVHN and comparison with RAM, DRAW and study with various values of number of patches for fine processing.

## Strengths

- Neat, general way to split computation based on importance of input; a hard-attention mechanism that can be trained with SGD, unlike RAM.

- Entropy gradient as a measure of saliency is an interesting idea, and it doesn't need labels i.e. can be used at test time.

arxiv.org
arxiv-vanity.com
scholar.google.com

DenseCap: Fully Convolutional Localization Networks for Dense Captioning
Justin Johnson and Andrej Karpathy and Li Fei-Fei
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CV, cs.LG
more

[link] Summary by Abhishek Das 7 years ago

This paper introduces the task of dense captioning and proposes
a network architecture that processes an image and produce region descriptions
in a single pass and can be trained end-to-end. Main contributions:

- Dense captioning
    - Generalization of object detection (caption consists of single word)
    and image captioning (region consists of whole image).

- Fully convolution localization network
    - Fully differentiable, can be trained jointly with the rest of the network
    - Consists of a region proposal network, box regression (similar to Faster R-CNN)
    and bilinear interpolation (similar to Spatial Transformer Networks) for
    sampling.

- Network details
    - Convolutional layer features are extracted for image
    - For each element in the feature map, k anchor boxes of different aspect ratios
    are selected in the input image space.
    - For each of these, the localization layer predicts offsets and confidence.
    - The region proposals are projected on the convolutional feature map and a sampling
    grid is computed from output feature map to input (bilinear sampling).
    - The computed feature map is passed through an MLP to compute representations
    corresponding to each region.
    - These are passed (in a batch) as the first word to an LSTM (Show and Tell) which
    is trained to predict each word of the caption.

## Strengths

- Fully differentiable 'spatial attention' mechanism (bilinear interpolation)
in place of RoI pooling as in the case of Faster R-CNN.
    - RoI pooling is not differentiable with respect to the input proposal coordinates.

- Fast, and impressive qualitative results.

## Weaknesses / Notes

The model is very well engineered together from different works (Faster R-CNN +
Spatial Transformer Networks + Show & Tell).

arxiv.org
scholar.google.com

Deep multi-scale video prediction beyond mean square error
Mathieu, Michaël and Couprie, Camille and LeCun, Yann
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Kirill Pevzner 7 years ago

Predict frames of a video using 3 newly proposed and complementary methods:
1. Multi scale cnn
2. GAN
3. Image gradient difference loss


Datasets:
-----------
* UCF101
* Sports1M


GAN
------
Generator:
   * Input: several frames of video from dataset
   * output: next frame of video

Discriminator:
   * input: original and last frame
   * output: is the last frame from dataset or generated

Problem: Still blurry on edges on moving object.
Solution: Image gradient difference loss

arxiv.org
scholar.google.com

Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding
Kendall, Alex and Badrinarayanan, Vijay and Cipolla, Roberto
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Qure.ai 7 years ago

**Contributions**: 
* Use dropout to get segmentation with a measure of model uncertainty.

**Explanation**:

We can consider dropout as a way of getting samples from a posterior distribution of models [see these papers: [1]( https://arxiv.org/abs/1506.02142), [2](https://arxiv.org/abs/1506.02158)) and thus can be be used to do Bayesian inference. 

This amounts to using dropout both during train and test time and getting multiple outputs (i.e sampling from model distribution) in test time. Mean of these outputs is taken as final segmentation and variation as model uncertainty.

Sample averaging performs better than weight averaging (i.e usual test time method for dropout) if averaged over more than 6 samples. Paper used 50 samples.

General technique which can be applied to any segmentation model. Sample averaging alone improves scores by 2-3%.

**Benchmarks**:

<score without dropout -> score with dropout>

*VOC2012*

Dilation Network: 71.3 -> 73.1 [Source](http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?cls=mean&challengeid=11&compid=6&submid=6103#KEY_Bayesian%20Dilation%20Network)

FCN8:    62.2 -> 65.4 [Source](http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?cls=mean&challengeid=11&compid=6&submid=6103#KEY_Bayesian%20FCN)

SegNet: 59.1 -> 60.5 (Source: reported in the paper)

*CamVid* 

SegNet: 71.20 -> 76.3 (Source: reported in the paper)

**My comments**

Nothing very new. Gotcha is that sample averaging performs better.

arxiv.org
scholar.google.com

Evaluating the visualization of what a Deep Neural Network has learned
Samek, Wojciech and Binder, Alexander and Montavon, Grégoire and Bach, Sebastian and Müller, Klaus-Robert
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Qure.ai 8 years ago

Layer-wise Relevance Propagation (LRP) is a novel technique has been used by authors in multiple use-cases (apart from this publication) to demonstrate the robustness and advantage of a *decomposition* method over other heatmap generation methods. Such heatmap generation methods are very crucial for increasing interpretability of Deep Learning models as such. Apart from LRP relevance, authors also discuss quantitative ways to measure the accuracy of the heatmap generated.

### LRP & Alternatives

What is LRP ?

LRP is a principled approach to decompose a classification decision into pixel-wise relevances indicating the contributions of a pixel to the overall classification score. The approach is derived
from a layer-wise conservation principle , which forces the propagated quantity (e.g. evidence for a predicted class) to be preserved between neurons of two adjacent layers.

Denoting by R(l) [i] the relevance associated to the ith neuron of layer and by R (l+1) [j] the relevance associated to the jth neuron in the next layer, the conservation principle requires that

![](https://i.imgur.com/GQxrnCT.png)

where R(l) [i] is given as
![](https://i.imgur.com/FD7AAfF.png)

where z[i,j] is the activation of jth neuron because of input from ith neuron

As per authors this is not necssarily the only relevance funtion which is conserved. The intuition behind using such a function is that lower-layer neurons that mostly contribute to the activation of the higher-layer neuron receive a larger share of the relevance Rj of the neuron j.

A downside of this propagation rule (at least if *epsilon* = 0) is that the denominator may tend to zero if lower-level contributions to neuron j cancel each other out. The numerical instability can be overcome by setting *epsilon* > 0. However in that case, the conservation idea is relaxated in order to gain better numerical properties. To conserve relevance, it can be formulated as sum of positive and negative activations
![](https://i.imgur.com/lo7f8AI.png)
such that *alpha* - *beta* = 1

#### Alternatives to LRP for heatmap

**Senstiivity measurement**

In such methods of generating heamaps, gradient of the output with respect to input is used for generating heatmap. This quantity measures how much small changes in the pixel value locally affect the network output.
##### Disadvantages
Given most models use ReLU as activation function, the gradient flows only through activation with positive output - thereby making makes the backward mapping discontinuous, and consequently strongly local. Also same applies for maxpool activations - wherein gradients only flow through neurons with maximum intensity in local neighbourhood.

Also, given most of these methods use absolute impact on prediction cause by changes in pixel intensities, the granularity of whether the pixel intensity was in favour or against evidence is lost.

**Deconvolutional Networks**

##### Disadvantages

Here the backward discontinuity problem of sensitivity based methods are absent, hence global features can be captured. However, since the method only takes in activation from final layer (which learns the presence or absence of features mostly) , using this for generating heatmaps is likely to yield avergae maps, lacking image specific localisation effects

LRP is able to counter the effects nicely because of the way it uses relevance

#### Performance of heatmaps

Few concerns that the authors raise are
- A heatmap is not a segmentation mask on the contrary missing evidence or the context may be very important for classification
- Salient features represent average explanations of what distinguishes one image category from another. For individual images these explanations may be meaningless or even wrong. For instance, salient features for the class ‘bicycle’ may be the wheels and the handlebar. However, in some images a bicycle
may be partly occluded so that these parts of a bike are not visible. In these images salient features fail to explain the classifier’s decision (which still may be correct).

Authors propose a novel method (MoRF - *Most Relevant First* ) of objectively quantifying quality of a heatmap. A good detailed idea of the measure can best be obtained from the paper. To give an idea, the most reliable method should ideally rank the most relevant regions in the same order even if small perturbations in pixel intensities are observed (in non-relevant areas.

The quantity of interest in this case is the area over the MoRF perturbation curve (AOPC).

#### Observation

Most of the sensitivity based methods answer to the question - *what change would make the image more or less belong to the category car* which isn't really the classifier's question. LRP plans to answer the real classifier question *what speaks for the presence of a car in the image*

An image below would be a good example of how LRPs can denoise heatmaps generated on the basis of sensitivity.

![](https://i.imgur.com/Sq0b5yg.png)

arxiv.org
arxiv-vanity.com
scholar.google.com

Visualizing Deep Convolutional Neural Networks Using Natural Pre-Images
Aravindh Mahendran and Andrea Vedaldi
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CV, 68T45
more

[link] Summary by Martin Thoma 8 years ago

This paper is about finding naturally looking images for the analysis of machine learning models in computer vision. There are 3 techniques:

* **inversion**: the aim is to reconstruct an image from its representation
* **activation maximization**: search for patterns that maximally stimulate a representation component (deep dream). This does NOT use an initial natural image.
* **caricaturization**: exaggerate the visual patterns that a representation detects in an image

The introduction is nice.


## Code

The paper comes with code: [robots.ox.ac.uk/~vgg/research/invrep](http://www.robots.ox.ac.uk/~vgg/research/invrep/index.html) ([GitHub: aravindhm/deep-goggle](https://github.com/aravindhm/deep-goggle))


## Related

* 2013, Zeiler & Fergus: [Visualizing and Understanding Convolutional Networks ](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13#martinthoma)

aclweb.org
scholar.google.com

Addressing the Rare Word Problem in Neural Machine Translation
Luong, Thang and Sutskever, Ilya and Le, Quoc V. and Vinyals, Oriol and Zaremba, Wojciech
Association for Computational Linguistics - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 8 years ago

# Addressing the Rare Word Problem in Neural Machine Translation

## Introduction

* NMT(Neural Machine Translation) systems perform poorly with respect to OOV(out-of-vocabulary) words or rare words.
* The paper presents a word-alignment based technique for translating such rare words.
* [Link to the paper](https://arxiv.org/abs/1410.8206)

## Technique

* Annotate the training corpus with information about what do different OOV words (in the target sentence) correspond to in the source sentence.
* NMT learns to track the alignment of rare words across source and target sentences and emits such alignments for the test sentences.
* As a post-processing step, use a dictionary to map rare words from the source language to target language.

## Annotating the Corpus

### Copy Model

* Annotate the OOV words in the source sentence with tokens *unk1*, *unk2*,..., etc such that repeated words get the same token.
* In target language, each OOV word, that is aligned to some OOV word in the source language, is assigned the same token as the word in the source language.
* The OOV word in the target language, which has no alignment or is aligned with a known word in the source language. is assigned the null token.
* Pros
* Very straightforward
* Cons
* Misses out on words which are not labelled as OOV in the source language.

### PosAll - Positional All Model

* All OOV words in the source language are assigned a single *unk* token.
* All words in the target sentences are assigned positional tokens which denote that the *jth* word in the target sentence is aligned to the *ith* word in the source sentence.
* Aligned words that are too far apart, or are unaligned, are assigned a null token.
* Pros
* Captures complete alignment between source and target sentences.
* Cons
* It doubles the length of target sentences.

### PosUnk - Positional Unknown Model

* All OOV words in the source language are assigned a single *unk* token.
* All OOV words in the target sentences are assigned *unk* token with the position which gives the relative position of the word in the target language with respect to its aligned source word.
* Pros:
* Faster than PosAll model.
* Cons
* Does not capture alignment for all words.

## Experiments

* Dataset
* Subset of WMT'14 dataset
* Alignment computed using the [Berkeley Aligner](https://code.google.com/archive/p/berkeleyaligner/)
* Used architecture from [Sequence to Sequence Learning with Neural Networks paper](https://gist.github.com/shagunsodhani/a2915921d7d0ac5cfd0e379025acfb9f).

## Results

* All the 3 approaches (more specifically the PosUnk approach) improve the performance of existing NMTs in the order PosUnk > PosAll > Copy.
* Ensemble models benefit more than individual models as the ensemble of NMT models works better at aligning the OOV words.
* Performance gains are more when using smaller vocabulary.
* Rare word analysis shows that performance gains are more when proposition of OOV words is higher.

arxiv.org
arxiv-vanity.com
scholar.google.com

Generating Sentences from a Continuous Space
Samuel R. Bowman and Luke Vilnis and Oriol Vinyals and Andrew M. Dai and Rafal Jozefowicz and Samy Bengio
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG, cs.CL
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors present an RNN-based variational autoencoder that can learn a latent sentence representation while learning to decode. A linear layer that predicts the parameter of a Gaussian distribution is inserted between encoder and decoder. The loss is a combination of the reconstruction objective and the KL divergence with the prior (Gaussian) - similar to the "standard" VAE does. The authors evaluate the model on Language Modeling and Impution (Inserting Missing Words) tasks and also present a qualitative analysis of the latent space.

#### Key Points

- Training is tricky. Vanilla training results in the decoder ignoring the encoder and the KL error term becoming zero.
- Training Trick 1: KL Cost Annealing. During training, increase weight on the KL term of the cost to anneal from vanilla to VAE.
- Training Trick 2: Word dropout using a word keep rate hyperparameter. This forces the decoder to rely more on the global representation.
- Results on Language Modeling: Standard model (without cost annealing and word dropout) trails Vanilla RNNLM model, but not by much. KL cost term goes to zero in this setting. In an inputless decoder setting (word keep prob = 0) the VAE outperforms the RNNLM (obviously)
- Results on Imputing Missing Words: Benchmarked using an adversarial error classifier. VAE significantly outperforms RNNLM. However, the comparison is somewhat unfair since the RNNML has nothing to condition on and relies on unigram distribution for the first token.
- Qualitative: Can use higher word dropout to get more diverse sentences
- Qualitative: Can walk the latent space and get grammatical and meaningful sentences.

arxiv.org
scholar.google.com

Deep Reinforcement Learning with a Natural Language Action Space
Ji He and Jianshu Chen and Xiaodong He and Jianfeng Gao and Lihong Li and Li Deng and Mari Ostendorf
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.AI, cs.CL, cs.LG
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors train a DQN on text-based games. The main difference is that their Q-Value functions embeds the state (textual context) and action (text-based choice) separately and then takes the dot product between them. The authors call this a Deep Reinforcement Learning Relevance network. Basically, just a different Q function implementation. Empirically, the authors show that their network can learn to solve "Saving John" and "Machine of Death" text games.

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Exploiting local features from deep networks for image retrieval
Ng, Joe Yue-Hei and Yang, Fan and Davis, Larry S.
Conference and Computer Vision and Pattern Recognition - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Vivek Gandhi 8 years ago

In this paper, the authors raise a very important point for instance based image retrieval. For a task like an image recognition features extracted from higher layer of deep networks works really well in general, but for task like instance based image retrieval features extracted from higher layers don't prove to be that useful, so the authors suggest that we take features from lower layer and on those features, apply [VLAD encoding](https://www.robots.ox.ac.uk/~vgg/publications/2013/arandjelovic13/arandjelovic13.pdf). On top of the VLAD encoding as part of post processing, we perform steps like intra-normalisation and then apply PCA and reduce the encoding to a size of 128 Dimension. The authors have performed their experiments using [Googlenet](https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf) and [VGG-16](https://arxiv.org/pdf/1409.1556v6.pdf), and they tried Inception 3a, Inception 4a and Inception 4e on GoogleNet and conv4_2, conv5_1 and conv5_2 on VGG-16. The above mentioned layers has almost similar performance on the dataset they have used. The performance metric used by the authors is Mean Average Precision(MAP).

arxiv.org
arxiv-vanity.com
scholar.google.com

Learning both Weights and Connections for Efficient Neural Networks
Song Han and Jeff Pool and John Tran and William J. Dally
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.NE, cs.CV, cs.LG
more

[link] Summary by Martin Thoma 8 years ago

This paper is about pruning a neural network to reduce the FLOPs and memory necessary to use it. This method reduces AlexNet parameters to 1/9  and VGG-16 to 1/13 of the original size.

## Receipt

1. Train a network
2. Prune network: For each weight $w$: if w < threshold, then w <- 0.
3. Train pruned network

## See also

* [Optimal Brain Damage](http://www.shortscience.org/paper?bibtexKey=conf/nips/CunDS89)

arxiv.org
arxiv-vanity.com
scholar.google.com

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
Alec Radford and Luke Metz and Soumith Chintala
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG, cs.CV
more

[link] Summary by Shagun Sodhani 8 years ago

# Deep Convolutional Generative Adversarial Nets

## Introduction

* The paper presents Deep Convolutional Generative Adversarial Nets (DCGAN) - a topologically constrained variant of conditional GAN.
* [Link to the paper](https://arxiv.org/abs/1511.06434)

## Benefits

* Stable to train
* Very useful to learn unsupervised image representations.

## Model

* GANs difficult to scale using CNNs.
* Paper proposes following changes to GANs:
* Replace any pooling layers with strided convolutions (for discriminator) and fractional strided convolutions (for generators).
* Remove fully connected hidden layers.
* Use batch normalisation in both generator (all layers except output layer) and discriminator (all layers except input layer).
* Use LeakyReLU in all layers of the discriminator.
* Use ReLU activation in all layers of the generator (except output layer which uses Tanh).

## Datasets

* Large-Scale Scene Understanding.
* Imagenet-1K.
* Faces dataset.

## Hyperparameters

* Minibatch SGD with minibatch size of 128.
* Weights initialized with 0 centered Normal distribution with standard deviation = 0.02
* Adam Optimizer
* Slope of leak = 0.2 for LeakyReLU.
* Learning rate = 0.0002, β1 = 0.5

## Observations

* Large-Scale Scene Understanding data
* Demonstrates that model scales with more data and higher resolution generation.
* Even though it is unlikely that model would have memorized images (due to low learning rate of minibatch SGD).
* Classifying CIFAR-10 dataset
* Features
* Train in Imagenet-1K and test on CIFAR-10.
* Max pool discriminator's convolutional features (from all layers) to get 4x4 spatial grids.
* Flatten and concatenate to get a 28672-dimensional vector.
* Linear L2-SVM classifier trained over the feature vector.
* 82.8% accuracy, outperforms K-means (80.6%)
* Street View House Number Classifier
* Similar pipeline as CIFAR-10
* 22.48% test error.
* The paper contains many examples of images generated by final and intermediate layers of the network.
* Images in the latent space do not show sharp transitions indicating that network did not memorize images.
* DCGAN can learn an interesting hierarchy of features.
* Networks seems to have some success in disentangling image representation from object representation.
* Vector arithmetic can be performed on the Z vectors corresponding to the face samples to get results like `smiling woman - normal woman + normal man = smiling man` visually.

arxiv.org
arxiv-vanity.com
scholar.google.com

Towards Dropout Training for Convolutional Neural Networks
Haibing Wu and Xiaodong Gu
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG, cs.CV, cs.NE
more

1	[link] Summary by Martin Thoma 8 years ago Probabilistic weighted pooling is proposed in this paper. It is based on max-pooling and dropout. more less

arxiv.org
scholar.google.com

A Roadmap towards Machine Intelligence
Mikolov, Tomas and Joulin, Armand and Baroni, Marco
- 2015 via Local Bibsonomy
Keywords: deep, learning, SYMBOL

[link] Summary by Shagun Sodhani 8 years ago

# A Roadmap towards Machine Intelligence

## Introduction

* The paper presents some general characteristics that intelligent machines should possess and a roadmap to develop such intelligent machines in small, realistic steps.
* [Link to the paper](https://arxiv.org/abs/1511.08130)

## Ability to Communicate

* The intelligent agents should be able to communicate with humans, preferably using language as the medium.
* Such systems can be programmed through natural language and can access much of the human knowledge which is encoded using natural language.
* The learning environment should facilitate interactive communication and the machine should have a minimalistic bit interface for IO to keep the interface simple.
* Further, the machine should be free to use any internal representation for learning tasks.

## Ability to Learn

* Learning allows the machine to adapt to the external environment and correct their mistakes.
* Users should be able to control the motivation of the machine via a communication channel. This is similar to the notion of rewards in reinforcement learning.

## A simulated ecosystem to educate communication-based intelligent machines

* Simulated environment to teach basic linguistic interactions and know-how to operate in the world.
* Though the environment should be challenging enough to force the machine to "learn how to learn", its complexity should be manageable.
* Unlike class AI block worlds, the simulated environment is not intended to teach an exhaustive set of functionality to the agent. The aim is to teach the machine how to learn efficiently by combining already acquired skills.

### Description

#### Agent

* Learner or actor
* Teacher
* Assigns tasks and rewards to the learner and provides helpful information.
* Aim is to kick start the learner's efficient learning capabilities without providing enough direct information.
* Environment
* Learner explores the environment by giving orders, asking questions and receiving feedback.
* Environment uses a controlled language which is more explicit and restricted.

Think of learner as a high-level programming language, the teacher as the programmer and the environment as the compiler.

#### Interface Channels

* Generic input and output channels.
* Teacher and environment write to the input channel.
* Reward is written to input channel.
* Learner writes to the output channel and learns to use ambigous prefixes to address the agents and services it needs to interact with.

#### Reward

* Way to provide feedback to the learner.
* Rewards should become sparse as the learner's intelligence grows and "curiosity" should be a learnt strategy.
* Learner should maximise average reward over time so that faster strategies are preferred in case of equal rewards.

#### Incremental Structure

* Think of learner progressing through different levels where skills from earlier levels can be used in later levels.
* Tasks need not be ordered within a level.
* Learner starts by performing basic tasks like repeating characters then learns to associate linguistic strings to action sequences. Further, the learner learns to ask questions and "read" natural text.

#### Time Off

* Learner is given time to either explore the environment or to interact with the Teacher or to update its internal structure by replaying the previous experience.

#### Evaluation

* Evaluating the learning agent on only the final behaviour only is not sufficient as it overlooks the number of attempts to reach the optimal behaviour.
* Better approach would be to conduct public competition where developers have access to preprogrammed environment for fixed amount of time and learners are evaluated on tasks that are considerably different from the tasks encountered during training.

#### Tasks

A brief overview of the type of tasks is provided [here](https://github.com/facebookresearch/CommAI-env/blob/master/TASKS.md)

## Types of Learning

* Concept of positive and negative rewards.
* Discovery of algorithms.
* Remember facts, skills, and learning strategies.

## Long term memory

* To store facts, algorithms and even ability to learn.

## Compositional Learning Skills

* Producing new structures by combining together known facts and skills.
* Understanding new concepts should not always require training examples.

## Computational properties of intelligent machines

* Computational model should be able to represent any pattern in data (alternatively, represent any algorithm in fixed length).
* Among the various Turning-complete computational systems available, the most natural choice would be a compositional system that can perform computations in parallel.
* Alternatively, a non-growing model with immensely large capacity could be used.
* In a growing model, new cells are connected to ones that spawned them leading to topological structures that can contribute to learning.
* But it is not clear if such topological structures can arise in a large-capacity unstructured model.

arxiv.org
arxiv-vanity.com
scholar.google.com

FaceNet: A Unified Embedding for Face Recognition and Clustering
Florian Schroff and Dmitry Kalenichenko and James Philbin
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Martin Thoma 8 years ago

FaceNet directly maps face images to $\mathbb{R}^{128}$ where distances directly correspond to a measure of face similarity. They use a triplet loss function. The triplet is (face of person A, other face of person A, face of person which is not A). Later, this is called (anchor, positive, negative).

The loss function is learned and inspired by LMNN. The idea is to minimize the distance between the two images of the same person and maximize the distance to the other persons image.

## LMNN

Large Margin Nearest Neighbor (LMNN) is learning a pseudo-metric

$$d(x, y) = (x -y) M  (x -y)^T$$

where $M$ is a positive-definite matrix. The only difference between a pseudo-metric and a metric is that $d(x, y) = 0 \Leftrightarrow x = y$ does not hold.

## Curriculum Learning: Triplet selection

Show simple examples first, then increase the difficulty. This is done by selecting the triplets.

They use the triplets which are *hard*. For the positive example, this means the distance between the anchor and the positive example is high. For the negative example this means the distance between the anchor and the negative example is low.

They want to have

$$||f(x_i^a) - f(x_i^p)||_2^2 + \alpha < ||f(x_i^a) - f(x_i^n)||_2^2$$

where $\alpha$ is a margin and $x_i^a$ is the anchor, $x_i^p$ is the positive face example and $x_i^n$ is the negative example. They increase $\alpha$ over time. It is crucial that $f$ maps the images not in the complete $\mathbb{R}^{128}$, but on the unit sphere. Otherwise one could double $\alpha$ by simply making $f' = 2 \cdot f$.

## Tasks

* **Face verification**: Is this the same person?
* **Face recognition**: Who is this person?

## Datasets

* 99.63% accuracy on Labeled FAces in the Wild (LFW)
* 95.12% accuracy on YouTube Faces DB

## Network

Two models are evaluated: The [Zeiler & Fergus model](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13)  and an architecture based on the [Inception model](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14).

## See also

* [DeepFace](http://www.shortscience.org/paper?bibtexKey=conf/cvpr/TaigmanYRW14#martinthoma)

arxiv.org
arxiv-vanity.com
scholar.google.com

Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems
Jesse Dodge and Andreea Gane and Xiang Zhang and Antoine Bordes and Sumit Chopra and Alexander Miller and Arthur Szlam and Jason Weston
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CL, cs.LG
more

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

* The paper presents a suite of benchmark tasks to evaluate end-to-end dialogue systems such that performing well on the tasks is a necessary (but not sufficient) condition for a fully functional dialogue agent.
* [Link to the paper](https://research.facebook.com/publications/evaluating-prerequisite-qualities-for-learning-end-to-end-dialog-systems/)

#### Dataset

* Created using large-scale real-world sources - OMDB (Open Movie Database), MovieLens and Reddit.
* Consists of ~75K movie entities and ~3.5M training examples.

#### Tasks

##### QA Task

* Answering Factoid Questions without relation to the previous dialogue.
* KB(Knowledge Base) created using OMDB and stored as triplets of the form (Entity, Relation, Entity).
* Question (in Natural Language Form) generated by creating templates using [SimpleQuestions](https://arxiv.org/abs/1506.02075)
* Instead of giving out just 1 response, the system ranks all the answers in order of their relevance.

##### Recommendation Task

* Providing personalised responses to the user via recommendation instead of providing universal facts as in case 1.
* MovieLens dataset with a *user x item* matrix of ratings.
* Statements (for any user) are generated by sampling highly ranked movies by the user and forming a statement about these movies using natural language templates.
* Like the previous case, a list of ranked responses is generated.

##### QA + Recommendation Task

* Maintaining short dialogues involving both factoid and personalised content.
* Dataset consists of short conversations of 3 exchanges (3 from each participant).

##### Reddit Discussion Task

* Identify most likely response is discussions on Reddit.
* Data processed to flatten the potential conversation so that it appears to be a two participant conversation.

##### Joint Task

* Combines all the previous tasks into one single task to test all the skills at once.

#### Models Tested

* **Memory Networks** - Comprises of a memory component that includes both long term memory and short term context.

* **Supervised Embedding Models** - Sum the word embeddings of the input and the target independently and compare them with a similarity metric.

* **Recurrent Language Models** - RNN, LSTM, SeqToSeq

* **Question Answering Systems** - Systems that answer natural language questions by converting them into search queries over a KB.

* **SVD(Singular Value Decomposition)** - Standard benchmark for recommendation.

* **Information Retrieval Models** - Given a message, find the most similar message in the training dataset and report its output or find a most similar response to input directly.

#### Result

##### QA Task

* QA System > Memory Networks > Supervised Embeddings > LSTM

##### Recommendation Task

* Supervised Embeddings > Memory Networks > LSTM > SVD

##### Task Involving Dialog History

* QA + Recommendation Task and Reddit Discussion Task
* Memory Networks > Supervised Embeddings > LSTM

##### Joint Task

* Supervised word embeddings perform very poorly even when using a large number of dimensions (2000 dimensions).
* Memory Networks perform better than embedding models as they can utilise the local context and the long-term memory. But they do not perform as well on standalone QA tasks.

arxiv.org
scholar.google.com

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
Jason Weston and Antoine Bordes and Sumit Chopra and Alexander M. Rush and Bart van Merriënboer and Armand Joulin and Tomas Mikolov
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.AI, cs.CL, stat.ML
more

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

The [paper](http://arxiv.org/pdf/1502.05698v10) presents a framework and a set of synthetic toy tasks (classified into skill sets) for analyzing the performance of different machine learning algorithms.

#### Tasks

* **Single/Two/Three Supporting Facts**: Questions where a single(or multiple) supporting facts provide the answer. More is the number of supporting facts, tougher is the task.
* **Two/Three Supporting Facts**: Requires differentiation between objects and subjects.
* **Yes/No Questions**: True/False questions.
* **Counting/List/Set Questions**: Requires ability to count or list objects having a certain property.
* **Simple Negation and Indefinite Knowledge**: Tests the ability to handle negation constructs and model sentences that describe a possibility and not a certainty.
* **Basic Coreference, Conjunctions, and Compound Coreference**: Requires ability to handle different levels of coreference.
* **Time Reasoning**: Requires understanding the use of time expressions in sentences.
* **Basic Deduction and Induction**: Tests basic deduction and induction via inheritance of properties.
* **Position and Size Reasoning**
* **Path Finding**: Find path between locations.
* **Agent's Motivation**: Why an agent performs an action ie what is the state of the agent.

#### Dataset

* The dataset is available [here](https://research.facebook.com/research/-babi/) and the source code to generate the tasks is available [here](https://github.com/facebook/bAbI-tasks).
* The different tasks are independent of each other.
* For supervised training, the set of relevant statements is provided along with questions and answers.
* The tasks are available in English, Hindi and shuffled English words.

#### Data Simulation

* Simulated world consists of entities of various types (locations, objects, persons etc) and of various actions that operate on these entities.
* These entities have their internal state and follow certain rules as to how they interact with other entities.
* Basic simulations are of the form: <actor> <action> <object> eg Bob go school.
* To add variations, synonyms are used for entities and actions.

#### Experiments

##### Methods

* N-gram classifier baseline
* LSTMs
* Memory Networks (MemNNs)
* Structured SVM incorporating externally labeled data

##### Extensions to Memory Networks

* **Adaptive Memories** - learn the number of hops to be performed instead of using the fixed value of 2 hops.
* **N-grams** - Use a bag of 3-grams instead of a bag-of-words.
* **Nonlinearity** - Apply 2-layer neural network with *tanh* nonlinearity in the matching function.

##### Structured SVM

* Uses coreference resolution and semantic role labeling (SRL) which are themselves trained on a large amount of data.
* First train with strong supervision to find supporting statements and then use a similar SVM to find the response.

##### Results

* Standard MemNN outperform N-gram and LSTM but still fail on a number of tasks.
* MemNNs with Adaptive Memory improve the performance for multiple supporting facts task and basic induction task.
* MemNNs with N-gram modeling improves results when word order matters.
* MemNNs with Nonlinearity performs well on Yes/No tasks and indefinite knowledge tasks.
* Structured SVM outperforms vanilla MemNNs but not as good as MemNNs with modifications.
* Structured SVM performs very well on path finding task due to its non-greedy search approach.

arxiv.org
arxiv-vanity.com
scholar.google.com

Generative Image Modeling Using Spatial LSTMs
Lucas Theis and Matthias Bethge
arXiv e-Print archive - 2015 via Local arXiv
Keywords: stat.ML, cs.CV, cs.LG
more

[link] Summary by Liew Jun Hao 8 years ago

#### Introduction
This [paper](https://github.com/lucastheis/ride) introduces *recurrent image density estimator* (RIDE), a generative model by combining a *multidimensional* recurrent neural network with mixtures of experts to model the distribution of natural image. In this work, the authors used  *spatial* LSTMs (SLSTM) to capture the semantics in the form of hidden states where these hidden vectors are then fed into a factorized *factorized mixtures of conditional Gaussian scale mixtures* (MCGSMs) to predict the state of the corresponding pixels.

##### __1. Spatial long short-term memory (SLSTM)__
This is a straightforward extension of the multidimensional RNN in order to capture long range interaction. Let $\mathbf{x}$ be a grayscale image patch and $x_{ij}$ be the intensity of pixel at location ${ij}$. At each location $ij$, each LSTM unit perform the following operations:

$\mathbf{c}_{ij} = \mathbf{g}_{ij} \odot \mathbf{i}_{ij} + \mathbf{c}_{i,j-1} \odot \mathbf{f}^c_{ij} + \mathbf{c}_{i-1,j} \odot \mathbf{f}^r_{ij} $ 

$\mathbf{h}_{ij} = \tanh(\mathbf{c}_{ij} \odot \mathbf{o}_{ij})$

$\begin{pmatrix} 
\mathbf{g}_{ij} \\ \mathbf{o}_{ij} \\ \mathbf{i}_{ij} \\ \mathbf{g}_{ij}\\ \mathbf{f}_{ij}^r\\ \mathbf{f}_{ij}^c
\end{pmatrix} = \begin{pmatrix} \tanh \\ \sigma \\ \sigma \\ \sigma \\ \sigma\\ \sigma  \end{pmatrix} T_{\mathbf{A,b}}  \begin{pmatrix} \mathbf{x}_{<ij} \\ \mathbf{h}_{i,j-1} \\ \mathbf{h}_{i-1,j} \end{pmatrix} $

where $\mathbf{c}_{ij}$ and $\mathbf{h}_{ij}$ are memory units and hidden units respectively. Note that, there are 2 different forget gates $\mathbf{f}^c_{ij}$ and $\mathbf{f}^r_{ij}$ for the 2 preceding memory states $\mathbf{c}_{i,j-1}$ and $\mathbf{c}_{i-1,j}$. Also note that $\mathbf{x}_{<ij}$ here denotes a set of *causal neighborhood* by applying Markov assumption.

![ride_1](http://i.imgur.com/W8ugGvl.png)
As shown in Fig. C, although the prediction of a pixel depends only on its neighborhood (green) through feedforward connections, there is an indirect connection to a much larger region (red) via recurrent connections.

##### __2. Factorized mixtures of conditional Gaussian scale mixtures__
A generative model can usually be expressed as $p(\mathbf{x};\mathbf{\theta}) = \prod_{i,j} p(x_{ij}|\mathbf{x}_{<ij}; \mathbf{\theta})$ using chain rule. One way to improve the representational power of a model is to introduce different sets of parameters for each pixel, i.e. $p(\mathbf{x}; \{ \mathbf{\theta} \}) = \prod_{i,j} p(x_{ij}|\mathbf{x}_{<ij}; \mathbf{\theta}_{ij})$. However, untying shared parameters will lead to drastic increase of parameters. Therefore, the author applied 2 simple common used assumptions:
1. __Markov  assumption__: $\mathbf{x}_{<ij}$ is limited to small neighborhood around $x_{ij}$ (causal neighborhood)
2. __Stationary and shift invariance__: the same set of $\mathbf{\theta}_{ij}$ is used for every location ${ij}$ which corresponds to recurrent structure in RNN.

Therefore, the hidden vector from SLSTMs can be fed into the MCGSM to predict the state of corresponding label, i.e. $p(x_{ij} | \textbf{x}_{<ij}) = p(x_{ij} | \textbf{h}_{ij})$.

The conditional distribution distribution in MCGSM is represented as a mixture of experts:

$p(x_{ij} | \mathbf{x}_{<ij}; \mathbf{\theta}_{ij}) = \sum_{c,s} p(c, s | \mathbf{x}_{<ij}, \mathbf{\theta}_{ij}) p (x_{ij} | \mathbf{x}_{<ij}, c, s, \mathbf{\theta}_{ij})$.

where the first and second term correspond to gate and experts respectively. To further reduce the number of parameters, the authors proposed using a *factorized* MCGSM in order to use larger neighborhoods and more mixture components. (*__Remarks__: I am not too sure about the exact training of MCGSM, but as far as I understand, the MCGSM is firstly trained end-to-end with SLSTM using SGD with momentum and then finetuned using L-BFGS after each epoch by fixing the parameters of SLSTM.*)

* For training:

```
for n in range(num_epochs):
	for b in range(0, inputs.shape[0] - batch_size + 1, batch_size):
		# compute gradients
		f, df = f_df(params, b)

		loss.append(f / log(2.) / self.num_channels)

		# update SLSTM parameters
		for l in train_layers:
			for key in params['slstm'][l]:
				diff['slstm'][l][key] = momentum * diff['slstm'][l][key] - df['slstm'][l][key]
				params['slstm'][l][key] = params['slstm'][l][key] + learning_rate * diff['slstm'][l][key]

		# update MCGSM parameters
		diff['mcgsm'] = momentum * diff['mcgsm'] - df['mcgsm']
		params['mcgsm'] = params['mcgsm'] + learning_rate * diff['mcgsm']
```

* Finetuning (part of the code)

```
for l in range(self.num_layers):
	self.slstm[l] = SLSTM(
		num_rows=hiddens.shape[1],
		num_cols=hiddens.shape[2],
		num_channels=hiddens.shape[3],
		num_hiddens=self.num_hiddens,
		batch_size=min([hiddens.shape[0], self.MAX_BATCH_SIZE]),
		nonlinearity=self.nonlinearity,
		extended=self.extended,
		slstm=self.slstm[l],
		verbosity=self.verbosity)

	hiddens = self.slstm[l].forward(hiddens)

# finetune with early stopping based on validation performance
return self.mcgsm.train(
	hiddens_train, outputs_train,
	hiddens_valid, outputs_valid,
	parameters={
		'verbosity': self.verbosity,
		'train_means': train_means,
		'max_iter': max_iter})
```

arxiv.org
scholar.google.com

Learning to Segment Object Candidates
Pinheiro, Pedro H. O. and Collobert, Ronan and Dollár, Piotr
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Qure.ai 8 years ago

Facebook has [released a series of papers](https://research.facebook.com/blog/learning-to-segment/) for object segmentation and detection. This paper is the first in that series.

This is how modern object detection works (think [RCNN](https://arxiv.org/abs/1311.2524), [Fast RCNN](http://arxiv.org/abs/1504.08083)):

1. A rich set of object proposals (i.e., a set of image regions which are likely to contain an object) is generated using a fast (but possibly imprecise) algorithm.
2. A CNN classifier is applied on each of the proposals.

The current paper improves the step 1, i.e., region/object proposals.

Most object proposals approaches fall into three categories:
* Objectness scoring
* Seed Segmentation
* Superpixel Merging

Current method is different from these three.
It share similarities with [Faster R-CNN](https://arxiv.org/abs/1506.01497) in that proposals are generated using a CNN.
The method predicts a segmentation mask given an input *patch* and assigns a score corresponding to how likely the patch is to contain an object.

## Model and Training

Both mask and score predictions are achieved with a single convolutional network but with multiple outputs. All the convolutional layers except the last few are from VGG-A pretrained model.

Each training sample is a triplet of RGB input patch, the binary mask corresponding to the input patch, a label which specifies whether the patch contains an object. A patch is given label 1 only if it satisfies the following constraints:
* the patch contains an object roughly centered in the input patch
* the object is fully contained in the patch and in a given scale range

Note that the network must output a mask for a single object at the center even when multiple objects are present.

Figure 1 shows the architecture and sampling for training.
![figure1](https://i.imgur.com/zSyP0ij.png)

Model is then jointly trained for segmentation and objectness. Negative samples are not used for segmentation.

## Inference

During full image inference, model is applied densely at multiple locations and scales.
This can be done efficiently since all computations are convolutional like in a fully convolutional network (FCN).

![figure2](https://i.imgur.com/dQWfy8R.png)

This approach surpasses the previous state of the art by a large margin in both box and segmentation proposal generation.

arxiv.org
scholar.google.com

Actions ~ Transformations
Wang, Xiaolong and Farhadi, Ali and Gupta, Abhinav
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 8 years ago

Originally posted [here](https://github.com/abhshkdz/papers/blob/master/reviews/actions-~-transformations.md).

This paper introduces a novel representation for actions in videos as transformations that change the state of the environment from what it was before the action (precondition) to what it will be after it (effect).

- Model
    - The model utilizes a Siamese architecture with each head having convolutional and fully-connected layers (similar to VGG16). Each head extracts features for a subset of video frames (precondition or effect) that are aggregated by average pooling and followed by a fully-connected layer.
    - The precondition frames are indexed from 1 to z\_p and the effect frames from z\_e to t. Both z\_p and z\_e are latent variables, constrained to be from [1/3t, 1/2t] and [1/2t, 2/3t] respectively and estimated via brute force search during training.
    - The action is represented as a linear transformation between the final fully-connected layers of the two heads. For n action categories, the transformation layer has n transformation matrices.
    - The model is trained with a contrastive loss function to 1) maximize cosine similarity between the effect embedding and the transformed precondition embedding, and 2) maximize distance for incorrect transformations if greater than a chosen margin.
- ACT Dataset
    - 50 keywords, 43 classes, ~500 YouTube videos per keyword.
    - The authors collect the ACT dataset primarily for the task of cross-category generalization (as it doesn't allow models to overfit to contextual information). For example, how would a model learned on "opening a window" generalize to recognize "opening the trunk of the car"? How about generalizing from a model trained on "climbing a cliff" to recognize "climbing a tree"?
    - The ACT dataset has class and super-class annotations from human workers. Each super-class has different sub-categories which are the same action under different subjects, objects and scenes.
- Experiments
    - Action recognition on UCF101, HMDB51, ACT.
    - Cross-category generalization on ACT.
- Visualizations
    - Nearest neighbor: modeling the actions as transformations gives semantically meaningful retrievals that don't just depend on motion and color.
    - Gradient visualizations (Simonyan et al. 2014): model focuses on changes in scene (human + object) than context.
    - Embedding retrievals based on transformed precondition embeddings.

** Thoughts **

- Modeling action as a transformation from precondition to effect is a very neat idea.
- The exact formulation and supporting experiments and ablation studies are thorough.
- During inference, the model first extracts features for all frames and then does a brute force search over (y,z\_p,z\_e) to estimate the action category and segmentation into precondition and effect. For longer sequences, this seems expensive. Although hard decisions aren't differentiable, a soft attention mechanism on z might be feasible and reduce computation to a single forward pass.

jmlr.org
scholar.google.com

Deep Unsupervised Learning using Nonequilibrium Thermodynamics
Sohl-Dickstein, Jascha and Weiss, Eric A. and Maheswaranathan, Niru and Ganguli, Surya
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by inFERENCe 8 years ago

I spend the week at ICML, and this paper on generative models is one of my favourites so far:

To be clear: this post doesn't add much to the presentation of the paper, but I will attempt to summarise my understanding of it. Also, I want to make clear that this is not my work.

Unsupervised learning has been one of the most interesting areas of machine learning in the last decades, but it is in the spotlight again since the deep learning crowd started to care about it. Unsupervised learning is hard because evaluating the loss function people want to use (log likelihood) is intractable for most interesting models. Therefore people come up with

- alternative objective functions, such as adversarial training, maximum mean discrepancy, or pseudolikelihood, which can be evaluated for a large class of interesting models
- alternative optimisation methods or approximate inference methods such as contrastive divergence or variational Bayes
- models that have some nice properties. This paper is an example of the latter

#### The key idea behind the paper

What we typically try to do in representation learning is to map data to a latent representation. While the Data can have arbitrarily complex distribution along some complicated nonlinear manifold, we want the computed latent representations to have a nice distribution, like a multivariate Gaussian.

This paper takes this idea very explicitly using a stochastic mapping to turn data into a representation: a random diffusion process. If you take any data, and apply Brownian motion-like stochastic process to this, you will end up with a standard Gaussian distributed variable, due to the stationarity of the Brownian motion. Below image shows an example: 2D observations (left) have a complex data distribution along the Swiss roll manifold. If one applies Brownian motion to each datapoint, the complicated structure starts to diffuse, and eventually the data is scrambled to become white noise (right).

![](http://www.inference.vc/content/images/2015/07/Screen-Shot-2015-07-09-at-13-27-41.png)

Now the trick the authors used is to train a dynamical system to inverts this random walk, to be able to reconstruct the original data distribution from the random Gaussian noise. Amazingly, this works, and the traninig objective becomes very similar to variational autoencoders. Below is a figure showing what happens when we try to reconstruct data in the Swiss roll example: The top images from right to left: we start with a bunch of points drawn from random noise (top right). We apply the inverse nonlinear transformation to these points (top middle). Over time points will be pushed towards the original Swiss roll manifold (top left).

`The information about the data distribution is encoded in the approximate inverse dynamical system`

The bottom pictures show where this dynamical system tries to push points as time progresses.

![](http://www.inference.vc/content/images/2015/07/Screen-Shot-2015-07-09-at-13-30-16.png)

This is super cool. Now we have a deep generative process that can turn random noise into something that looks like our datapoints. It can generate roughly natural-looking images like these:

![](http://www.inference.vc/content/images/2015/07/Screen-Shot-2015-07-09-at-13-33-38.png)

#### Advantages

In this model a lot of things that are otherwise hard to do are easy to do:

1. generating/imagining data is straightforward
2. inference, i.e. calculating the latent representation from data, is simple
3. you can multiply the distribution with another distribution, making Bayesian calculations for stuff like denoising or superresolution possible.

#### Drawbacks and extensions

I think a drawback of the model is that if you run the diffusion process for too long (i.e. make the model deeper), the mutual information between datapoint and its representation is bound to decrease, due to the stationarity of Brownian motion. I guess this is going to be an important limitation to the depth of these models.

Also, the latent representations at each layer are assumed to be exactly if the same dimensionality and type as the data itsef. So if we are modeling 100x100 images, then all layers in the resulting network will have 100k nodes. I guess this can be overcome by combining variational autoencoders with this method. Also, you can imagine augmenting your space with extra 'pixels' that are only used for richer representations in the intermediate layers.

Anyway, this is super cool, go read the paper.

jmlr.org
scholar.google.com

MADE: Masked Autoencoder for Distribution Estimation
Germain, Mathieu and Gregor, Karol and Murray, Iain and Larochelle, Hugo
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by inFERENCe 8 years ago

This is my second favourite paper from ICML last week, and I think the title really does not do it justice. It is a great idea about training rich, tractable autoregressive generative models of data, and doing so by using standard techniques from autoencoder training with dropout.

Caveat (again): this is not my work, and this blog post does not really add anything new to the paper, only my perspective on it.

#### Unsupervised learning primer (again)

Unsupervised learning is about modelling the probability distribution $p(\mathbf{x})$ of some data, from which we observe independent samples $\mathbf{x}_i$. Often, the vector $\mathbf{x}$ is high dimensional, such as in images, where different components of $\mathbf{x}$ encode pixel intensities.

Typically, a probability model is specified as $q(\mathbf{x};\theta) = \frac{f(\mathbf{x};\theta)}{Z_\theta}$, where $f(\mathbf{x};\theta)$ is some positive function parametrised by $\theta$. The denominator $Z_\theta$ is called the normalisation constant which makes sure that $q$ is a valid probability model: it has to sum up to one over all possible configurations of $\mathbf{x}$. The central problem in unsupervised learning is that for the most interesting models in high dimensional spaces, calculating $Z_\theta$ is intractable, so crucial quantities such as the model likelihood cannot be calculated and the model cannot be fitted. The community is therefore in search for

- interesting models that have tractable normalisation constants
- fancy methods to deal with intractable models (pseudo-likelihood, adversarial networks, contrastive divergence)

This paper is about the former.

#### Core ingredient: autoregressive models

This paper sidesteps the high dimensional normalisation problem by restricting the class of probability distributions to autoregressive models, which can be written as follows:

$$q(\mathbf{x};\theta) = \prod_{d=1}^{D} q(x_{d}\vert x_{1:d-1};\theta).$$

Here $x_d$ denotes the $d^{th}$ component of the input vector $\mathbf{x}$. In a model like this, we only need to compute the normalisation of each $q(x_{d}\vert x_{1:d-1};\theta)$ term, and we can be sure that the resulting model is a valid model over the whole vector $\mathbf{x}$. But as normalising these one-dimensional probability distributions is a lot easier, we have a whole range of interesting tractable distributions at our disposal.

#### Training multiple models simultaneously

Autoregressive models are used a lot in time series modelling and language modelling: hidden Markov models or recurrent neural networks are examples. There, autoregressive models are a very natural way to model data because the data comes ordered (in time).

What's weird about using autoregressive models in this context is that it is sensitive to ordering of dimensions, even though that ordering might not mean anything. If $\mathbf{x}$ encodes an image, you can think about multiple orders in which pixel values can be serialised: sweeping left-to-right, top-to-bottom, inside-out etc. For images, neither of these orderings is particularly natural, yet all of these different ordering specifies a different model above.

But it turns out, you don't have to choose one ordering, you can choose all of them at the same time. The neat trick in the masking autoencoder paper is to train multiple autoregressive models all at the same time, all of them sharing (a subset of) parameters $\theta$, but defined over different ordering of coordinates. This can be achieved by thinking of deep autoregressive models as a special cases of an autoencoder, only with a few edges missing.

![](http://www.inference.vc/content/images/2015/07/Screen-Shot-2015-07-13-at-10-48-54.png)

Consider a fixed ordering of input dimensions. Now take a fully connected autoencoder, which defines a probability distribution $q(\hat{\mathbf{x}}\vert\mathbf{x};\theta)$. You can write this as

$$q(\hat{\mathbf{x}}\vert\mathbf{x};\theta) = \prod_{d=1}^{D} q(\hat{x}_{d}\vert x_{1:D};\theta)$$

Note the similarity to the autoregressive equation above, the only difference being that each coordinate now depends on every other coordinate ($x_{1:D}$), rather than only coordinates that precede it in the ordering ($x_{1:d-1}$). To turn this equation into autoregressive equation above, we simply have to remove dependencies of each output coordinate $\hat{x}_{d}$ on any input coordinate $\hat{x}_{e}$, where $e>=d$. This can be done by removing edges along all paths from the input coordinate $\hat{x}_{e}$ to output coordinate $\hat{x}_{d}$. You can achieve this cutting of edges by multiplying the weight matrices $\mathbf{W}^{l}$of the autoencoder neural network elementwise by binary masking matrices $\mathbf{M}^{\mathbf{W}^{l}}$. Hence the name masked autoencoder.

The procedure above considered a fixed ordering of coordinates. You can repeat this process for any arbitrary ordering, for which you obtain different masking matrices but otherwise the same procedure. If you train this autoencoder network with randomly sampled masking matrices, you essentially train a family of autoregressive models, each sharing some parameters via the underlying autoencoder network.

Because masking is similar to the popular dropout training, implementing it is relatively straightforward and requires minimal change to existing autoencoder code. However, now you have a generative model - in fact, a large set of generative models - which has a lot of nice properties for you to enjoy.

The slight concern

Of course, this would be all too good to be true: a powerful deep generative model that is easy to evaluate and all. I think the problem with this is the following: If you train just one of these autoregressive models, that's tractable, exact and fine. But you really want to combine all (or many) of these becuause individually they are weak.

What is the interpretation of training with randomly drawn masking matrices? You can think of it as stochastic gradient descent on the following objective:

$$\mathbb{E}_{\mathbf{x}\sim p}\mathbb{E}_{\pi \sim U} \log q(\mathbf{x},\pi,\theta)$$

Here, I used $\pi$ to denote a permutation of the coordinates, and $\mathbb{E}_{\pi \sim U}$ to take an expectation over a uniform distribution over permutations. The distribution $q(\mathbf{x},\pi,\theta)$ is the autoregressive model defined by $\theta$ and the masking matrices corresponding to permutation $\pi$. $\mathbb{E}_{\mathbf{x}\sim p}$ denotes averaging over the empirical data distribution.

Combining as a mixture model

One way to combine autoregressive models is to take a mixture model. In the paper, the authors actually use an ensemble to make predictions, which is analogous to an equal mixture model where the mixture weights are uniform and fixed. The likelihood for this model would be the following:

$$\mathbb{E}_{\mathbf{x}\sim p} \log \mathbb{E}_{\pi \sim U} q(\mathbf{x},\pi,\theta)$$

Notice that the averaging over permutations now takes place inside the logarithm. By Jensen's inequality, we can say that randomly sampling masking matrices during training amounts to optimising a stochastically estimated lower bound to the likelihood of an equal mixture. This raises the question whether actually learning the weights in such a model would be hard using something like an EM algorithm with a sparsity-enforcing regulariser/prior over mixture weights.

Combining as a product of experts model

Combining these autoregressive models as a mixture is not ideal. In mixture modeling the sharpness of the mixture distribution is bounded by the sharpness of component distributions. Your combined prediction can never be more confident than the your most confident model. In this case, I expect the AR models to be pretty poor models individually, and therefore not to be very sharp, particularly along the first few coordinates in the corresponding ordering.

A better way to combine probabilistic models is via product of experts. You can actually interpret training by random masking matrices as a form of product of experts, but with the global normalisation ignored. I'm not sure if it would be possible/tractable to do anything better than this.

papers.nips.cc
scholar.google.com

Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks
Denton, Emily L. and Chintala, Soumith and Szlam, Arthur and Fergus, Rob
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by inFERENCe 8 years ago

This post is a comment on the Laplacian pyramid-based generative model proposed by researchers from NYU/Facebook AI Research.

Let me start by saying that I really like this model, and I think - looking at the samples drawn - it represents a nice big step towards convincing generative models of natural images.

To summarise the model, the authors use the Laplacian pyramid representation of images, where you recursively decompose the image to a lower resolution subsampled component and the high-frequency residual. The reason this decomposition is favoured in image processing is the fact that the high-frequency residuals tend to be very sparse, so they are relatively easy to compress and encode.

In this paper the authors propose using convolutional neural networks at each layer of the Laplacian pyramid representation to generate an image sequentially, increasing the resolution at each step. The convnet at each layer is conditioned on the lower resolution image, and some noise component $z_k$, and generates a random higher resolution image. The process continues recursively until the desired resilution is reached. For training they use the adversarial objective function. Below is the main figure that explains how the generative model works, I encourage everyone to have a look at the paper for more details:

![](http://www.inference.vc/content/images/2015/07/Screen-Shot-2015-07-23-at-11-15-17.png)

#### An argument about Conditional Entropies

What I think is weird about the model is the precise amount of noise that is injected at each layer/resolution. In the schematic above, these are the $z_k$ variables. Adding the noise is crucial to defining a probabilistic generative process; this is how it defines a probability distribution.

I think it's useful to think about entropies of natural images at different resolutions. When doing generative modelling or unsuperised learning, we want to capture the distribution of data. One important aspect of a probability distribution is its entropy, which measures the variability of the random quantity. In this case, we want to describe the statistics of the full resolution observed natural image $I_0$. (I borrow the authors' notation where $I_0$ represents the highest resolution image, and $I_k$ represent the $k$-times subsampled version. Using the Laplacian pyramid representation, we can decompose the entropy of an image in the following way:

$$\mathbb{H}[I_0] = \mathbb{H}[I_{K}] + \sum_{k=0}^{K-1} \mathbb{H}[I_k\vert I_{k+1}].$$

The reason why the above decomposition holds is very simple. Because $I_{k+1}$ is a deterministic function of $I_{k}$ (subsampling), the conditional entropy $\mathbb{H}[I_{k+1}\vert I_{k}] = 0$. Therefore the joint entropy of the two variables is simply the entropy of the higher resolution image $I_{k}$, that is $\mathbb{H}[I_{k},I_{k+1}] = \mathbb{H}[I_{k}] + \mathbb{H}[I_{k+1}\vert I_{k}] = \mathbb{H}[I_{k}]$. So by induction, the join entropy of all images $I_{k}$ is just the marginal entropy of the highest resolution image $I_0$. Applying the chain rule for joint entropies we get the expression above.

Now, the interesting bit is how the conditional entropies $\mathbb{H}[I_k\vert I_{k+1}]$ are 'achieved' in the Laplacian pyramid generative model paper. These entropies are provided by the injected random noise variables $z_k$. By the information processing lemma $\mathbb{H}[I_k\vert I_{k+1}] \leq \mathbb{H}[z_k]$. The authors choose $z_k$ to be uniform random variables whose dimensionality grows with the resolution of $I_k$. To quote them "The noise input $z_k$ to $G_k$ is presented as a 4th color plane to low-pass $l_k$, hence its dimensionality varies with the pyramid level." Therefore $\mathbb{H}[z_k] \propto 4^{-k}$, assuming that the pixel count quadruples at each layer.

So the conditional entropy $\mathbb{H}[I_k\vert I_{k+1}]$ is allowed to grow exponentially with resolution, at the same rate it would grow if the images contained pure white noise. In their model, they allow the per-pixel conditional entropy $c\cdot 4^{-k}\cdot \mathbb{H}[I_k\vert I_{k+1}]$ to be constant across resolutions. To me, this seems undesirable. My intuition is, for natural images, $\mathbb{H}[I_k\vert I_{k+1}]$ may grow as $k$ decreases (because the dimensionality gorws), but the per-pixel value $c\cdot 4^{k}\cdot \mathbb{H}[I_k\vert I_{k+1}]$ should decrease or converge to $0$ as the resolution increases. Very low low-resolution subsampled natural images behave a little bit like white noise, there is a lot of variability in them. But as you increase the resolution, the probability distribution of the high-res image given the low-res image will become a lot sharper.

In terms of model capacity, this is not a problem, inasmuch as the convolutional models $G_{k}$ can choose to ignore some variance in $z_k$ and learn a more deterministic superresolution process. However, adding unnecessarily high entropy will almost certainly make the fitting of such model harder. For example, the adversarial training process relies on sampling from $z_k$, and the procedure is pretty sensitive to sampling noise. If you make the distribution of $z_k$ unneccessarily high entropy, you will end up doing a lot of extra work during training until the network figures out to ignore the extra variance.

To solve this problem, I propose to keep the entropy of the noise vectors constant, or make them grow sub-linearly with the number of pixels in the image. This mperhaps akes the generative convnets harder to implement. Another quick solution would be to introduce dependence between components of $z_k$ via a low-rank covariance matrix, or some sort of a hashing trick.

#### Adversarial training vs superresolution autoencoders

Another weird thing is that the adversarial objective function forgets the identity of the image. For example, you would want your model so that

`"if at the previous layer you have a low-resolution parrot, the next layer should be a higher-resolution parrot"`

Instead, what you get with the adversarial objective is

`"if at the previous layer you have a low-resolution parrot, the next layer should output a higher-dimensional image that looks like a plausible natural image"`

So, there is nothing in the objective function that enforces dependency between subsequent layers of the pyramid. I think if you made $G_k$ very complex, it could just learn to model natural images by itself, so that $I_{k}$ is in essence independent of $I_{k+1}$ and is purely driven by the noise $z_{k}$. You could sidestep this problem by restricting the complexity of the generative nets, or, again, to restrict the entropy of the noise.

Overall, I think the approach would benefit from a combination of the adversarial and a supervised (superresolution autoencoder) objective function.

papers.nips.cc
scholar.google.com

Learning to Linearize Under Uncertainty
Goroshin, Ross and Mathieu, Michaël and LeCun, Yann
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by inFERENCe 8 years ago

I think this paper has two main ideas in there, I see them as independent, for reasons explained below:

- A new penalty function that aims at regularising the second derivative of the trajectory the latent representation traces over time. I see this as a generalisation of the slowness principle or temporal constancy, more about this in the next section.

- A new autoencoder-like method to predict future frames in video. Video is really hard to forward-predict with non-probabilistic models because high level aspects of video are genuinely uncertain. For example, in a football game, you can't really predict whether the ball will hit the goalpost, but the results might look completely different visually. This, combined with L2 penalties often results in overly conservative, blurry predictions. The paper improves things by introducing extra hidden variables, that allow the model to represent uncertainty in its predictions. More on this later.

#### Inductive bias: penalising curvature

The key idea of this paper is to learn good distributed representations of natural images from video in an unsupervised way. Intuitively, there is a lot of information contained in video, which is lost if you scramble the video and look at statistics individual frames only. The race is on to develop the right kind of prior and inductive bias that helps us fully exploit this temporal information. This paper presents a way, which is called learning to linearise (I'm going to call this L2L).

Naturally occurring images are thought to reside on some complex, nonlinear manifold whose intrinsic dimension is substantially lower than the number of pixels in an image. It is then natural to think about video as a journey on this manifold surface, along some smooth path. Therefore, if we aim to learn good generic features that correspond to coordinates on this underlying manifold, we should expect that these features vary in a smooth fashion over time as you play the video.

L2L uses this intuition to motivate their choice of an objective function that penalises a scale-invariant measure of curvature over time. In a way it tries to recover features that transform nearly linearly as time progresses and the video is played.

In their notations, $x_{t}$ denotes the data in frame $t$, which is transformed by a deep network to obtain the latent representation $z_{t}$. The penalty for the latent representation is as follows.

$$-\sum_{t} \frac{(z_t - z_{t-1})^{T}(z_{t+1} - z_{t})}{\|z_t - z_{t-1}\|\|z_{t+1} - z_{t}\|}$$

The expression above has a geometric meaning as the cosine of the angle between the vectors $(z_t - z_{t-1})$ and $(z_{t+1} - z_{t})$. The penalty is minimised if these two vectors are parallel and point in the same direction. In other words the penalty prefers when the latent feature representation keeps its momentum and continues along a linear path - and it does not like sharp turns or jumps. This seems like a sensible prior assumption to build on.

L2L is very similar to another popular inductive bias used in slow feature analysis: the temporal slowness principle. According to this principle, the most relevant underlying features don't change very quickly. The slowness principle has a long history both in machine learning and as a model of human visual perception. In SFA one would minimise the following penalty on the latent representation:

$$\sum_{t} (z_t - z_{t-1})^{2},$$

where the square is applied component-wise. There are additional constraints in SFA, more about this later. We can understand the connection between SFA and this paper's penalty if we plot the penalty for a single hidden feature $z_{t,f}$ at time $t$, keeping all other features and values at neighbouring timesteps constant. This is plotted in the figure below (scaled and translated so the objectives line up nicely).

![](http://www.inference.vc/content/images/2015/09/-RfX4Dp2Y3YAAAAASUVORK5CYII-.png)

As you can see, both objectives have a minimum at the same location: they both try to force $z_{t,f}$ to linearly interpolate between the neighbouring timesteps. However, while SFA has a quadratic penalty, the learning to linearise objective tapers off at long distances. Compare this to Tukey's loss function used in outlier-resistant robust regression.

Based on this, my prediction is that compared to SFA, this loss function is more tolerant of outliers, which in the temporal domain would mean abrupt jumps in the latent representation. So while SFA is equivalent to assuming that the latent features follow a Brownian-motion-like Ornstein–Uhlenbeck process, I'd imagine this prior corresponds to something like a jump diffusion process (although I don't think the analogy holds mathematically).

Which one of these inductive biases/priors are better at exploiting temporal information in natural video? Slow Brownian motion, or nearly-linear trajectories with potentially a few jumps Unfortunately, don't expect any empirical answer to that from the paper. All experiments seem to be performed on artificially constructed examples, where the temporal information is synthetically engineered. Nor there is any real comparison to SFA.

#### Representing predictive uncertainty with auxillary variables

While the encoder network learns to construct smoothly varrying features $z_t$, the model also has a decoder network that tries to reconstruct $x_t$ and predict subsequent frames. This, the authors agree, is necessary in order for $z_t$ to contain enough relevant information about the frame $x_t$ (more about whether or not this is necessary later). The precise way this decoding is done has a novel idea as well: minimising over auxillary variables.

Let's say our task is to predict a future frame $x_{t+k}$ based on the latent representation $z_{t}$. The problem is, this is a very hard problem. In video, just like in real life, anything can happen. Imagine you're modelling soccer footage, and the ball is about to hit the goalpost. In order to predict the next frames, not only do we have to know about natural image statistics, we also have to be able to predict whether the goal is in or not. An optimal predictive model would give a highly multimodal probability distribution as its answer. If you use the L2 loss with a deterministic feed-forward predictive network, it's likely to come up with a very blurry image, which would correspont to the average of this nasty multimodal distribution. This calls for something better, either a smarter objective function, or a better way of representing predictive uncertainty.

The solution the authors gave is to introduce hidden variables $\delta_{t}$, that the decoder network also receives as input in addition to $z_t$. For each frame, $\delta_t$ is optimised so that only the best possible reconstruction is taken into account in the loss function. Thus, the decoder network is allowed to use $\delta$ as a source of non-determinism to hedge its bets as to what the contents of the next frame will be. This is one step closer to the ideal setting where the decoder network is allowed to give a full probability distribution of possibilities and then is evaluated using a strictly proper scoring rule.

This inner loop minimisation (of $\delta$) looks very tedious, and introduces a few more parameters that may be hard to set. The algorithm is reminiscent of the E-step in expectation-maximisation, and also very similar to the iterated closest point algorithm Andrew Fitzgibbon talked about in his tutorial at BMVC this year.

In his tutorial, Andrew gave examples where jointly optimising model parameters and auxiliary variables ($\delta$) is advantageous, and I think the same logic applies here. Instead of the inner loop, simultaneous optimisation helps fixing some pathologies, like slow convergence near the optimum. In addition, Andrew advocates exploiting the sparsity structure of the Hessian to implement efficient second-order gradient-based optimisation methods. These tricks are explained in paragraphs around equation 8 in (Prasad et al, 2010).

#### Predictive model: Is it necessary?

On a more fundamental level, I question whether the predictive decoder network is really a necessary addition to make L2L work.

The authors observe that the objective function is minimised by the "trivial" solutions $z_{t} = at + b$, where $a,b$ can be arbitrary constants. They then say that in order to make sure features do something more than just discover some of these trivial solutions, we also have to include a decoder network, that uses $z_t$ to predict future frames. I believe this is not necessary at all.

Because $z_t$ is a deterministic function of $x_t$, and $t$ is not accessible to $z_{t}$ in any other way than through inferring it from $x_t$, as long as $a\neq 0$, the linear solutions are not trivial at all. If the network discovers $z_{t} = at, a\neq 0$, you should in fact be very happy (assuming a single feature). The only problems with trivial solutions occur when $z_{t} = b$ ($z$ doesn't depend on the data at all) or when $z$ is multidimensional and several redundant features are sensitive to exactly the same thing.

These trivial solutions could be avoided the same way they are avoided in SFA, by constraining the overall spatial covariance of $z_{t}$ over the videoclip to be $I$. This would force each feature to vary at least a little bit with data- hence avoiding the trivial constant solutions. It would also force features to be linearly decorrelated - solving the redundant features problem.

So I wonder if the decoder network is indeed a necessary addition to the model. I would love to encourage the authors to implement their new hypothesis of a prior both with and without the decoder. They may already have tried it without and found it really didn’t work, so it might just be a matter of including those results. This would in turn allow us to see SFA and L2L side-by-side, and learn something about whether and why their prior is better than the sl

arxiv.org
scholar.google.com

How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary?
Huszar, Ferenc
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by inFERENCe 8 years ago

### Evaluating Generative Models

A key topic I'm very interested in is the choices of objective functions used in unsupervised learning and generative models. The key organising principle should be this: the objective function we use for training a probabilistic model should match the way we ultimately want to use the model. Yet, in unsupervised learning this is often overlooked and I think we lack clarity around what the models are used for and how they should be trained and evaluated. This paper tries to clarify this a bit in the context of generative models. I also want to mention that another ICLR submission this year also deals with this fundamental question: I highly recommend taking a look.

Here, I'm going to consider a narrow definition of generative models: models we actually want to use to generate samples from which are then shown to a human user/observer. This includes use-cases such as image captioning, texture generation, machine translation, speech synthesis and dialogue systems, but excludes things like unsupervised pre-training for supervised learning, semisupervised learning, data compression, denoising and many others. Very often people don't make this distinction clear when talking about generative models which is one of the reasons why there is still no clarity about what different objective functions do.

I argue that when the goal is to train a model that can generate natural-looking samples, maximum likelihood is not a desirable training objective. Maximum likelihood is consistent so it can learn any distribution if it is given infinite data and a perfect model class. However, under model misspecification and finite data (that is, in pretty much every practically interesting scenario), it has a tendency to produce models that overgeneralise.

#### KL divergence as a perceptual loss

Generative modelling is about finding a probabilistic model $Q$ that in some sense approximates the natural distribution of data $P$. When researchers (or users of their product) evaluate generative models for perceptual quality, they draw samples from it, then - for lack of a better word - eyeball the samples. In visual information processing this is often referred to as no-reference perceptual quality assessment \citep[see e.,g.\ ][]{wang2002noreference}. In the paper, I propose that the KL divergence $KL[Q\| P]$ can be used as an idealised objective function to describe this scenario. This related to maximum likelihood which minimises $KL[P\|Q]$, but different in fundamental ways which I will explain later.

Here is why I think $KL[Q\|P]$ should be used: First, we can make the assumption that the perceived quality of each sample is related to the \emph{surprisal} $-\log Q_{human}(x)$ under the human observers' subjective prior of stimuli $Q_{human}(x)$. For those of you not familiar with computational cognitive science, this will seem ad-hoc, but it's a relatively common assumption to make when modelling reaction times in experiments for example. We further assume that the human observer maintains a very accurate model of natural stimuli, thus, $Q_{human}(x) \approx P(x)$. This is a fancy way of saying things like the observer being a native speaker therefore understanding all the nuances in language. These two assumptions suggest that in order to optimise our chances in this Turing test-like scenario, we need to minimise the following cross-entropy or perplexity term:

\begin{equation} - \mathbb{E}_{x\sim Q} \log P(x) \end{equation}

This perplexity is the exact opposite average negative log likelihood $- \mathbb{E}_{x\sim P} \log Q(x)$, with the role of $P$ and $Q$ changed. However, the perplexity alone would be maximised by a model $Q$ that deterministically picks the most likely stimulus. To enforce diversity one can simultaneously try to maximise the Shannon entropy of $Q$. This leaves us with the following KL divergence to optimise:

\begin{equation} KL[Q\| P] = - \mathbb{E}{x\sim Q} \log P(x) + \mathbb{E}{x\sim Q} \log Q(x) \end{equation}

So if we want to train models that produce nice samples, my recommendation is to try to use $KL[Q\|P]$ as an objective function or something that behaves like it. How does maximum likelihood compare?

#### Differences between maximum likelihood and $KL[Q\|P]$

Maximum likelihood is roughly the same as minimising $KL[P\|Q]$. The differences between minimising $KL[P\|Q]$ and $K[Q\|P]$ are well understood and it frequently comes up in the context of Bayesian approximate inference as well. Both divergences ensure consistency, minimising either converges to the true $P$ in the limit of infinite data and a perfect model class. However, they differ fundamentally in the way they deal with finite data and model misspecification (in almost every practical scenario):

$KL[P\|Q]$ tends to favour approximations $Q$ that overgeneralise $P$. If P is multimodal, the optimal $Q$ will tend to cover all the modes of $P$, even at the cost of introducing probability mass where $P$ has $0$ mass. Practically this means that the model will occasionally sample unplausible samples that don't look anything like samples from $P$.
$KL[Q\|P]$ tends to favour under-generalisation. The optimal $Q$ will typically describe the single largest mode of $P$ well, at the cost of ignoring other modes if they are hard to model without covering low-probability areas as well. Practically this means that $KL[Q\|P]$ will try to avoid introducing unplausible samples, sometimes at the cost of missing the majority of plausible samples under $P$.
In other words: $KL[P\|Q]$ is liberal, $KL[Q\|P]$ is conservative. In yet other words: $KL[P\|Q]$ is an optimist, $KL[Q\|P]$ is a pessimist.

The problem of course is that $KL[Q\|P]$ is super hard to optimise beased on a finite sample from $P$. Even harder than maximum likelihood. Not only that, the KL divergence is also not very well behaved, and is not well-defined unless $P$ is positive everywhere where $Q$ is positive. So there is little hope we can turn $KL[Q\|P]$ into a practical training algorithm.

#### Generalised Adversarial Training

Generative Adversarial Networks(GANs) train a generative model jointly with an adversarial discriminative model that tries to differentiate between artificial and real data. The idea is, a generative model is good if it can fool the best discriminative model into thinking the generated samples are real. GANs have produced some of the nicest looking samples you'll find on the Internet and got people very excited about generative models again: human faces, album covers, etc.

How do they come into this picture? It's because they can be understood as approximately minimising the Jensen-Shannon divergence:

\begin{equation} JSD[P\|Q] = JSD[P\|Q] = \frac{1}{2}KL\left[P\middle\|\frac{P+Q}{2}\right] + \frac{1}{2}KL\left[Q\middle\|\frac{P+Q}{2}\right]. 
\end{equation}

Looking at the equation above you can immediately see how it's related to this topic. JS divergence is a bit like a symmetrised version of KL divergence. It's not $KL[P\|Q]$, not $KL[Q\|P]$, but a bit of both. So one can expect that minimising JS divergence would exhibit a behaviour that is kind of halfway between the two extremes explained above. And that means that they would generate better samples than methods trained via maximum likelihood and similar objectives.

What's more, one can generalise JS divergence to a whole family of divergences, parametrised by a probability $0<\pi<1$ as follows:

\begin{equation} JS_{\pi}[P\|Q] = \pi \cdot KL[P\|\pi P+(1-\pi)Q] + (1-\pi)KL[Q\|\pi P+(1-\pi)Q]. 
\end{equation}

What I show in the paper is that by varrying $\pi$ between the two extremes, one can effectively interpolate between the behaviour of maximum likelihood ($\pi\rightarrow 0$) and minimising $KL[Q\|P]$ ($\pi\rightarrow 1$). See the paper for details. This interpolation between behaviours is explained in this main figure below:

![](http://www.inference.vc/content/images/2015/11/Screen-Shot-2015-11-16-at-16-19-10.png)

For any given value of $\pi$, we can optimise $JS_{\pi}$ approximately using an algorithm that is a slightly changed version of the original GAN algorithm. This is because the generalised JS divergence still has an elegant information theoretic interpretation. Consider a communications channel on which we can transmit a single data point of some kind. We toss a coin and with probability $\pi$, we send a sample from $P$, and with probability $1-\pi$ we send a sample from $Q$ instead. The receiver doesn't know the outcome of the coinflip, she only observes the sample. The $JS_{\pi}$ is the mutual information between the observed sample and the coinflip. It is also an upper bound on how well any algorithm can do in guessing the coinflip from the observed sample.

To implement an adversarial training algorithm for $JS_{\pi}$ one simply needs to change the ratio of samples the discriminative network sees from $Q$ vs $P$ (or apply appropriate weights during training). In the original method the discriminator network is faced with a balanced classification problem, i.e. $\pi=\frac{1}{2}$. It is hard to believe, but this irrelevant-looking modification changes the behaviour of the GAN algorithm dramatically, and can in theory allow the GAN algorithm to approximate both maximum likelihood or $KL[Q\|P]$.

This analysis explains why GANs have been so successful in generating very nice looking images, and relatively few weird-looking ones. It is also worth pointing out that the GAN method is still in its infantcy and has many issues and limitations. The main issue is that it is based on sampling from $Q$ which doesn't work well in high dimensions. Hopefully some of these limitations can be overcome and then we should have a pretty powerful framework for training good generative models.

arxiv.org
scholar.google.com

Adversarial Autoencoders
Makhzani, Alireza and Shlens, Jonathon and Jaitly, Navdeep and Goodfellow, Ian J.
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by inFERENCe 8 years ago

#### Summary of this post:

* an overview the motivation behind adversarial autoencoders and how they work * a discussion on whether the adversarial training is necessary in the first place. tl;dr: I think it's an overkill and I propose a simpler method along the lines of kernel moment matching.

#### Adversarial Autoencoders

Again, I recommend everyone interested to read the actual paper, but I'll attempt to give a high level overview the main ideas in the paper. I think the main figure from the paper does a pretty good job explaining how Adversarial Autoencoders are trained:

![](http://www.inference.vc/content/images/2016/01/Screen-Shot-2016-01-08-at-14-48-25.png)

The top part of this image is a probabilistic autoencoder. Given the input $\mathbf{x}$, some latent code $\mathbf{z}$ is generated by sampling from an encoding distribution $q(\mathbf{z}\vert\mathbf{x})$. This distribution is typically modeled as the output a deep neural network. In normal autoencoders this encoder would be deterministic, now we allow it to be probabilistic.

A decoder network is then trained to decode $\mathbf{z}$ and reconstruct the original input $\mathbf{x}$. Of course, reconstruction will not be perfect, but we train the networks to minimise reconstruction error, this is typically just mean squared error.

The reconstruction cost ensures that the encoding process retains information about the input image, but it doesn't enforce anything else about what these latent representations $\mathbf{z}$ should do. In general, their distribution is described as the aggregate posterior $q(\mathbf{z})=\mathbb{E}_\mathbf{x} q(\mathbf{z}\vert\mathbf{x})$. Often, we would like this distribution to match a certain prior $p(\mathbf{z})$. For example. we may want $\mathbf{z}$ to have independent components and Gaussian distributed (nonlinear ICA,PCA). Or we may want to force the latent representations to correspond to discrete class labels, or binary factors. Or we may simply want to ensure there are 'no gaps' in the latent space, and any random $\mathbf{z}$ would lead to a viable sample when squashed through the decoder network.

So there are multiple reasons why one might want to control the aggregate posterior $q(\mathbf{z})$ to match a predefined prior $p(\mathbf{z})$. The authors achieve this by introducing an additional term in the autoencoder loss function, one that measures the divergence between $q$ and $p$. The authors chose to do this via adversarial training: they train a discriminator network that constantly learns to discriminate between real code vectors $\mathbb{z}$ produced by encoding real data, and random code vectors sampled from $p$. If $q$ matches $p$ perfectly, the optimal discriminator network should have a large classification error.

#### Is this an overkill?

My main question about this paper was whether the adversarial cost is really needed here, because I think it's an overkill. Let me explain:

Adversarial training is powerful when all else fails to quantify divergence between complicated, potentially degenerate distributions in high dimensions, such as images or video. Our toolkit for dealing with images is limited, CNNs are the best tool we have, so it makes sense to incorporate them in training generative models for images. GANs - when applied directly to images - are a great idea.

However, here adversarial training is applied to an easier problem: to quantify the divergence between a simple, fixed prior (e.g. Gaussian) and an empirical distribution of latents. The latent space is usually lower-dimensional, distributions better behaved. Therefore, matching to $p(\mathbf{z})$ in latent space should be considerably easier than matching distributions over images.

Adversarial training makes no assumptions about the distributions compared, other than sampling from them. This comes very handy when both $p$ and $q$ are nasty such as in the generative adversarial network scenario: there, $p$ is the distribution of natural images, $q$ is a super complicated, degenerate distribution produced by squashing noise through a deep convnet. The price we pay for this flexibility is this: when $p$ or $q$ are actually easy to work with, adversarial training cannot exploit that, it still has to sample. (it would be interesting to see if expectations over $p(\mathbf{z})$ could be computed analytically). So even though in this work $p$ is as simple as a mixture of ten 2D Gaussians, we need to approximate everything by drawing samples.

#### Other things might work: kernel moment matching

Why can’t one use easier divergences? For example, I think moment matching based on kernel MMD would work brilliantly in this scenario. It would have the following advantages over the adversarial cost.

- closed form expressions: Depending on the choice of the prior $p(\mathbf{z})$ and kernel used in MMD, the expectations over $p$ may be available in closed form, without sampling. So for example if we use a squared exponential kernel and a mixture of Gaussians as $p$, the divergence from $p$ can be precomputed in closed form that is easy to evaluate.

- no nasty inner loop: Adversarial training requires the discriminator network to be reoptimised every time the generative model changes. So we end up with a gradient descent in the inner loop of a gradient descent, which is anything but nice to work with. This is why it takes so long to get it working, the whole thing is pretty unstable. In contrast, to evaluate MMD, the inner loop is not needed. In fact, MMD can also be thought of as the solution to a convex maximisation problem, but via the kernel trick the maximum has a closed form solution.

- the problem is well suited for MMD: because the distributions are smooth, and the space is nice and low-dimensional, MMD might work very well. Kernel-based methods struggle with complicated manifold-like structure of natural images, so I wouldn't expect MMD to be competitive with adversarial training if it is applied directly in the image space. Therefore, I actually prefer generative adversarial networks to generative moment matching networks. However, here we have an easier problem, simpler space, simpler distributions where MMD shines, and adversarial training is just not needed.

arxiv.org
scholar.google.com

Multi-Scale Context Aggregation by Dilated Convolutions
Yu, Fisher and Koltun, Vladlen
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by inFERENCe 8 years ago

- I give an overview of the paper which proposes an exponential schedule of dilated convolutional layers as a way to combine local and global knowledge
- I point out the connection between 2D dilated convolutions and Kronecker products
- cascades of exponentially dilated convolutions - as proposed in the paper - can be thought of as parametrising a large convolution kernel as a Kronecker product of small kernels
- the relationship to Kronecker factorisation only holds under particular assumptions, in this sense cascades of exponenetially diluted convolutions are a generalisation of the Kronecker layer (Zhou et al. 2015)
- I note that dilated convolutions are equivariant under image translation, a property that other multi-scale architectures often violate.

#### Background

The key application the dilated convolution authors have in mind is dense prediction: vision applications where the predicted object that has similar size and structure to the input image. For example, semantic segmentation with one label per pixel; image super-resolution, denoising, demosaicing, bottom-up saliency, keypoint detection, etc.

In many such applications one wants to integrate information from different spatial scales and balance two properties:

1. local, pixel-level accuracy, such as precise detection of edges, and
2. integrating knowledge of the wider, global context

To address this problem, people often use some kind of multi-scale convolutional neural networks, which often relies on spatial pooling. Instead the authors here propose using layers dilated convolutions, which allow us to address the multi-scale problem efficiently without increasing the number of parameters too much.

#### Dilated Convolutions

It's perhaps useful to first note why vanilla convolutions struggle to integrate global context. Consider a purely convolutional network composed of layers of $k\times k$ convolutions, without pooling. It is easy to see that size of the receptive field of each unit - the block of pixels which can influence its activation - is $l*(k-1)+k$, where $l$ is the layer index. So the effective receptive field of units can only grow linearly with layers. This is very limiting, especially for high-resolution input images.

Dilated convolutions to the rescue! The dilated convolution between signal $f$ and kernel $k$ and dilution factor $l$ is defined as:

$$ \left(k \ast_{l} f\right)_t = \sum_{\tau=-\infty}^{\infty} k_\tau \cdot f_{t - l\tau} $$

Note that I'm using slightly different notation than the authors. The above formula differs from vanilla convolution in last subscript $f_{t - l\tau}$. For plain old convolution this would be $f_{t - \tau}$. In the dilated convolution, the kernel only touches the signal at every $l^{th}$ entry. This formula applies to a 1D signal, but it can be straightforwardly extended to 2D convolutions.

The authors then build a network out of multiple layers of diluted convolutions, where the dilation factor $l$ increases exponentially at each layer. When you do that, even though the number of parameters grows only linearly with layers, the effective receptive field of units grows exponentially with layer depth. This is illustrated in the figure below:

![](http://www.inference.vc/content/images/2016/05/Screen-Shot-2016-05-12-at-09-47-12.png)

What this figure doesn't really show is the parameter sharing and parameter dependencies across the receptive field (frankly, it's pretty hard to visualise exactly with more than 2 layers). The receptive field grows at a faster rate than the number of parameters, and it is obvious that this can only be achieved by introducing additional constraints on the parameters across the receptive field. The network won't be able to learn arbitrary receptive field behaviours, so one question is, how severe is that restriction?

#### Relationship to Kronecker Products

To me this whole dilated convolution paper cries Kronecker product, although this connection is never made in the paper itself. It's easy to see that a 2D dilated convolution with matrix/filter $K$ is the same as vanilla convolution with a diluted filter $\hat{K}_{l}$ which can be represented as the following Kronecker product:

$$ \hat{K}_l = K \otimes \begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\
0 & 0 & \ddots & & 0 \\
0 & \ddots & \ddots & \ddots & \\
0 & & \ddots & \ddots & 0 \\
0 & 0 & 0 & 0 & 0
\end{bmatrix} $$

Using this, and properties of convolutions and Kronecker products (I suggest beginners to make extensive use of the matrix cookbook) we can even understand something about exponentially iterated dilated convolutions.

Let's assume we apply several layers of dilated convolutions, without nonlinearity, as in Equation 3 of the paper. For simplicity, I assume that that all convolution kernels $K_l, L=1\ldots L$ are $a\times a$ in size, the dilation factor at layer $l$ is $a^{l}$, and we only have a single channel throughout ($C=1$). In this case we can show that:

$$ F_{L+1} = K_L \ast_{a^L} \left( K_{L-1} \ast_{a^{(L-1)}} \left( \cdots K_1 \ast_{a} \left( K_0 \ast F_0 \right) \cdots \right) \right) = \left( K_L \otimes K_{L-1} \otimes \cdots \otimes K_{0} \right) \ast F_0
$$

The left-hand side of this equation is the same construction as in Equation 3 in the paper, but expanded. The right hand side is a single vanilla convolution, but with a convolution kernel that is constructed as the Kronecker product of all the $a\times a$ kernels $K_l$.

It turns out Kronecker-factored parametrisations of convolution tensors are already used in CNNs, a quick googling revealed this paper:

Shuchang Zhou, Jia-Nan Wu, Yuxin Wu, Xinyu Zhou (2015) Exploiting Local Structures with the Kronecker Layer in Convolutional Networks
What can Kronecker-factored filters represent?

Let's look at what kind of kernels can we represent with Kronecker products, and hence what behaviour should we expect from dilated convolutions. Here are a few examples of $27\times 27$ kernels that result from taking the Kronecker product of three random $3\times 3$ kernels:

![](http://www.inference.vc/content/images/2016/05/VzORx0FEfAAAAAElFTkSuQmCC.png)

These look somehow natural, at least to me. They look like pretty plausible texture patches taken from some pixellated video game. You will notice the repeated patterns and the hierarchical structure. Indeed, we can draw cool self-similar fractal-like filters if we keep taking the Kronecker product of the same kernel with itself, some examples of such random fractals:

![](http://www.inference.vc/content/images/2016/05/YSJIkSZIkLYw3bCRJkiRJkhbGGzaSJEmSJEkL8zeSmRmMrhHPQgAAAABJRU5ErkJggg--.png)

I would say these kernels are not entirely unreasonable for a ConvNet, and if you allow for multiple channels ($C>1$) they can represent pretty nice structured patterns and shapes with reasonable number of parameters.

Compare these filters to another common technique for reducing parameters of convolution tensors: low-rank decompositions (see e.g. Lebedev et al, 2014). Spatially, a low-rank approximation to a square 2D convolution filter can be understood as subsequently applying two smaller rectangular filters: one with a limited horizontal extent and one with limited vertical extent. Here are a few random samples of $27\times 27$ filters with a rank of 1. These can be represented using the same number of parameters (27) as the Kronecker samples above.

To me, these don't look so natural. Notice also that for low-rank representations the number of parameters has to scale linearly with the spatial extent of the filter, whereas this scaling can be logarithmic if we use a Kronecker parametrisation. This is the real deal when using Kronecker products or dilated convolutions.

Here is another cool illustration of the naturalness of the Kronecker approximation, taken out of the Kronecker layer paper:

![](http://www.inference.vc/content/images/2016/05/Screen-Shot-2016-05-12-at-14-58-33.png)

So in general, parametrising convolution kernels as Kronecker-products seems like a pretty good idea. The dilated convolutions paper presents a more flexible approach than just Kronecker-factors. Firstly, you can add nonlinearities after each layer of dilated convolution, which would now be different from Kronecker products. Secondly, the Kronecker analogy only holds if the dilation factor and the kernel size are the same. In the paper the authors used a kernel size of $3$ and dilation factor of $2$.

#### Final note on translational equivariance

One desirable property of convolutions is that they are translationally equivariant: if you shift the input image by any amount, the output remains the same, shifted by the same amount. This is a very useful inductive bias/prior assumtion to use in a dense prediction task.

One way to introduce multiscale thinking to ConvNets is to use architectures that look like the figure below: we first decrease the spatial extent of feature-maps via pooling, then grow them back again via unpooling/deconvolution. Additional shortcut connections ensure that pixel-level local accuracy can be retained. The example below is from the SegNet paper, but there are multiple other papers such as this one on recombinator networks.

![](http://www.inference.vc/content/images/2016/05/conv-deconv.png)

However, as soon as you include spatial pooling, the translational equivariance property of the whole network might break. For example the SegNet above is not translationally equivariant anymore: the network's predictions are sensitive to small, single-pixel shifts to the input image, which is undesirable. Thankfully, layers of dilated convolutions are still translationally equivariant, which is a good thing.

#### Summary

This dilated convolutions idea is pretty cool, and I think these papers are just scratching the surface of this topic. The dilated convolution architecture generalises Kronecker-factored convolutional filters, it allows for very large receptive fields while only growing the number o

papers.nips.cc
scholar.google.com

Exploring Models and Data for Image Question Answering
Ren, Mengye and Kiros, Ryan and Zemel, Richard S.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 8 years ago

This paper addresses the task of image-based Q&A on 2 axes: comparison of different models on 2 datasets and creation of a new dataset based on existing captions.

The paper is addressing an important and interesting new topic which has seen recent surge of interest (Malinowski2014, Malinowski2015, Antol2015, Gao2015, etc.). The paper is technically sound, well-written, and well-organized. They achieve good results on both datasets and the baselines are useful to understand important ablations. The new dataset is also much larger than previous work, allowing training of stronger models, esp. deep NN ones.

However, there are several weaknesses: their main model is not very different from existing work on image-Q&A (Malinowski2015, who also had a VIS+LSTM style model (but they were also jointly training the CNN and RNN, and also decoding with RNNs to produce longer answers) and achieves similar performance (except that adding bidirectionality and 2-way image input helps). Also, as the authors themselves discuss, the dataset in its current form, synthetically created from captions, is a good start but is quite conservative and limited, being single-word answers, and the transformation rules only designed for certain simple syntactic cases.

It is exploration work and will benefit a lot from a bit more progress in terms of new models and a slightly more broad dataset (at least with answers up to 2-3 words).

Regarding new models, e.g., attention-based models are very relevant and intuitive here (and the paper would be much more complete with this), since these models should learn to focus on the right area of the image to answer the given question and it would be very interesting to analyze the results of whether this focusing happens correctly.

Before attention models, since 2-way image input helped (actually, it would be good to ablate 2-way versus bidirectionality in the 2-VIS+BLSTM model), it would be good to also show the model version that feeds the image vector at every time step of the question.

Also, it would be useful to have a nearest neighbor baseline as in Devlin et al., 2015, given their discussion of COCO's properties. Here too, one could imagine copying answers of training questions, for cases where the captions are very similar.

Regarding a broader-scope dataset, the issue with the current approach is that it is too similar to the captioning approach or task, which has the drawback that a major motivation to move to image-Q&A is to move away from single, vague (non-specific), generic, one-event-focused captions to a more complex and detailed understanding of and reasoning over the image; which doesn't happen with this paper's current dataset creation approach, and so this will also not encourage thinking of very different models to handle image-Q&A, since the best captioning models will continue to work well here. Also, having 2-3 word answers will capture more realistic and more diverse scenarios; and though it is true that evaluation is harder, one can start with existing metrics like BLEU, METEOR, CIDEr, and human eval. And since these will not be full sentences but just 2-3 word phrases, such existing metrics will be much more robust and stable already.

Originality:

The task of image-Q&A is very recent with only a couple of prior and concurrent work, and the dataset creation procedure, despite its limitations (discussed above) is novel. The models are mostly not novel, being very similar to Malinowski2015, but the authors add bidirectionality and 2-way image input (but then Malinowski2015 was jointly training the CNN and RNN, and also decoding with RNNs to produce longer answers).

Significance:

As discussed above, the paper show useful results and ablations on the important, recent task of image-Q&A, based on 2 datasets -- an existing small dataset and a new large dataset; however, the second, new dataset is synthetically created by rule-transforming captions and only to single-word answers, thus keeping the impact of the dataset limited, because it keeps the task too similar to the generic captioning task and because there is no generation of answers or prediction of multi-word answers.

papers.nips.cc
scholar.google.com

Winner-Take-All Autoencoders
Makhzani, Alireza and Frey, Brendan J.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 8 years ago

The paper proposes a novel way to train a sparse autoencoder where the hidden unit sparsity is governed by a winner-take-all kind of selection scheme. This is a convincing way to achieve a sparse autoencoder, while the paper could have included some more details about their training strategy and the complexity of the algorithm.

The authors present a fully connected auto-encoder with a new sparsity constraint called the lifetime sparsity. For each hidden unit across the mini-batch, they rank the activation values, keeping only the top-k% for reconstruction. The approach is appealing because they don't need to find a hard threshold and it makes sure every hidden unit/filter is updated (no dead filters because their activation was below the threshold).

Their encoder is a deep stack of ReLu and the decoder is shallow and linear (note that usually non-symmetric auto-encoders lead to worse results). They also show how to apply to RBM. The effect of sparsity is very effective and noticeable on the images depicting the filters.

They extend this auto-encoder in a convolutional/deconvolutional framework, making it possible to train on larger images than MNIST or TFD. They add a spatial sparsity, keeping the top activation per feature map for the reconstruction and combine it with the lifetime sparsity presented before.

The proposed approach exploits on a mechanism close to the one of k-sparse autoencoders proposed by Makkhzani et al [14]. The authors extend the idea from [14] to build winner-take-all encoders (and RBMs), that enforce both spatial and lifetime regularization by keeping only a percentage (the biggest) of activations. The lifetime sparsity allows overcoming problems that could arise with k-sparse autoencoders. The authors next propose to embed their modeling framework in convolutional neural nets to deal with larger images than e.g. those of mnist.

papers.nips.cc
scholar.google.com

End-To-End Memory Networks
Sukhbaatar, Sainbayar and Szlam, Arthur and Weston, Jason and Fergus, Rob
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 8 years ago

This paper presents an end-to-end version of memory networks (Weston et al., 2015) such that the model doesn't train on the intermediate 'supporting facts' strong supervision of which input sentences are the best memory accesses, making it much more realistic. They also have multiple hops (computational steps) per output symbol. The tasks are Q&A and language modeling, and achieves strong results.

The paper is a useful extension of memNN because it removes the strong, unrealistic supervision requirement and still performs pretty competitively. The architecture is defined pretty cleanly and simply. The related work section is quite well-written, detailing the various similarities and differences with multiple streams of related work. The discussion about the model's connection to RNNs is also useful.

papers.nips.cc
scholar.google.com

Spatial Transformer Networks
Jaderberg, Max and Simonyan, Karen and Zisserman, Andrew and Kavukcuoglu, Koray
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 8 years ago

This paper presents a novel layer that can be used in convolutional neural networks. A spatial transformer layer computes re-sampling points of the signal based on another neural network. The suggested transformations include scaling, cropping, rotations and non-rigid deformation whose paramerters are trained end-to-end with the rest of the model. The resulting re-sampling grid is then used to create a new representation of the underlying signal through bi-linear or nearest neighbor interpolation. This has interesting implications: the network can learn to co-locate objects in a set of images that all contain the same object, the transformation parameter localize the attention area explicitly, fine data resolution is restricted to areas important for the task. Furthermore, the model improves over previous state-of-the-art on a number of tasks.

The layer has one mini neural network that regresses on the parameters of a parametric transformation, e.g. affine), then there is a module that applies the transformation to a regular grid and a third more or less "reads off" the values in the transformed positions and maps them to a regular grid, hence under-forming the image or previous layer. Gradients for back-propagation in a few cases are derived. The results are mostly of the classic deep learning variety, including mnist and svhn, but there is also the fine-grained birds dataset. The networks with spatial transformers seem to lead to improved results in all cases.

papers.nips.cc
scholar.google.com

Teaching Machines to Read and Comprehend
Hermann, Karl Moritz and Kociský, Tomás and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 8 years ago

This paper deals with the formal question of machine reading. It proposes a novel methodology for automatic dataset building for machine reading model evaluation. To do so, the authors leverage on news resources that are equipped with a summary to generate a large number of questions about articles by replacing the named entities of it. Furthermore a attention enhanced LSTM inspired reading model is proposed and evaluated. The paper is well-written and clear, the originality seems to lie on two aspects. First, an original methodology of question answering dataset creation, where context-query-answer triples are automatically extracted from news feeds. Such proposition can be considered as important because it opens the way for large model learning and evaluation. The second contribution is the addition of an attention mechanism to an LSTM reading model. the empirical results seem to show relevant improvement with respect to an up-to-date list of machine reading models.

Given the lack of an appropriate dataset, the author provides a new dataset which scraped CNN and Daily Mail, using both the full text and abstract summaries/bullet points. The dataset was then anonymised (i.e. entity names removed). Next the author presents a two novel Deep long-short term memory models which perform well on the Cloze query task.

arxiv.org
scholar.google.com

Attention with Intention for a Neural Network Conversation Model
Yao, Kaisheng and Zweig, Geoffrey and Peng, Baolin
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors propose an Attention with Intention (AWI) model for Conversation Modeling. AWI consists of three recurrent networks: An encoder that embeds the source sentence from the user, an intention network that models the intention of the conversation over time, and a decoder that generates responses. The authors show that the network can general natural responses.

#### Key Points

- Intuition: Intention changes over the course of a conversation, e.g. communicate problem -> resolve issue -> acknowledge.
- Encoder RNN: Depends on last state of the decoder. Reads the input sequence and converts it into a fixed-length vector.
- Intention RNN: Gets encoder representation, previous intention state, and previous decoder state as input and generates new representation of the intention.
- Decoder RNN: Gets current intention state and attention vector over the encoder as an input. Generates a new output.
- Architecture is evaluated on an internal helpdesk chat dataset with 10k dialogs, 100k turns and 2M tokens. Perplexity scores and a sample conversation are reported.

#### Notes/Questions

- It's a pretty short paper and not sure what to make of the results. The PPL scores were not compared to alternative implementations and no other evaluations (e.g. crowdsourced as in Neural Conversational Model) are done.

papers.nips.cc
scholar.google.com

Deep Knowledge Tracing
Piech, Chris and Bassen, Jonathan and Huang, Jonathan and Ganguli, Surya and Sahami, Mehran and Guibas, Leonidas J. and Sohl-Dickstein, Jascha
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors apply an RNN to modeling the students knowledge. The input is an exercise question and answer (correct/incorrect), either as one-hot vectors or embedded. The network then predicts whether or not the student can answer a future question correctly. The authors show that the RNN approach results in significant improvement over previous models, can be used for curriculum optimization, and also discovers the latent structure in exercise concepts.

#### Key Points

- Two encodings tried: One hot, embedded
- RNN/LSTM, 200-dimensional hidden layer, output dropout, NLL. 
- No expert annotation for concepts or question/answers are needed
- Blocking (series of exercises of same type) vs Mixing for curriculum optimization: Blocking seems to perform better
- Lots of cool future direction ideas

#### Question / Notes

- Can we not only predict whether an exercise is answered correctly, but also what the most likely student answer would be? My give insight into confusing concepts.

arxiv.org
scholar.google.com

Distilling the Knowledge in a Neural Network
Hinton, Geoffrey E. and Vinyals, Oriol and Dean, Jeffrey
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors show that we can distill the knowledge of a complex ensemble of models into a smaller model by letting the smaller model learn directly from the "soft targets" (softmax output with high temperature) of the ensemble. Intuitively, this works because the errors in probability assignment (e.g. assigning 0.1% to the wrong class) carry a lot of information about what the network learns. Learning directly from logits (unnormalized scores) as was done in a previous paper, is a special case of the distillation approach. The authors show how distillation works on the MNIST and an ASR data set.


#### Key Points

- Can use unlabeled data to transfer knowledge, but using the same training data seems to work well in practice.
- Use softmax with temperature, values from 1-10 seem to work well, depending on the problem.
- The MNIST networks learn to recognize digits without ever having seen base, solely based on the "errors" that the teacher network makes. (Bias needs to be adjusted)
- Training on soft targets with less data performs much better than training on hard targets with same amount of data.


#### Notes/Question

- Breaking up the complex models into specialists didn't really fit into this paper without distilling those experts into one model. Also would've liked to see training of only specialists (without general network) and then distill their knowledge.

arxiv.org
scholar.google.com

A Diversity-Promoting Objective Function for Neural Conversation Models
Li, Jiwei and Galley, Michel and Brockett, Chris and Gao, Jianfeng and Dolan, Bill
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors use a Maximum Mutual Information (MMI) objective function to generate conversational responses. They still train their models with maximum likelihood, but use MMI to generate responses during decoding. The idea behind MMI is that it promotes more diversity and penalizes trivial responses. The authors evaluate their method using BLEU scores, human evaluators, and qualitative analysis and find that the proposed metric indeed leads to more diverse responses.

#### Key Points

- In practice, NCM (Neural Conversation Models) often generate trivial responses using high-frequency terms partly due to the likelihood objective function.
- Two models: MMI-antiLM and MMI-bidi depending on the formulation of the MMI objective. These objectives are used during response generation, not during training.
- Use Deep 4-layer LSTM with 1000-dimensional hidden state, 1000-dimensional word embeddings.
- Datasets: Twitter triples with 129M context-message-response triples. OpenSubtitles with 70M spoken lines that are noisy and don't include turn information.
- Authors state that perplexity is not a good metric because their objective is to explicitly steer away from the high probability responses.


#### Notes

- BLEU score seems like a bad metric for this. Shouldn't more diverse responses result in a lower BLEU score?
- Not sure if I like the direction of this. To me it seems wrong to "artificially" promote diversity. Shouldn't diversity come naturally as a function of context and intention?

arxiv.org
scholar.google.com

Document Embedding with Paragraph Vectors
Dai, Andrew M. and Olah, Christopher and Le, Quoc V.
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors evaluate Paragraph Vectors on large Wikipedia and arXiv document retrieval tasks and compare the results to LDA, BoW and word vector averaging models. Paragraph Vectors either outperform or match the performance of other models. The authors show how the embedding dimensionality affects the results. Furthermore, the authors find that one can perform arithemetic operations on paragraph vectors and obtain meaningful results and present qualitative analyses in the form of visualizations and document examples.


#### Data Sets

Accuracy is evaluated by constructing triples, where a pair of items are close to each other and the third one is unrelated (or less related). Cosine similarity is used to evaluate semantic closeness.

Wikipedia (hand-built) PV: 93%
Wikipedia (hand-built) LDA: 82%
Wikipedia (distantly supervised) PV: 78.8%
Wikipedia (distantly supervised) LDA: 67.7%
arXiv PV: 85%
arXiv LDA: 85%


#### Key Points

- Jointly training PV and word vectors seems to improve performance.
- Used Hierarchical Softmax as Huffman tree for large vocabulary
- The use only the PV-BoW model, because it's more efficient.

#### Questions/Notes

- Why the performance discrepancy between the arXiv and Wikipedia tasks? BoW performs surprisingly well on Wikipedia, but not arXiv. LDA is the opposite.

arxiv.org
scholar.google.com

Larger-Context Language Modelling
Wang, Tian and Cho, Kyunghyun
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors propose new ways to incorporate context (previous sentences) into a Recurrent Language Model (RLM). They propose 3 ways to model the context, and 2 ways to incorporate the context into the predictions for the current sentence. Context can be modeled with BoW, Sequence BoW (BoW for each sentence), and Sequence BoW with attention. Context can be incorporated using "early fusion", which gives the context as an input to the RNN, or "late fusion", which modifies the LSTM to directly incorporate the context. The authors evaluate their architecture on IMDB, BBC and Penn TreeBank corpora, and show that most approaches perform well (reducing perplexity), with Sequence BoW with attention + late fusion outperforming all others.

#### Key Points:

- Context as BoW: Compress N previous sentences into a single BoW vector
- Context as Sequential Bow: Compress each of the N previous sentences into a BoW vector and use an LSTM to "embed" them. Alternatively, use an attention mechanism.
- Early Fusion: Give the context vector as an input to the LSTM, together with the current word.
- Late Fusion: Add another gate to the LSTM that incorporates the context vector. Helps to combat vanishing gradients.
- Interestingly the Sequence BoW without attention performs very poorly. The reason here seems to be the same as for seq2seq, it's hard to compress the sentence vectors into a single fixed-length representation using an LSTM.
- LSTM models trained with 1000 units, Adadelta. Only sentences up to 50 words are considered.
- Noun phrases seem to benefit the most from the context, which makes intuitive sense.

#### Notes/Questions:

- A problem with current Language Models is that they are corpus-specific. A model trained on one corpus doesn't do well on another corpus because all sentences are treated as being independent. However, if we can correctly incorporate context we may be able to train a general-purpose LM that does well across various corpora. So I think this is important work.
- I am surprised that the authors did not try using a sentence embedding (skip-thought, paragraph-vector) to construct their context vectors. That seems like an obvious choice over using BoW.
- The argument for why the Sequence BoW without attention model performs poorly isn't convincing. In the seq2seq work the argument for attention was based on the length of the sequence. However, here the sequence is very short, so the LSTM should be able to capture all the dependencies. The performance may be poor due to the BoW representation, or due too little training data.
- Would've been nice to visualize what the attention mechanism is modeling.
- I'm not sure if I agree with the authors that relying explicit sentence boundaries is an advantage, I see it as a limiting factor.

arxiv.org
scholar.google.com

Learning Document Embeddings by Predicting N-grams for Sentiment Classification of Long Movie Reviews
Li, Bofang and Liu, Tao and Du, Xiaoyong and Zhang, Deyuan and Zhao, Zhe
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors present DV-ngram, a new method to learn document embeddings. DV-ngrams is a variation on Paragraph Vectors with a training objective of predicting words and n-grams solely based on the document vector, forcing the embedding to capture the semantics of the text. The authors evaluate their model on the IMDB data sets, beating both n-gram based and Deep Learning models.

#### Key Points

- When the word vectors are already sufficiently predictive of the next words, the standard PV embedding cannot learn anything useful.
- Training objective: Predict words and n-grams solely based on document vector. Negative Sampling to deal with large vocabulary. In practice, each n-gram is treated as a special token and appended to the document.
- Code will be at https://github.com/libofang/DV-ngram


#### Question/Notes

- The argument that PV may not work when the word vectors themselves are predictive enough makes intuitive sense. But what about applying word-level dropout? Wouldn't that also force the PV to learn the document semantics?
- It seems to be that predicting n-grams leads to a huge sparse vocabulary space. I wonder how this method scales, even with negative sampling. I am actually surprised this works well at all.
- The authors mention that they beat "other Deep Learning models, including PV, but neither their model nor PV are "deep learning". The networks are not deep ;)

arxiv.org
scholar.google.com

Multilingual Language Processing From Bytes
Gillick, Dan and Brunk, Cliff and Vinyals, Oriol and Subramanya, Amarnag
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors train a deep seq-2-seq LSTM directly on byte-level input of several langauges (shuffling the examples of all languages) and apply it to NER and POS tasks, achieving state-of-the-art or close to that. The model outputs spans of the form `[START_POSITION, LENGTH, LABEL]`, where each span element is a separate token prediction. A single model works well for all languages and learns shared high-level representations. The authors also present a novel way to dropout input tokens (bytes in their case), by randomly replacing them with a `DROP` symbol.

#### Data and model performance

Data:

- POS Tagging: 13 languages, 2.87M tokens, 25.3M training segments
- NER: 4 languags, 0.88M tokens, 6M training segments

Results:

- POS CRF Accuracy (average across languages): 95.41
- POS BTS Accuracy (average across languages): 95.85
- NER BTS en/de/es/nl F1: 86.50/76.22/82.95/82.84
- (See paper for NER comparsion models)

#### Key Takeaways

- Surprising to me that the span generations works so well without imposing independence assumptions on it. It's state the LSTM has to keep in memory.
- 0.2-0.3 Dropout, 320-dimensional embeddings, 320 units LSTM, 4 layers seems to perform well. The resulting model is surprisingly compact (~1M parameters) due to the small vocabulary size of 256 bytes. Changing input sequence order didn't have much of an effect. Dropout and Byte Dropout significantly (74 -> 78 -> 82) improved F1 for NER.
- To limit sequence length the authors split the text into k=60 sized segment, with 50% overlap to avoid splitting mid-span.
- Byte Dropout can be seen as "blurring text". I believe I've seen the same technique applied to words before and labeled word dropout.
- Training examples for all languages are shuffled together. The biggest improvements in scores are seen observed for low-resource languages.
- Not clear how to tune recall of the model since non-spans are simply not annotated.

#### Notes / Questions

- I wonder if the fixed-vector embedding of the input sequence is a bottleneck since the decoder LSTM has to carry information not only about the input sequence, but also about the structure that has been produced so far. I wonder if the authors have experimented with varying `k`, or using attention mechanisms to deal with long sequences (I've seen papers dealing with sequences of 2000 tokens?). 60 seems quite short to me. Of course, output vocabulary size is also a concern with longer sequences.
- What about LSTM initialization? When feeding spans coming from the same document, is the state kept around or re-initialized? I strongly suspect it's kept since 60 bytes probably don't contain enough information for proper labeling, but didn't see an explicit reference.
- Why not a bidirectional LSTM? Seems to be the standard in most other papers.
- How exactly are multiple languages encoded in the LSTM memories? I *kind of* understand the reasoning behind this, but it's unclear what these "high-level" representations are. Experiments that demonstrate what the LSTM cells represent would be valuable.
- Is there a way to easily re-train the model for a new language?

arxiv.org
scholar.google.com

Multi-task Sequence to Sequence Learning
Luong, Minh-Thang and Le, Quoc V. and Sutskever, Ilya and Vinyals, Oriol and Kaiser, Lukasz
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors show that we can improve the performance of a reference task (like translation) by simultaneously training other tasks, like image caption generation or parsing, and vice versa. The authors evaluate 3 MLT (Multi-Task Learning) scenarios: One-to-many, many-to-one and many-to-many. The authors also find that using skip-thought unsupervised training works well for improving translation performance, but sequence autoencoders don't.

#### Key Points

- 4-Layer seq2seq LSTM, 1000-dimensional cells each layer and embedding, batch size 128, dropout 0.2, SGD wit LR 0.7 and decay.
- The authors define a mixing ratio for parameter updates that is defined with respect to a reference tasks. Picking the right mixing ratio is a hyperparameter.
- One-To-Many experiments: Translation (EN -> GER) + Parsing (EN). Improves result for both tasks. Surprising that even a very small amount of parsing updates significantly improves MT result.
- Many-to-One experiments: Captioning + Translation (GER -> EN). Improves result for both tasks (wrt. to reference task)
- Many-to-Many experiments: Translation (EN <-> GER) + Autoencoders or Skip-Thought. Skip-Thought vectors improve the result, but autoencoders make it worse.
- No attention mechanism

#### Questions / Notes

- I think this is very promising work. it may allow us to build general-purpose systems for many tasks, even those that are not strictly seq2seq. We can easily substitute classification.
- How do the authors pick the mixing ratios for the parameter updates, and how sensitive are the results to these ratios? It's a new hyperparameter and I would've liked to see graphs for these. Makes me wonder if they picked "just the right" ratio to make their results look good, or if these architectures are robust.
- The authors found that seq2seq autoencoders don't improve translation, but skip-thought does. In fact, autoencoders made translation performance significantly worse. That's very surprising to me. Is there any intuition behind that?

arxiv.org
scholar.google.com

A Neural Attention Model for Abstractive Sentence Summarization
Rush, Alexander M. and Chopra, Sumit and Weston, Jason
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors apply a neural seq2seq model to sentence summarization. The model uses an attention mechanism (soft alignment).


#### Key Points

- Summaries generated on the sentence level, not paragraph level
- Summaries have fixed length output
- Beam search decoder
- Extractive tuning for scoring function to encourage the model to take words from the input sequence
- Training data: Headline + first sentence pair.

arxiv.org
scholar.google.com

A Neural Conversational Model
Vinyals, Oriol and Le, Quoc V.
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors train a seq2seq model on conversations, building a chat bot. The first data set is an IT Helpdesk dataset with 33M tokens. The trained model can help solve simple IT problems. The second data set is the OpenSubtitles data with ~1.3B tokens (62M sentences). The resulting model learns simple world knowledge, can generalize to new questions, but lacks a coherent personality.

#### Key Points

- IT Helpdesk: 1-layer LSTM, 1024-dimensional cells, 20k vocabulary. Perplexity of 8.
- OpenSubtitles: 2-layer LSTM, 4096-dimensional cells, 100k vocabulary, 2048 affine layer. Attention did not help.
- OpenSubtitles: Treat two consecutive sentences as coming from different speakers. Noisy dataset.
- Model lacks personality, gives different answers to similar questions (What do you do? What's your job?)
- Feed previous context (whole conversation) into encoder, for IT data only.
- In both data sets, the neural models achieve better perplexity than n-gram models.

#### Notes / Questions

- Authors mention that Attention didn't help in OpenSubtitles. It seems like the encoder/decoder context is very short (just two sentences, not a whole conversation). So perhaps attention doesn't help much here, as it's meant for long-range dependencies (or dealing with little data?)
- Can we somehow encode conversation context in a separate vector, similar to paragraph vectors?
- It seems like we need a principled way to deal with long sequences and context. It doesn't really make sense to treat each sentence tuple in OpenSubtitles as a separate conversation. Distant Supervision based on subtitles timestamps could also be interesting, or combine with multimodal learning.
- How we can learn a "personality vector"? Do we need world knowledge or is it learnable from examples?

aclweb.org
scholar.google.com

Neural Responding Machine for Short-Text Conversation
Shang, Lifeng and Lu, Zhengdong and Li, Hang
Association for Computational Linguistics - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The author train a three variants of a seq2seq model to generate a response to social media posts taken from Weibo. The first variant, NRM-glo is the standard model without attention mechanism using the last state as the decoder input. The second variant, NRM-loc, uses an attention mechanism. The third variant, NRM-hyb combines both by concatenating local and global state vectors. The authors use human users to evaluate their responses and compare them to retrievel-based and SMT-based systems. The authors find that SRM models generate reasonable responses ~75% of the time.

#### Key Points

- STC: Short-text conversation. Generate only a response to a post. Don't need to keep track of a whole conversation.
- Training data: 200k posts, 4M responses.
- Authors use GRU with 1000 hidden units. 
- Vocabulary: Most frequent 40k words for both input and response.
- Retrieval is done using beam search with beam size 10.
- Hybrid model is difficult to train jointly. The authors train the model individually and then fine-tune the hybrid model.
- Tradeoff with retrieval based methods: Responses are written by a human and don't have grammatical errors, but cannot easily generalize to unseen inputs.

arxiv.org
scholar.google.com

A Neural Network Approach to Context-Sensitive Generation of Conversational Responses
Sordoni, Alessandro and Galley, Michel and Auli, Michael and Brockett, Chris and Ji, Yangfeng and Mitchell, Margaret and Nie, Jian-Yun and Gao, Jianfeng and Dolan, Bill
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors propose three neural models to generate a response (r) based on a context and message pair (c,m). The context is defined as a single message. The first model, RLMT, is a basic Recurrent Language Model that is fed the whole (c,m,r) triple. The second model, DCGM-1, encodes context and message into a BoW representation, put it through a feedforward neural network encoder, and then generates the response using an RNN decoder. The last model, DCGM-2, is similar but keeps the representations of context and message separate instead of encoding them into a single BoW vector. The authors train their models on 29M triple data set from Twitter and evaluate using BLEU, METEOR and human evaluator scores.

#### Key Points:

- 3 Models: RLMT, DCGM-1, DCGM-2
- Data: 29M triples from Twitter
- Because (c,m) is very long on average the authors expect RLMT to perform poorly.
- Vocabulary: 50k words, trained with NCE loss
- Generates responses degrade with length after ~8 tokens


#### Notes/Questions:

- Limiting the context to a single message kind of defeats the purpose of this. No real conversations have only a single message as context, and who knows how well the approach works with a larger context?
- Authors complain that dealing with long sequences is hard, but they don't even use an LSTM/GRU. Why?

aclweb.org
scholar.google.com

On Using Very Large Target Vocabulary for Neural Machine Translation
Jean, Sébastien and Cho, KyungHyun and Memisevic, Roland and Bengio, Yoshua
Association for Computational Linguistics - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors propose an importance-sampling approach to deal with large vocabularies in NMT models. During training, the corpus is partitioned, and for each partition only target words occurring in that partition are chosen. To improve decoding speed over the full vocabulary, the authors build a dictionary mapping from source sentence to potential target vocabulary. The authors evaluate their approach on standard MT tasks and perform better than the baseline models with smaller vocabulary.

#### Key Points:

- Computing partition function is the bottleneck. Use sampling-based approach.
- Dealing with large vocabulary during training is separate from dealing with large vocab during decoding. Training is handled with importance sampling. Decoding is handled with source-based candidate list.
- Decoding with candidate list takes around 0.12s (0.05) per token on CPU (GPU). Without target list 0.8s (0.25s).
- Issue: Candidate list is depended on source sentence, so it must be re-computed for each sentence.
- Reshuffling the data set is expensive as new partitions need to be calculated (not necessary, but improved scores).

#### Notes:

- How is the corpus partitioned? What's the effect of the partitioning strategy?
- The authors say that they replace UNK tokens using "another word alignment model" but don't go into detail what this is. The results show that doing this results in much larger score bump than increasing the vocab does. (The authors do this for all comparison models though).
- Reshuffling the dataset also results in a significant performance bump, but this operation is expensive. IMO the authors should take all these into account when reporting performance numbers. A single training update may be a lot faster, but the setup time increases. I'd would've like to see the authors assign a global time budget to train/test and then compare the models based on that.
- The authors only briefly mentioned that re-building the target vocab for each source sentence is an issue and how they solve it, no details given.

papers.nips.cc
scholar.google.com

Pointer Networks
Vinyals, Oriol and Fortunato, Meire and Jaitly, Navdeep
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors propose a new architecture called "Pointer Network". A Pointer Network is a seq2seq architecture with attention mechanism where the output vocabulary is the set of input indices. Since the output vocabulary varies based on input sequence length, a Pointer Network can generalize to variable-length inputs. The attention method trough which this is achieved is O(n^2), and only a sight variation of the standard seq2seq attention mechanism. The authors evaluate the architecture on tasks where the outputs correspond to positions of the inputs: Convex Hull, Delaunay Triangulation and Traveling Salesman problems. The architecture performs well these, and generalizes to sequences longer than those found in the training data.

#### Key Points

- Similar to standard attention, but don't blend the encoder states, use the attention vector directory.
- Softmax probabilities of outputs can be interpreted as a fuzzy pointer.
- We can solve the same problem artificially using seq2seq and outputting "coordinates", but that ignores the output constraints and would be less efficient.
- 512 unit LSTM, SGD with LR 1.0, batch size of 128, L2 gradient clipping of 2.0.
- In the case of TSP, the "student" networks outperforms the "teacher" algorithm.

#### Notes/ Questions

- Seems like this architecture could be applied to generating spans (as in the newer "Text Processing From Bytes" paper), for POS tagging for example. That would require outputting classes in addition to input pointers. How?

arxiv.org
scholar.google.com

ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks
Visin, Francesco and Kastner, Kyle and Cho, Kyunghyun and Matteucci, Matteo and Courville, Aaron C. and Bengio, Yoshua
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors propose a novel architecture called ReNet, which replaces convolutional and max-pooling layers with RNNs that sweep over the image vertically and horizontally. These RNN layers are then stacked. The authors demonstrate that ReNet architecture is a viable alternative to CNNs. ReNet doesn't outperform CNNs in this paper, but further optimizations and hyperparameter tuning are likely going to lead to improved results in the future.

#### Key Points:

- Split images into patches, feed one patch per time step into RNN, vertically then horizontally. 4 RNNs per layer, 2 vertical and 2 horizontal, one per diretion.
- Because the RNNs sweep over the whole image they can see the context of the full image, as opposed to just a local context in the case of conv/pool layers.
- Smooth from end-end to end.
- In experiments, 2 256-dimensional ReNet layers, 2x2 patches, 4096-dimensional affine layers.
- Flipping and shifting for data augmentation.

#### Notes/Questions:

- What is the training time/complexity compared to a CNN? 
- Why split the image into patches at all? I wonder if the authors have experimented with various patch sizes, like defining patches that go over the full vertical height. 2x2 patches as used in the experiment seem quite small and like a waste of computational resources.

papers.nips.cc
scholar.google.com

Semi-supervised Sequence Learning
Dai, Andrew M. and Le, Quoc V.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors show that we can pre-train RNNs using unlabeled data by either reconstructing the original sequence (SA-LSTM), or predicting the next token as in a language model (LM-LSTM). We can then fine-tune the weights on a supervised task. Pre-trained RNNs are more stable, generalize better, and achieve state-of-the-art results on various text classification tasks. The authors show that unlabeled data can compensate for a lack of labeled data.

#### Data Sets

Error Rates for SA-LSTM, previous best results in parens.

- IMDB: 7.24% (7.42%)
- Rotten Tomatoes 16.7% (18.5%) (using additional unlabeled data)
- 20 Newsgroups: 15.6% (17.1%)
- DBPedia character-level: 1.19% (1.74%)

#### Key Takeaways

- SA-LSTM: Predict sequence based on final hidden state
- LM-LSTM: Language-Model pretraining
- LSTM, 1024-dimensional cell, 512-dimensional embedding, 512-dimensional hidden affine layer + 50% dropout, Truncated backprop 400 steps. Clipped cell outputs and gradients. Word and input embedding dropout tuned on dev set.
- Linear Gain: Inject gradient at each step and linearly increase weights of prediction objectives

#### Notes / Questions

- Not clear when/how linear gain yields improvements. On some data sets it significantly reduces performance, on other it significantly improves performance. Any explanations?
- Word dropout is used in the paper but not explained. I'm assuming it's replacing random words with `DROP` tokens?
- The authors mention a joint training model, but it's only evaluated on the IMDB data set. I'm assuming the authors didn't evaluate it further because it performed badly, but it would be nice to get an intuition for why it doesn't work, and show results for other data sets.
- All tasks are classification tasks. Does SA-LSTM also improve performance on seq2seq tasks?
- What is the training time? :) (I also wonder how the batching is done, are texts padded to the same length with mask?)

arxiv.org
scholar.google.com

A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification
Zhang, Ye and Wallace, Byron
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors evaluate the impact of hyperparameters (embeddings, filter region size, number of feature maps, activation function, pooling, dropout and l2 norm constraint) on Kim's (2014) CNN for sentence classification. The authors present empirical findings with variance nunbers based on a large number of experiments on 7 classification data sets, and give practical recommendation for architecture decisions.

#### Key Points

- Recommended Baseline configuration: word2vec, (3,4,5) filter regions, 100 feature maps per region size, ReLU activation, 1-max-pooling, 0.5 dropout, l2 norm constraint on weight vector of 3.
- One-hot vectors perform worse than pre-trained embeddings. word2vec outperforms GloVe most of the time.
- Filter region size is dependent on data set in the range of 2-25. Recommended to do a line search over single region size and then combine multiple sizes.
- Increasing the number of feature maps per filter region to more than 600 doesn't seem to help much.
- ReLU almost always best activation function
- Max-pooling almost always best pooling strategy
- Dropout from 0.1 to 0.5 helps, l2 norm constraint not much

#### Notes/Questions

- All datasets analyzed in this paper are rather similar. They have similar average and max sentence length, and even the number of examples is of roughly the same magnitude. It would be interesting to see how the result change with very different datasets, such as long documents, or very large numbers of training examples.

jmlr.org
scholar.google.com

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Xu, Kelvin and Ba, Jimmy and Kiros, Ryan and Cho, Kyunghyun and Courville, Aaron C. and Salakhutdinov, Ruslan and Zemel, Richard S. and Bengio, Yoshua
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors use an attention mechanism in image caption generation, allowing the decoder RNN focus on specific parts of the image. In order find the correspondence between words and image patches, the RNN uses a lower convolutional layer as its input (before pooling). The authors propose both a "hard" attention (trained using sampling methods) and "soft" attention (trained end-to-end) mechanism, and show qualitatively that the decoder focuses on sensible regions while generating text, adding an additional layer of interpretability to the model. The attention-based models achieve state-of-the art on Flickr8k, Flickr30 and MS Coco.

#### Key Points

- To find image correspondence use lower convolutional layers to attend to.
- Two attention mechanisms: Soft and hard. Depending on evaluation metric (BLEU vs. METERO) one or the other performs better.
- Largest data set (MS COCO) takes 3 days to train on Titan Black GPU. Oxford VGG.
- Soft attention is same as for seq2seq models.
- Attention weights are visualized by upsampling and applying a Gaussian

#### Notes/Questions

- Would've liked to see an explanation of when/how soft vs. hard attention does better.
- What is the computational overhead of using the attention mechanism? Is it significant?

papers.nips.cc
scholar.google.com

Skip-Thought Vectors
Kiros, Ryan and Zhu, Yukun and Salakhutdinov, Ruslan and Zemel, Richard S. and Urtasun, Raquel and Torralba, Antonio and Fidler, Sanja
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors apply the skip-thoguth word2vec model to the sentence level, training auto-encoders that predict the previous and next sentences. The resulting general-purpose vector representations are called skip-thought vectors. The authors evaluate the performance of these vectors as features on semantic relatedness and classification tasks, achieving competitive results, but not beating fine-tuned models.

#### Key Points

- Code at https://github.com/ryankiros/skip-thoughts
- Training is done on large book corpus (74M sentences, 1B tokens), takes 2 weeks. 
- Two variations: Bidirectional encoder and unidirectional encoder with 1200 and 2400 units per encoder respectively. GRU cell, Adam optimizer, gradient clipping norm 10.
- Vocabulary can be expanded by learning a mapping from a large word2vec voab to the smaller skip-thought vocab. Could also used sampling/hierarchical softmax during training for larger vocab, or train on characters.

#### Questions/Notes

- Authors clearly state that this is not the goal of the paper, though I'd be curious how more sophisticated (non-linear) classifiers perform with skip-thought vectors. Authors probably tried this but it didn't do well ;)
- The fact that the story generation doesn't seem work well shows that the model has problems learning or understanding long-term dependencies. I wonder if this can be solved by deeper encoders or attention.

arxiv.org
scholar.google.com

Strategies for Training Large Vocabulary Neural Language Models
Chen, Welin and Grangier, David and Auli, Michael
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors evaluate softmax, hierarchical softmax, target sampling, NCE, self-normalization and differentiated softmax (novel technique presented in the paper) on data sets with varying vocabulary size (10k, 100k, 800k) with a fixed-time training budget. The authors find that techniques that work best for small vocabluaries are not necessarily the ones that work best for large vocabularies.

#### Data and Models

Models:

- Sotmax
- Hierarchical Softmax (cross-validation of clustering techniques)
- Differentiated softmax, adjusting capacity based on token frequency (cross-validation of number of frequency bands and size)
- Target Sampling (cross-validation of number of distractors)
- NCE (cross-validation of noise ratio)
- Self-normalization (cross-validation of regularization strenth)

Data:

- PTB (1M tokens, 10k vocab)
- Gigaword (5B tokens, 100k vocab)
- billionW (800M tokens, 800k vocab)

#### Key Takeaways

- Techniques that work best for small vocabluaries are not necessarily the ones that work best for large vocabularies.
- Differentiated softmax varies the capacity (size of matrix slice in the last layer) based on token frequency. In practice, it's implemented as separate matrices with different sizes.
- Perplexity doesn't seem to improve much after ~500M tokens
- Models are trained for 1 week each
- The competitiveness of softmax diminishes with vocabulary sizes. It seems to perform relatively well on 10k and 100k, but poorly on 800k since it need more processing time per example.
- Traning time, not training data, is the main factor of limiting performance. The authors found that very large models are still making progress after one week and may eventually beat if the other models if allowed to run longer.

#### Questions / Notes

- What about the hyperparameters for Differentiated Softmax? The paper doesn't show an analysis. Also, the fact that this method introduces two additional hyperparameters makes it harder to apply in practice.
- Would've liked to see more comparisons for Softmax, which is the simplest technique of all and doesn't need hyperparameter tuning. It doesn't work well on 800k vocab, but it does for 100k. So, the authors only show how it breaks down for one dataset.

arxiv.org
scholar.google.com

Target-Dependent Sentiment Classification with Long Short Term Memory
Tang, Duyu and Qin, Bing and Feng, Xiaocheng and Liu, Ting
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors propose two LSTM-based models for target-dependent sentiment classification. TD-LSTM uses two LSTM networks running towards to target word from left and right respectively, making a prediction at the target time step. TC-LSTM is the same, but additionally incorporates the an averaged target word vector as an input at each time step. The authors evaluate their models with pre-trained word embeddings on a Twitter sentiment classification dataset, achieving state of the art.

#### Key Points

- TD-LSTM: Two LSTM networks, running from left to right towards the target. The final states of both networks are concatenated and the prediction is made at the target word.
- TC-LSTM: Same architecture as TD-LSTM, but also incorporates the word vector as an input at each time step. The word vector is the average of the word vectors for the target phrase.
- Embeddings seem to make a huge difference, state of the art is only obtained with 200-dimensional GloVe embeddings.

#### Notes/Questions

- A *huge* fraction of the performance improvement comes from pre-trained word embeddings. Without these, the proposed models clearly underperforms simpler models. This raises the question of whether incorporating the same embeddings into the simpler models would do.
- Would've liked to see performance without *any* pre-trained embeddings.
- The authors also experimented with attention mechanisms, but weren't able to achieve good results. Small size of training corpus may be the reason for this.

arxiv.org
dblp.org
sci-hub
scholar.google.com

Text Understanding from Scratch
Zhang, Xiang and LeCun, Yann
CoRR - 2015 via Local Bibsonomy
Keywords: thema, deep_learning, thema:convolutional_neural_networks, language_model

[link] Summary by Denny Britz 9 years ago

TLDR; Authors apply 6-layer and 9-layer (+3 affine) convolutional nets to character-level input and evaluate their models on Sentiment Analysis and Categorization tasks using (new) large-scale data sets. The authors don't use pre-trained word-embeddings, or any notion of words, and instead learn directly from character-level input with characters being encoded as one-hot vetors. This means the same model can be applied to any language (provided the vocabulary is small enough). The models presented in this paper beat BoW and word2vec baseline models.

### Data and model performance

Because existing ones were too small the authors collected several new datasets that don't have standard benchmarks.

- DBpedia Ontology Classification: 560k training, 70k test.
- Amazon Reviews 5-class: 3M train, 650k test
- Amazon Reviews polar: 3.6M train, 400k test
- Yahoo! Answer topics 10-class: 1.4M train, 60k test
- AG news classification 4-class: 120k train, 1.9k test
- Sogou Chinese News 5-class: 450k train, 60k test

Model accuracy for small and large models:

- DBpedia: 98.02 / 98.27
- Amazon 5-class: 59.47 / 58.69
- Amazon 2-class: 94.50 / 94.49
- Yahoo 10-class: 70.16 / 70.45
- AG 4-class: 84.35 / 87.18
- Chinese 5-class: 91.35 / 95.12

#### Key Takeaways

- Pretty Standard CNN architecture applied to characters. Conv, ReLU, Maxppol, fully-connected. Filter sizes of 7 and 3. See paper for parameter details.
- Training takes a long time, presumably due to the size of the data. The authors quote 5 days per epoch on the large Amazon data set and large model.
- Authors can't handle large vocabularies, they romanize Chinese.
- Authors experiment with randomly replacing words with synonyms, seems to give a small improvements:

#### Notes / Questions

- The authors claim to do "text understanding" and learn representations, but all experiments are on simple classification tasks. There is no evidence that the network actually learns meaningful high-level representations, and doesn't just memorize n-grams for example.
- These data sets are large, and the authors claim that they need large data sets, but there are no experiments in the paper that show this. How does performance vary with data size?
- The comparision with other models is lacking. I would have liked to see some of the other state-of-the-art model being compared, e.g. Kim's CNN. Comparing with BoW doesn't show much. As these models are openly available the comparison should have been easy.
- The romanization of Chinese is an ugly "hack" that goes against what the authors claim: Being language-independent and learning "from scratch".
- It's strange that the authors use a thesaurus as a means for training example augmentation, as a theraus is word-level and language-specific, something that the authors explicitly argue against in this paper. Perhaps could have used word (character-level) dropout instead.
- Are there any hyperparameters that were optimized? Authors don't mention any dev sets.
- Have the datasets been made publicly available? The authors complain that "the unfortunate fact in literature is that there are no large openly accessible datasets", but fail to publish their own.
- I'd expect the confustion matrix for the 5-star Amazon reviews to show mistakes coming from negations, but it doesn't, which suggests that the model really learns meaningful representations (such as negation).

arxiv.org
scholar.google.com

A Unified Tagging Solution: Bidirectional LSTM Recurrent Neural Network with Word Embedding
Wang, Peilu and Qian, Yao and Soong, Frank K. and He, Lei and Zhao, Hai
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 9 years ago

TLDR; The authors evaluate the use of a Bidirectional LSTM RNN on POS tagging, chunking and NER tasks. The inputs are task-independent input features: The word and its capitalization. The authors incorporate prior knowledge about the taging tasks by restricting the decoder to output valid sequences of tags, and also propose a novel way of learning word embeddings: Randomly replacing words in a sequence and using an RNN to predict which words are correct vs. incorrect. The authors show that their model combined with pre-trained word embeddings performs on par state of the art models.

#### Key Points

- Bidirectional LSTM with 100-dimensional embeddings, and 100-dimensional cells. Both 1 and 2 layers are evaluated. Predict tags at each step. Higher dimensionality of cells resultes in little improvement.
- Word vector pretraining: Randomly replace words and use LSTM to predict correct/incorrect words.

#### Notes/Questions

- The fact that we need a task-specific decoder kind of defeats the purpose of this paper. The goal was to create a "task-independent" system. To be fair, the need for this decoder is probably only due to the small size of the training data. Not all tag combination appear in the training data.
- The comparisons with other state of the art systems are somewhat unfair since the proposed model heavily relies on pre-trained word embeddings from external data (trained on more than 600M words) to achieve good performance. It also relies on external embeddings trained in yet another way.
- I'm surprised that the authors didn't try combining all of the tagging tasks into one model, which seem like an obvious extension.

arxiv.org
scholar.google.com

GradNets: Dynamic Interpolation Between Neural Architectures
Almeida, Diogo and Sauder, Nate
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Open Review 9 years ago

A common setting in deep networks is to design the network first, "freeze" the network architecture, and then train the parameters. The paper pointed out a potential dilemma of that, in the sense that complex networks may have better representation power but may be hard to train. To address this issue the paper proposed to train the network in a hybrid fashion where simpler components and more complex components are combined via a weight average, and the weight is updated over the training procedure to introduce the more complex components, while utilizing the fast training capability of simpler ones.

The authors propose to blend any two architectural components as the time of optimisation progresses. As the time progresses, the initial approach, e.g. employed rectifier, is gradually switched off in place of another rectifier. The authors claim that this strategy is good for a fast convergence and they present some experimental results.

arxiv.org
scholar.google.com

Doctor AI: Predicting Clinical Events via Recurrent Neural Networks
Choi, Edward and Bahadori, Mohammad Taha and Sun, Jimeng
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Open Review 9 years ago

This paper presents an applications of RNNs to predict "clinical events", such as disease diagnosis and medication prescription and their timing.

The paper proposes/suggests:
1. Applying an RNN to disease diagnosis, medication prescription and timing prediction.

2. "Initializing" the neural net with skipgrams instead of one-hot vectors. However, it seems from the description that the authors are not "initializing", rather just feeding a different feature vector into the RNN.

3. Initializing a model that is to be trained on a small corpus from a model trained on a large corpus works. Concludes: information can be transferred between models (read across hospitals).

arxiv.org
scholar.google.com

Alternative structures for character-level RNNs
Bojanowski, Piotr and Joulin, Armand and Mikolov, Tomas
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Open Review 9 years ago

This paper introduces two model extensions to improve character level recurrent neural network language models. The authors evaluate their approaches on a multilingual language modeling benchmark along with the standard Penn Tree Bank Corpus. Evaluation uses only entropy rather than including the language model in a downstream task but that's okay for a paper of this scope. The paper is clearly written and definitely a sufficient contribution for the workshop track it would be really nice to see how well these methods can improve and more sophisticated recurrent architecture like gru or lstm units. On the PTB Corpus it would be nice to include a state-of-the-art or standard n-gram model to use as a reference point for the reported results.

The conditioning on words model is an interesting approach. It's unfortunate that such a small word level vocabulary is used with this model. It seems like the small vocabulary restriction is due to the fact that the word level model is jointly trained along with the character models. An alternative approach might be to use as input features the hidden representations from a word level recurrent model already trained when building the Character level language model. I don't have a good sense for how much joint training of both models matters.

arxiv.org
scholar.google.com

Fixed Point Quantization of Deep Convolutional Networks
Lin, Darryl Dexu and Talathi, Sachin S. and Annapureddy, V. Sreekanth
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Open Review 9 years ago

This paper proposes a layers wise adaptive depth quantization of DCNs, giving an better tradeoff of error rate/ memory requirement than the fixed bit width across layers.

The authors describe an optimization problem for determining the bit-width for different layers of DCNs for reducing model size and required computation.

This paper builds further upon the line of research that tries to represent neural network weights and outputs with lower bit-depths. This way, NN weights will take less memory/space and can speed up implementations of NNs (on GPUs or more specialized hardware).

arxiv.org
scholar.google.com

Diversity Networks
Mariet, Zelda and Sra, Suvrit
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 9 years ago

The goal is to compress a neural network based on figuring out the most significant neurons. They sample from Determinantal Point Process (DPP) in order to find set of neurons that have the most dissimilar activations and then project remaining neurons to them in order to reduce number of neurons overall.

DPPs compute the probability of volume of dissimilarity over volume of all neurons:

$$P(\text{subset } Y) = \frac{det(L_Y)}{det(L+I)}$$ 

More dissimilarity means higher probability. A simple sample of the neurons outputs are taken given the training set.

dx.doi.org
sci-hub
scholar.google.com

WxBS: Wide Baseline Stereo Generalizations
Dmytro Mishkin and Jiri Matas and Michal Perdoch and Karel Lenc
Procedings of the British Machine Vision Conference 2015 - 2015 via Local CrossRef
Keywords:

[link] Summary by Dmytro Mishkin 9 years ago

- SIFT family is still the best local descriptor, outperforms novel CNN [SiamNet2015] approaches.
- (adaptive) Hessian-Affine is the best detector with broad applicability (not beaten yet)
- Affine view synthesis greatly helps for non-geometrical problems.
- Datasets and WxBS-Matcher available http://cmp.felk.cvut.cz/wbs/
-  We need more diverse datasets for learning local descriptors than Yosemite and Libert

dx.doi.org
sci-hub
scholar.google.com

MODS: Fast and robust method for two-view matching
Mishkin, Dmytro and Matas, Jiri and Perdoch, Michal
Computer Vision and Image Understanding - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Dmytro Mishkin 9 years ago

For robust wide baseline matching:

1) Use combination of MSER and Hessian-Affine with RootSIFT as a descriptor

2) Do iteratively increasing affine view synthesis  - from sparse to dense

So you can match both fast for easy pairs and reliably for extreme (80 degrees of view point difference) pairs of same view of the object. Works for non-planar objects as well, much better than ASIFT.

arxiv.org
scholar.google.com

All you need is a good init
Mishkin, Dmytro and Matas, Jiri
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Dmytro Mishkin 9 years ago

Mean(input) = 0, var(input) =1 is good for learning. Independent input features are good for learning.
So:

1) Pre-Initialize network weights with (approximate) orthonormal matrices

2) Do forward pass with mini-batch

3) Divide layer weights by $\sqrt{var(Output)}$

4) PROFIT!

arxiv.org
scholar.google.com

Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 9 years ago

Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. 

Advantages:

* Learning the identity becomes learning 0 which is simpler
* Loss in information flow in the forward pass is not a problem anymore
    * No vanishing / exploding gradient
* Identities don't have parameters to be learned

## Evaluation

The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128.

* ImageNet ILSVRC 2015: 3.57% (ensemble)
* CIFAR-10: 6.43%
* MS COCO: 59.0% mAp@0.5 (ensemble)
* PASCAL VOC 2007: 85.6% mAp@0.5
* PASCAL VOC 2012: 83.8% mAp@0.5

## See also

* [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993)

arxiv.org
scholar.google.com

Neural GPUs Learn Algorithms
Kaiser, Lukasz and Sutskever, Ilya
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by hoqqanen 9 years ago

# Neural GPUs Learn Algorithms

Authors: Łukasz Kaiser, Ilya Sutskever. http://arxiv.org/abs/1511.08228

This originally appeared here: https://github.com/ProofByConstruction/better-explanations/blob/master/summaries/1511.08228.md

## Short Version

Using convolutions on embedded symbols in a recurrent-looking setting allows training of what is essentially a cellular automaton which can perform various algorithms and generalize to sequences of very long lengths.

## Problem Setting

We want to teach neural nets to learn algorithms (e.g. copy, reverse, binary addition or multiplication, etc.). Other approaches to this include sequence to sequence modeling (usually via some form of [LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)). Related work of seq-seq approaches includes [Neural Turing Machines](http://arxiv.org/abs/1410.5401), [Stack RNNs](http://arxiv.org/abs/1503.01007), [Pointer Networks](http://arxiv.org/abs/1506.03134), and [Grid LSTM](http://arxiv.org/abs/1507.01526). Related but different, see [Neural Programmer-Interpreters](http://arxiv.org/abs/1511.06279).

These types of problems are particularly difficult for neural approaches because as sequence length grows, so too does the necessary computation. Furthermore generalization can be difficult, since it's possible for networks to effectively learn length-dependent rules that break down on longer sequences. This mostly comes from the fact that it's difficult to encode a symbolic, rule-based algorithmic approach in a neural net (hence all of the different approaches above and the active research in this area).

The Neural GPU is somewhat different in its approach compared to other sequence-sequence models in that it don't exactly take in or generate a sequence, but reads in an entire sequence as a static object and generates a fixed size image length equal to that of the input (with special padding characters to indicate gaps). These inputs and outputs are sized dependent on the problem (longer sequences get bigger representations). In this way, the Neural GPU looks much more like a traditional feedforward convolutional net with variable input/output size (and variable computational depth, but we'll get to that later).

## Architecture

The three stages are
 - an embedder which takes a sequence of symbols to an input "image"
 - a stack of convolutional [GRUs](http://arxiv.org/abs/1412.3555) (Gated Recurrent Units) which at each step progressively process their input
 - a decoder which effectively does the reverse of the embedder (takes output representation to symbols)

### Encoder

The input is a sequence of length n over set of symbols (e.g. {0, 1, +, PAD (=P)}, input is (in reverse-binary) 1010+0111, output, constrained to be the same size, should be 11001PPPP) of size I. We first map the symbols to vectors of length m (which will be the number of channels of our mental image) by looking up each symbol in an embedding matrix of size I*m. Call these embeddings {e_i}. We then create an initial "mental image" which is a rectangular volume with width w (set to be 4), height n (the length of the input sequence) and depth m (the embedding size). In the first column we insert the e_i's depthwise, and set all other cells in the volume to be 0.

Note that the size (height) of the input volume is dependent on sequence length. This could be a problem in other architectures, but since everything is done by convolution (as we'll see shortly) this architecture is able to exploit an adaptive-size input rather than reading in sequences (as I imagine convolutional text models might)

### CGRU

The processing done by the network is over n (the length of the input sequence) time steps using L layers (L=2) of [GRUs](http://arxiv.org/abs/1412.3555), however this is a bit misleading because we only feed input in once (encoded as state), and if we unroll the computation for the fixed length of n time steps, it looks like a purely feedforward net. We might ask why use the GRU architecture at all (instead of only weight sharing across layers) and my guess is that the update and reset gates help in training over long unrollings (ie without them we might experience vanishing gradients -- though it might be worth trying relu activated convolutional layers with shared weights across layers).

Additionally, we structure the operation occurring in the GRU by forcing it to be a convolution (instead of e.g. fully connected). Intuitively what this CGRU (Convolutional GRU) is doing is processing the "mental image" (which remains the same shape over all time steps) in the same way at each point in time, thus bearing resemblance to cellular automata.

### Decoder

We consider the output much like the input and read out only the first column (of height n, the output sequence length, and depth m, the embedding size). We use a decoder matrix O, of shape m*I which takes the m-dimensionally represented characters and maps to I logit probabilities.

$$l_k = O s_{fin}[0, k, :] = O c_k = [l_k^1, l_k^2, \dots, l_k^I]$$

  The output at each slot in the output sequence is then whichever character is most probable. For each character, its loss is the probability of the target character (where the probability is softmax over the logits). The loss is then the sum of the logs (~product) of these over all characters

$$L = - \sum_{k \in [1, n]} \log p(c_k^{target}) = - \sum_{k \in 1, n} \log \frac{e^{l_k^{target}}}{\sum_j e^{l_k^j}}$$

## Training

 There are a bunch of tricks employed in training these.
  - Dropout (between 6%-13.5%), noise added to gradients (gaussian with mean 0, variance ~ 1/sqrt(step number)), gradient clipping
  - Grid search for hyperparameter tuning
  - Curriculum design - the distribution over lengths of presented examples during training shifts towards more difficult problems as mastery of easier ones increases (note that this curriculum is designed by hand)
  - "Parameter sharing relaxation": each GRU is allowed to have different weights at each time step for 6 steps, then cycles that same set of weights (step 7 uses weights at step 1). This allows for more variation in the weights so that the network can achieve a better fit. To get a single weight matrix there's a penalty that progressively increases which is proportional to the differences of weights from the mean.

## Other Notes

  - The number of steps the network is run for is equal to the length of the input sequence (for algorithms which require more computational steps, this architecture is therefore insufficient for perfect calculation). I'm not sure if this is a trick, or just a decision for architectural simplicity. Regardless, it removes the need to learn a stopping mechanism.

arxiv.org
scholar.google.com

Training recurrent networks online without backtracking
Ollivier, Yann and Charpiat, Guillaume
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper suggests a method (NoBackTrack) for training recurrent neural networks in an online way, i.e. without having to do backprop through time. One way of understanding the method is that it applies the [forward method for automatic differentiation](//en.wikipedia.org/wiki/Automatic_differentiation#Forward_accumulation), but since it requires maintaining a large Jacobian matrix (nb. of hidden units times nb. of parameters), they propose a way of obtaining a stochastic (but unbiased!) estimate of that matrix. Moreover, the method is improved by using Kalman filtering on that estimate, effectively smoothing the estimate over time.

#### My two cents

Online training of RNNs is a big, unsolved problem. The current approach people use is to truncate backprop to only a few steps in the past, which is more of a heuristic.

This paper makes progress towards a more principled approach. I really like the "rank-one trick" of Equation 7, really cute! And it is quite central to this method too, so good job on connecting those dots!

The authors present this work as being preliminary, and indeed they do not compare with truncated backprop. I really hope they do in a future version of this work.

Also, I don't think I buy their argument that the "theory of stochastic gradient descent applies". Here's the reason. So the method tracks the Jacobian of the hidden state wrt the parameter, which they note $G(t)$. It is update into $G(t+1)$, using a recursion which is based on the chain rule. However, between computing $G(t)$ and $G(t+1)$, a gradient step is performed during training. This means that $G(t)$ is now slightly stale, and corresponds to the gradient with respect to old value of the parameters, not the current value. As far as I understand, this implies that $G(t+1)$ (more specifically, its stochastic estimate as proposed in this paper) isn't unbiased anymore. So, unless I'm missing something (which I might!), I don't think we can invoke the theory of SGD as they suggest.

But frankly, that last issue seems pretty unavoidable in the online setting. I suspect this will never be solved, and future research will have to somehow have to design learning algorithms that are robust to this issue (or develop new theory that shows it isn't one).

So overall, kudos to the authors, and I'm really looking forward to read more about where this research goes!

arxiv.org
scholar.google.com

Accelerating Stochastic Gradient Descent via Online Learning to Sample
Bouchard, Guillaume and Trouillon, Théo and Perez, Julien and Gaidon, Adrien
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

SGD is a widely used optimization method for training the parameters of some model f on some given task. Since the convergence of SGD is related to the variance of the stochastic gradient estimate, there's been a lot of work on trying to come up with such stochastic estimates with smaller variance. This paper does it using an importance sampling (IS) Monte Carlo estimate of the gradient, and learning the proposal distribution $q$ of the IS estimate. 

The proposal distribution $q$ is parametrized in some way, and is trained to minimize the variance of the gradient estimate. It is trained simultaneously while the model $f$ that SGD (i.e. the SGD that uses IS to get its gradient) is training. To make this whole story more recursive, the proposal distribution $q$ is also trained with SGD :-) This makes sense, since one expects the best proposal to depend on the value of the parameters of model $f$, so the best proposal $q$ should vary as $f$ is trained.

One application of this idea is in optimizing a classification model over a distribution that is imbalanced class-wise (e.g. there are classes with much fewer examples). In this case, the proposal distribution determines how frequently we sample examples from each class (conditioned on the class, training examples are chosen uniformly).


#### My two cents

This is a really cool idea. I particularly like the application to training on an imbalanced classification problem. People have mostly been using heuristics to tackle this problem, such as initially sampling each class equally as often, and then fine-tuning/calibrating the model using the real class proportions. This approach instead proposes a really elegant, coherent, solution to this problem.

I would have liked to see a comparison with that aforementioned heuristic (for mainly selfish reasons :-) ). They instead compare with an importance sampling approach with proposal that assigns the same probability to each class, which is a reasonable alternative (though I don't know if it's used as often as the more heuristic approach).

There are other applications, to matrix factorization and reinforcement learning, that are presented in the paper and seem neat, though I haven't gone through those as much.

Overall, one of my favorite paper this year: it's original, tackles a problem for which I've always hated the heuristic solution I'm using now, proposes an elegant solution to it, and is applicable even more widely than that setting.

papers.nips.cc
scholar.google.com

Semi-supervised Learning with Ladder Networks
Rasmus, Antti and Berglund, Mathias and Honkala, Mikko and Valpola, Harri and Raiko, Tapani
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper describes a learning algorithm for deep neural networks that can be understood as an extension of stacked denoising autoencoders. In short, instead of reconstructing one layer at a time and greedily stacking, a unique unsupervised objective involving the reconstruction of all layers is optimized jointly by all parameters (with the relative importance of each layer cost controlled by hyper-parameters).

In more details:

* The encoding (forward propagation) adds noise (Gaussian) at all layers, while decoding is noise-free.
* The target at each layer is the result of noise-less forward propagation.
* Direct connections (also known as skip-connections) between a layer and its decoded reconstruction are used. The resulting encoder/decoder architecture thus ressembles a ladder (hence the name Ladder Networks).
* Miniature neural networks with a single hidden unit and skip-connections are used to decode the left and top layers into a reconstruction. Each network is applied element-wise (without parameter sharing across reconstructed units).
* The unsupervised objective is combined with a supervised objective, corresponding to the regular negative class log-likelihood objective (using an output softmax layer). Two losses are used for each input/target pair: one based on the noise-free forward propagation (which also provides the target of the denoising objective) and one with the noise added (which also corresponds to the encoding stage of the unsupervised autoencoder objective).
Batch normalization is used to train the network.
Since the model combines unsupervised and supervised learning, it can be used for semi-supervised learning, where unlabeled examples can be used to update the network using the unsupervised objective only. State of the art results in the semi-supervised setting are presented, for both the MNIST and CIFAR-10 datasets.

#### My two cents

What I find most exciting about this paper is its performance. On MNIST, with only 100 labeled examples, it achieves 1.13% error! That is essentially the performance of stacked denoising autoencoders, trained on the entire training set (though that was before ReLUs and batch normalization, which this paper uses)! This confirms a current line of thought in Deep Learning (DL) that, while recent progress in DL applied on large labeled datasets does not rely on any unsupervised learning (unlike at the "beginning" of DL in the mid 2000s), unsupervised learning might instead be crucial for success in low-labeled data regime, in the semi-supervised setting.

Unfortunately, there is one little issue in the experiments, disclosed by the authors: while they used few labeled examples for training, model selection did use all 10k labels in the validation set. This is of course unrealistic. But model selection in the low data regime is arguably, in itself, an open problem. So I like to think that this doesn't invalidate the progress made in this paper, and only suggests that some research needs to be done on doing effective hyper-parameter search with a small validation set.

Generally, I really hope this paper will stimulate more research on DL methods to the specific case of small labeled dataset / large unlabeled dataset. While this isn't a problem that is as "flashy" as tasks such as the ImageNet Challenge which comes with lots of labeled data, I think this is a crucial research direction for AI in general. Indeed, it seems naive to me to expect that we will be able to collect large labeled dataset for each and every task, on our way to real AI.

arxiv.org
scholar.google.com

Towards Neural Network-based Reasoning
Peng, Baolin and Lu, Zhengdong and Li, Hang and Wong, Kam-Fai
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper presents a neural network architecture that can take as input a question and a sequence of facts expressed in natural language (i.e. a sequence of words) and produce its output the answer to that question. The main components of the architecture are as follows:

* The question (q) and the facts (f_1, ... , f_K) are each individually transformed into a fixed size vector using the same GRU RNN (with the last hidden layer serving as the vector representation).
* These vectors are each passed through "reasoning layers", where each layer transforms the question q and the facts f_k into a new vector representation. This is done by feeding each question fact pair (q,f_k) to a neural network that outputs a new representation for the fact f_k (which replaces its old representation in the layer), as well as a new representation for the question. All K new question representations are then pooled to obtain a single question representation that replace the old one in the layer.
* The last reasoning layer is either fed to a softmax layer for binary questions, or to a scoring layer for questions with multiple and varying candidate answers.

This so-called Neural Reasoner can be trained by backpropagation, in an end-to-end, supervised way. The authors also suggest the use of auxiliary tasks, to improve results. The first ("original") adds an autoencoder reconstuction cost, that reproduces the question and facts from its first layer encoding. The second ("abstract") instead reconstructs a more abstract version of the sentences (e.g. "The triangle is above the pink rectangle." becomes "x is above y").

Importantly, while the Neural Reasoner framework is presented in this paper as covering many different variants, the version that is experimentally tested is one where the fact representations f_k are actually left unchanged throughout the reasoning layers, with only the question representation being changed.

The paper presents experiments on two synthetic reasoning tasks and report performances that compare favorably with previously published alternatives (based on the general Memory Network architecture). The experiments also show that the auxiliary tasks can substantially improve the performance of the model

#### My two cents

The proposed Neural Reasoner framework is actually very close to work published on arXiv at about the same time on End-to-End Memory Networks \cite{conf/nips/SukhbaatarSWF15}. In fact, the version tested in the paper, with unchanged fact representations throughout layers, is extremely close to End-to-End Memory Networks.

That said, there are also lots of differences. For instance, this paper proposes the use of multilayer networks within each Reasoning Layer, to produce updated question representations. In fact, experiments suggest that using several layers can be very beneficial for the path finding task. The sentence representation at the first layer is also different, being based on a non-linear RNN instead of being based on linear operations on embeddings as in Memory Networks.

The most interesting aspect of this paper to me is probably the demonstration that the use of an auxiliary task such as "original", which is unsupervised, can substantially improve the performance, again for the path finding task. That is, to me, probably the most exciting direction of future research that this paper highlights as promising.

I also liked how the model is presented. It didn't take me much time to understand the model, and I actually found it easier to absorb than the Memory Network model, despite both being very similar. I think this model is indeed a bit simpler than Memory Networks, which is a good thing. It also suggests a different approach to the problem, one where the facts representations are also updated during forward propagation, not just the question's representation (which is the version initially described in the paper... I hope experiments on that variant are eventually presented).

It's unfortunate that the authors only performed experiments on 2 of the 20 synthetic question-answering tasks. I hope a future version of this work can report results on the full benchmark and directly compare with End-to-End Memory Networks.

I was also unable to find out which of the question representation pooling mechanism (section 3.2.2) was used in the experiments. Perhaps the authors forgot to state it?

Overall, a pretty interesting paper that open different doors towards reasoning with neural networks.

arxiv.org
scholar.google.com

Importance Weighted Autoencoders
Burda, Yuri and Grosse, Roger B. and Salakhutdinov, Ruslan
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper proposes to train a neural network generative model by optimizing an importance sampling (IS) weighted estimate of the log probability under the model. The authors show that the case of an estimate based on a single sample actually corresponds to the learning objective of variational autoencoders (VAE). Importantly, they exploit this connection by showing that, similarly to VAE, a gradient can be passed through the approximate posterior (the IS proposal) samples, thus yielding an importance weighted autoencoder (IWAE). The authors also show that, by using more samples, this objective, which is a lower bound of the actual log-likelihood, becomes an increasingly tighter approximation to the log-likelihood. In other words, the IWAE is expected to better optimize the real log-likelihood of the neural network, compared to VAE.

The experiments presented show that the model achieves competitive performance on a version of the binarized MNIST benchmark and on the Omniglot dataset.

#### My two cents

This is a really neat contribution! While simple (both conceptually and algorithmically), it really seems to be an important step forward for the VAE framework. I really like the theoretical result showing that IWAE provides a better approximation to the real log-likelihood, it's quite neat and provides an excellent motivation for the method.

The results on binarized MNIST are certainly impressive. Unfortunately, it appears that the training setup isn't actually comparable to the majority of published results on this dataset. Indeed, it seems that they didn't use the stochastic but *fixed* binarization of the inputs that other publications on this benchmark have used (since my paper on NADE with Iain Murray, we've made available that fixed training set for everyone to use, along with fixed validation and test sets as well). I believe instead they've re-sampled the binarization for each minibatch, effectively creating a setup with a somewhat larger training set than usual. It's unfortunate that this is the case, since it makes this result effectively impossible to compare directly with previous work.

I'm being picky on this issue only because I'm super interested in this problem (that is of generative modeling with neural networks) and this little issue is pretty much the only thing that stops this paper from being a slam dunk. Hopefully the authors (or perhaps someone interested in reimplementing IWAE) can clarify this question eventually.

Otherwise, it seems quite clear to me that IWAE is an improvement over VAE. The experiments of section 5.2, showing that fine-tuning a VAE model with IWAE training improves performance, while fine-tuning a IWAE model using VAE actually makes things worse, is further demonstration that IWAE is indeed a good idea.

papers.nips.cc
scholar.google.com

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
Bengio, Samy and Vinyals, Oriol and Jaitly, Navdeep and Shazeer, Noam
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper considers the problem of structured output prediction, in the specific case where the output is a sequence and we represent the sequence as a (conditional) directed graphical model that generates from the first token to the last. The paper starts from the observation that training such models by maximum likelihood (ML) does not reflect well how the model is actually used at test time. Indeed, ML training implies that the model is effectively trained to predict each token conditioned on the previous tokens *from the ground truth* sequence (this is known as "teacher forcing"). Yet, when making a prediction for a new input, the model will actually generate a sequence by generating tokens one after another and conditioning on *its own predicted tokens* instead.

So the authors propose a different training procedure, where at training time each *conditioning* ground truth token is sometimes replaced by the model's previous prediction. The choice of replacing the ground truth by the model's prediction is made by "flipping a coin" with some probability, independently for each token. Importantly, the authors propose to start with a high probability of using the ground truth (i.e. start close to ML) and anneal that probability closer to 0, according to some schedule (thus the name Schedule Sampling).

Experiments on 3 tasks (image caption generation, constituency parsing and speech recognition) based on neural networks with LSTM units, demonstrate that this approach indeed improves over ML training in terms of the various performance metrics appropriate for each problem, and yields better sequence prediction models.

#### My two cents

Big fan of this paper. It both identifies an important flaw in how sequential prediction models are currently trained and, most importantly, suggests a solution that is simple yet effective. I also believe that this approach played a non-negligible role in Google's winner system for image caption generation, in the Microsoft COCO competition.

My alternative interpretation of why Scheduled Sampling helps is that ML training does not inform the model about the relative quality of the errors it can make. In terms of ML, it is as bad to put high probability on an output sequence that has just 1 token that's wrong, than it is to put the same amount of probability on a sequence that has all tokens wrong. Yet, say for image caption generation, outputting a sentence that is one word away from the ground truth is clearly preferable from making a mistake on a words (something that is also reflected in the performance metrics, such as BLEU).

By training the model to be robust to its own mistakes, Scheduled Sampling ensures that errors won't accumulate and makes predictions that are entirely off much less likely.

An alternative to Scheduled Sampling is DAgger (Dataset Aggregation: \cite{journals/jmlr/RossGB11}), which briefly put alternates between training the model and adding to the training set examples that mix model predictions and the ground truth. However, Scheduled Sampling has the advantage that there is no need to explicitly create and store that increasingly large dataset of sampled examples, something that isn't appealing for online learning or learning on large datasets.

I'm also very curious and interested by one of the direction of future work mentioned in the conclusion: figuring out a way to backprop through the stochastic predictions made by the model. Indeed, as the authors point out, the current algorithm ignores the fact that, by sometimes taking as input its previous prediction, this induces an additional relationship between the model's parameters and its ultimate prediction, a relationship that isn't taken into account during training. To take it into account, you'd need to somehow backpropagate through the stochastic process that generated the previous token prediction. While the work on variational autoencoders has shown that we can backprop through gaussian samples, backpropagating through the sampling of a discrete multinomial distribution is essentially an open problem. I do believe that there is work that tried to tackle propagating through stochastic binary units however, so perhaps that's a start. Anyways, if the authors could make progress on that specific issue, it could be quite useful not just in the context of Schedule Sampling, but possibly in the context of training networks with discrete stochastic units in general!

arxiv.org
scholar.google.com

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
Gal, Yarin and Ghahramani, Zoubin
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper presents an interpretation of dropout training as performing approximate Bayesian learning in a deep Gaussian process (DGP) model. This connection suggests a very simple way of obtaining, for networks trained with dropout, estimates of the model's output uncertainty. This estimate is based and computed from an ensemble of networks each obtained by sampling a new dropout mask.

#### My two cents

This is a really nice and thought provoking contribution to our understanding of dropout. Unfortunately, the paper in fact doesn't provide a lot of comparisons with either other ways of estimating the predictive uncertainty of deep networks, or to other approximate inference schemes in deep GPs (actually, see update below). The qualitative examples provided however do suggest that the uncertainty estimate isn't terrible.

Irrespective of the quality of the uncertainty estimate suggested here, I find the observation itself really valuable. Perhaps future research will then shed light on how useful that method is compared to other approaches, including Bayesian dark knowledge \cite{conf/nips/BalanRMW15}.

`Update: On September 27th`, the authors uploaded to arXiv a new version that now includes comparisons with 2 alternative Bayesian learning methods for deep networks, specifically the stochastic variational inference approach of Graves and probabilistic back-propagation of Hernandez-Lobato and Adams. Dropout actually does very well against these baselines and, across datasets, is almost always amongst the best performing method!

papers.nips.cc
scholar.google.com

Variational Dropout and the Local Reparameterization Trick
Blum, Avrim and Haghtalab, Nika and Procaccia, Ariel D.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper starts by introducing a trick to reduce the variance of stochastic gradient variational Bayes (SGVB) estimators. In neural networks, SGVB consists in learning a variational (e.g. diagonal Gaussian) posterior over the weights and biases of neural networks, through a procedure that (for the most part) alternates between adding (Gaussian) noise to the model's parameters and then performing a model update with backprop.

The authors present a local reparameterization trick, which exploits the fact that the Gaussian noise added into the weights could instead be added directly into the pre-activation (i.e. before the activation fonction) vectors during forward propagation. This is due to the fact that computing the pre-activation is a linear operation, thus noise at that level is also Gaussian. The advantage of doing so is that, in the context of minibatch training, one can efficiently then add independent noise to the pre-activation vectors for each example of the minibatch. The nature of the local reparameterization trick implies that this is equivalent to using one corrupted version of the weights for each example in the minibatch, something that wouldn't be practical computationally otherwise. This is in fact why, in normal SGVB, previous work would normally use a single corrupted version of the weights for all the minibatch.

The authors demonstrate that using the local reparameterization trick yields stochastic gradients with lower variance, which should improve the speed of convergence.

Then, the authors demonstrate that the Gaussian version of dropout (one that uses multiplicative Gaussian noise, instead of 0-1 masking noise) can be seen as the local reparameterization trick version of a SGVB objective, with some specific prior and variational posterior. In this SGVB view of Gaussian dropout, the dropout rate is an hyper-parameter of this prior, which can now be tuned by optimizing the variational lower bound of SGVB. In other words, we now have a method to also train the dropout rate! Moreover, it becomes possible to tune an individual dropout rate parameter for each layer, or even each parameter of the model.

Experiments on MNIST confirm that tuning that parameter works and allows to reach good performance of various network sizes, compared to using a default dropout rate.

##### My two cents

This is another thought provoking connection between Bayesian learning and dropout. Indeed, while Deep GPs have allowed to make a Bayesian connection with regular (binary) dropout learning \cite{journals/corr/GalG15}, this paper sheds light on a neat Bayesian connection for the Gaussian version of dropout. This is great, because it suggests that Gaussian dropout training is another legit way of modeling uncertainty in the parameters of neural networks. It's also nice that that connection also yielded a method for tuning the dropout rate automatically.

I hope future work (by the authors or by others) can evaluate the quality of the corresponding variational posterior in terms of estimating uncertainty in the network and, in particular, in obtaining calibrated output probabilities.

Little detail: I couldn't figure out whether the authors tuned a single dropout rate for the whole network, or used many rates, for instance one per parameter, as they suggest can be done.

arxiv.org
scholar.google.com

Dropout as data augmentation
Konda, Kishore Reddy and Bouthillier, Xavier and Memisevic, Roland and Vincent, Pascal
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper suggests a novel explanation for why dropout training is helpful: because it corresponds to an adaptive data augmentation method. Indeed, the authors point out that, when sampling a mask of the hidden units in a network (effectively setting the corresponding units to 0), the same effect would have been obtained by feeding as input an example tailored to yield activations of 0 for these units and otherwise the same activation for all other units. Since this "ghost" example will have to be different from the original example, and since each different mask would correspond to a different "ghost" example, then effectively mask sampling is similar to data augmentation.

While in practice finding a ghost example that replicates exactly the same dropout hidden activations might not be possible, the authors show that finding an "approximate" ghost example that minimizes a distance between the target dropout activation and the deterministic activation of the ghost example works well. Indeed, they show that training a deep neural net on additional data generated by this procedure yields results that are at least as good as regular dropout on MNIST and CIFAR-10 (actually, the deterministic neural net still uses regular dropout at the input layer, however they do show that the additional ghost examples are necessary to match the neural net trained with dropout at all layers).

Then the authors use that interpretation to justify a variation of dropout where the dropout rate isn't fixed, but itself is randomly sampled in some range for each example. Indeed, if we think of dropout at a fixed rate as a specific class of ghost data being added, varying the dropout rate corresponds to enriching even more the ghost data pool. The experiments show that this can help, though not by much.

Finally, the authors propose an explanation of a property of dropout: that it tends to generate hidden representations that are sparser. Again, the authors rely on their interpretation of dropout as data augmentation. The explanation goes as follows. Training on the ghost data distribution might imply that the classification problem has become significantly harder. Specifically, it is quite possible that the addition of new ghost examples generates new isolated class clusters in input space that the model most now learn to  discriminate. And they hypothesize that the generation of such additional clusters would encourage sparsity. To test this hypothesis, the authors synthetically simulate this scenario, by sampling data on a circle, which is clustered in small arcs each assigned to one of 10 possible classes in cycling order. Decreasing the arc length thus increases the number of arcs, i.e. class clusters. They show that training deep networks on datasets with increasing number of class clusters does yield representations that are increasingly sparser. This thus suggests that dropout might indeed be equivalent to modifying the input distribution by adding such isolated class-specific clusters in input space. 

One assumption behind this analysis is that the sparsity patterns (i.e. the set of non-zero dimensions) play an important role in classification and incorporate most of the discriminative class information. This assumption is also confirmed in experiments, where converting the ReLU activation function by a binary activation (that is 1 if the pre-activation is positive and 0 otherwise) after training still yields a network with good performance (though slightly worse).


#### My two cents

This is a really original and thought provoking paper. One interpretation I make of these results is that the inductive bias corresponding to using a deep neural network with ReLU activations is more valuable than one might have thought, and that the usefulness of deep neural networks goes beyond just being black boxes that can learn data-dependent representations. Otherwise, it's not clear to me why the ghost data implicitly generated by the architecture would be useful at all. This also suggests an experiment where such ghost samples would be fed to  another type of classifier, such as an SVM, to test whether the data augmentation is useful in itself and reflects meaningful structure in the data, as opposed to being somehow useful only for neural nets.

I note that the results are mostly specific to architectures based on ReLU activations (not that this is a problem, but one should keep this in mind).

I'd really like to see what the ghost samples look like. Do they correspond to interpretable images? The authors also mention that exploring how the samples change with training would be interesting to investigate, and I agree.

Finally, I think there might be a typo in Figure 1. While the labels of a) and b) states that the arc length is smaller for a) than b), the plot clearly show otherwise.

arxiv.org
scholar.google.com

LSTM: A Search Space Odyssey
Greff, Klaus and Srivastava, Rupesh Kumar and Koutník, Jan and Steunebrink, Bas R. and Schmidhuber, Jürgen
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper presents an extensive evaluation of variants of LSTM networks. Specifically, they start from what they consider to be the vanilla architecture and, from it, also consider 8 variants which correspond to small modifications on the vanilla case. The vanilla architecture is the one described in Graves & Schmidhuber (2005) \cite{journals/nn/GravesS05}, and the variants consider removing single parts of it (input,forget,output gates or activation functions), coupling the input and forget gate (which is inspired from GRU) or having full recurrence between all gates (which comes from the original LSTM formulation).

In their experimental setup, they consider 3 datasets: TIMIT (speech recognition), IAM Online Handwriting Database (character recognition) and JSB Chorales (polyphonic music modeling). For each, they tune the hyper-parameters of each of the 9 architectures, using random search based on 200 samples. Then, they keep the 20 best hyper-parameters and use the statistics of those as a basis for comparing the architectures.

#### My two cents

This was a very useful ready. I'd make it a required read for anyone that wants to start using LSTMs. First, I found the initial historical description of the developments surrounding LSTMs very interesting and clarifying. But more importantly, it presents a really useful picture of LSTMs that can both serve as a good basis for starting to use LSTMs and also an insightful (backed with data) exposition of the importance of each part in the LSTM.

The analysis based on an fANOVA (which I didn't know about until now) is quite neat. Perhaps the most surprising observation is that momentum actually doesn't seem to help that much. Investigating second order interaction between hyper-parameters was a smart thing to do (showing that tuning the learning rate and hidden layer jointly might not be that important, which is a useful insight).The illustrations in Figure 4, layout out the estimated relationship (with uncertainty) between learning rate / hidden layer size / input noise variance and performance / training time is also full of useful information.

I wont repeat here the main observations of the paper, which are laid out clearly in the conclusion (section 6).

Additionally, my personal take-away point is that, in an LSTM implementation, it might still be useful to support the removal peepholes or having coupled input and forget gates, since they both yielded the ultimate best test set performance on at least one of the datasets (I'm assuming it was also best on the validation set, though this might not be the case...)

The fANOVE analysis makes it clear that the learning rate is the most critical hyper-parameter to tune (can be "make or break"). That said, this is already well known. And the fact that it explains so much of the variance might reflect a bias of the analysis towards a situation where the learning rate isn't tuned as well as it could be in practice (this is afterall THE hyper-parameter that neural net researcher spend the most time tuning in practice). So, as future work, this suggests perhaps doing another round of the same analysis (which is otherwise really neatly setup), where more effort is always put on tuning the learning rate, individually for each of the other hyper-parameters. In other words, we'd try to ignore the regions of hyper-parameter space that correspond to bad learning rates, in order to "marginalize out" its effect. This would thus explore the perhaps more realistic setup that assumes one always tunes the learning rate as best as possible.

Also, considering a less aggressive gradient clipping into the hyper-parameter search would be interesting since, as the authors admit, clipping within [-1,1] might have been too much and could explain why it didn't help

Otherwise, a really great and useful read!

arxiv.org
scholar.google.com

Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models
Stadie, Bradly C. and Levine, Sergey and Abbeel, Pieter
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

The main idea in this paper is to use the agent's ability to predict observations at the next step as a measure of how much exploration of that action should be encouraged. This prediction is based on a deep architecture, specifically a deep autoencoder representation of observations, and accuracy of prediction is measured at the level of that learned, deep representation. Exploration is encourage by increasing the reward whenever the models prediction of the representation at the next time step is bad.

#### My two cents

I'm not sure how novel this idea is in RL, but at the very least it's interesting that it was explored the way it was here, with deep learning. As a non-expert in RL, I certainly enjoyed reading the paper. Also, this implements nicely an idea that just seems like common sense, as an exploration strategy for an agent: actions that merit exploration are those that yield results that are unexpected to you. 

It will be interesting to see if this general approach will be able to exploit upcoming progress in the development of better generative deep learning models, an area that is currently very active.

papers.nips.cc
scholar.google.com

Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets
Vincent, Pascal and de Brébisson, Alexandre and Bouthillier, Xavier
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper presents a linear algebraic trick for computing both the value and the gradient update for a loss function that compares a very high-dimensional target with a (dense) output prediction. Most of the paper exposes the specific case of the squared error loss, though it can also be applied to some other losses such as the so-called spherical softmax. One use case could be for training autoencoders with the squared error on very high-dimensional but sparse inputs. While a naive (i.e. what most people currently do) implementation would scale in $O(Dd)$ where $D$ is the input dimensionality and d the hidden layer dimensionality, they show that their trick allows to scale in $O(d^2)$.

Their experiments show that they can achieve speedup factors of over 500 on the CPU, and over 1500 on the GPU.

#### My two cents

This is a really neat, and frankly really surprising, mathematical contribution. I did not suspect getting rid of the dependence on D in the complexity would actually be achievable, even for the "simpler" case of the squared error.

The jury is still out as to whether we can leverage the full power of this trick in practice. Indeed, the squared error over sparse targets isn't the most natural choice in most situations. The authors did try to use this trick in the context of a version of the neural network language model that uses the squared error instead of the negative log-softmax (or at least I think that's what was done... I couldn't confirm this with 100% confidence). They showed that good measures of word similarity (Simlex-999) could be achieved in this way, though using the hierarchical softmax actually achieves better performance in about the same time.

But as far as I'm concerned, that doesn't make the trick less impressive. It's still a neat piece of new knowledge to have about reconstruction errors. Also, the authors mention that it would be possible to adapt the trick to the so-called (negative log) spherical softmax, which is like the softmax but where the numerator is the square of the pre-activation, instead of the exponential. I hope someone tries this out in the future, as perhaps it could be key to making this trick a real game changer!

papers.nips.cc
scholar.google.com

Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning
Mohamed, Shakir and Rezende, Danilo Jimenez
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper presents a variational approach to the maximisation of mutual information in the context of a reinforcement learning agent. Mutual information in this context can provide a learning signal to the agent that is "intrinsically motivated", because it relies solely on the agent's state/beliefs and does not require from the ("outside") user an explicit definition of rewards.

Specifically, the learning objective, for a current state s, is the mutual information between the sequence of K actions a proposed by an exploration distribution $w(a|s)$ and the final state s' of the agent after performing these actions. To understand what the properties of this objective, it is useful to consider the form of this mutual information as a difference of conditional entropies:

$$I(a,s'|s) = H(a|s) - H(a|s',s)$$

Where $I(.|.)$ is the (conditional) mutual information and $H(.|.)$ is the (conditional) entropy. This objective thus asks that the agent find an exploration distribution that explores as much as possible (i.e. has high $H(a|s)$ entropy) but is such that these actions have predictable consequences (i.e. lead to predictable state s' so that $H(a|s',s)$ is low). So one could think of the agent as trying to learn to have control of as much of the environment as possible, thus this objective has also been coined as "empowerment".

The main contribution of this work is to show how to train, on a large scale (i.e. larger state space and action space) with this objective, using neural networks. They build on a variational lower bound on the mutual information and then derive from it a stochastic variational training algorithm for it. The procedure has 3 components: the exploration distribution $w(a|s)$, the environment $p(s'|s,a)$ (can be thought as an encoder, but which isn't modeled and is only interacted with/sampled from) and the planning model $p(a|s',s)$ (which is modeled and can be thought of as a decoder). The main technical contribution is in how to update the exploration distribution (see section 4.2.2 for the technical details).

This approach exploits neural networks of various forms. Neural autoregressive generative models are also used as models for the exploration distribution as well as the decoder or planning distribution. Interestingly, the framework allows to also learn the state representation s as a function of some "raw" representation x of states. For raw states corresponding to images (e.g. the pixels of the screen image in a game), CNNs are used.

papers.nips.cc
scholar.google.com

Bayesian dark knowledge
Balan, Anoop Korattikara and Rathod, Vivek and Murphy, Kevin P. and Welling, Max
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper combines two ideas. The first is stochastic gradient Langevin dynamics (SGLD), which is an efficient Bayesian learning method for larger datasets, allowing to efficiently sample from the posterior over the parameters of a model (e.g. a deep neural network). In short, SGLD is stochastic (minibatch) gradient descent, but where Gaussian noise is added to the gradients before each update. Each update thus results in a sample from the SGLD sampler. To make a prediction for a new data point, a number of previous parameter values are combined into an ensemble, which effectively corresponds to Monte Carlo estimate of the posterior predictive distribution of the model.

The second idea is distillation or dark knowledge, which in short is the idea of training a smaller model (student) in replicating the behavior and performance of a much larger model (teacher), by essentially training the student to match the outputs of the teacher.

The observation made in this paper is that the step of creating an ensemble of several models (e.g. deep networks) can be expensive, especially if many samples are used and/or if each model is large. Thus, they propose to approximate the output of that ensemble by training a single network to predict to output of ensemble. Ultimately, this is done by having the student predict the output of a teacher corresponding to the model with the last parameter value sampled by SGLD.

Interestingly, this process can be operated in an online fashion, where one alternates between sampling from SGLD (i.e. performing a noisy SGD step on the teacher model) and performing a distillation update (i.e. updating the student model, given the current teacher model). The end result is a student model, whose outputs should be calibrated to the bayesian predictive distribution.

arxiv.org
scholar.google.com

Clustering is Efficient for Approximate Maximum Inner Product Search
Auvolat, Alex and Vincent, Pascal
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

`Update 2015/11/23: Since I first wrote this note, I became involved in the next iterations of this work, which became v2 of the arXiv manuscript. The notes below were made based on v1.`

This paper considers the problem of Maximum Inner Product Search (MIPS). In MIPS, given a query $q$ and a set of inputs $x_i$, we want to find the input (or the top n inputs) with highest inner product, i.e. $argmax_i q' x_i$.

Recently, it was shown that a simple transformation to the query and input vectors made it possible to approximately solve MIPS using hashing methods for Maximum Cosine Similarity Search (MCSS), a problem for which solutions are readily available (see section 2.4 for a brief but very clear description of the transformation). 

In this paper, the authors combine this approach with clustering, in order to improve the quality of retrieved inputs. Specifically, they consider the spherical k-means algorithm, which is a variant of k-means in which data points are clustered based on cosine similarity instead of the euclidean similarity (in short, data points are first scaled to be of unit norm, then in the training inner loop points are assigned to the cluster centroid with highest dot product and cluster centroids are updated as usual, except that they are always rescaled to unit norm). Moreover, they consider a bottom-up application of the algorithm to yield a hierarchical clustering tree.

They propose to use such a hierarchical clustering tree to find the top-n candidates for MIPS. The key insight here is that, since spherical k-means relies on cosine similarity for finding the best cluster, and since we have a transformation that allows the maximisation of inner product to be approximated by the maximisation of cosine similarity, then a tree to find MIPS candidates could be constructed by running spherical k-means on the inputs transformed by the same transformation used for hashing-based MIPS. 

In order to make the search more robust to border issues when a query is close to the frontier between clusters, at each level of the tree they consider more than one candidate cluster during top-down search, so as to merge the candidates in several leaves of the tree at the very end of a full top down query.

Their experiments using search with word embeddings show that the quality of the top 1, 10 and 100 MIPS candidates using their spherical k-means approach is better than using two hashing-based search methods.

arxiv.org
scholar.google.com

Speed learning on the fly
Massé, Pierre-Yves and Ollivier, Yann
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper presents a method for "learning the learning rate" of a stochastic gradient descent method, in the context of online learning. Indeed, variations on the chosen learning rate or learning rate schedule can have a large impact in observed performance of stochastic gradient descent. Moreover, in the context of online learning, where we are interested in achieving high performance not only at convergence but every step of the way, the "choosing the learning rate" problem is even more crucial.

The authors present a method which attempts to train the learning rate itself by gradient descent. This is achieved by "unrolling" the parameter updates of our model across the time steps of online learning, which exposes the interaction between the learning rate and the sum of losses of the model across these time steps. The authors then propose a way to approximate the gradient of the sum of losses with respect to the learning rate, so that it can be used to perform gradient updates on the learning rate itself.

The gradient on the learning rate has to be approximated, for essentially the same reason that gradients to train a recurrent neural network online must be approximated (see also my notes on another good paper by Yann Ollivier here: \cite{journals/corr/OllivierC15}). Another approximation is introduced to avoid having to compute an Hessian matrix. Nevertheless, results suggest that the proposed approximation works well and can improve over a fixed learning with a reasonable rate decay schedule

#### My two cents

I think the authors are right on the money as to the challenges posed by online learning. I think these challenges are likely to be greater in the context of training neural networks online, for which little satisfactory solutions exist right now. So this is a direction of research I'm particularly excited about.

At this points, the experiments consider fairly simple learning scenarios, but I don't see any obstacle in applying the same method to neural networks. One interesting observation from the results is that results are fairly robust to variations of "the learning rate of the learning rate", compared to varying and fixing the learning rate itself.

Finally, I haven't had time to entirely digest one of their theoretical result, suggesting that their approximation actually corresponds to an exact gradient taken "alongside the effective trajectory" of gradient descent. However, that result seems quite interesting and would deserve more attention.

jmlr.org
scholar.google.com

Gradient-based Hyperparameter Optimization through Reversible Learning
Maclaurin, Dougal and Duvenaud, David K. and Adams, Ryan P.
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This is another "learning the learning rate" paper, which predates (and might have inspired) the "Speed learning on the fly" paper I recently wrote notes about (see \cite{journals/corr/MasseO15}). In this paper, they consider the off-line training scenario, and propose to do gradient descent on the learning rate by unrolling the *complete* training procedure and treating it all as a function to optimize, with respect to the learning rate. This way, they can optimize directly the validation set loss.

The paper in fact goes much further and can tune many other hyper-parameters of the gradient descent procedure: momentum, weight initialization distribution parameters, regularization and input preprocessing.

#### My two cents

This is one of my favorite papers of this year. While the method of unrolling several steps of gradient descent (100 iterations in the paper) makes it somewhat impractical for large networks (which is probably why they considered 3-layer networks with only 50 hidden units per layer), it provides an incredibly interesting window on what are good hyper-parameter choices for neural networks. Note that, to substantially reduce the memory requirements of the method, the authors had to be quite creative and smart about how to encode changes in the network's weight changes.

There are tons of interesting experiments, which I encourage the reader to go check out (see section 3).

One experiment on training the learning rates, separately for each iteration (i.e. learning a learning rate schedule), for each layer and for either weights or biases (800 hyper-parameters total) shows that a good schedule is one where the top layer first learns quickly (large learning), then the bottom layer starts training faster, and finally the learning rates of all layers is decayed towards zero. Note that some of the experiments presented actually optimized the training error, instead of the validation set error.

Another looked at finding optimal scales for the weight initialization. Interestingly, the values found weren't that far from an often prescribed scale of $1 / \sqrt{N}$, where $N$ is the number of units in the previous layer.

The experiment on "training the training set", i.e. generating the 10 examples (one per class) that would minimize the validation set loss of a network trained on these examples is a pretty cool idea (it essentially learns prototypical images of the digits from 0 to 9 on MNIST).

Another experiment tried to optimize a multitask regularization matrix, in order to encourage forms of soft-weight-tying across tasks.

Note that approaches like the one in this paper make tools for automatic differentiation incredibly valuable. Python autograd, the author's automatic differentiation Python library https://github.com/HIPS/autograd (which inspired our own Torch autograd https://github.com/twitter/torch-autograd) was in fact developed in the context of this paper.

Finally, I'll end with a quote from the paper, that I found particularly funny: "The last remaining parameter to SGD is the initial parameter vector. Treating this vector as a hyperparameter blurs the distinction between learning and meta-learning. In the extreme case where all elementary learning rates are set to zero, the training set ceases to matter and the meta-learning procedure exactly reduces to elementary learning on the validation set. Due to philosophical vertigo, we chose not to optimize the initial parameter vector."

arxiv.org
scholar.google.com

Infinite Dimensional Word Embeddings
Nalisnick, Eric T. and Ravi, Sachin
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper introduces a version of the skipgram word embeddings learning algorithm that can also learn the size (nb. of dimensions) of these embeddings. The method, coined infinite skipgram (iSG), is inspired from my work with Marc-Alexandre Côté on the infinite RBM, in which we describe a mathematical trick for learning the size of a latent representation. This is done by introducing an additional latent variable $z$ representing the number of dimensions effectively involved in the energy function. Moreover, a term penalizing increasing values for $z$ is also incorporated, such that the infinite sum over $z$ is converging.

In this paper, the authors extend the probabilistic model behind skipgram with such a variable $z$, now corresponding to the number of dimensions involved in the dot product between word embeddings. They also propose a few approximations required to allow for an efficient training algorithm. Mainly they optimize an upper bound on the regular skipgram objective (see Section 3.2) and they approximate the computation of the conditional over $z$ for a given word $w$, which requires summing over all possible context words $c$, by summing only over the words observed in the immediate current context of $w$ (thus this sum will very across training example of the same word $w$).

Experiments show that the iSG better learns to exploit different dimensions to model different senses of words, better than the original skipgram model. Quantitatively, the iSG seems to provide better probabilities to context words.

arxiv.org
scholar.google.com

Gated Graph Sequence Neural Networks
Li, Yujia and Tarlow, Daniel and Brockschmidt, Marc and Zemel, Richard S.
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper presents a feed-forward neural network architecture for processing graphs as inputs, inspired from previous work on Graph Neural Networks.

In brief, the architecture of the GG-NN corresponds to $T$ steps of GRU-like (gated recurrent units) updates, where T is a hyper-parameter. At each step, a vector representation is computed for all nodes in the graph, where a node's representation at step t is computed from the representation of nodes at step $t-1$. Specifically, the representation of a node will be updated based on the representation of its neighbors in the graph. Incoming and outgoing edges in the graph are treated differently by the neural network, by using different parameter matrices for each. Moreover, if edges have labels, separate parameters can be learned for the different types of edges (meaning that edge labels determine the configuration of parameter sharing in the model). Finally, GG-NNs can incorporate node-level attributes, by using them in the initialization (time step 0) of the nodes' representations.

GG-NNs can be used to perform a variety of tasks on graphs. The per-node representations can be used to make per-node predictions by feeding them to a neural network (shared across nodes). A graph-level predictor can also be obtained using a soft attention architecture, where per-node outputs are used as scores into a softmax in order to pool the representations across the graph, and feed this graph-level representation to a neural network. The attention mechanism can be conditioned on a "question" (e.g. on a task to predict the shortest path in a graph, the question would be the identity of the beginning and end nodes of the path to find), which is fed to the node scorer of the soft attention mechanism. Moreover, the authors describe how to chain GG-NNs to go beyond predicting individual labels and predict sequences.

Experiments on several datasets are presented. These include tasks where a single output is required (on a few bAbI tasks) as well as tasks where a sequential output is required, such as outputting the shortest path or the Eulerian circuit of a graph. Moreover, experiments on a much more complex and interesting program verification task are presented.

arxiv.org
scholar.google.com

Net2Net: Accelerating Learning via Knowledge Transfer
Chen, Tianqi and Goodfellow, Ian J. and Shlens, Jonathon
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper presents an approach to initialize a neural network from the parameters of a smaller and previously trained neural network. This is effectively done by increasing the size (in width and/or depth) of the previously trained neural network, in such of a way that the function represented by the network doesn't change (i.e. the output of the larger neural network is still the same). The motivation here is that initializing larger neural networks in this way allows to accelerate their training, since at initialization the neural network will already be quite good.

In a nutshell, neural networks are made wider by adding several copies (selected randomly) of the same hidden units to the hidden layer, for each hidden layer. To ensure that the neural network output remains the same, each incoming connection weight must also be divided by the number of replicas that unit is connected to in the previous layer. If not training using dropout, it is also recommended to add some noise to this initialization, in order to break its initial symmetry (though this will actually break the property that the network's output is the same). As for making a deeper network, layers are added by initializing them to be the identity function. For ReLU units, this is achieved using an identity matrix as the connection weight matrix. For units based on sigmoid or tanh activations, unfortunately it isn't possible to add such identity layers.

In their experiments on ImageNet, the authors show that this initialization allows them to train larger networks faster than if trained from random initialization. More importantly, they were able to outperform their previous validation set ImageNet accuracy by initializing a very large network from their best Inception network.

arxiv.org
scholar.google.com

Order-Embeddings of Images and Language
Vendrov, Ivan and Kiros, Ryan and Fidler, Sanja and Urtasun, Raquel
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper proposes to learn embeddings of text and/or images according to a dissimilarity metric that is asymmetric and implements the notion of partial order. For example, we'd like the metric to capture that the sentence "a dog in the yard" is more specific than just "a dog". Similarly, given the image of a scene and a caption describing it, we'd also like to capture that the image is more specific than the caption, since captions only describe the main elements of the scene. We'd also like to capture the hypernym relation between single words, e.g. where "woman" is more specific than "person".

To achieve this, they propose to use the following dissimilarity metric:

$$E(x,y) = ||max(0,y-x)||^2$$

where x and y are embedding vectors and the max operation is applied element-wise. The way to use this metric is to learn embeddings such that, for a pair x,y where the object (e.g. "a dog in the yard") represented by $x$ is more specific than the object (e.g. "a dog") represented by $y$, then $E(x,y)$ is as small as possible.

For example, let's assume that $x$ and y are the output of a neural network, where each output dimension detects a certain concept, i.e. is non-zero only if the concept associated with that dimension is present in the input. For x representing "a dog in the yard", we could expect having only two dimensions that are non-zero: one detecting the concept "dog" (let's note it $x_j$) and another detecting the concept "yard" ($x_k$). For y representing "a dog", only the dimension associated with "dog" ($y_j$) would be non-zero and have the same value as $x_j$. In this situation, it is easy to see that $E(x,y)$ would be 0, but $E(y,x)$ would be greater than zero, thus capturing appropriately the asymmetric relationship between the two.

The authors show in the paper how to leverage this new asymmetric metric in training losses that are appropriate for 3 problems: hypernym detection, caption-image retrieval and textual entailment. They show that the proposed metric yields superior performance on these problems compared to symmetric metrics that have been used by prior work.

arxiv.org
scholar.google.com

Sequence Level Training with Recurrent Neural Networks
Ranzato, Marc'Aurelio and Chopra, Sumit and Auli, Michael and Zaremba, Wojciech
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper is concerned with the problem of predicting a sequence at the output, e.g. using an RNN. It aims at addressing the issue it refers to as exposure bias, which here refers to the fact that while at training time the RNN producing the output sequence is being fed the ground truth previous tokens (words) when producing the next token (something sometimes referred to as teacher forcing, which really is just maximum likelihood), at test time this RNN makes predictions using recursive generation, i.e. it is instead recursively fed by its own predictions (which might be erroneous).

Moreover, it also proposes a training procedure that can take into account a rich performance measure that can't easily be optimized directly, such as the BLEU score for text outputs.

The key observation is that the REINFORCE algorithm could be used to optimize the expectation of such arbitrarily complicated performance measures, for outputs produced by (stochastic) recursive generation. However, REINFORCE is a notoriously unstable training algorithm, which can often work terribly (in fact, the authors mention that they have tried using REINFORCE only, without success). Thus, they instead propose to gradually go from training according to maximum likelihood / teacher forcing to training using the REINFORCE algorithm on the expected performance measure.

The proposed procedure, dubbed MIXER (Mixed Incremental Cross-Entropy Reinforce), goes as follows:
1. Train model to optimize the likelihood of the target sequence, i.e. minimize the per time-step cross-entropy loss.
2. Then, for a target sequence of size T, optimize the cross-entropy for the T-Δ first time steps of the sequence and use Reinforce to get a gradient on the expected loss (e.g. negative BLEU) for the recursive generation of the rest of the Δ time steps.
3. Increase Δ and go back to 2., until Δ is equal to T.

Experiments on 3 text benchmarks (summarization, machine translation and image captioning) show that this approach yields models that produces much better outputs when not using beam search (i.e. using greedy recursive generation) to generate an output sequence, compared to other alternatives such as regular maximum likelihood and Data as Demonstrator (DaD). DaD is similar to the scheduled sampling method of Bengio et al. (see my note: \cite{conf/nips/BengioVJS15}), in that at training time, some of the previous tokens fed to the model are predicted tokens instead of ground truths. When using beam search, MIXER is only outperformed by DaD on the machine translation task.

arxiv.org
scholar.google.com

MuProp: Unbiased Backpropagation for Stochastic Neural Networks
Gu, Shixiang and Levine, Sergey and Sutskever, Ilya and Mnih, Andriy
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper presents a method for training feed-forward neural networks with stochastic hidden units (e.g. sigmoid belief networks), to optimize the expectation (over the stochastic units) of some arbitrary loss function. While the proposed method is applicable to any type of stochastic units, it is most interesting for the case of discrete stochastic units, since the reparametrization trick of variational autoencoders cannot be applied to backprop through the sampling step. 

In short, the method builds on the likelihood ratio method (of which REINFORCE is a special case) and proposes a baseline (also known as control variate) which, according to the authors, is such that an unbiased gradient is obtained. Specifically, the baseline corresponds to the first-order Taylor expansion of the loss function around some deterministic value of the hidden units (x̄) that doesn't depend on the stochastic hidden units (noted x in the paper).

For a likelihood ratio method to be unbiased, it is required that the expectation of the baseline (times the gradient of the model's log distribution) with respect to the model's distribution be tractable. For the proposed baseline, it can be shown that computing this expectation requires the gradient of the mean (μ) of each stochastic unit in the network with respect to each parameter. The key idea behind the proposed method is that 1) an estimate of this expectation can be obtained simply using mean-field and 2)  since mean-field is estimated by a feedforward deterministic pass over the network, it is thus possible to compute the gradients of μ by backpropagation through the mean-field pass (hence the name of the method, MuProp).

Experiments show that this method converges much faster than previously proposed unbiased methods and often performs better. Experiments also show that the method obtains competitive performance compared to biased methods (such as the "straight through" method).

arxiv.org
scholar.google.com

A note on the evaluation of generative models
Theis, Lucas and Oord, Aäron Van Den and Bethge, Matthias
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper presents a variety of issues related to the evaluation of image generative models. Specifically, they provide evidence that evaluations of generative models based on the popular Parzen windows estimator or based on a visual fidelity (qualitative) measure both present serious flaws.

The Parzen windows approach to generative modeling evaluation works by taking a finite set of samples generated from a given model and then using those as the centroids of a Parzen windows Gaussian mixture. The constructed Parzen windows mixture is then used to compute a log-likelihood score on a set of test examples.

Some of the key observations made in this paper are:
1. A simple, k-means based approach can obtain better Parzen windows performance than using the original training samples for a given dataset, even though these are samples from the true distribution!
2. Even for the fairly low dimensional space of 6x6 image patches, a Parzen windows estimator would require an extremely large number of samples to come close to the true log-likelihood performance of a model.
3. Visual fidelity is a bad predictor of true log-likelihood performance, as it is possible to
Obtain great visual fidelity and arbitrarily low log-likelihood, with a Parzen windows model made of Gaussians with very small variance.
Obtain bad visual fidelity and high log-likelihood by taking a model with high log-likelihood and mixing it with a white noise model and putting as much as 99% of the mixing probability on the white noise model (i.e. which would produce bad samples 99% of the time).
4. Measuring overfitting of a model by taking samples from the model and making sure their training set nearest neighbors are different is ineffective, since it is actually trivial to generate samples that are each visually almost identical to a training example, but that yet each have large euclidean distance with their corresponding (visually similar) training example.

jmlr.org
scholar.google.com

DRAW: A Recurrent Neural Network For Image Generation
Gregor, Karol and Danihelka, Ivo and Graves, Alex and Rezende, Danilo Jimenez and Wierstra, Daan
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by José Manuel Rodríguez Sotelo 9 years ago

The paper introduces a sequential variational auto-encoder that generates complex images iteratively. The authors also introduce a new spatial attention mechanism that allows the model to focus on small subsets of the image. This new approach for image generation produces images that can’t be distinguished from the training data.

#### What is DRAW:
The deep recurrent attention writer (DRAW) model has two differences with respect to other variational auto-encoders. First, the encoder and the decoder are recurrent networks. Second, it includes an attention mechanism that restricts the input region observed by the encoder and the output region observed by the decoder.

#### What do we gain?
The resulting images are greatly improved by allowing a conditional and sequential generation. In addition, the spatial attention mechanism can be used in other contexts to solve the “Where to look?” problem.

#### What follows?
A possible extension to this model would be to use a convolutional architecture in the encoder or the decoder. Although this might be less useful since we are already restricting the input of the network.

#### Like:
* As observed in the samples generated by the model, the attention mechanism works effectively by reconstructing images in a local way.
* The attention model is fully differentiable.

#### Dislike:
* I think a better exposition of the attention mechanism would improve this paper.

papers.nips.cc
scholar.google.com

Training Very Deep Networks
Srivastava, Rupesh Kumar and Greff, Klaus and Schmidhuber, Jürgen
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

Machine learning researchers frequently find that they get better results by adding more and more layers to their neural networks, but the difficulties of initialization and decaying/exploding gradients have been severely limiting. Indeed, the difficulties of getting information to flow through deep neural networks arguably kept them out of widespread use for 30 years. This paper addresses this problem head on and demonstrates one method for training 100 layer nets.

The paper describes an affective method to train very deep neural networks by means of 'information highways', or building direct connections to upper network layers. Although a generalization of prior techniques, such as cross-layer connections, the authors have shown this method to be effective by experimentation. The contributions are quite novel and well supported by experimental evidence.

papers.nips.cc
scholar.google.com

Path-SGD: Path-Normalized Optimization in Deep Neural Networks
Neyshabur, Behnam and Salakhutdinov, Ruslan and Srebro, Nathan
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

Deep rectified neural networks are over-parameterized in the sense that scaling of the weights in one layer, can be compensated for exactly in the subsequent layer. This paper introduces Path-SGD, a simple modification to the SGD update rule, whose update is invariant to such rescaling. The method is derived from the proximal form of gradient descent, whereby a constraint term is added which preserves the norm of the "product weight" formed along each path in the network (from input to output node). Path-SGD is thus principled and shown to yield faster convergence for a standard 2 layer rectifier network, across a variety of dataset (MNIST, CIFAR-10, CIFAR-100, SVHN). As the method implicitly regularizes the neural weights, this also translates to better generalization performance on half of the datasets.

At its core, Path-SGD belongs to the family of learning algorithms which aim to be invariant to model reparametrizations. This is the central tenet of Amari's natural gradient (NG) \cite{amari_natural_1998}, whose importance has resurfaced in the area of deep learning. Path-SGD can thus be cast an approximation to NG, which focuses on a particular type of rescaling between neighboring layers. The paper would greatly benefit from such a discussion in my opinion. I also believe NG to be a much more direct way to motivate Path-SGD, than the heuristics of max-norm regularization.

papers.nips.cc
scholar.google.com

Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets
Joulin, Armand and Mikolov, Tomas
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

Endowing memory to recurrent neural networks is clearly one of the most important topics of deep learning and crucial to do real reasoning. The proposed stack-augmented recurrent nets outperform simple RNN and LSTM \cite{journals/neco/HochreiterS97} on a series of synthetic problems (learning simple algorithmic patterns). The complexity of problems is clearly defined and the behavior of resulting stack RNN could be well understood and easily analyzed. However, the conclusions merely depending on those synthetic data set may take a risk. The importance of the problems to real sequence modeling task could be uncertain and the failures of other models could be greatly improved by more and dense hyper-parameter searching. Like in \cite{journals/corr/LeJH15}, by a very simple trick a RNN works very well on a toy task (a adding problem) which seems to need to model long term dependencies.

papers.nips.cc
scholar.google.com

Probabilistic Line Searches for Stochastic Optimization
Mahsereci, Maren and Hennig, Philipp
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

The authors propose a probabilistic version of the "line search" procedure that is commonly used as a subroutine in many deterministic optimization algorithms. The new technique can be applied when the evaluations of the objective function and its gradients are corrupted by noise. Therefore, the proposed method can be successfully used in stochastic optimization problems, eliminating the requirement of having to specify a learning rate parameter in this type of problems. The proposed method uses a Gaussian process surrogate model for the objective and its gradients. This allows us to obtain a probabilistic version of the conditions commonly used to terminate line searches in the deterministic scenario. The result is a soft version of those conditions that is used to stop the probabilistic line search process. At each iteration within such process, the next evaluation location is collected by using Bayesian optimization methods. A series of experiments with neural networks on the MNIST and CIFAR10 datasets validate the usefulness of the proposed technique.

papers.nips.cc
scholar.google.com

Color Constancy by Learning to Predict Chromaticity from Luminance
Chakrabarti, Ayan
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

The algorithm presented here is simple and interesting. Pixel luminance, chrominance, and illumination chrominance are all histogrammed, and then evaluation is simply each pixel's luminance voting on each pixel's true chrominance for each of the "memorized" illuminations. The model can be trained generative by simply counting pixels in the training set, or can be trained end-to-end for a slight performance boost. This algorithm's simplicity and speed are appealing, and additionally it seems like it may be a useful building block for a more sophisticated spatially-varying illumination model.

papers.nips.cc
scholar.google.com

A Convergent Gradient Descent Algorithm for Rank Minimization and Semidefinite Programming from Random Linear Measurements
Zheng, Qinqing and Lafferty, John D.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

The paper presents results on recovery of low-rank semidefinite matrices from linear measurements, using nonconvex optimization. The approach is inspired by recent work on phase retrieval, and combines spectral initialization with gradient descent. The connection to phase retrieval comes because measurements which are linear in the semidefinite matrix $X = Z Z'$ are quadratic in the factors $Z$. The paper proves recovery results which imply that correct recovery occurs when the number of measurements m is essentially proportional to n $r^2$, where n is the dimensionality and r is the rank. The convergence analysis is based on a form of restricted strong convexity (restricted because there is an $r(r-1)/2$-dimensional set of equivalent solutions along which the objective is flat). This condition also implies linear convergence of the proposed algorithm.

The implementation seems awful. When compared to recent implementations, e.g. http://arxiv.org/abs/1408.2467 the performance seems orders of magnitude away from the state of the art -- and being an order of magnitude faster than general-purpose SDP solver on the nuclear norm does not make it any better. The authors should acknowledge that and compare the results with other codes on some established benchmark (e.g. Lenna), so as to show that the price in terms of run-time brings about much better performance in terms of objective function values (SNR, RMSE) -- which is plausible, but far from certain.

papers.nips.cc
scholar.google.com

Parallel Correlation Clustering on Big Graphs
Pan, Xinghao and Papailiopoulos, Dimitris S. and Oymak, Samet and Recht, Benjamin and Ramchandran, Kannan and Jordan, Michael I.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

This work addresses an important special case of the correlation clustering problem: Given as input a graph with edges labeled -1 (disagreement) or +1 (agreement), the goal is to decompose the graph so as to maximize agreement within components. Building on recent work \cite{conf/kdd/BonchiGL14} \cite{conf/kdd/ChierichettiDK14}, this paper contributes two concurrent algorithms, a proof of their approximation ratio, a run-time analysis as well as a set of experiments which demonstrate convincingly the advantage of the proposed algorithms over the state of the art.

papers.nips.cc
scholar.google.com

Logarithmic Time Online Multiclass prediction
Choromanska, Anna and Langford, John
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

This paper proposes a novel online algorithm for constructing a multiclass classifier that enjoys a time complexity logarithmic in the number of classes k. This is done by constructing online a decision tree which locally maximizes an appropriate novel objective function, which measures the quality of a tree according to a combined "balancedness" and "purity" score. A theoretical analysis (of a probably intractable algorithm) is provided via a boosting argument (assuming weak learnability), essentially extending the work of Kearns and Mansour (1996) \cite{conf/stoc/KearnsM96} to the multiclass setup. A concrete algorithm is given to a relaxed problem (but see below) without any guarantees, but quite simple, natural and interesting.

papers.nips.cc
scholar.google.com

Covariance-Controlled Adaptive Langevin Thermostat for Large-Scale Bayesian Sampling
Shang, Xiaocheng and Zhu, Zhanxing and Leimkuhler, Benedict J. and Storkey, Amos J.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

This paper presents a new method (the "covariance-controlled adaptive Langevin thermostat") for MCMC posterior sampling for Bayesian inference. Along the lines of previous work in scalable MCMC, this is a stochastic gradient sampling method. The presented method aims to decrease parameter-dependent noise (in order to speed-up convergence to the given invariant distribution of the Markov chain, and generate beneficial samples more efficiently), while maintaining the desired invariant distribution of the Markov chain. Similar to existing stochastic gradient MCMC methods, this method aims to find use in large-scale machine learning settings (i.e. Bayesian inference with large numbers of observations). Experiments on three models (a normal-gamma model, Bayesian logistic regression, and a discriminative restricted Boltzmann machine) aim to show that the presented method performs better than Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) \cite{10.1016/0370-2693(87)91197-X} and Stochastic Gradient Nose-Hoover Thermostat (SGNHT), two similar existing methods.

papers.nips.cc
scholar.google.com

Adaptive Low-Complexity Sequential Inference for Dirichlet Process Mixture Models
Tsiligkaridis, Theodoros and Forsythe, Keith W.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

This paper introduces ASUGS (adaptive sequential updating and greedy search), building on the previous work on SUGS by Wang & Dunson 2011 \cite{10.1198/jcgs.2010.07081}, which is a sequential (ie online) MAP inference method for DPMMs.

The main contribution of the paper is to provide online updating for the concentration parameter, $\alpha$.

The paper shows that the posterior distribution on $\alpha$ can be expected to behave has a gamma distribution (that depends on the current number of clusters and on n) in the large-scale limit, assuming an exponential prior on $\alpha$.

ASUGS uses the mean of this gamma distribution as the $\alpha$ for updating cluster assignments, the remainder of the algorithm proceeding as in SUGS (ie using conjugacy to update model parameters in an online fashion, with hard assignments of data to clusters.)

The paper also shows that this choice of \alpha is bounded by $\log^\epsilon n$ for an arbitrarily small $\epsilon$, so that we may expect this process to converge, or at the very least be stable even in large settings.

papers.nips.cc
scholar.google.com

Learning with Symmetric Label Noise: The Importance of Being Unhinged
van Rooyen, Brendan and Menon, Aditya Krishna and Williamson, Robert C.
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

The paper presents a solution to binary classification with symmetric label noise (SLN). They show that, in order to obtain consistency (w.r.t. to the 0-1 loss in the "noiseless" case) while using a convex surrogate, one must use the loss $\ell(v,y) = 1 - vy$ -- the "unhinged loss" -- , which is shown to enjoy some useful properties, including robustness to SLN. In a more restricted sense of robustness, it is the only such loss, but in any case it overcomes the limitations of other convex losses for the same problem.

Different implications of using the unhinged loss are discussed; the problem of classification with SLN with the unhinged loss and "linear" classifiers is investigated and solved analytically. The authors also present an empirical evaluation to motivate that their theoretical considerations have practical impact.

papers.nips.cc
scholar.google.com

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
Ren, Shaoqing and He, Kaiming and Girshick, Ross B. and Sun, Jian
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by NIPS Conference Reviews 9 years ago

This work proposes a two stage object detection algorithm based on convolutional neural network (CNN). The first stage is region proposal, which is based on the traditional sliding window method but working on the top layer feature map of CNN (RPN). In the second stage, a fast R-CNN is applied to the proposed regions. Since the convolution layers are shared between RPN and R-CNN, and the calculation is speeded up using GPU, the algorithm can achieve near real-time (5fps).

dx.doi.org
sci-hub
scholar.google.com

Fast R-CNN
Girshick, Ross B.
International Conference on Computer Vision - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 9 years ago

This method is based on improving the speed of R-CNN \cite{conf/cvpr/GirshickDDM14}

1. Where R-CNN would have two different objective functions, Fast R-CNN combines localization and classification losses into a "multi-task loss" in order to speed up training.
2. It also uses a pooling method based on \cite{journals/pami/HeZR015} called the RoI pooling layer that scales the input so the images don't have to be scaled before being set an an input image to the CNN. "RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of sub-windows of approximate size $h/H \times w/W$ and then max-pooling the values in each sub-window into the corresponding output grid cell."
3. Backprop is performed for the RoI pooling layer by taking the argmax of the incoming gradients that overlap the incoming values.

This method is further improved by the paper "Faster R-CNN" \cite{conf/nips/RenHGS15}

jmlr.org
scholar.google.com

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Ioffe, Sergey and Szegedy, Christian
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 9 years ago

A *Batch Normalization* applied immediately after fully connected layers and adjusts the values of the feedforward output so that they are centered to a zero mean and have unit variance.

It has been used by famous Convolutional Neural Networks such as GoogLeNet \cite{journals/corr/SzegedyLJSRAEVR14} and ResNet \cite{journals/corr/HeZRS15}

arxiv.org
scholar.google.com

Universum Prescription: Regularization using Unlabeled Data
Zhang, Xiang and LeCun, Yann
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Evan Su 9 years ago

This paper apply temporal convolutional neural network on character input to learn abstract text concepts. Depending on application, the model can output the category of text or review sentiment. The model is trained from character level and do not require knowledge of syntax or semantic structure. Therefore, the model can work for various language including English and Chinese with little prior knowledge of languages.

arxiv.org
scholar.google.com

Semi-Supervised Web Wrapper Repair via Recursive Tree Matching
Cohen, Joseph Paul and 0003, Wei Ding and Bagherjeiran, Abraham
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 9 years ago

This idea is so badass! It uses Simple Tree Matching \cite{journals/spe/Yang91} and extends it to work with HTML and then recursively searches an unseen document to align it with previously seen examples. An overview of the problem of *shift* can be seen on the left of the figure below and  the alignment is shown on the right.

http://i.imgur.com/b8EzP42.png