[link]
**Summary** Representation (or feature) learning with unsupervised learning has yet really to yield the type of results that many believe to be achievable. For example, we’d like to unleash an unsupervised learning algorithm on all web images and then obtain a representation that captures the various factors of variation we know to be present (e.g. objects and people). One popular approach for this is to train a model that assumes a highlevel vector representation with independent components. However, despite a large body of literature on such models by now, such socalled disentangling of these factors of variation still seems beyond our reach. In this short paper, the authors propose an alternative to this approach. They propose that disentangling might be achievable by learning a representation whose dimensions are each separately **controllable**, i.e. that each have an associated policy which changes the value of that dimension **while letting other dimensions fixed**. Specifically, the authors propose to minimize the following objective: $\mathop{\mathbb{E}}_s\left[\frac{1}{2}sg(f(s))^2_2 \right]  \lambda \sum_k \mathbb{E}_{a,s}\left[\sum_a \pi_k(as) \log sel(s,a,k)\right]$ where  $s$ is an agent’s state (e.g. frame image) which encoder $f$ and decoder $g$ learn to autoencode  $k$ iterates over all dimensions of the representation space (output of encoder)  $a$ iterates over actions that the agent can take  $\pi_k(as)$ is the policy that is meant to control the $k^{\rm th}$ dimension of the representation space $f(s)_k$  $sel(s,a,k)$ is the selectivity of $f(s)_k$ relative to other dimensions in the representation, at state $s$: $sel(s,a,k) = \mathop{\mathbb{E}}_{s’\sim {\cal P}_{ss’}^a}\left[\frac{f_k(s’)f_k(s)}{\sum_{k’} f_{k’}(s’)f_{k’}(s) }\right]$ ${\cal P}_{ss’}^a$ is the conditional distribution over the next step state $s’$ given that you are at state $s$ and are taking action $a$ (i.e. the environment transition distribution). One can see that selectivity is higher when the change $f_k(s’)f_k(s)$ in dimension $k$ is much larger than the change $f_{k’}(s’)f_{k’}(s)$ in the other dimensions $k’$. A directed version of selectivity is also proposed (and I believe was used in the experiments), where the absolute value function is removed and $\log sel$ is replaced with $\log(1+sel)$ in the objective. The learning objective will thus encourage the discovery of a representation that is informative of the input (in that you can reconstruct it) and for which there exists policies that separately control these dimensions. Algorithm 1 in the paper describes a learning procedure for optimizing this objective. In brief, for every update, a state $s$ is sampled from which an update for the autoencoder part of the loss can be made. Then, iterating over each dimension $k$, REINFORCE is used to get a gradient estimate of the selectivity part of the loss, to update both the policy $\pi_k$ and the encoder $f$ by using the policy to reach a next state $s’$. **My two cents** I find this concept very appealing and thought provoking. Intuitively, I find the idea that valuable features are features which reflect an aspect of our environment that we can control more sensible and possibly less constraining than an assumption of independent features. It also has an interesting analogy of an infant learning about the world by interacting with it. The caveat is that unfortunately, this concept is currently fairly impractical, since it requires an interactive environment where an agent can perform actions, something we can’t easily have short of deploying a robot with sensors. Moreover, the proposed algorithm seems to assume that each state $s$ is sampled independently for each update, whereas a robot would observe a dependent stream of states. Accordingly, the experiments in this short paper are mostly “proof of concept”, on simplistic synthetic environments. Yet they do a good job at illustrating the idea. To me this means that there’s more interesting work worth doing in what seems to be a promising direction!
6 Comments

[link]
This paper presents a recurrent neural network architecture in which some of the recurrent weights dynamically change during the forward pass, using a hebbianlike rule. They correspond to the matrices $A(t)$ in the figure below: ![Fast weights RNN figure](http://i.imgur.com/DCznSf4.png) These weights $A(t)$ are referred to as *fast weights*. Comparatively, the recurrent weights $W$ are referred to as slow weights, since they are only changing due to normal training and are otherwise kept constant at test time. More specifically, the proposed fast weights RNN compute a series of hidden states $h(t)$ over time steps $t$, but, unlike regular RNNs, the transition from $h(t)$ to $h(t+1)$ consists of multiple ($S$) recurrent layers $h_1(t+1), \dots, h_{S1}(t+1), h_S(t+1)$, defined as follows: $$h_{s+1}(t+1) = f(W h(t) + C x(t) + A(t) h_s(t+1))$$ where $f$ is an elementwise nonlinearity such as the ReLU activation. The next hidden state $h(t+1)$ is simply defined as the last "inner loop" hidden state $h_S(t+1)$, before moving to the next time step. As for the fast weights $A(t)$, they too change between time steps, using the hebbianlike rule: $$A(t+1) = \lambda A(t) + \eta h(t) h(t)^T$$ where $\lambda$ acts as a decay rate (to partially forget some of what's in the past) and $\eta$ as the fast weight's "learning rate" (not to be confused with the learning rate used during backprop). Thus, the role played by the fast weights is to rapidly adjust to the recent hidden states and remember the recent past. In fact, the authors show an explicit relation between these fast weights and memoryaugmented architectures that have recently been popular. Indeed, by recursively applying and expending the equation for the fast weights, one obtains $$A(t) = \eta \sum_{\tau = 1}^{\tau = t1}\lambda^{t\tau1} h(\tau) h(\tau)^T$$ *(note the difference with Equation 3 of the paper... I think there was a typo)* which implies that when computing the $A(t) h_s(t+1)$ term in the expression to go from $h_s(t+1)$ to $h_{s+1}(t+1)$, this term actually corresponds to $$A(t) h_s(t+1) = \eta \sum_{\tau =1}^{\tau = t1} \lambda^{t\tau1} h(\tau) (h(\tau)^T h_s(t+1))$$ i.e. $A(t) h_s(t+1)$ is a weighted sum of all previous hidden states $h(\tau)$, with each hidden states weighted by an "attention weight" $h(\tau)^T h_s(t+1)$. The difference with many recent memoryaugmented architectures is thus that the attention weights aren't computed using a softmax nonlinearity. Experimentally, they find it beneficial to use [layer normalization](https://arxiv.org/abs/1607.06450). Good values for $\eta$ and $\lambda$ seem to be 0.5 and 0.9 respectively. I'm not 100% sure, but I also understand that using $S=1$, i.e. using the fast weights only once per time steps, was usually found to be optimal. Also see Figure 3 for the architecture used on the image classification datasets, which is slightly more involved. The authors present a series 4 experiments, comparing with regular RNNs (IRNNs, which are RNNs with ReLU units and whose recurrent weights are initialized to a scaled identity matrix) and LSTMs (as well as an associative LSTM for a synthetic associative retrieval task and ConvNets for the two image datasets). Generally, experiments illustrate that the fast weights RNN tends to train faster (in number of updates) and better than the other recurrent architectures. Surprisingly, the fast weights RNN can even be competitive with a ConvNet on the two image classification benchmarks, where the RNN traverses glimpses from the image using a fixed policy. **My two cents** This is a very thought provoking paper which, based on the comparison with LSTMs, suggests that fast weights RNNs might be a very good alternative. I'd be quite curious to see what would happen if one was to replace LSTMs with them in the myriad of papers using LSTMs (e.g. all the Seq2Seq work). Intuitively, LSTMs seem to be able to do more than just attending to the recent past. But, for a given task, if one was to observe that fast weights RNNs are competitive to LSTMs, it would suggests that the LSTM isn't doing something that much more complex. So it would be interesting to determine what are the tasks where the extra capacity of an LSTM is actually valuable and exploitable. Hopefully the authors will release some code, to facilitate this exploration. The discussion at the end of Section 3 on how exploiting the "memory augmented" view of fast weights is useful to allow the use of minibatches is interesting. However, it also suggests that computations in the fast weights RNN scales quadratically with the sequence size (since in this view, the RNN technically must attend to all previous hidden states, since the beginning of the sequence). This is something to keep in mind, if one was to consider applying this to very long sequences (i.e. much longer than the hidden state dimensionality). Also, I don't quite get the argument that the "memory augmented" view of fast weights is more amenable to minibatch training. I understand that having an explicit weight matrix $A(t)$ for each minibatch sequence complicates things. However, in the memory augmented view, we also have a "memory matrix" that is different for each sequence, and yet we can handle that fine. The problem I can imagine is that storing a *sequence of arbitrary weight matrices* for each sequence might be storage demanding (and thus perhaps make it impossible to store a forward/backward pass for more than one sequence at a time), while the implicit memory matrix only requires appending a new row at each time step. Perhaps the argument to be made here is more that there's already minibatch compatible code out there for dealing with the use of a memory matrix of stored previous memory states. This work strikes some (partial) resemblance to other recent work, which may serve as food for thought here. The use of possibly multiple computation layers between time steps reminds me of [Adaptive Computation Time (ACT) RNN]( http://www.shortscience.org/paper?bibtexKey=journals/corr/Graves16). Also, expressing a backpropable architecture that involves updates to weights (here, hebbianlike updates) reminds me of recent work that does backprop through the updates of a gradient descent procedure (for instance as in [this work]( http://www.shortscience.org/paper?bibtexKey=conf/icml/MaclaurinDA15)). Finally, while I was familiar with the notion of fast weights from the work on [Using Fast Weights to Improve Persistent Contrastive Divergence](http://people.ee.duke.edu/~lcarin/FastGibbsMixing.pdf), I didn't realize that this concept dated as far back as the late 80s. So, for young researchers out there looking for inspiration for research ideas, this paper confirms that looking at the older neural network literature for inspiration is probably a very good strategy :) To sum up, this is really nice work, and I'm looking forward to the NIPS 2016 oral presentation of it! 
[link]
This paper derives an algorithm for passing gradients through a sample from a mixture of Gaussians. While the reparameterization trick allows to get the gradients with respect to the Gaussian means and covariances, the same trick cannot be invoked for the mixing proportions parameters (essentially because they are the parameters of a multinomial discrete distribution over the Gaussian components, and the reparameterization trick doesn't extend to discrete distributions). One can think of the derivation as proceeding in 3 steps: 1. Deriving an estimator for gradients a sample from a 1dimensional density $f(x)$ that is such that $f(x)$ is differentiable and its cumulative distribution function (CDF) $F(x)$ is tractable: $\frac{\partial \hat{x}}{\partial \theta} =  \frac{1}{f(\hat{x})}\int_{t=\infty}^{\hat{x}} \frac{\partial f(t)}{\partial \theta} dt$ where $\hat{x}$ is a sample from density $f(x)$ and $\theta$ is any parameter of $f(x)$ (the above is a simplified version of Equation 6). This is probably the most important result of the paper, and is based on a really clever use of the general form of the Leibniz integral rule. 2. Noticing that one can sample from a $D$dimensional Gaussian mixture by decomposing it with the product rule $f({\bf x}) = \prod_{d=1}^D f(x_d{\bf x}_{<d})$ and using ancestral sampling, where each $f(x_d{\bf x}_{<d})$ are themselves 1dimensional mixtures (i.e. with differentiable densities and tractable CDFs) 3. Using the 1dimensional gradient estimator (of Equation 6) and the chain rule to backpropagate through the ancestral sampling procedure. This requires computing the integral in the expression for $\frac{\partial \hat{x}}{\partial \theta}$ above, where $f(x)$ is one of the 1D conditional Gaussian mixtures and $\theta$ is a mixing proportion parameter $\pi_j$. As it turns out, this integral has an analytical form (see Equation 22). **My two cents** This is a really surprising and neat result. The author mentions it could be applicable to variational autoencoders (to support posteriors that are mixtures of Gaussians), and I'm really looking forward to read about whether that can be successfully done in practice. The paper provides the derivation only for mixtures of Gaussians with diagonal covariance matrices. It is mentioned that extending to nondiagonal covariances is doable. That said, ancestral sampling with nondiagonal covariances would become more computationally expensive, since the conditionals under each Gaussian involves a matrix inverse. Beyond the case of Gaussian mixtures, Equation 6 is super interesting in itself as its application could go beyond that case. This is probably why the paper also derived a samplingbased estimator for Equation 6, in Equation 9. However, that estimator might be inefficient, since it involves sampling from Equation 10 with rejection, and it might take a lot of time to get an accepted sample if $\hat{x}$ is very small. Also, a good estimate of Equation 6 might require *multiple* samples from Equation 10. Finally, while I couldn't find any obvious problem with the mathematical derivation, I'd be curious to see whether using the same approach to derive a gradient on one of the Gaussian mean or standard deviation parameters gave a gradient that is consistent with what the reparameterization trick provides.
3 Comments

[link]
This paper describes how rank pooling, a very recent approach for pooling representations organized in a sequence $\\{{\bf v}_t\\}_{t=1}^T$, can be used in an endtoend trained neural network architecture. Rank pooling is an alternative to average and max pooling for sequences, but with the distinctive advantage of maintaining some order information from the sequence. Rank pooling first solves a regularized (linear) support vector regression (SVR) problem where the inputs are the vector representations ${\bf v}_t$ in the sequence and the target is the corresponding index $t$ of that representation in the sequence (see Equation 5). The output of rank pooling is then simply the linear regression parameters $\bf{u}$ learned for that sequence. Because of the way ${\bf u}$ is trained, we can see that ${\bf u}$ will capture order information, as successful training would imply that ${\bf u}^\top {\bf v}_t < {\bf u}^\top {\bf v}_{t'} $ if $t < t'$. See [this paper](https://www.robots.ox.ac.uk/~vgg/rg/papers/videoDarwin.pdf) for more on rank pooling. While previous work has focused on using rank pooling on handdesigned and fixed representations, this paper proposes to use ConvNet features (pretrained on ImageNet) for the representation and backpropagate through rank pooling to finetune the ConvNet features. Since the output of rank pooling corresponds to an argmin operation, passing gradients through this operation is not as straightforward as for average or max pooling. However, it turns out that if the objective being minimized (in our case regularized SVR) is twice differentiable, gradients with respect to its argmin can be computed (see Lemmas 1 and 2). The authors derive the gradient for rank pooling (Equation 21). Finally, since its gradient requires inverting a matrix (corresponding to a hessian), the authors propose to either use an efficient procedure for computing it by exploiting properties of sums of rankone matrices (see Lemma 3) or to simply use an approximation based on using a diagonal hessian. In experiments on two small scale video activity recognition datasets (UCFSports and Hollywood2), the authors show that finetuning the ConvNet features significantly improves the performance of rank pooling and makes it superior to max and average pooling. **My two cents** This paper was eye opening for me, first because I did not realize that one could backpropagate through an operation corresponding to an argmin that doesn't have a closed form solution (though apparently this paper isn't the first to make that observation). Moreover, I did not know about rank pooling, which itself is a really thought provoking approach to pooling representations in a way that preserves some organizational information about the original representations. I wonder how sensitive the results are to the value of the regularization constant of the SVR problem. The authors mention some theoretical guaranties on the stability of the solution found by SVR in general, but intuitively I would expect that the regularization constant would play a large role in the stability. I'll be looking forward to any future attempts to increase the speed of rank pooling (or any similar method). Indeed, as the authors mention, it is currently too slow to be used on the larger video datasets that are currently available. Code for computing rank pooling (though not for computing its gradients) seems to be available [here](https://bitbucket.org/bfernando/videodarwin).
2 Comments

[link]
This paper presents a novel neural network approach (though see [here](https://www.facebook.com/hugo.larochelle.35/posts/172841743130126?pnref=story) for a discussion on prior work) to density estimation, with a focus on image modeling. At its core, it exploits the following property on the densities of random variables. Let $x$ and $z$ be two random variables of equal dimensionality such that $x = g(z)$, where $g$ is some bijective and deterministic function (we'll note its inverse as $f = g^{1}$). Then the change of variable formula gives us this relationship between the densities of $x$ and $z$: $p_X(x) = p_Z(z) \left{\rm det}\left(\frac{\partial g(z)}{\partial z}\right)\right^{1}$ Moreover, since the determinant of the Jacobian matrix of the inverse $f$ of a function $g$ is simply the inverse of the Jacobian of the function $g$, we can also write: $p_X(x) = p_Z(f(x)) \left{\rm det}\left(\frac{\partial f(x)}{\partial x}\right)\right$ where we've replaced $z$ by its deterministically inferred value $f(x)$ from $x$. So, the core of the proposed model is in proposing a design for bijective functions $g$ (actually, they design its inverse $f$, from which $g$ can be derived by inversion), that have the properties of being easily invertible and having an easytocompute determinant of Jacobian. Specifically, the authors propose to construct $f$ from various modules that all preserve these properties and allows to construct highly nonlinear $f$ functions. Then, assuming a simple choice for the density $p_Z$ (they use a multidimensional Gaussian), it becomes possible to both compute $p_X(x)$ tractably and to sample from that density, by first samples $z\sim p_Z$ and then computing $x=g(z)$. The building blocks for constructing $f$ are the following: **Coupling layers**: This is perhaps the most important piece. It simply computes as its output $b\odot x + (1b) \odot (x \odot \exp(l(b\odot x)) + m(b\odot x))$, where $b$ is a binary mask (with half of its values set to 0 and the others to 1) over the input of the layer $x$, while $l$ and $m$ are arbitrarily complex neural networks with input and output layers of equal dimensionality. In brief, for dimensions for which $b_i = 1$ it simply copies the input value into the output. As for the other dimensions (for which $b_i = 0$) it linearly transforms them as $x_i * \exp(l(b\odot x)_i) + m(b\odot x)_i$. Crucially, the bias ($m(b\odot x)_i$) and coefficient ($\exp(l(b\odot x)_i)$) of the linear transformation are nonlinear transformations (i.e. the output of neural networks) that only have access to the masked input (i.e. the nontransformed dimensions). While this layer might seem odd, it has the important property that it is invertible and the determinant of its Jacobian is simply $\exp(\sum_i (1b_i) l(b\odot x)_i)$. See Section 3.3 for more details on that. **Alternating masks**: One important property of coupling layers is that they can be stacked (i.e. composed), and the resulting composition is still a bijection and is invertible (since each layer is individually a bijection) and has a tractable determinant for its Jacobian (since the Jacobian of the composition of functions is simply the multiplication of each function's Jacobian matrix, and the determinant of the product of square matrices is the product of the determinant of each matrix). This is also true, even if the mask $b$ of each layer is different. Thus, the authors propose using masks that alternate across layer, by masking a different subset of (half of) the dimensions. For images, they propose using masks with a checkerboard pattern (see Figure 3). Intuitively, alternating masks are better because then after at least 2 layers, all dimensions have been transformed at least once. **Squeezing operations**: Squeezing operations corresponds to a reorganization of a 2D spatial layout of dimensions into 4 sets of features maps with spatial resolutions reduced by half (see Figure 3). This allows to expose multiple scales of resolutions to the model. Moreover, after a squeezing operation, instead of using a checkerboard pattern for masking, the authors propose to use a per channel masking pattern, so that "the resulting partitioning is not redundant with the previous checkerboard masking". See Figure 3 for an illustration. Overall, the models used in the experiments usually stack a few of the following "chunks" of layers: 1) a few coupling layers with alternating checkboard masks, 2) followed by squeezing, 3) followed by a few coupling layers with alternating channelwise masks. Since the output of each layerschunk must technically be of the same size as the input image, this could become expensive in terms of computations and space when using a lot of layers. Thus, the authors propose to explicitly pass on (copy) to the very last layer ($z$) half of the dimensions after each layerschunk, adding another chunk of layers only on the other half. This is illustrated in Figure 4b. Experiments on CIFAR10, and 32x32 and 64x64 versions of ImageNet show that the proposed model (coined the realvalued nonvolume preserving or Real NVP) has competitive performance (in bits per dimension), though slightly worse than the Pixel RNN. **My Two Cents** The proposed approach is quite unique and thought provoking. Most interestingly, it is the only powerful generative model I know that combines A) a tractable likelihood, B) an efficient / onepass sampling procedure and C) the explicit learning of a latent representation. While achieving this required a model definition that is somewhat unintuitive, it is nonetheless mathematically really beautiful! I wonder to what extent Real NVP is penalized in its results by the fact that it models pixels as realvalued observations. First, it implies that its estimate of bits/dimensions is an upper bound on what it could be if the uniform subpixel noise was integrated out (see Equations 345 of [this paper](http://arxiv.org/pdf/1511.01844v3.pdf)). Moreover, the authors had to apply a nonlinear transformation (${\rm logit}(\alpha + (1\alpha)\odot x)$) to the pixels, to spread the $[0,255]$ interval further over the reals. Since the Pixel RNN models pixels as discrete observations directly, the Real NVP might be at a disadvantage. I'm also curious to know how easy it would be to do conditional inference with the Real NVP. One could imagine doing approximate MAP conditional inference, by clamping the observed dimensions and doing gradient descent on the loglikelihood with respect to the value of remaining dimensions. This could be interesting for image completion, or for structured output prediction with realvalued outputs in general. I also wonder how expensive that would be. In all cases, I'm looking forward to saying interesting applications and variations of this model in the future! 
[link]
This paper presents a method to train a neural network to make predictions for *counterfactual* questions. In short, such questions are questions about what the result of an intervention would have been, had a different choice for the intervention been made (e.g. *Would this patient have lower blood sugar had she received a different medication?*). One approach to tackle this problem is to collect data of the form $(x_i, t_i, y_i^F)$ where $x_i$ describes a situation (e.g. a patient), $t_i$ describes the intervention made (in this paper $t_i$ is binary, e.g. $t_i = 1$ if a new treatment is used while $t_i = 0$ would correspond to using the current treatment) and $y_i^F$ is the factual outcome of the intervention $t_i$ for $x_i$. From this training data, a predictor $h(x,t)$ taking the pair $(x_i, t_i)$ as input and outputting a prediction for $y_i^F$ could be trained. From this predictor, one could imagine answering counterfactual questions by feeding $(x_i, 1t_i)$ (i.e. a description of the same situation $x_i$ but with the opposite intervention $1t_i$) to our predictor and comparing the prediction $h(x_i, 1t_i)$ with $y_i^F$. This would give us an estimate of the change in the outcome, had a different intervention been made, thus providing an answer to our counterfactual question. The authors point out that this scenario is related to that of domain adaptation (more specifically to the special case of covariate shift) in which the input training distribution (here represented by inputs $(x_i,t_i)$) is different from the distribution of inputs that will be fed at test time to our predictor (corresponding to the inputs $(x_i, 1t_i)$). If the choice of intervention $t_i$ is evenly spread and chosen independently from $x_i$, the distributions become the same. However, in observational studies, the choice of $t_i$ for some given $x_i$ is often not independent of $x_i$ and made according to some unknown policy. This is the situation of interest in this paper. Thus, the authors propose an approach inspired by the domain adaptation literature. Specifically, they propose to have the predictor $h(x,t)$ learn a representation of $x$ that is indiscriminate of the intervention $t$ (see Figure 2 for the proposed neural network architecture). Indeed, this is a notion that is [well established][1] in the domain adaptation literature and has been exploited previously using regularization terms based on [adversarial learning][2] and [maximum mean discrepancy][3]. In this paper, the authors used instead a regularization (noted in the paper as $disc(\Phi_{t=0},\Phi_ {t=1})$) based on the socalled discrepancy distance of [Mansour et al.][4], adapting its use to the case of a neural network. As an example, imagine that in our dataset, a new treatment ($t=1$) was much more frequently used than not ($t=0$) for men. Thus, for men, relatively insufficient evidence for counterfactual inference is expected to be found in our training dataset. Intuitively, we would thus want our predictor to not rely as much on that "feature" of patients when inferring the impact of the treatment. In addition to this term, the authors also propose incorporating an additional regularizer where the prediction $h(x_i,1t_i)$ on counterfactual inputs is pushed to be as close as possible to the target $y_{j}^F$ of the observation $x_j$ that is closest to $x_i$ **and** actually had the counterfactual intervention $t_j = 1t_i$. The paper first shows a bound relating the counterfactual generalization error to the discrepancy distance. Moreover, experiments simulating counterfactual inference tasks are presented, in which performance is measured by comparing the predicted treatment effects (as estimated by the difference between the observed effect $y_i^F$ for the observed treatment and the predicted effect $h(x_i, 1t_i)$ for the opposite treatment) with the real effect (known here because the data is simulated). The paper shows that the proposed approach using neural networks outperforms several baselines on this task. **My two cents** The connection with domain adaptation presented here is really clever and enlightening. This sounds like a very compelling approach to counterfactual inference, which can exploit a lot of previous work on domain adaptation. The paper mentions that selecting the hyperparameters (such as the regularization terms weights) in this scenario is not a trivial task. Indeed, measuring performance here requires knowing the true difference in intervention outcomes, which in practice usually cannot be known (e.g. two treatments usually cannot be given to the same patient once). In the paper, they somewhat "cheat" by using the ground truth difference in outcomes to measure outofsample performance, which the authors admit is unrealistic. Thus, an interesting avenue for future work would be to design practical hyperparameter selection procedures for this scenario. I wonder whether the *reverse crossvalidation* approach we used in our work on our adversarial approach to domain adaptation (see [Section 5.1.2][5]) could successfully be used here. Finally, I command the authors for presenting such a nicely written description of counterfactual inference problem setup in general, I really enjoyed it! [1]: https://papers.nips.cc/paper/2983analysisofrepresentationsfordomainadaptation.pdf [2]: http://arxiv.org/abs/1505.07818 [3]: http://ijcai.org/Proceedings/09/Papers/200.pdf [4]: http://www.cs.nyu.edu/~mohri/pub/nadap.pdf [5]: http://arxiv.org/pdf/1505.07818v4.pdf#page=16 
[link]
This paper can be thought as proposing a variational autoencoder applied to a form of metalearning, i.e. where the input is not a single input but a dataset of inputs. For this, in addition to having to learn an approximate inference network over the latent variable $z_i$ for each input $x_i$ in an input dataset $D$, approximate inference is also learned over a latent variable $c$ that is global to the dataset $D$. By using Gaussian distributions for $z_i$ and $c$, the reparametrization trick can be used to train the variational autoencoder. The generative model factorizes as $p(D=(x_1,\dots,x_N), (z_1,\dots,z_N), c) = p(c) \prod_i p(z_ic) p(x_iz_i,c)$ and learning is based on the following variational posterior decomposition: $q((z_1,\dots,z_N), cD=(x_1,\dots,x_N)) = q(cD) \prod_i q(z_ix_i,c)$. Moreover, latent variable $z_i$ is decomposed into multiple ($L$) layers $z_i = (z_{i,1}, \dots, z_{i,L})$. Each layer in the generative model is directly connected to the input. The layers are generated from $z_{i,L}$ to $z_{i,1}$, each layer being conditioned on the previous (see Figure 1 *Right* for the graphical model), with the approximate posterior following a similar decomposition. The architecture for the approximate inference network $q(cD)$ first maps all inputs $x_i\in D$ into a vector representation, then performs mean pooling of these representations to obtain a single vector, followed by a few more layers to produce the parameters of the Gaussian distribution over $c$. Training is performed by stochastic gradient descent, over minibatches of datasets (i.e. multiple sets $D$). The model has multiple applications, explored in the experiments. One is of summarizing a dataset $D$ into a smaller subset $S\in D$. This is done by initializing $S\leftarrow D$ and greedily removing elements of $S$, each time minimizing the KL divergence between $q(cD)$ and $q(cS)$ (see the experiments on a synthetic Spatial MNIST problem of section 5.3). Another application is fewshot classification, where very few examples of a number of classes are given, and a new test example $x'$ must be assigned to one of these classes. Classification is performed by treating the small set of examples of each class $k$ as its own dataset $D_k$. Then, test example $x$ is classified into class $k$ for which the KL divergence between $q(cx')$ and $q(cD_k)$ is smallest. Positive results are reported when training on OMNIGLOT classes and testing on either the MNIST classes or unseen OMNIGLOT datasets, when compared to a 1nearest neighbor classifier based on the raw input or on a representation learned by a regular autoencoder. Finally, another application is that of generating new samples from an input dataset of examples. The approximate posterior is used to compute $q(cD)$. Then, $c$ is assigned to its posterior mean, from which a value for the hidden layers $z$ and finally a sample $x$ can be generated. It is shown that this procedure produces convincing samples that are visually similar from those in the input set $D$. **My two cents** Another really nice example of deep learning applied to a form of metalearning, i.e. learning a model that is trained to take *new* datasets as input and generalize even if confronted to datasets coming from an unseen data distribution. I'm particularly impressed by the many tasks explored successfully with the same approach: fewshot classification and generative sampling, as well as a form of summarization (though this last probably isn't really metalearning). Overall, the approach is quite elegant and appealing. The very simple, synthetic experiments of section 5.1 and 5.2 are also interesting. Section 5.2 presents the notion of a *priorinterpolation layer*, which is well motivated but seems to be used only in that section. I wonder how important it is, outside of the specific case of section 5.2. Overall, very excited by this work, which further explores the theme of metalearning in an interesting way. 
[link]
This paper presents Swapout, a simple dropout method applied to Residual Networks (ResNets). In a ResNet, a layer $Y$ is computed from the previous layer $X$ as $Y = X + F(X)$ where $F(X)$ is essentially the composition of a few convolutional layers. Swapout simply applies dropout separately on both terms of a layer's equation: $Y = \Theta_1 \odot X + \Theta_2 \odot F(X)$ where $\Theta_1$ and $\Theta_2$ are independent dropout masks for each term. The paper shows that this form of dropout is at least as good or superior as other forms of dropout, including the recently proposed [stochastic depth dropout][1]. Much like in the stochastic depth paper, better performance is achieved by linearly increasing the dropout rate (from 0 to 0.5) from the first hidden layer to the last. In addition to this observation, I also note the following empirical observations: 1. At test time, averaging the output layers of multiple dropout mask samples (referenced to as stochastic inference) is better than replacing the masks by their expectation (deterministic inference), the latter being the usual standard. 2. Comparable performance is achieved by making the ResNet wider (e.g. 4 times) and with fewer layers (e.g. 32) than the orignal ResNet work with thin but very deep (more than 1000 layers) ResNets. This would confirm a similar observation from [this paper][2]. Overall, these are useful observations to be aware of for anyone wanting to use ResNets in practice. [1]: http://arxiv.org/abs/1603.09382v1 [2]: https://arxiv.org/abs/1605.07146 
[link]
This paper tests the following hypothesis, about features learned by a deep network trained on the ImageNet dataset: *Object features and anticausal features are closely related. Context features and causal features are not necessarily related.* First, some definitions. Let $X$ be a visual feature (i.e. value of a hidden unit) and $Y$ be information about a label (e.g. the logodds of probability of different object appearing in the image). A causal feature would be one for which the causal direction is $X \rightarrow Y$. An anticausal feature would be the opposite case, $X \leftarrow Y$. As for object features, in this paper they are features whose value tends to change a lot when computed on a complete original image versus when computed on an image whose regions *falling inside* object bounding boxes have been blacked out (see Figure 4). Contextual features are the opposite, i.e. values change a lot when blacking out the regions *outside* object bounding boxes. See section 4.2.1 for how "object scores" and "context scores" are computed following this description, to quantitatively measure to what extent a feature is an "object feature" or a "context feature". Thus, the paper investigates whether 1) for object features, their relationship with object appearance information is anticausal (i.e. whether the object feature's value seems to be caused by the presence of the object) and whether 2) context features are not clearly causal or anticausal. To perform this investigation, the paper first proposes a generic neural network model (dubbed the Neural Causation Coefficient architecture or NCC) to predict a score of whether the relationship between an input variable $X$ and target variable $Y$ is causal. This model is trained by taking as input datasets of $X$ and $Y$ pairs synthetically generated in such a way that we know whether $X$ caused $Y$ or the opposite. The NCC architecture first embeds each individual $X$,$Y$ instance pair into some hidden representation, performs mean pooling of these representations and then feeds the result to fully connected layers (see Figure 3). The paper shows that the proposed NCC model actually achieves SOTA performance on the Tübingen dataset, a collection of realworld causeeffect observational samples. Then, the proposed NCC model is used to measure the average object score of features of a deep residual CNN identified as being most causal and most anticausal by NCC. The same is done with the context score. What is found is that indeed, the object score is always higher for the top anticausal features than for the top causal features. However, for the context score, no such clear trend is observed (see Figure 5). **My two cents** I haven't been following the growing literature on machine learning for causal inference, so it was a real pleasure to read this paper and catch up a little bit on that. Just for that I would recommend the reading of this paper. The paper does a really good job at explaining the notion of *observational causal inference*, which in short builds on the observation that if we assume IID noise on top of a causal (or anticausal) phenomenon, then causation can possibly be inferred by verifying in which direction of causation the IID assumption on the noise seems to hold best (see Figure 2 for a nice illustration, where in (a) the noise is clearly IID, but isn't in (b)). Also, irrespective of the study of causal phenomenon in images, the NCC architecture, which achieves SOTA causal prediction performance, is in itself a nice contribution. Regarding the application to image features, one thing that is hard to wrap your head around is that, for the $Y$ variable, instead of using the true image label, the logodds at the output layer are used instead in the study. The paper justifies this choice by highlighting that the NCC network was trained on examples where $Y$ is continuous, not discrete. On one hand, that justification makes sense. On the other, this is odd since the logodds were in fact computed directly from the visual features, meaning that technically the value of the logodds are directly caused by all the features (which goes against the hypothesis being tested). My best guess is that this isn't an issue only because NCC makes a causal prediction between *a single feature* and $Y$, not *from all features* to $Y$. I'd be curious to read the authors' perspective on this. Still, this paper at this point is certainly just scratching the surface on this topic. For instance, the paper mentions that NCC could be used to encourage the learning of causal or anticausal features, providing a new and intriguing type of regularization. This sounds like a very interesting future direction for research, which I'm looking forward to.
4 Comments

[link]
This paper proposes a variant of Neural Turing Machine (NTM) for metalearning or "learning to learn", in the specific context of fewshot learning (i.e. learning from few examples). Specifically, the proposed model is trained to ingest as input a training set of examples and improve its output predictions as examples are processed, in a purely feedforward way. This is a form of metalearning because the model is trained so that its forward pass effectively executes a form of "learning" from the examples it is fed as input. During training, the model is fed multiples sequences (referred to as episodes) of labeled examples $({\bf x}_1, {\rm null}), ({\bf x}_2, y_1), \dots, ({\bf x}_T, y_{T1})$, where $T$ is the size of the episode. For instance, if the model is trained to learn how to do 5class classification from 10 examples per class, $T$ would be $5 \times 10 = 50$. Mainly, the paper presents experiments on the Omniglot dataset, which has 1623 classes. In these experiments, classes are separated into 1200 "training classes" and 423 "test classes", and each episode is generated by randomly selecting 5 classes (each assigned some arbitrary vector representation, e.g. a onehot vector that is consistent within the episode, but not across episodes) and constructing a randomly ordered sequence of 50 examples from within the chosen 5 classes. Moreover, the correct label $y_t$ of a given input ${\bf x}_t$ is always provided only at the next time step, but the model is trained to be good at its prediction of the label of ${\bf x}_t$ at the current time step. This is akin to the scenario of online learning on a stream of examples, where the label of an example is revealed only once the model has made a prediction. The proposed NTM is different from the original NTM of Alex Graves, mostly in how it writes into its memory. The authors propose to focus writing to either the least recently used memory location or the most recently used memory location. Moreover, the least recently used memory location is reset to zero before every write (an operation that seems to be ignored when backpropagating gradients). Intuitively, the proposed NTM should learn a strategy by which, given a new input, it looks into its memory for information from other examples earlier in the episode (perhaps similarly to what a nearest neighbor classifier would do) to predict the class of the new input. The paper presents experiments in learning to do multiclass classification on the Omniglot dataset and regression based on functions synthetically generated by a GP. The highlights are that: 1. The proposed model performs much better than an LSTM and better than an NTM with the original write mechanism of Alex Graves (for classification). 2. The proposed model even performs better than a 1st nearest neighbor classifier. 3. The proposed model is even shown to outperform human performance, for the 5class scenario. 4. The proposed model has decent performance on the regression task, compared to GP predictions using the groundtruth kernel. **My two cents** This is probably one of my favorite ICML 2016 papers. I really think metalearning is a problem that deserves more attention, and this paper presents both an interesting proposal for how to do it and an interesting empirical investigation of it. Much like previous work [\[1\]][1] [\[2\]][2], learning is based on automatically generating a metalearning training set. This is clever I think, since a very large number of such "metalearning" examples (the episodes) can be constructed, thus transforming what is normally a "small data problem" (few shot learning) into a "big data problem", for which deep learning is more effective. I'm particularly impressed by how the proposed model outperforms a 1nearest neighbor classifier. That said, the proposed NTM actually performs 4 reads at each time step, which suggests that a fairer comparison might be with a 4nearest neighbor classifier. I do wonder how this baseline would compare. I'm also impressed with the observation that the proposed model surpassed humans. The paper also proposes to use 5letter words to describe classes, instead of onehot vectors. The motivation is that this should make it easier for the model to scale to much more than 5 classes. However, I don't entirely follow the logic as to why onehot vectors are problematic. In fact, I would think that arbitrarily assigning 5letter words to classes would instead imply some similarity between classes that share letters that is arbitrary and doesn't reflect true class similarity. Also, while I find it encouraging that the performance for regression of the proposed model is decent, I'm curious about how it would compare with a GP approach that incrementally learns the kernel's hyperparameter (instead of using the groundtruth values, which makes this baseline unrealistically strong). Finally, I'm still not 100% sure how exactly the NTM is able to implement the type of feedforward inference I'd expect to be required. I would expect it to learn a memory representation of examples that combines information from the input vector ${\bf x}_t$ *and* its label $y_t$. However, since the label of an input is presented at the following time step in an episode, it is not intuitive to me then how the read/write mechanisms are able to deal with this misalignment. My only guess is that since the controller is an LSTM, then it can somehow remember ${\bf x}_t$ until it gets $y_t$ and appropriately include the combined information into the memory. This could be supported by the fact that using a nonrecurrent feedforward controller is much worse than using an LSTM controller. But I'm not 100% sure of this either. All the above being said, this is still a really great paper, which I hope will help stimulate more research on metalearning. Hopefully code for this paper can eventually be released, which would help in popularizing the topic. [1]: http://snowedin.net/tmp/Hochreiter2001.pdf [2]: http://www.thespermwhale.com/jaseweston/ram/papers/paper_16.pdf 
[link]
This paper presents an unsupervised generative model, based on the variational autoencoder framework, but where the encoder is a recurrent neural network that sequentially infers the identity, pose and number of objects in some input scene (2D image or 3D scene). In short, this is done by extending the DRAW model to incorporate discrete latent variables that determine whether an additional object is present or not. Since the reparametrization trick cannot be used for discrete variables, the authors estimate the gradient through the sampling operation using a likelihood ratio estimator. Another innovation over DRAW is the application to 3D scenes, in which the decoder is a graphics renderer. Since it is not possible to backpropagate through the renderer, gradients are estimated using finitedifference estimates (which require going through the renderer several times). Experiments are presented where the evaluation is focused on the ability of the model to detect and count the number of objects in the image or scene. **My two cents** This is a nice, natural extension of DRAW. I'm particularly impressed by the results for the 3D scene setting. Despite the fact that setup is obviously synthetic and simplistic, I really surprised that estimating the decoder gradients using finitedifferences worked at all. It's also interesting to see that the proposed model does surprisingly well compared to a CNN supervised approach that directly predicts the objects identity and pose. Quite cool! To see the model in action, see [this cute video][1]. [1]: https://www.youtube.com/watch?v=4tc84kKdpY4 
[link]
This paper describes how to apply the idea of batch normalization (BN) successfully to recurrent neural networks, specifically to LSTM networks. The technique involves the 3 following ideas: **1) Careful initialization of the BN scaling parameter.** While standard practice is to initialize it to 1 (to have unit variance), they show that this situation creates problems with the gradient flow through time, which vanishes quickly. A value around 0.1 (used in the experiments) preserves gradient flow much better. **2) Separate BN for the "hiddens to hiddens preactivation and for the "inputs to hiddens" preactivation.** In other words, 2 separate BN operators are applied on each contributions to the preactivation, before summing and passing through the tanh and sigmoid nonlinearities. **3) Use of largest timestep BN statistics for longer testtime sequences.** Indeed, one issue with applying BN to RNNs is that if the input sequences have varying length, and if one uses pertimestep mean/variance statistics in the BN transformation (which is the natural thing to do), it hasn't been clear how do deal with the last time steps of longer sequences seen at test time, for which BN has no statistics from the training set. The paper shows evidence that the preactivation statistics tend to gradually converge to stationary values over time steps, which supports the idea of simply using the training set's last time step statistics. Among these ideas, I believe the most impactful idea is 1). The papers mentions towards the end that improper initialization of the BN scaling parameter probably explains previous failed attempts to apply BN to recurrent networks. Experiments on 4 datasets confirms the method's success. **My two cents** This is an excellent development for LSTMs. BN has had an important impact on our success in training deep neural networks, and this approach might very well have a similar impact on the success of LSTMs in practice. 
[link]
This paper investigates different paradigms for learning how to answer natural language queries through various forms of feedback. Most interestingly, it investigates whether a model can learn to answer correctly questions when the feedback is presented purely in the form of a sentence (e.g. "Yes, that's right", "Yes, that's correct", "No, that's incorrect", etc.). This later form of feedback is particularly hard to leverage, since the model has to somehow learn that the word "Yes" is a sign of a positive feedback, but not the word "No". Normally, we'd trained a model to directly predict the correct answer to questions based on feedback provided by an expert that always answers correctly. "Imitating" this expert just corresponds to regular supervised learning. The paper however explores other variations on this learning scenario. Specifically, they consider 3 dimensions of variations. The first dimension of variation is who is providing the answers. Instead of an expert (who is always right), the paper considers the case where the model is instead observing a different, "imperfect" expert whose answers come from a fixed policy that answers correctly only a fraction of the time (the paper looked at 0.5, 0.1 and 0.01). Note that the paper refers to these answers as coming from "the learner" (which should be the model), but since the policy is fixed and actually doesn't depend on the model, I think one can also think of it as coming from another agent, which I'll refer to as the imperfect expert (I think this is also known as "off policy learning" in the RL world). The second dimension of variation on the learning scenario that is explored is in the nature of the "supervision type" (i.e. nature of the labels). There are 10 of them (see Figure 1 for a nice illustration). In addition to the real expert's answers only (Type 1), the paper considers other types that instead involve the imperfect expert and fall in one of the two categories below: 1. Explicit positive / negative rewards based on whether the imperfect expert's answer is correct. 2. Various forms of natural language responses to the imperfect expert's answers, which vary from worded positive/negative feedback, to hints, to mentions of the supporting fact for the correct answer. Also, mixtures of the above are considered. Finally, the third dimension of variation is how the model learns from the observed data. In addition to the regular supervised learning approach of imitating the observed answers (whether it's from the real expert or the imperfect expert), two other distinct approaches are considered, each inspired by the two categories of feedback mentioned above: 1. Rewardbased imitation: this simply corresponds to ignoring answers from the imperfect expert for which the reward is not positive (as for when the answers come from the regular expert, they are always used I believe). 2. Forward prediction: this consists in predicting the natural language feedback to the answer of the imperfect expert. This is essentially treated as a classification problem over possible feedback (with negative sampling, since there are many possible feedback responses), that leverages a softattention architecture over the answers the expert could have given, which is also informed by the actual answer that was given (see Equation 2). Also, a mixture of both of these learning approaches is considered. The paper thoroughly explores experimentally all these dimensions, on two questionanswering datasets (single supporting fact bAbI dataset and MovieQA). The neural net model architectures used are all based on memory networks. Without much surprise, imitating the true expert performs best. But quite surprisingly, forward prediction leveraging only natural language feedback to an imperfect expert often performs competitively compared to rewardbased imitation. #### My two cents This is a very thought provoking paper! I very much like the idea of exploring how a model could learn a task based on instructions in natural language. This makes me think of this work \cite{conf/iccv/BaSFS15} on using zeroshot learning to learn a model that can produce a visual classifier based on a description of what must be recognized. Another component that is interesting here is studying how a model can learn without knowing a priori whether a feedback is positive or negative. This sort of makes me think of [this work](http://www.thespermwhale.com/jaseweston/ram/papers/paper_16.pdf) (which is also close to this work \cite{conf/icann/HochreiterYC01}) where a recurrent network is trained to process a training set (inputs and targets) to later produce another model that's applied on a test set, without the RNN explicitly knowing what the training gradients are on this other model's parameters. In other words, it has to effectively learn to execute (presumably a form of) gradient descent on the other model's parameters. I find all such forms of "learning to learn" incredibly interesting. Coming back to this paper, unfortunately I've yet to really understand why forward prediction actually works. An explanation is given, that is that "this is because there is a natural coherence to predicting true answers that leads to greater accuracy in forward prediction" (see paragraph before conclusion). I can sort of understand what is meant by that, but it would be nice to somehow dig deeper into this hypothesis. Or I might be misunderstanding something here, since the paper mentions that changing how wrong answers are sampled yields a "worse" accuracy of 80% on Task 2 for the bAbI dataset and a policy accuracy of 0.1, but Table 1 reports an accuracy 54% for this case (which is not better, but worse). Similarly, I'd like to better understand Equation 2, specifically the β* term, and why exactly this is an appropriate form of incorporating which answer was given and why it works. I really was unable to form an intuition around Equation 2. In any case, I really like that there's work investigating this theme and hope there can be more in the future! 
[link]
This paper suggests a method (NoBackTrack) for training recurrent neural networks in an online way, i.e. without having to do backprop through time. One way of understanding the method is that it applies the [forward method for automatic differentiation](//en.wikipedia.org/wiki/Automatic_differentiation#Forward_accumulation), but since it requires maintaining a large Jacobian matrix (nb. of hidden units times nb. of parameters), they propose a way of obtaining a stochastic (but unbiased!) estimate of that matrix. Moreover, the method is improved by using Kalman filtering on that estimate, effectively smoothing the estimate over time. #### My two cents Online training of RNNs is a big, unsolved problem. The current approach people use is to truncate backprop to only a few steps in the past, which is more of a heuristic. This paper makes progress towards a more principled approach. I really like the "rankone trick" of Equation 7, really cute! And it is quite central to this method too, so good job on connecting those dots! The authors present this work as being preliminary, and indeed they do not compare with truncated backprop. I really hope they do in a future version of this work. Also, I don't think I buy their argument that the "theory of stochastic gradient descent applies". Here's the reason. So the method tracks the Jacobian of the hidden state wrt the parameter, which they note $G(t)$. It is update into $G(t+1)$, using a recursion which is based on the chain rule. However, between computing $G(t)$ and $G(t+1)$, a gradient step is performed during training. This means that $G(t)$ is now slightly stale, and corresponds to the gradient with respect to old value of the parameters, not the current value. As far as I understand, this implies that $G(t+1)$ (more specifically, its stochastic estimate as proposed in this paper) isn't unbiased anymore. So, unless I'm missing something (which I might!), I don't think we can invoke the theory of SGD as they suggest. But frankly, that last issue seems pretty unavoidable in the online setting. I suspect this will never be solved, and future research will have to somehow have to design learning algorithms that are robust to this issue (or develop new theory that shows it isn't one). So overall, kudos to the authors, and I'm really looking forward to read more about where this research goes! 
[link]
SGD is a widely used optimization method for training the parameters of some model f on some given task. Since the convergence of SGD is related to the variance of the stochastic gradient estimate, there's been a lot of work on trying to come up with such stochastic estimates with smaller variance. This paper does it using an importance sampling (IS) Monte Carlo estimate of the gradient, and learning the proposal distribution $q$ of the IS estimate. The proposal distribution $q$ is parametrized in some way, and is trained to minimize the variance of the gradient estimate. It is trained simultaneously while the model $f$ that SGD (i.e. the SGD that uses IS to get its gradient) is training. To make this whole story more recursive, the proposal distribution $q$ is also trained with SGD :) This makes sense, since one expects the best proposal to depend on the value of the parameters of model $f$, so the best proposal $q$ should vary as $f$ is trained. One application of this idea is in optimizing a classification model over a distribution that is imbalanced classwise (e.g. there are classes with much fewer examples). In this case, the proposal distribution determines how frequently we sample examples from each class (conditioned on the class, training examples are chosen uniformly). #### My two cents This is a really cool idea. I particularly like the application to training on an imbalanced classification problem. People have mostly been using heuristics to tackle this problem, such as initially sampling each class equally as often, and then finetuning/calibrating the model using the real class proportions. This approach instead proposes a really elegant, coherent, solution to this problem. I would have liked to see a comparison with that aforementioned heuristic (for mainly selfish reasons :) ). They instead compare with an importance sampling approach with proposal that assigns the same probability to each class, which is a reasonable alternative (though I don't know if it's used as often as the more heuristic approach). There are other applications, to matrix factorization and reinforcement learning, that are presented in the paper and seem neat, though I haven't gone through those as much. Overall, one of my favorite paper this year: it's original, tackles a problem for which I've always hated the heuristic solution I'm using now, proposes an elegant solution to it, and is applicable even more widely than that setting. 
[link]
This paper describes a learning algorithm for deep neural networks that can be understood as an extension of stacked denoising autoencoders. In short, instead of reconstructing one layer at a time and greedily stacking, a unique unsupervised objective involving the reconstruction of all layers is optimized jointly by all parameters (with the relative importance of each layer cost controlled by hyperparameters). In more details: * The encoding (forward propagation) adds noise (Gaussian) at all layers, while decoding is noisefree. * The target at each layer is the result of noiseless forward propagation. * Direct connections (also known as skipconnections) between a layer and its decoded reconstruction are used. The resulting encoder/decoder architecture thus ressembles a ladder (hence the name Ladder Networks). * Miniature neural networks with a single hidden unit and skipconnections are used to decode the left and top layers into a reconstruction. Each network is applied elementwise (without parameter sharing across reconstructed units). * The unsupervised objective is combined with a supervised objective, corresponding to the regular negative class loglikelihood objective (using an output softmax layer). Two losses are used for each input/target pair: one based on the noisefree forward propagation (which also provides the target of the denoising objective) and one with the noise added (which also corresponds to the encoding stage of the unsupervised autoencoder objective). Batch normalization is used to train the network. Since the model combines unsupervised and supervised learning, it can be used for semisupervised learning, where unlabeled examples can be used to update the network using the unsupervised objective only. State of the art results in the semisupervised setting are presented, for both the MNIST and CIFAR10 datasets. #### My two cents What I find most exciting about this paper is its performance. On MNIST, with only 100 labeled examples, it achieves 1.13% error! That is essentially the performance of stacked denoising autoencoders, trained on the entire training set (though that was before ReLUs and batch normalization, which this paper uses)! This confirms a current line of thought in Deep Learning (DL) that, while recent progress in DL applied on large labeled datasets does not rely on any unsupervised learning (unlike at the "beginning" of DL in the mid 2000s), unsupervised learning might instead be crucial for success in lowlabeled data regime, in the semisupervised setting. Unfortunately, there is one little issue in the experiments, disclosed by the authors: while they used few labeled examples for training, model selection did use all 10k labels in the validation set. This is of course unrealistic. But model selection in the low data regime is arguably, in itself, an open problem. So I like to think that this doesn't invalidate the progress made in this paper, and only suggests that some research needs to be done on doing effective hyperparameter search with a small validation set. Generally, I really hope this paper will stimulate more research on DL methods to the specific case of small labeled dataset / large unlabeled dataset. While this isn't a problem that is as "flashy" as tasks such as the ImageNet Challenge which comes with lots of labeled data, I think this is a crucial research direction for AI in general. Indeed, it seems naive to me to expect that we will be able to collect large labeled dataset for each and every task, on our way to real AI. 
[link]
This paper presents a neural network architecture that can take as input a question and a sequence of facts expressed in natural language (i.e. a sequence of words) and produce its output the answer to that question. The main components of the architecture are as follows: * The question (q) and the facts (f_1, ... , f_K) are each individually transformed into a fixed size vector using the same GRU RNN (with the last hidden layer serving as the vector representation). * These vectors are each passed through "reasoning layers", where each layer transforms the question q and the facts f_k into a new vector representation. This is done by feeding each question fact pair (q,f_k) to a neural network that outputs a new representation for the fact f_k (which replaces its old representation in the layer), as well as a new representation for the question. All K new question representations are then pooled to obtain a single question representation that replace the old one in the layer. * The last reasoning layer is either fed to a softmax layer for binary questions, or to a scoring layer for questions with multiple and varying candidate answers. This socalled Neural Reasoner can be trained by backpropagation, in an endtoend, supervised way. The authors also suggest the use of auxiliary tasks, to improve results. The first ("original") adds an autoencoder reconstuction cost, that reproduces the question and facts from its first layer encoding. The second ("abstract") instead reconstructs a more abstract version of the sentences (e.g. "The triangle is above the pink rectangle." becomes "x is above y"). Importantly, while the Neural Reasoner framework is presented in this paper as covering many different variants, the version that is experimentally tested is one where the fact representations f_k are actually left unchanged throughout the reasoning layers, with only the question representation being changed. The paper presents experiments on two synthetic reasoning tasks and report performances that compare favorably with previously published alternatives (based on the general Memory Network architecture). The experiments also show that the auxiliary tasks can substantially improve the performance of the model #### My two cents The proposed Neural Reasoner framework is actually very close to work published on arXiv at about the same time on EndtoEnd Memory Networks \cite{conf/nips/SukhbaatarSWF15}. In fact, the version tested in the paper, with unchanged fact representations throughout layers, is extremely close to EndtoEnd Memory Networks. That said, there are also lots of differences. For instance, this paper proposes the use of multilayer networks within each Reasoning Layer, to produce updated question representations. In fact, experiments suggest that using several layers can be very beneficial for the path finding task. The sentence representation at the first layer is also different, being based on a nonlinear RNN instead of being based on linear operations on embeddings as in Memory Networks. The most interesting aspect of this paper to me is probably the demonstration that the use of an auxiliary task such as "original", which is unsupervised, can substantially improve the performance, again for the path finding task. That is, to me, probably the most exciting direction of future research that this paper highlights as promising. I also liked how the model is presented. It didn't take me much time to understand the model, and I actually found it easier to absorb than the Memory Network model, despite both being very similar. I think this model is indeed a bit simpler than Memory Networks, which is a good thing. It also suggests a different approach to the problem, one where the facts representations are also updated during forward propagation, not just the question's representation (which is the version initially described in the paper... I hope experiments on that variant are eventually presented). It's unfortunate that the authors only performed experiments on 2 of the 20 synthetic questionanswering tasks. I hope a future version of this work can report results on the full benchmark and directly compare with EndtoEnd Memory Networks. I was also unable to find out which of the question representation pooling mechanism (section 3.2.2) was used in the experiments. Perhaps the authors forgot to state it? Overall, a pretty interesting paper that open different doors towards reasoning with neural networks. 
[link]
This paper proposes to train a neural network generative model by optimizing an importance sampling (IS) weighted estimate of the log probability under the model. The authors show that the case of an estimate based on a single sample actually corresponds to the learning objective of variational autoencoders (VAE). Importantly, they exploit this connection by showing that, similarly to VAE, a gradient can be passed through the approximate posterior (the IS proposal) samples, thus yielding an importance weighted autoencoder (IWAE). The authors also show that, by using more samples, this objective, which is a lower bound of the actual loglikelihood, becomes an increasingly tighter approximation to the loglikelihood. In other words, the IWAE is expected to better optimize the real loglikelihood of the neural network, compared to VAE. The experiments presented show that the model achieves competitive performance on a version of the binarized MNIST benchmark and on the Omniglot dataset. #### My two cents This is a really neat contribution! While simple (both conceptually and algorithmically), it really seems to be an important step forward for the VAE framework. I really like the theoretical result showing that IWAE provides a better approximation to the real loglikelihood, it's quite neat and provides an excellent motivation for the method. The results on binarized MNIST are certainly impressive. Unfortunately, it appears that the training setup isn't actually comparable to the majority of published results on this dataset. Indeed, it seems that they didn't use the stochastic but *fixed* binarization of the inputs that other publications on this benchmark have used (since my paper on NADE with Iain Murray, we've made available that fixed training set for everyone to use, along with fixed validation and test sets as well). I believe instead they've resampled the binarization for each minibatch, effectively creating a setup with a somewhat larger training set than usual. It's unfortunate that this is the case, since it makes this result effectively impossible to compare directly with previous work. I'm being picky on this issue only because I'm super interested in this problem (that is of generative modeling with neural networks) and this little issue is pretty much the only thing that stops this paper from being a slam dunk. Hopefully the authors (or perhaps someone interested in reimplementing IWAE) can clarify this question eventually. Otherwise, it seems quite clear to me that IWAE is an improvement over VAE. The experiments of section 5.2, showing that finetuning a VAE model with IWAE training improves performance, while finetuning a IWAE model using VAE actually makes things worse, is further demonstration that IWAE is indeed a good idea. 
[link]
This paper considers the problem of structured output prediction, in the specific case where the output is a sequence and we represent the sequence as a (conditional) directed graphical model that generates from the first token to the last. The paper starts from the observation that training such models by maximum likelihood (ML) does not reflect well how the model is actually used at test time. Indeed, ML training implies that the model is effectively trained to predict each token conditioned on the previous tokens *from the ground truth* sequence (this is known as "teacher forcing"). Yet, when making a prediction for a new input, the model will actually generate a sequence by generating tokens one after another and conditioning on *its own predicted tokens* instead. So the authors propose a different training procedure, where at training time each *conditioning* ground truth token is sometimes replaced by the model's previous prediction. The choice of replacing the ground truth by the model's prediction is made by "flipping a coin" with some probability, independently for each token. Importantly, the authors propose to start with a high probability of using the ground truth (i.e. start close to ML) and anneal that probability closer to 0, according to some schedule (thus the name Schedule Sampling). Experiments on 3 tasks (image caption generation, constituency parsing and speech recognition) based on neural networks with LSTM units, demonstrate that this approach indeed improves over ML training in terms of the various performance metrics appropriate for each problem, and yields better sequence prediction models. #### My two cents Big fan of this paper. It both identifies an important flaw in how sequential prediction models are currently trained and, most importantly, suggests a solution that is simple yet effective. I also believe that this approach played a nonnegligible role in Google's winner system for image caption generation, in the Microsoft COCO competition. My alternative interpretation of why Scheduled Sampling helps is that ML training does not inform the model about the relative quality of the errors it can make. In terms of ML, it is as bad to put high probability on an output sequence that has just 1 token that's wrong, than it is to put the same amount of probability on a sequence that has all tokens wrong. Yet, say for image caption generation, outputting a sentence that is one word away from the ground truth is clearly preferable from making a mistake on a words (something that is also reflected in the performance metrics, such as BLEU). By training the model to be robust to its own mistakes, Scheduled Sampling ensures that errors won't accumulate and makes predictions that are entirely off much less likely. An alternative to Scheduled Sampling is DAgger (Dataset Aggregation: \cite{journals/jmlr/RossGB11}), which briefly put alternates between training the model and adding to the training set examples that mix model predictions and the ground truth. However, Scheduled Sampling has the advantage that there is no need to explicitly create and store that increasingly large dataset of sampled examples, something that isn't appealing for online learning or learning on large datasets. I'm also very curious and interested by one of the direction of future work mentioned in the conclusion: figuring out a way to backprop through the stochastic predictions made by the model. Indeed, as the authors point out, the current algorithm ignores the fact that, by sometimes taking as input its previous prediction, this induces an additional relationship between the model's parameters and its ultimate prediction, a relationship that isn't taken into account during training. To take it into account, you'd need to somehow backpropagate through the stochastic process that generated the previous token prediction. While the work on variational autoencoders has shown that we can backprop through gaussian samples, backpropagating through the sampling of a discrete multinomial distribution is essentially an open problem. I do believe that there is work that tried to tackle propagating through stochastic binary units however, so perhaps that's a start. Anyways, if the authors could make progress on that specific issue, it could be quite useful not just in the context of Schedule Sampling, but possibly in the context of training networks with discrete stochastic units in general! 
[link]
This paper presents an interpretation of dropout training as performing approximate Bayesian learning in a deep Gaussian process (DGP) model. This connection suggests a very simple way of obtaining, for networks trained with dropout, estimates of the model's output uncertainty. This estimate is based and computed from an ensemble of networks each obtained by sampling a new dropout mask. #### My two cents This is a really nice and thought provoking contribution to our understanding of dropout. Unfortunately, the paper in fact doesn't provide a lot of comparisons with either other ways of estimating the predictive uncertainty of deep networks, or to other approximate inference schemes in deep GPs (actually, see update below). The qualitative examples provided however do suggest that the uncertainty estimate isn't terrible. Irrespective of the quality of the uncertainty estimate suggested here, I find the observation itself really valuable. Perhaps future research will then shed light on how useful that method is compared to other approaches, including Bayesian dark knowledge \cite{conf/nips/BalanRMW15}. `Update: On September 27th`, the authors uploaded to arXiv a new version that now includes comparisons with 2 alternative Bayesian learning methods for deep networks, specifically the stochastic variational inference approach of Graves and probabilistic backpropagation of HernandezLobato and Adams. Dropout actually does very well against these baselines and, across datasets, is almost always amongst the best performing method! 
[link]
This paper starts by introducing a trick to reduce the variance of stochastic gradient variational Bayes (SGVB) estimators. In neural networks, SGVB consists in learning a variational (e.g. diagonal Gaussian) posterior over the weights and biases of neural networks, through a procedure that (for the most part) alternates between adding (Gaussian) noise to the model's parameters and then performing a model update with backprop. The authors present a local reparameterization trick, which exploits the fact that the Gaussian noise added into the weights could instead be added directly into the preactivation (i.e. before the activation fonction) vectors during forward propagation. This is due to the fact that computing the preactivation is a linear operation, thus noise at that level is also Gaussian. The advantage of doing so is that, in the context of minibatch training, one can efficiently then add independent noise to the preactivation vectors for each example of the minibatch. The nature of the local reparameterization trick implies that this is equivalent to using one corrupted version of the weights for each example in the minibatch, something that wouldn't be practical computationally otherwise. This is in fact why, in normal SGVB, previous work would normally use a single corrupted version of the weights for all the minibatch. The authors demonstrate that using the local reparameterization trick yields stochastic gradients with lower variance, which should improve the speed of convergence. Then, the authors demonstrate that the Gaussian version of dropout (one that uses multiplicative Gaussian noise, instead of 01 masking noise) can be seen as the local reparameterization trick version of a SGVB objective, with some specific prior and variational posterior. In this SGVB view of Gaussian dropout, the dropout rate is an hyperparameter of this prior, which can now be tuned by optimizing the variational lower bound of SGVB. In other words, we now have a method to also train the dropout rate! Moreover, it becomes possible to tune an individual dropout rate parameter for each layer, or even each parameter of the model. Experiments on MNIST confirm that tuning that parameter works and allows to reach good performance of various network sizes, compared to using a default dropout rate. ##### My two cents This is another thought provoking connection between Bayesian learning and dropout. Indeed, while Deep GPs have allowed to make a Bayesian connection with regular (binary) dropout learning \cite{journals/corr/GalG15}, this paper sheds light on a neat Bayesian connection for the Gaussian version of dropout. This is great, because it suggests that Gaussian dropout training is another legit way of modeling uncertainty in the parameters of neural networks. It's also nice that that connection also yielded a method for tuning the dropout rate automatically. I hope future work (by the authors or by others) can evaluate the quality of the corresponding variational posterior in terms of estimating uncertainty in the network and, in particular, in obtaining calibrated output probabilities. Little detail: I couldn't figure out whether the authors tuned a single dropout rate for the whole network, or used many rates, for instance one per parameter, as they suggest can be done. 
[link]
This paper suggests a novel explanation for why dropout training is helpful: because it corresponds to an adaptive data augmentation method. Indeed, the authors point out that, when sampling a mask of the hidden units in a network (effectively setting the corresponding units to 0), the same effect would have been obtained by feeding as input an example tailored to yield activations of 0 for these units and otherwise the same activation for all other units. Since this "ghost" example will have to be different from the original example, and since each different mask would correspond to a different "ghost" example, then effectively mask sampling is similar to data augmentation. While in practice finding a ghost example that replicates exactly the same dropout hidden activations might not be possible, the authors show that finding an "approximate" ghost example that minimizes a distance between the target dropout activation and the deterministic activation of the ghost example works well. Indeed, they show that training a deep neural net on additional data generated by this procedure yields results that are at least as good as regular dropout on MNIST and CIFAR10 (actually, the deterministic neural net still uses regular dropout at the input layer, however they do show that the additional ghost examples are necessary to match the neural net trained with dropout at all layers). Then the authors use that interpretation to justify a variation of dropout where the dropout rate isn't fixed, but itself is randomly sampled in some range for each example. Indeed, if we think of dropout at a fixed rate as a specific class of ghost data being added, varying the dropout rate corresponds to enriching even more the ghost data pool. The experiments show that this can help, though not by much. Finally, the authors propose an explanation of a property of dropout: that it tends to generate hidden representations that are sparser. Again, the authors rely on their interpretation of dropout as data augmentation. The explanation goes as follows. Training on the ghost data distribution might imply that the classification problem has become significantly harder. Specifically, it is quite possible that the addition of new ghost examples generates new isolated class clusters in input space that the model most now learn to discriminate. And they hypothesize that the generation of such additional clusters would encourage sparsity. To test this hypothesis, the authors synthetically simulate this scenario, by sampling data on a circle, which is clustered in small arcs each assigned to one of 10 possible classes in cycling order. Decreasing the arc length thus increases the number of arcs, i.e. class clusters. They show that training deep networks on datasets with increasing number of class clusters does yield representations that are increasingly sparser. This thus suggests that dropout might indeed be equivalent to modifying the input distribution by adding such isolated classspecific clusters in input space. One assumption behind this analysis is that the sparsity patterns (i.e. the set of nonzero dimensions) play an important role in classification and incorporate most of the discriminative class information. This assumption is also confirmed in experiments, where converting the ReLU activation function by a binary activation (that is 1 if the preactivation is positive and 0 otherwise) after training still yields a network with good performance (though slightly worse). #### My two cents This is a really original and thought provoking paper. One interpretation I make of these results is that the inductive bias corresponding to using a deep neural network with ReLU activations is more valuable than one might have thought, and that the usefulness of deep neural networks goes beyond just being black boxes that can learn datadependent representations. Otherwise, it's not clear to me why the ghost data implicitly generated by the architecture would be useful at all. This also suggests an experiment where such ghost samples would be fed to another type of classifier, such as an SVM, to test whether the data augmentation is useful in itself and reflects meaningful structure in the data, as opposed to being somehow useful only for neural nets. I note that the results are mostly specific to architectures based on ReLU activations (not that this is a problem, but one should keep this in mind). I'd really like to see what the ghost samples look like. Do they correspond to interpretable images? The authors also mention that exploring how the samples change with training would be interesting to investigate, and I agree. Finally, I think there might be a typo in Figure 1. While the labels of a) and b) states that the arc length is smaller for a) than b), the plot clearly show otherwise. 
[link]
This paper presents an extensive evaluation of variants of LSTM networks. Specifically, they start from what they consider to be the vanilla architecture and, from it, also consider 8 variants which correspond to small modifications on the vanilla case. The vanilla architecture is the one described in Graves & Schmidhuber (2005) \cite{journals/nn/GravesS05}, and the variants consider removing single parts of it (input,forget,output gates or activation functions), coupling the input and forget gate (which is inspired from GRU) or having full recurrence between all gates (which comes from the original LSTM formulation). In their experimental setup, they consider 3 datasets: TIMIT (speech recognition), IAM Online Handwriting Database (character recognition) and JSB Chorales (polyphonic music modeling). For each, they tune the hyperparameters of each of the 9 architectures, using random search based on 200 samples. Then, they keep the 20 best hyperparameters and use the statistics of those as a basis for comparing the architectures. #### My two cents This was a very useful ready. I'd make it a required read for anyone that wants to start using LSTMs. First, I found the initial historical description of the developments surrounding LSTMs very interesting and clarifying. But more importantly, it presents a really useful picture of LSTMs that can both serve as a good basis for starting to use LSTMs and also an insightful (backed with data) exposition of the importance of each part in the LSTM. The analysis based on an fANOVA (which I didn't know about until now) is quite neat. Perhaps the most surprising observation is that momentum actually doesn't seem to help that much. Investigating second order interaction between hyperparameters was a smart thing to do (showing that tuning the learning rate and hidden layer jointly might not be that important, which is a useful insight).The illustrations in Figure 4, layout out the estimated relationship (with uncertainty) between learning rate / hidden layer size / input noise variance and performance / training time is also full of useful information. I wont repeat here the main observations of the paper, which are laid out clearly in the conclusion (section 6). Additionally, my personal takeaway point is that, in an LSTM implementation, it might still be useful to support the removal peepholes or having coupled input and forget gates, since they both yielded the ultimate best test set performance on at least one of the datasets (I'm assuming it was also best on the validation set, though this might not be the case...) The fANOVE analysis makes it clear that the learning rate is the most critical hyperparameter to tune (can be "make or break"). That said, this is already well known. And the fact that it explains so much of the variance might reflect a bias of the analysis towards a situation where the learning rate isn't tuned as well as it could be in practice (this is afterall THE hyperparameter that neural net researcher spend the most time tuning in practice). So, as future work, this suggests perhaps doing another round of the same analysis (which is otherwise really neatly setup), where more effort is always put on tuning the learning rate, individually for each of the other hyperparameters. In other words, we'd try to ignore the regions of hyperparameter space that correspond to bad learning rates, in order to "marginalize out" its effect. This would thus explore the perhaps more realistic setup that assumes one always tunes the learning rate as best as possible. Also, considering a less aggressive gradient clipping into the hyperparameter search would be interesting since, as the authors admit, clipping within [1,1] might have been too much and could explain why it didn't help Otherwise, a really great and useful read! 
[link]
This paper explores the problem of question answering based on natural text. While this has been explored recently in the context of Memory Networks, the problems tackled so far have been synthetically generated. In this paper, the authors propose to extract from news sites more realistic question answering examples, by treating the main body of a news article as the content (the "facts") and extracting questions from the article's bullet point summaries. Specifically, by detecting the entities in these bullet points and replacing them with a question place older (e.g. "Producer X will not press charges"), they are able to generate queries which, while grammatically not being questions, do require to perform a form of question answering. Thanks to this procedure, two large *supervised* datasets are created, with several thousands of questions, based on the CNN and Daily Mail news sites. Then, the authors investigate neural network based systems for solving this task. They consider a fairly simple Deep LSTM network, which is first fed the article's content and then the query. They also consider two architectures that incorporate an attentional mechanism, based on softmax weighting. The first ("Attentive Reader") attends once in the document (i.e. uses a single softmax weight vector) while the second ("Impatient Reader") attends after every word in the query (akin to the soft attention architecture in the "Show Attend and Tell" paper). These neural network architectures are also compared with simpler baselines, which are closer to what a more "classical" statistical NLP solution might look like. Results on both datasets demonstrate that the neural network approaches have superior performance, with the attentional models being significantly better than the simpler Deep LSTM model. #### My two cents This is welcome development in the research on reasoning models based on neural networks. I've always thought it was unfortunate that the best benchmark available is based on synthetically generated cases. This work fixes this problem in a really clever way, while still being able to generate a large amount of training data. Particularly clever is the random permutation of entity markers when processing each case. Thanks to that, a system cannot simply use general statistics on words to answer questions (e.g. just from the query "The hitech bra that helps you beat breast X" it's obvious that "cancer" is an excellent answer). In this setup, the system is forced to exploit the content of the article, thus ensuring that the benchmark is indeed measuring the system's questionanswering abilities. Since the dataset itself is an important contribution of this paper, I hope the authors release it publicly in the near future. The evaluation of the different neural architectures is also really thoroughly done. The nonneural baselines are reasonable and the comparison between the neural nets is itself interesting, bringing more evidence that the softmax weighted attentional mechanism (which has been gaining in popularity) indeed brings something over a regular LSTM approach. 
[link]
The main idea in this paper is to use the agent's ability to predict observations at the next step as a measure of how much exploration of that action should be encouraged. This prediction is based on a deep architecture, specifically a deep autoencoder representation of observations, and accuracy of prediction is measured at the level of that learned, deep representation. Exploration is encourage by increasing the reward whenever the models prediction of the representation at the next time step is bad. #### My two cents I'm not sure how novel this idea is in RL, but at the very least it's interesting that it was explored the way it was here, with deep learning. As a nonexpert in RL, I certainly enjoyed reading the paper. Also, this implements nicely an idea that just seems like common sense, as an exploration strategy for an agent: actions that merit exploration are those that yield results that are unexpected to you. It will be interesting to see if this general approach will be able to exploit upcoming progress in the development of better generative deep learning models, an area that is currently very active. 
[link]
This paper presents a linear algebraic trick for computing both the value and the gradient update for a loss function that compares a very highdimensional target with a (dense) output prediction. Most of the paper exposes the specific case of the squared error loss, though it can also be applied to some other losses such as the socalled spherical softmax. One use case could be for training autoencoders with the squared error on very highdimensional but sparse inputs. While a naive (i.e. what most people currently do) implementation would scale in $O(Dd)$ where $D$ is the input dimensionality and d the hidden layer dimensionality, they show that their trick allows to scale in $O(d^2)$. Their experiments show that they can achieve speedup factors of over 500 on the CPU, and over 1500 on the GPU. #### My two cents This is a really neat, and frankly really surprising, mathematical contribution. I did not suspect getting rid of the dependence on D in the complexity would actually be achievable, even for the "simpler" case of the squared error. The jury is still out as to whether we can leverage the full power of this trick in practice. Indeed, the squared error over sparse targets isn't the most natural choice in most situations. The authors did try to use this trick in the context of a version of the neural network language model that uses the squared error instead of the negative logsoftmax (or at least I think that's what was done... I couldn't confirm this with 100% confidence). They showed that good measures of word similarity (Simlex999) could be achieved in this way, though using the hierarchical softmax actually achieves better performance in about the same time. But as far as I'm concerned, that doesn't make the trick less impressive. It's still a neat piece of new knowledge to have about reconstruction errors. Also, the authors mention that it would be possible to adapt the trick to the socalled (negative log) spherical softmax, which is like the softmax but where the numerator is the square of the preactivation, instead of the exponential. I hope someone tries this out in the future, as perhaps it could be key to making this trick a real game changer! 
[link]
This paper presents a variational approach to the maximisation of mutual information in the context of a reinforcement learning agent. Mutual information in this context can provide a learning signal to the agent that is "intrinsically motivated", because it relies solely on the agent's state/beliefs and does not require from the ("outside") user an explicit definition of rewards. Specifically, the learning objective, for a current state s, is the mutual information between the sequence of K actions a proposed by an exploration distribution $w(as)$ and the final state s' of the agent after performing these actions. To understand what the properties of this objective, it is useful to consider the form of this mutual information as a difference of conditional entropies: $$I(a,s's) = H(as)  H(as',s)$$ Where $I(..)$ is the (conditional) mutual information and $H(..)$ is the (conditional) entropy. This objective thus asks that the agent find an exploration distribution that explores as much as possible (i.e. has high $H(as)$ entropy) but is such that these actions have predictable consequences (i.e. lead to predictable state s' so that $H(as',s)$ is low). So one could think of the agent as trying to learn to have control of as much of the environment as possible, thus this objective has also been coined as "empowerment". The main contribution of this work is to show how to train, on a large scale (i.e. larger state space and action space) with this objective, using neural networks. They build on a variational lower bound on the mutual information and then derive from it a stochastic variational training algorithm for it. The procedure has 3 components: the exploration distribution $w(as)$, the environment $p(s's,a)$ (can be thought as an encoder, but which isn't modeled and is only interacted with/sampled from) and the planning model $p(as',s)$ (which is modeled and can be thought of as a decoder). The main technical contribution is in how to update the exploration distribution (see section 4.2.2 for the technical details). This approach exploits neural networks of various forms. Neural autoregressive generative models are also used as models for the exploration distribution as well as the decoder or planning distribution. Interestingly, the framework allows to also learn the state representation s as a function of some "raw" representation x of states. For raw states corresponding to images (e.g. the pixels of the screen image in a game), CNNs are used. 
[link]
This paper combines two ideas. The first is stochastic gradient Langevin dynamics (SGLD), which is an efficient Bayesian learning method for larger datasets, allowing to efficiently sample from the posterior over the parameters of a model (e.g. a deep neural network). In short, SGLD is stochastic (minibatch) gradient descent, but where Gaussian noise is added to the gradients before each update. Each update thus results in a sample from the SGLD sampler. To make a prediction for a new data point, a number of previous parameter values are combined into an ensemble, which effectively corresponds to Monte Carlo estimate of the posterior predictive distribution of the model. The second idea is distillation or dark knowledge, which in short is the idea of training a smaller model (student) in replicating the behavior and performance of a much larger model (teacher), by essentially training the student to match the outputs of the teacher. The observation made in this paper is that the step of creating an ensemble of several models (e.g. deep networks) can be expensive, especially if many samples are used and/or if each model is large. Thus, they propose to approximate the output of that ensemble by training a single network to predict to output of ensemble. Ultimately, this is done by having the student predict the output of a teacher corresponding to the model with the last parameter value sampled by SGLD. Interestingly, this process can be operated in an online fashion, where one alternates between sampling from SGLD (i.e. performing a noisy SGD step on the teacher model) and performing a distillation update (i.e. updating the student model, given the current teacher model). The end result is a student model, whose outputs should be calibrated to the bayesian predictive distribution. 
[link]
This paper proposes to learn features for images using neural networks that predict the relative motion of the camera that captured two successive images. The main motivation for this approach is that such data would be very cheap to collect, as it would not require any labelling by a human and only relies on "egomotion" (and thus readily available) information. More concretely, what must be predicted is the X/Y/Z rotation or translation movements. This is converted into a classification problem by binning each movement into a fixed number of ranges of movement magnitude. The neural network architecture then consists in a siamesestyle CNN (SCNN). First two BaseCNN (BCNN) with tied weights process the input image pair (one image per BCNN) to produce features for each image. These features are then concatenated and fed to a TopCNN (TCNN) which produces a prediction for the relative transformation that relates the two images. The output layer thus contains groups of softmax units, one for each dimension of variation of the transformation (e.g. 3 for X/Y/Z rotation). The experiments show that pretraining on this task is competitive with pretraining a CNN on the same amount of ImageNet classification data. 
[link]
`Update 2015/11/23: Since I first wrote this note, I became involved in the next iterations of this work, which became v2 of the arXiv manuscript. The notes below were made based on v1.` This paper considers the problem of Maximum Inner Product Search (MIPS). In MIPS, given a query $q$ and a set of inputs $x_i$, we want to find the input (or the top n inputs) with highest inner product, i.e. $argmax_i q' x_i$. Recently, it was shown that a simple transformation to the query and input vectors made it possible to approximately solve MIPS using hashing methods for Maximum Cosine Similarity Search (MCSS), a problem for which solutions are readily available (see section 2.4 for a brief but very clear description of the transformation). In this paper, the authors combine this approach with clustering, in order to improve the quality of retrieved inputs. Specifically, they consider the spherical kmeans algorithm, which is a variant of kmeans in which data points are clustered based on cosine similarity instead of the euclidean similarity (in short, data points are first scaled to be of unit norm, then in the training inner loop points are assigned to the cluster centroid with highest dot product and cluster centroids are updated as usual, except that they are always rescaled to unit norm). Moreover, they consider a bottomup application of the algorithm to yield a hierarchical clustering tree. They propose to use such a hierarchical clustering tree to find the topn candidates for MIPS. The key insight here is that, since spherical kmeans relies on cosine similarity for finding the best cluster, and since we have a transformation that allows the maximisation of inner product to be approximated by the maximisation of cosine similarity, then a tree to find MIPS candidates could be constructed by running spherical kmeans on the inputs transformed by the same transformation used for hashingbased MIPS. In order to make the search more robust to border issues when a query is close to the frontier between clusters, at each level of the tree they consider more than one candidate cluster during topdown search, so as to merge the candidates in several leaves of the tree at the very end of a full top down query. Their experiments using search with word embeddings show that the quality of the top 1, 10 and 100 MIPS candidates using their spherical kmeans approach is better than using two hashingbased search methods. 
[link]
This paper presents a method for "learning the learning rate" of a stochastic gradient descent method, in the context of online learning. Indeed, variations on the chosen learning rate or learning rate schedule can have a large impact in observed performance of stochastic gradient descent. Moreover, in the context of online learning, where we are interested in achieving high performance not only at convergence but every step of the way, the "choosing the learning rate" problem is even more crucial. The authors present a method which attempts to train the learning rate itself by gradient descent. This is achieved by "unrolling" the parameter updates of our model across the time steps of online learning, which exposes the interaction between the learning rate and the sum of losses of the model across these time steps. The authors then propose a way to approximate the gradient of the sum of losses with respect to the learning rate, so that it can be used to perform gradient updates on the learning rate itself. The gradient on the learning rate has to be approximated, for essentially the same reason that gradients to train a recurrent neural network online must be approximated (see also my notes on another good paper by Yann Ollivier here: \cite{journals/corr/OllivierC15}). Another approximation is introduced to avoid having to compute an Hessian matrix. Nevertheless, results suggest that the proposed approximation works well and can improve over a fixed learning with a reasonable rate decay schedule #### My two cents I think the authors are right on the money as to the challenges posed by online learning. I think these challenges are likely to be greater in the context of training neural networks online, for which little satisfactory solutions exist right now. So this is a direction of research I'm particularly excited about. At this points, the experiments consider fairly simple learning scenarios, but I don't see any obstacle in applying the same method to neural networks. One interesting observation from the results is that results are fairly robust to variations of "the learning rate of the learning rate", compared to varying and fixing the learning rate itself. Finally, I haven't had time to entirely digest one of their theoretical result, suggesting that their approximation actually corresponds to an exact gradient taken "alongside the effective trajectory" of gradient descent. However, that result seems quite interesting and would deserve more attention. 
[link]
This is another "learning the learning rate" paper, which predates (and might have inspired) the "Speed learning on the fly" paper I recently wrote notes about (see \cite{journals/corr/MasseO15}). In this paper, they consider the offline training scenario, and propose to do gradient descent on the learning rate by unrolling the *complete* training procedure and treating it all as a function to optimize, with respect to the learning rate. This way, they can optimize directly the validation set loss. The paper in fact goes much further and can tune many other hyperparameters of the gradient descent procedure: momentum, weight initialization distribution parameters, regularization and input preprocessing. #### My two cents This is one of my favorite papers of this year. While the method of unrolling several steps of gradient descent (100 iterations in the paper) makes it somewhat impractical for large networks (which is probably why they considered 3layer networks with only 50 hidden units per layer), it provides an incredibly interesting window on what are good hyperparameter choices for neural networks. Note that, to substantially reduce the memory requirements of the method, the authors had to be quite creative and smart about how to encode changes in the network's weight changes. There are tons of interesting experiments, which I encourage the reader to go check out (see section 3). One experiment on training the learning rates, separately for each iteration (i.e. learning a learning rate schedule), for each layer and for either weights or biases (800 hyperparameters total) shows that a good schedule is one where the top layer first learns quickly (large learning), then the bottom layer starts training faster, and finally the learning rates of all layers is decayed towards zero. Note that some of the experiments presented actually optimized the training error, instead of the validation set error. Another looked at finding optimal scales for the weight initialization. Interestingly, the values found weren't that far from an often prescribed scale of $1 / \sqrt{N}$, where $N$ is the number of units in the previous layer. The experiment on "training the training set", i.e. generating the 10 examples (one per class) that would minimize the validation set loss of a network trained on these examples is a pretty cool idea (it essentially learns prototypical images of the digits from 0 to 9 on MNIST). Another experiment tried to optimize a multitask regularization matrix, in order to encourage forms of softweighttying across tasks. Note that approaches like the one in this paper make tools for automatic differentiation incredibly valuable. Python autograd, the author's automatic differentiation Python library https://github.com/HIPS/autograd (which inspired our own Torch autograd https://github.com/twitter/torchautograd) was in fact developed in the context of this paper. Finally, I'll end with a quote from the paper, that I found particularly funny: "The last remaining parameter to SGD is the initial parameter vector. Treating this vector as a hyperparameter blurs the distinction between learning and metalearning. In the extreme case where all elementary learning rates are set to zero, the training set ceases to matter and the metalearning procedure exactly reduces to elementary learning on the validation set. Due to philosophical vertigo, we chose not to optimize the initial parameter vector." 
[link]
This paper introduces a version of the skipgram word embeddings learning algorithm that can also learn the size (nb. of dimensions) of these embeddings. The method, coined infinite skipgram (iSG), is inspired from my work with MarcAlexandre Côté on the infinite RBM, in which we describe a mathematical trick for learning the size of a latent representation. This is done by introducing an additional latent variable $z$ representing the number of dimensions effectively involved in the energy function. Moreover, a term penalizing increasing values for $z$ is also incorporated, such that the infinite sum over $z$ is converging. In this paper, the authors extend the probabilistic model behind skipgram with such a variable $z$, now corresponding to the number of dimensions involved in the dot product between word embeddings. They also propose a few approximations required to allow for an efficient training algorithm. Mainly they optimize an upper bound on the regular skipgram objective (see Section 3.2) and they approximate the computation of the conditional over $z$ for a given word $w$, which requires summing over all possible context words $c$, by summing only over the words observed in the immediate current context of $w$ (thus this sum will very across training example of the same word $w$). Experiments show that the iSG better learns to exploit different dimensions to model different senses of words, better than the original skipgram model. Quantitatively, the iSG seems to provide better probabilities to context words. 
[link]
This paper presents a feedforward neural network architecture for processing graphs as inputs, inspired from previous work on Graph Neural Networks. In brief, the architecture of the GGNN corresponds to $T$ steps of GRUlike (gated recurrent units) updates, where T is a hyperparameter. At each step, a vector representation is computed for all nodes in the graph, where a node's representation at step t is computed from the representation of nodes at step $t1$. Specifically, the representation of a node will be updated based on the representation of its neighbors in the graph. Incoming and outgoing edges in the graph are treated differently by the neural network, by using different parameter matrices for each. Moreover, if edges have labels, separate parameters can be learned for the different types of edges (meaning that edge labels determine the configuration of parameter sharing in the model). Finally, GGNNs can incorporate nodelevel attributes, by using them in the initialization (time step 0) of the nodes' representations. GGNNs can be used to perform a variety of tasks on graphs. The pernode representations can be used to make pernode predictions by feeding them to a neural network (shared across nodes). A graphlevel predictor can also be obtained using a soft attention architecture, where pernode outputs are used as scores into a softmax in order to pool the representations across the graph, and feed this graphlevel representation to a neural network. The attention mechanism can be conditioned on a "question" (e.g. on a task to predict the shortest path in a graph, the question would be the identity of the beginning and end nodes of the path to find), which is fed to the node scorer of the soft attention mechanism. Moreover, the authors describe how to chain GGNNs to go beyond predicting individual labels and predict sequences. Experiments on several datasets are presented. These include tasks where a single output is required (on a few bAbI tasks) as well as tasks where a sequential output is required, such as outputting the shortest path or the Eulerian circuit of a graph. Moreover, experiments on a much more complex and interesting program verification task are presented. 
[link]
This paper presents an approach to initialize a neural network from the parameters of a smaller and previously trained neural network. This is effectively done by increasing the size (in width and/or depth) of the previously trained neural network, in such of a way that the function represented by the network doesn't change (i.e. the output of the larger neural network is still the same). The motivation here is that initializing larger neural networks in this way allows to accelerate their training, since at initialization the neural network will already be quite good. In a nutshell, neural networks are made wider by adding several copies (selected randomly) of the same hidden units to the hidden layer, for each hidden layer. To ensure that the neural network output remains the same, each incoming connection weight must also be divided by the number of replicas that unit is connected to in the previous layer. If not training using dropout, it is also recommended to add some noise to this initialization, in order to break its initial symmetry (though this will actually break the property that the network's output is the same). As for making a deeper network, layers are added by initializing them to be the identity function. For ReLU units, this is achieved using an identity matrix as the connection weight matrix. For units based on sigmoid or tanh activations, unfortunately it isn't possible to add such identity layers. In their experiments on ImageNet, the authors show that this initialization allows them to train larger networks faster than if trained from random initialization. More importantly, they were able to outperform their previous validation set ImageNet accuracy by initializing a very large network from their best Inception network. 
[link]
This paper proposes to learn embeddings of text and/or images according to a dissimilarity metric that is asymmetric and implements the notion of partial order. For example, we'd like the metric to capture that the sentence "a dog in the yard" is more specific than just "a dog". Similarly, given the image of a scene and a caption describing it, we'd also like to capture that the image is more specific than the caption, since captions only describe the main elements of the scene. We'd also like to capture the hypernym relation between single words, e.g. where "woman" is more specific than "person". To achieve this, they propose to use the following dissimilarity metric: $$E(x,y) = max(0,yx)^2$$ where x and y are embedding vectors and the max operation is applied elementwise. The way to use this metric is to learn embeddings such that, for a pair x,y where the object (e.g. "a dog in the yard") represented by $x$ is more specific than the object (e.g. "a dog") represented by $y$, then $E(x,y)$ is as small as possible. For example, let's assume that $x$ and y are the output of a neural network, where each output dimension detects a certain concept, i.e. is nonzero only if the concept associated with that dimension is present in the input. For x representing "a dog in the yard", we could expect having only two dimensions that are nonzero: one detecting the concept "dog" (let's note it $x_j$) and another detecting the concept "yard" ($x_k$). For y representing "a dog", only the dimension associated with "dog" ($y_j$) would be nonzero and have the same value as $x_j$. In this situation, it is easy to see that $E(x,y)$ would be 0, but $E(y,x)$ would be greater than zero, thus capturing appropriately the asymmetric relationship between the two. The authors show in the paper how to leverage this new asymmetric metric in training losses that are appropriate for 3 problems: hypernym detection, captionimage retrieval and textual entailment. They show that the proposed metric yields superior performance on these problems compared to symmetric metrics that have been used by prior work. 
[link]
This paper is concerned with the problem of predicting a sequence at the output, e.g. using an RNN. It aims at addressing the issue it refers to as exposure bias, which here refers to the fact that while at training time the RNN producing the output sequence is being fed the ground truth previous tokens (words) when producing the next token (something sometimes referred to as teacher forcing, which really is just maximum likelihood), at test time this RNN makes predictions using recursive generation, i.e. it is instead recursively fed by its own predictions (which might be erroneous). Moreover, it also proposes a training procedure that can take into account a rich performance measure that can't easily be optimized directly, such as the BLEU score for text outputs. The key observation is that the REINFORCE algorithm could be used to optimize the expectation of such arbitrarily complicated performance measures, for outputs produced by (stochastic) recursive generation. However, REINFORCE is a notoriously unstable training algorithm, which can often work terribly (in fact, the authors mention that they have tried using REINFORCE only, without success). Thus, they instead propose to gradually go from training according to maximum likelihood / teacher forcing to training using the REINFORCE algorithm on the expected performance measure. The proposed procedure, dubbed MIXER (Mixed Incremental CrossEntropy Reinforce), goes as follows: 1. Train model to optimize the likelihood of the target sequence, i.e. minimize the per timestep crossentropy loss. 2. Then, for a target sequence of size T, optimize the crossentropy for the TΔ first time steps of the sequence and use Reinforce to get a gradient on the expected loss (e.g. negative BLEU) for the recursive generation of the rest of the Δ time steps. 3. Increase Δ and go back to 2., until Δ is equal to T. Experiments on 3 text benchmarks (summarization, machine translation and image captioning) show that this approach yields models that produces much better outputs when not using beam search (i.e. using greedy recursive generation) to generate an output sequence, compared to other alternatives such as regular maximum likelihood and Data as Demonstrator (DaD). DaD is similar to the scheduled sampling method of Bengio et al. (see my note: \cite{conf/nips/BengioVJS15}), in that at training time, some of the previous tokens fed to the model are predicted tokens instead of ground truths. When using beam search, MIXER is only outperformed by DaD on the machine translation task. 
[link]
This paper presents a method for training feedforward neural networks with stochastic hidden units (e.g. sigmoid belief networks), to optimize the expectation (over the stochastic units) of some arbitrary loss function. While the proposed method is applicable to any type of stochastic units, it is most interesting for the case of discrete stochastic units, since the reparametrization trick of variational autoencoders cannot be applied to backprop through the sampling step. In short, the method builds on the likelihood ratio method (of which REINFORCE is a special case) and proposes a baseline (also known as control variate) which, according to the authors, is such that an unbiased gradient is obtained. Specifically, the baseline corresponds to the firstorder Taylor expansion of the loss function around some deterministic value of the hidden units (x̄) that doesn't depend on the stochastic hidden units (noted x in the paper). For a likelihood ratio method to be unbiased, it is required that the expectation of the baseline (times the gradient of the model's log distribution) with respect to the model's distribution be tractable. For the proposed baseline, it can be shown that computing this expectation requires the gradient of the mean (μ) of each stochastic unit in the network with respect to each parameter. The key idea behind the proposed method is that 1) an estimate of this expectation can be obtained simply using meanfield and 2) since meanfield is estimated by a feedforward deterministic pass over the network, it is thus possible to compute the gradients of μ by backpropagation through the meanfield pass (hence the name of the method, MuProp). Experiments show that this method converges much faster than previously proposed unbiased methods and often performs better. Experiments also show that the method obtains competitive performance compared to biased methods (such as the "straight through" method). 
[link]
This paper presents a variety of issues related to the evaluation of image generative models. Specifically, they provide evidence that evaluations of generative models based on the popular Parzen windows estimator or based on a visual fidelity (qualitative) measure both present serious flaws. The Parzen windows approach to generative modeling evaluation works by taking a finite set of samples generated from a given model and then using those as the centroids of a Parzen windows Gaussian mixture. The constructed Parzen windows mixture is then used to compute a loglikelihood score on a set of test examples. Some of the key observations made in this paper are: 1. A simple, kmeans based approach can obtain better Parzen windows performance than using the original training samples for a given dataset, even though these are samples from the true distribution! 2. Even for the fairly low dimensional space of 6x6 image patches, a Parzen windows estimator would require an extremely large number of samples to come close to the true loglikelihood performance of a model. 3. Visual fidelity is a bad predictor of true loglikelihood performance, as it is possible to Obtain great visual fidelity and arbitrarily low loglikelihood, with a Parzen windows model made of Gaussians with very small variance. Obtain bad visual fidelity and high loglikelihood by taking a model with high loglikelihood and mixing it with a white noise model and putting as much as 99% of the mixing probability on the white noise model (i.e. which would produce bad samples 99% of the time). 4. Measuring overfitting of a model by taking samples from the model and making sure their training set nearest neighbors are different is ineffective, since it is actually trivial to generate samples that are each visually almost identical to a training example, but that yet each have large euclidean distance with their corresponding (visually similar) training example. 
[link]
This paper explores the use of convolutional (PixelCNN) and recurrent units (PixelRNN) for modeling the distribution of images, in the framework of autoregression distribution estimation. In this framework, the input distribution $p(x)$ is factorized into a product of conditionals $\Pi p(x_i  x_i1)$. Previous work has shown that very good models can be obtained by using a neural network parametrization of the conditionals (e.g. see our work on NADE \cite{journals/jmlr/LarochelleM11}). Moreover, unlike other approaches based on latent stochastic units that are directed or undirected, the autoregressive approach is able to compute logprobabilities tractably. So in this paper, by considering the specific case of x being an image, they exploit the topology of pixels and investigate appropriate architectures for this. Among the paper's contributions are: 1. They propose Diagonal BiLSTM units for the PixelRNN, which are efficient (thanks to the use of convolutions) while making it possible to, in effect, condition a pixel's distribution on all the pixels above it (see Figure 2 for an illustration). 2. They demonstrate that the use of residual connections (a form of skip connections, from hidden layer i1 to layer $i+1$) are very effective at learning very deep distribution estimators (they go as deep as 12 layers). 3. They show that it is possible to successfully model the distribution over the pixel intensities (effectively an integer between 0 and 255) using a softmax of 256 units. 4. They propose a multiscale extension of their model, that they apply to larger 64x64 images. The experiments show that the PixelRNN model based on Diagonal BiLSTM units achieves stateoftheart performance on the binarized MNIST benchmark, in terms of loglikelihood. They also report excellent loglikelihood on the CIFAR10 dataset, comparing to previous work based on realvalued density models. Finally, they show that their model is able to generate high quality image samples. 
[link]
This paper explores the use of socalled Monte Carlo objectives for training directed generative models with latent variables. Monte Carlo objectives take the form of the logarithm of a Monte Carlo estimate (i.e. an average over samples) of the marginal probability $P(x)$. One important motivation for using Monte Carlo objectives is that they can be shown (see the Importance Weighted Variational Autoencoder paper \cite{journals/corr/BurdaGS15} and my notes on it) to correspond to bounds on the true likelihood of the model, and one can tighten the bound simply by drawing more samples in the Monte Carlo objective. Currently, the most successful application of Monte Carlo objectives is based on an importance sampling estimate, which involves training a proposal distribution $Q(hx)$ in addition to the model $P(x,h)$. This paper considers the problem of training with gradient descent on such objectives, in the context of a model to which the reparametrization trick cannot be used (e.g. for discrete latent variables). They analyze the sources of variance in the estimation of the gradients (see Equation 5) and propose a very simple approach to reducing the variance of a samplingbased estimator of these gradients. First, they argue that gradients with respect to the $P(x,h)$ parameters are less susceptible to problems due to high variance gradients. Second, and most importantly, they derive a multisample estimate of the gradient that is meant to reduce the variance of gradients on the proposal distribution parameters $Q(hx)$. The end result is the gradient estimate of Equations 1011. It is based on the observation that the first term of the gradient of Equation 5 doesn't distinguish between the contribution of each sampled latent hi. The key contribution is this: they notice that one can incorporate a variance reducing baseline for each sample hi, corresponding to the Monte Carlo estimate of the loglikelihood when removing hi from the estimate (see Equation 10). The authors show that this is a proper baseline, in that using it doesn't introduce a bias in the estimation for the gradients. Experiments show that this approach yields better performance than training based on Reweighted Wake Sleep \cite{journals/corr/BornscheinB14} or the use of NVIL baselines \cite{conf/icml/MnihG14}, when training sigmoid belief networks as generative models or as structured output prediction (image completion) models on binarized MNIST. 
[link]
This paper presents the theoretical notion of ensemble robustness and how it might provide an explanation for the success of deep learning algorithms. This work is an extension of some of the author's previous work (see Definition 2), demonstrating a theoretical relationship between a notion of robustness to adversarial examples and generalization performance. One initial observation made in this work is that this previous notion of robustness cannot explain the good performance of deep neural networks, since they have been shown to in fact not be robust to adversarial examples. So in this paper, the authors propose to study a notion of ensemble robustness (see Definition 3), and show that it can also be linked to generalization performance (see Theorem 1 and Corollary 1). The "ensemble" part comes from taking into account the stochasticity of the learning algorithm, i.e. the fact that the models they produce can vary from one run to another, even if applied on the same training set. The stochasticity here can come from the use of dropout, of SGD with random ordering of the training examples or from the random parameter initialization. Other theoretical results are also presented, such as one relating the variance of the robustness to generalization performance and another specific to the use of dropout. Finally, the paper also proposes a semisupervised learning algorithm inspired from their definition of ensemble robustness, in which a model is trained to classify the perturbed (adversarial) version of an example in the same class as the original (non perturbed) example. On MNIST, they achieve excellent results, matching the performance of the stateoftheart Ladder Networks. 
[link]
This paper presents a conditional generative model of text, where text can be generated either one character at a time or by copying some full chunks of character taken directly from the input into the output. At each step of the generation, the model can decide which of these two modes of generation to use, mixing them as needed to generate a correct output. They refer to this structure for generation as Latent Predictor Networks \cite{conf/nips/VinyalsFJ15}. The characterlevel generation part of the model is based on a simple output softmax over characters, while the generationbycopy component is based on a Pointer Network architecture. Critically, the authors highlight that it is possible to marginalize over the use of either types of components by dynamic programming as used in semiMarkov models \cite{conf/nips/SarawagiC04}. One motivating application is machine translation, where the input might contain some named entities that should just be directly copied at the output. However, the authors experiment on a different problem, that of generating code that would implement the action of a card in the trading card games Magic the Gathering and Hearthstone. In this application, copying is useful to do things such as copy the name of the card or its numericallyvalued effects. In addition to the Latent Predictor Network structure, the proposed model for this application includes a slightly adapted form of softattention as well as characteraware word embeddings as in \cite{conf/emnlp/LingDBTFAML15} Also, the authors experiment with a compression procedure on the target programs, that can help in reducing the size of the output space. Experiments show that the proposed neural network approach outperforms a variety of strong baselines (including systems based on machine translation or information retrieval). 
[link]
This paper proposes a neural architecture that allows to backpropagate gradients though a procedure that can go through a variable and adaptive number of iterations. These "iterations" for instance could be the number of times computations are passed through the same recurrent layer (connected to the same input) before producing an output, which is the case considered in this paper. This is essentially achieved by pooling the recurrent states and respective outputs computed by each iteration. The pooling mechanism is essentially the same as that used in the really cool Neural Stack architecture of Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman and Phil Blunsom \cite{conf/nips/GrefenstetteHSB15}. It relies on the introduction of halting units, which are sigmoidal units computed at each iteration and which gives a soft weight on whether the computation should stop at the current iteration. Crucially, the paper introduces a new ponder cost $P(x)$, which is a regularization cost that penalizes what is meant to be a smooth upper bound on the number of iterations $N(t)$ (more on that below). The paper presents experiment on RNNs applied on sequences where, at each time step t (not to be confused with what I'm calling computation iterations, which are indexed by n) in the sequence the RNN can produce a variable number $N(t)$ of intermediate states and outputs. These are the states and outputs that are pooled, to produce a single recurrent state and output for the time step t. During each of the $N(t)$ iterations at time step t, the intermediate states are connected to the same timestept input. After the $N(t)$ iterations, the RNN pools the $N(t)$ intermediate states and outputs, and then moves to the next time step $t+1$. To mark the transitions between time steps, an extra binary input is appended, which is 1 only for the first intermediate computation iteration. Results are presented on a variety of synthetic problems and a character prediction problem. 