[link]
If you read modern (that is, 20182020) papers using deep learning on molecular inputs, almost all of them use some variant of graph convolution. So, I decided to go back through the citation chain and read the earliest papers that thought to apply this technique to molecules, to get an idea of lineage of the technique within this domain. This 2015 paper, by Duvenaud et al, is the earliest one I can find. It focuses the entire paper on comparing differentiable, messagepassing networks to the state of the art standard at the time, circular fingerprints (more on that in a bit). I really appreciated this approach, which, rather than trying to claim an unrealistic level of novelty, goes into detail on the prior approach, and carves out specific areas of difference. At a high level, the authors' claim is: our model is, in its simplest case, a more flexible and super version of existing work. The unspoken corollary, which ended up being proven true, is that the flexibility of the neural network structure makes it easy to go beyond this initial level of simplicity. Circular Fingerprinting (or, more properly, ExtendedConnectivity Circular Fingerprints), is a fascinating algorithm that captures many of the elements of convolution: shared weights, a hierarchy of kernels that match patterns at different scales, and a clever way of aggregating information across an arbitrary number of input nodes. Mechanistically, Circular Fingerprints work by: 1) Taking each atom, and creating a concatenated vector of its basic features, along with the basic features of each atom it's bonded to (with bonded neighbors quasirandomly) 2) Calculating nextlevel features by applying some number of hash functions (roughly equivalent to convolutional kernels) to the neighborhood feature vector at the lower level to produce an integer 3) For each feature, setting the value of the fingerprint vector to 1 at the index implied by the integer in step (2) 4) Iterating this process at progressively higher layers, using the hash The effect of this is to assign each index of the vector to an binary feature (modulo hash collisions), where that feature is activated if an exact match is found to a structure within a given atom. Its main downside is that (a) its "kernel" equivalents are fixed and not trainable, since they're just random hashes, and (b) its features represent *exact* matches to lowerlevel feature patterns, which means you can't have one feature activated to different degrees by variations on a pattern it's identifying. https://i.imgur.com/V8FpfVE.png Duvenaud et al present their alternative in terms of keeping a similar structure, but swapping out fixed and binary components for trainable (because differentiable) and continuous ones. Instead of concatenating a random sorting of atom neighbors to enforce invariance to sorting, they simply sum feature vectors across neighbors, which is also an orderinvariantoperation. Instead of applying hash functions, they apply parametrized kernel functions, with the same parameters used across all aggregated neighborhood vectors . This will no longer look for exact matches, but will activate to the extent a structure within an atom matches against a kernel pattern. Then, these features are put into a softmax, which instead setting an index of a vector to a sharp one value, activates different feature indices in the final vector to differing degrees. The final fingerprint is simply the sum of these softmax feature activations for each atom. The authors do a few tests to confirm their substitution is working well, including starting out with a random network (to better approximate the random hash functions), comparing distances between atoms according to either the circular or neural footprint (which had a high correlation), and confirming that that performs similarly to circular fingerprints on a set of supervised learning tasks on modules. When they trained weights to be better than random on three such supervised tasks, they found that their model was comparable or better than circular fingerprints on all three (to break that down: it was basically equivalent on one, and notably better on the other two, according to mean squared error) This really is the simplest possible version of a messagepassing or graph convolutional network (it doesn't use edge features, it doesn't calculate features of a neighborconnection according to the features of each node, etc), but it's really satisfying to see it laid out as a nextstep alternative that offered value just by stepping away from exactmatch feature dynamics and nonrandom functions, even without all the sophisticated additions that would later be added to such models.
2 Comments

[link]
This paper shows how a family of reinforcement learning algorithms known as value gradient methods can be generalised to learn stochastic policies and deal with stochastic environment models. Value gradients are a type of policy gradient algorithm which represent a value function either by: * A learned Qfunction (a critic) * Linking together a policy, an environment model and reward function to define a recursive function to simulate the trajectory and the total return from a given state. By backpropagating though these functions, value gradient methods can calculate a policy gradient. This backpropagation sets them apart from other policy gradient methods (like REINFORCE for example) which are modelfree and sample returns from the real environment. Applying value gradients to stochastic problems requires differentiating the stochastic bellman equation: \begin{equation} V ^t (s) = \int \left[ r^t + γ \int V^{t+1} (s) p(s'  s, a) ds' \right] p(as; θ) da \end{equation} To do that, the authors use a trick called reparameterisation to express the stochastic bellman equation as a deterministic function which takes a noise variable as an input. To differentiate a reparameterised function, one simply samples the noise variable then computes the derivative as if the function were deterministic. This can then be repeated $ M $ times and averaged to arrive at a Monte Carlo estimate for the derivative of the stochastic function. The reparameterised bellman equation is: $ V (s) = \mathbb{E}_{ \rho(\eta) } \left[ r(s, \pi(s, \eta; \theta)) + \gamma \mathbb{E}_{\rho(\xi) } \left[ V' (f(s, \pi(s, \eta; \theta), \xi)) \right] \right] $ It's derivative with respect to the current state and the policy parameters is: $ V_s = \mathbb{E}_{\rho(\eta)} \[ r_\textbf{s} + r_\textbf{a} \pi_\textbf{s} + \gamma \mathbb{E}_{\rho(\xi)} V'_{s'} (\textbf{f}_\textbf{s} + \textbf{f}_\textbf{a} \pi_\textbf{s}) \] $ $ V_\theta = \mathbb{E}_{\rho(\eta)} \[ r_\textbf{a} \pi_\theta + \gamma \mathbb{E}_{\rho(\xi)} \[ V'_{\textbf{s'}} \textbf{f}_\textbf{a} \pi_\textbf{s} + V'_\theta\] \] $ Based on these relationships the authors define two algorithms; SVG(∞), SVG(1) * SVG(∞) takes the trajectory from an entire episode and starting at the terminal state accumulates a gradients $V_{\textbf{s}} $ and $ V_{\theta} $ using the expressions above to arrive at a policy gradient. SVG(∞) is onpolicy and only works with finitehorizon environments * SVG(1) trains a value function then uses its gradient as an estimate for $ V_{\textbf{s}} $ above. SVG(1) also uses importance weighting so as to be offpolicy and can work with infinitehorizon environments. Both algorithms use an environment model which is trained using an experience replay database. The paper also introduces SVG(0) which is a similar to SVG(1), but is modelfree. SVG was analysed using several MuJoCo environments and it was found that: * SVG(∞) outperformed a BBPT planner on a control problem with a stochastic model, indicating that gradient evaluation using real trajectories is more effective than planning for stochastic environments * SVG(1) is more robust to inaccurate environment models and value functions than SVG(∞) * SVG(1) was able to solve several complex environments 
[link]
This paper introduces an endtoend trainable neural model capable of performing analogical reasoning in image representations followed by decoding back to image space. Specifically, given a 4tuple A:B::C:D, the task is to apply the transformation A:B to C. The motivation is clear — humans are excellent at generalizing to hypothetical transformations about images ("what if this chair were rotated 30 degrees clockwise?").  The objective function follows directly from vector addition: $MSE(d  g(f(b)  f(a) + f(c)))$ where $f$ and $g$ are convolutional neural networks.  In case of rotation, a purely additive transformation is not optimal because repeated application of this transformation to the same query image will never return to the original point. Instead, multiplicative interactions or MLPs are used to condition the transformation on $c$ as well.  Analogymaking is also performed on disentangled representations, which separate factors of variation to separate coordinates and are learnt from distinct images $a,b, c$ such that the objective is $MSE(c  g(s . f(a) + (1s) . f(b)))$ where $s$ are switch variables to disentangle features. Disentangled image features allow the analogymaking model to traverse the manifold of a given factor or subset of factors.  Experiments on transforming shapes, generating 2D video game sprites and 3D car renderings. ## Strengths  Neat idea, wellpresented 
[link]
#### Introduction This [paper](https://github.com/lucastheis/ride) introduces *recurrent image density estimator* (RIDE), a generative model by combining a *multidimensional* recurrent neural network with mixtures of experts to model the distribution of natural image. In this work, the authors used *spatial* LSTMs (SLSTM) to capture the semantics in the form of hidden states where these hidden vectors are then fed into a factorized *factorized mixtures of conditional Gaussian scale mixtures* (MCGSMs) to predict the state of the corresponding pixels. ##### __1. Spatial long shortterm memory (SLSTM)__ This is a straightforward extension of the multidimensional RNN in order to capture long range interaction. Let $\mathbf{x}$ be a grayscale image patch and $x_{ij}$ be the intensity of pixel at location ${ij}$. At each location $ij$, each LSTM unit perform the following operations: $\mathbf{c}_{ij} = \mathbf{g}_{ij} \odot \mathbf{i}_{ij} + \mathbf{c}_{i,j1} \odot \mathbf{f}^c_{ij} + \mathbf{c}_{i1,j} \odot \mathbf{f}^r_{ij} $ $\mathbf{h}_{ij} = \tanh(\mathbf{c}_{ij} \odot \mathbf{o}_{ij})$ $\begin{pmatrix} \mathbf{g}_{ij} \\ \mathbf{o}_{ij} \\ \mathbf{i}_{ij} \\ \mathbf{g}_{ij}\\ \mathbf{f}_{ij}^r\\ \mathbf{f}_{ij}^c \end{pmatrix} = \begin{pmatrix} \tanh \\ \sigma \\ \sigma \\ \sigma \\ \sigma\\ \sigma \end{pmatrix} T_{\mathbf{A,b}} \begin{pmatrix} \mathbf{x}_{<ij} \\ \mathbf{h}_{i,j1} \\ \mathbf{h}_{i1,j} \end{pmatrix} $ where $\mathbf{c}_{ij}$ and $\mathbf{h}_{ij}$ are memory units and hidden units respectively. Note that, there are 2 different forget gates $\mathbf{f}^c_{ij}$ and $\mathbf{f}^r_{ij}$ for the 2 preceding memory states $\mathbf{c}_{i,j1}$ and $\mathbf{c}_{i1,j}$. Also note that $\mathbf{x}_{<ij}$ here denotes a set of *causal neighborhood* by applying Markov assumption. ![ride_1](http://i.imgur.com/W8ugGvl.png) As shown in Fig. C, although the prediction of a pixel depends only on its neighborhood (green) through feedforward connections, there is an indirect connection to a much larger region (red) via recurrent connections. ##### __2. Factorized mixtures of conditional Gaussian scale mixtures__ A generative model can usually be expressed as $p(\mathbf{x};\mathbf{\theta}) = \prod_{i,j} p(x_{ij}\mathbf{x}_{<ij}; \mathbf{\theta})$ using chain rule. One way to improve the representational power of a model is to introduce different sets of parameters for each pixel, i.e. $p(\mathbf{x}; \{ \mathbf{\theta} \}) = \prod_{i,j} p(x_{ij}\mathbf{x}_{<ij}; \mathbf{\theta}_{ij})$. However, untying shared parameters will lead to drastic increase of parameters. Therefore, the author applied 2 simple common used assumptions: 1. __Markov assumption__: $\mathbf{x}_{<ij}$ is limited to small neighborhood around $x_{ij}$ (causal neighborhood) 2. __Stationary and shift invariance__: the same set of $\mathbf{\theta}_{ij}$ is used for every location ${ij}$ which corresponds to recurrent structure in RNN. Therefore, the hidden vector from SLSTMs can be fed into the MCGSM to predict the state of corresponding label, i.e. $p(x_{ij}  \textbf{x}_{<ij}) = p(x_{ij}  \textbf{h}_{ij})$. The conditional distribution distribution in MCGSM is represented as a mixture of experts: $p(x_{ij}  \mathbf{x}_{<ij}; \mathbf{\theta}_{ij}) = \sum_{c,s} p(c, s  \mathbf{x}_{<ij}, \mathbf{\theta}_{ij}) p (x_{ij}  \mathbf{x}_{<ij}, c, s, \mathbf{\theta}_{ij})$. where the first and second term correspond to gate and experts respectively. To further reduce the number of parameters, the authors proposed using a *factorized* MCGSM in order to use larger neighborhoods and more mixture components. (*__Remarks__: I am not too sure about the exact training of MCGSM, but as far as I understand, the MCGSM is firstly trained endtoend with SLSTM using SGD with momentum and then finetuned using LBFGS after each epoch by fixing the parameters of SLSTM.*) * For training: ``` for n in range(num_epochs): for b in range(0, inputs.shape[0]  batch_size + 1, batch_size): # compute gradients f, df = f_df(params, b) loss.append(f / log(2.) / self.num_channels) # update SLSTM parameters for l in train_layers: for key in params['slstm'][l]: diff['slstm'][l][key] = momentum * diff['slstm'][l][key]  df['slstm'][l][key] params['slstm'][l][key] = params['slstm'][l][key] + learning_rate * diff['slstm'][l][key] # update MCGSM parameters diff['mcgsm'] = momentum * diff['mcgsm']  df['mcgsm'] params['mcgsm'] = params['mcgsm'] + learning_rate * diff['mcgsm'] ``` * Finetuning (part of the code) ``` for l in range(self.num_layers): self.slstm[l] = SLSTM( num_rows=hiddens.shape[1], num_cols=hiddens.shape[2], num_channels=hiddens.shape[3], num_hiddens=self.num_hiddens, batch_size=min([hiddens.shape[0], self.MAX_BATCH_SIZE]), nonlinearity=self.nonlinearity, extended=self.extended, slstm=self.slstm[l], verbosity=self.verbosity) hiddens = self.slstm[l].forward(hiddens) # finetune with early stopping based on validation performance return self.mcgsm.train( hiddens_train, outputs_train, hiddens_valid, outputs_valid, parameters={ 'verbosity': self.verbosity, 'train_means': train_means, 'max_iter': max_iter}) ``` 
[link]
Facebook has [released a series of papers](https://research.facebook.com/blog/learningtosegment/) for object segmentation and detection. This paper is the first in that series. This is how modern object detection works (think [RCNN](https://arxiv.org/abs/1311.2524), [Fast RCNN](http://arxiv.org/abs/1504.08083)): 1. A rich set of object proposals (i.e., a set of image regions which are likely to contain an object) is generated using a fast (but possibly imprecise) algorithm. 2. A CNN classifier is applied on each of the proposals. The current paper improves the step 1, i.e., region/object proposals. Most object proposals approaches fall into three categories: * Objectness scoring * Seed Segmentation * Superpixel Merging Current method is different from these three. It share similarities with [Faster RCNN](https://arxiv.org/abs/1506.01497) in that proposals are generated using a CNN. The method predicts a segmentation mask given an input *patch* and assigns a score corresponding to how likely the patch is to contain an object. ## Model and Training Both mask and score predictions are achieved with a single convolutional network but with multiple outputs. All the convolutional layers except the last few are from VGGA pretrained model. Each training sample is a triplet of RGB input patch, the binary mask corresponding to the input patch, a label which specifies whether the patch contains an object. A patch is given label 1 only if it satisfies the following constraints: * the patch contains an object roughly centered in the input patch * the object is fully contained in the patch and in a given scale range Note that the network must output a mask for a single object at the center even when multiple objects are present. Figure 1 shows the architecture and sampling for training. ![figure1](https://i.imgur.com/zSyP0ij.png) Model is then jointly trained for segmentation and objectness. Negative samples are not used for segmentation. ## Inference During full image inference, model is applied densely at multiple locations and scales. This can be done efficiently since all computations are convolutional like in a fully convolutional network (FCN). ![figure2](https://i.imgur.com/dQWfy8R.png) This approach surpasses the previous state of the art by a large margin in both box and segmentation proposal generation. 
[link]
This post is a comment on the Laplacian pyramidbased generative model proposed by researchers from NYU/Facebook AI Research. Let me start by saying that I really like this model, and I think  looking at the samples drawn  it represents a nice big step towards convincing generative models of natural images. To summarise the model, the authors use the Laplacian pyramid representation of images, where you recursively decompose the image to a lower resolution subsampled component and the highfrequency residual. The reason this decomposition is favoured in image processing is the fact that the highfrequency residuals tend to be very sparse, so they are relatively easy to compress and encode. In this paper the authors propose using convolutional neural networks at each layer of the Laplacian pyramid representation to generate an image sequentially, increasing the resolution at each step. The convnet at each layer is conditioned on the lower resolution image, and some noise component $z_k$, and generates a random higher resolution image. The process continues recursively until the desired resilution is reached. For training they use the adversarial objective function. Below is the main figure that explains how the generative model works, I encourage everyone to have a look at the paper for more details: ![](http://www.inference.vc/content/images/2015/07/ScreenShot20150723at111517.png) #### An argument about Conditional Entropies What I think is weird about the model is the precise amount of noise that is injected at each layer/resolution. In the schematic above, these are the $z_k$ variables. Adding the noise is crucial to defining a probabilistic generative process; this is how it defines a probability distribution. I think it's useful to think about entropies of natural images at different resolutions. When doing generative modelling or unsuperised learning, we want to capture the distribution of data. One important aspect of a probability distribution is its entropy, which measures the variability of the random quantity. In this case, we want to describe the statistics of the full resolution observed natural image $I_0$. (I borrow the authors' notation where $I_0$ represents the highest resolution image, and $I_k$ represent the $k$times subsampled version. Using the Laplacian pyramid representation, we can decompose the entropy of an image in the following way: $$\mathbb{H}[I_0] = \mathbb{H}[I_{K}] + \sum_{k=0}^{K1} \mathbb{H}[I_k\vert I_{k+1}].$$ The reason why the above decomposition holds is very simple. Because $I_{k+1}$ is a deterministic function of $I_{k}$ (subsampling), the conditional entropy $\mathbb{H}[I_{k+1}\vert I_{k}] = 0$. Therefore the joint entropy of the two variables is simply the entropy of the higher resolution image $I_{k}$, that is $\mathbb{H}[I_{k},I_{k+1}] = \mathbb{H}[I_{k}] + \mathbb{H}[I_{k+1}\vert I_{k}] = \mathbb{H}[I_{k}]$. So by induction, the join entropy of all images $I_{k}$ is just the marginal entropy of the highest resolution image $I_0$. Applying the chain rule for joint entropies we get the expression above. Now, the interesting bit is how the conditional entropies $\mathbb{H}[I_k\vert I_{k+1}]$ are 'achieved' in the Laplacian pyramid generative model paper. These entropies are provided by the injected random noise variables $z_k$. By the information processing lemma $\mathbb{H}[I_k\vert I_{k+1}] \leq \mathbb{H}[z_k]$. The authors choose $z_k$ to be uniform random variables whose dimensionality grows with the resolution of $I_k$. To quote them "The noise input $z_k$ to $G_k$ is presented as a 4th color plane to lowpass $l_k$, hence its dimensionality varies with the pyramid level." Therefore $\mathbb{H}[z_k] \propto 4^{k}$, assuming that the pixel count quadruples at each layer. So the conditional entropy $\mathbb{H}[I_k\vert I_{k+1}]$ is allowed to grow exponentially with resolution, at the same rate it would grow if the images contained pure white noise. In their model, they allow the perpixel conditional entropy $c\cdot 4^{k}\cdot \mathbb{H}[I_k\vert I_{k+1}]$ to be constant across resolutions. To me, this seems undesirable. My intuition is, for natural images, $\mathbb{H}[I_k\vert I_{k+1}]$ may grow as $k$ decreases (because the dimensionality gorws), but the perpixel value $c\cdot 4^{k}\cdot \mathbb{H}[I_k\vert I_{k+1}]$ should decrease or converge to $0$ as the resolution increases. Very low lowresolution subsampled natural images behave a little bit like white noise, there is a lot of variability in them. But as you increase the resolution, the probability distribution of the highres image given the lowres image will become a lot sharper. In terms of model capacity, this is not a problem, inasmuch as the convolutional models $G_{k}$ can choose to ignore some variance in $z_k$ and learn a more deterministic superresolution process. However, adding unnecessarily high entropy will almost certainly make the fitting of such model harder. For example, the adversarial training process relies on sampling from $z_k$, and the procedure is pretty sensitive to sampling noise. If you make the distribution of $z_k$ unneccessarily high entropy, you will end up doing a lot of extra work during training until the network figures out to ignore the extra variance. To solve this problem, I propose to keep the entropy of the noise vectors constant, or make them grow sublinearly with the number of pixels in the image. This mperhaps akes the generative convnets harder to implement. Another quick solution would be to introduce dependence between components of $z_k$ via a lowrank covariance matrix, or some sort of a hashing trick. #### Adversarial training vs superresolution autoencoders Another weird thing is that the adversarial objective function forgets the identity of the image. For example, you would want your model so that `"if at the previous layer you have a lowresolution parrot, the next layer should be a higherresolution parrot"` Instead, what you get with the adversarial objective is `"if at the previous layer you have a lowresolution parrot, the next layer should output a higherdimensional image that looks like a plausible natural image"` So, there is nothing in the objective function that enforces dependency between subsequent layers of the pyramid. I think if you made $G_k$ very complex, it could just learn to model natural images by itself, so that $I_{k}$ is in essence independent of $I_{k+1}$ and is purely driven by the noise $z_{k}$. You could sidestep this problem by restricting the complexity of the generative nets, or, again, to restrict the entropy of the noise. Overall, I think the approach would benefit from a combination of the adversarial and a supervised (superresolution autoencoder) objective function. 
[link]
I think this paper has two main ideas in there, I see them as independent, for reasons explained below:  A new penalty function that aims at regularising the second derivative of the trajectory the latent representation traces over time. I see this as a generalisation of the slowness principle or temporal constancy, more about this in the next section.  A new autoencoderlike method to predict future frames in video. Video is really hard to forwardpredict with nonprobabilistic models because high level aspects of video are genuinely uncertain. For example, in a football game, you can't really predict whether the ball will hit the goalpost, but the results might look completely different visually. This, combined with L2 penalties often results in overly conservative, blurry predictions. The paper improves things by introducing extra hidden variables, that allow the model to represent uncertainty in its predictions. More on this later. #### Inductive bias: penalising curvature The key idea of this paper is to learn good distributed representations of natural images from video in an unsupervised way. Intuitively, there is a lot of information contained in video, which is lost if you scramble the video and look at statistics individual frames only. The race is on to develop the right kind of prior and inductive bias that helps us fully exploit this temporal information. This paper presents a way, which is called learning to linearise (I'm going to call this L2L). Naturally occurring images are thought to reside on some complex, nonlinear manifold whose intrinsic dimension is substantially lower than the number of pixels in an image. It is then natural to think about video as a journey on this manifold surface, along some smooth path. Therefore, if we aim to learn good generic features that correspond to coordinates on this underlying manifold, we should expect that these features vary in a smooth fashion over time as you play the video. L2L uses this intuition to motivate their choice of an objective function that penalises a scaleinvariant measure of curvature over time. In a way it tries to recover features that transform nearly linearly as time progresses and the video is played. In their notations, $x_{t}$ denotes the data in frame $t$, which is transformed by a deep network to obtain the latent representation $z_{t}$. The penalty for the latent representation is as follows. $$\sum_{t} \frac{(z_t  z_{t1})^{T}(z_{t+1}  z_{t})}{\z_t  z_{t1}\\z_{t+1}  z_{t}\}$$ The expression above has a geometric meaning as the cosine of the angle between the vectors $(z_t  z_{t1})$ and $(z_{t+1}  z_{t})$. The penalty is minimised if these two vectors are parallel and point in the same direction. In other words the penalty prefers when the latent feature representation keeps its momentum and continues along a linear path  and it does not like sharp turns or jumps. This seems like a sensible prior assumption to build on. L2L is very similar to another popular inductive bias used in slow feature analysis: the temporal slowness principle. According to this principle, the most relevant underlying features don't change very quickly. The slowness principle has a long history both in machine learning and as a model of human visual perception. In SFA one would minimise the following penalty on the latent representation: $$\sum_{t} (z_t  z_{t1})^{2},$$ where the square is applied componentwise. There are additional constraints in SFA, more about this later. We can understand the connection between SFA and this paper's penalty if we plot the penalty for a single hidden feature $z_{t,f}$ at time $t$, keeping all other features and values at neighbouring timesteps constant. This is plotted in the figure below (scaled and translated so the objectives line up nicely). ![](http://www.inference.vc/content/images/2015/09/RfX4Dp2Y3YAAAAASUVORK5CYII.png) As you can see, both objectives have a minimum at the same location: they both try to force $z_{t,f}$ to linearly interpolate between the neighbouring timesteps. However, while SFA has a quadratic penalty, the learning to linearise objective tapers off at long distances. Compare this to Tukey's loss function used in outlierresistant robust regression. Based on this, my prediction is that compared to SFA, this loss function is more tolerant of outliers, which in the temporal domain would mean abrupt jumps in the latent representation. So while SFA is equivalent to assuming that the latent features follow a Brownianmotionlike Ornstein–Uhlenbeck process, I'd imagine this prior corresponds to something like a jump diffusion process (although I don't think the analogy holds mathematically). Which one of these inductive biases/priors are better at exploiting temporal information in natural video? Slow Brownian motion, or nearlylinear trajectories with potentially a few jumps Unfortunately, don't expect any empirical answer to that from the paper. All experiments seem to be performed on artificially constructed examples, where the temporal information is synthetically engineered. Nor there is any real comparison to SFA. #### Representing predictive uncertainty with auxillary variables While the encoder network learns to construct smoothly varrying features $z_t$, the model also has a decoder network that tries to reconstruct $x_t$ and predict subsequent frames. This, the authors agree, is necessary in order for $z_t$ to contain enough relevant information about the frame $x_t$ (more about whether or not this is necessary later). The precise way this decoding is done has a novel idea as well: minimising over auxillary variables. Let's say our task is to predict a future frame $x_{t+k}$ based on the latent representation $z_{t}$. The problem is, this is a very hard problem. In video, just like in real life, anything can happen. Imagine you're modelling soccer footage, and the ball is about to hit the goalpost. In order to predict the next frames, not only do we have to know about natural image statistics, we also have to be able to predict whether the goal is in or not. An optimal predictive model would give a highly multimodal probability distribution as its answer. If you use the L2 loss with a deterministic feedforward predictive network, it's likely to come up with a very blurry image, which would correspont to the average of this nasty multimodal distribution. This calls for something better, either a smarter objective function, or a better way of representing predictive uncertainty. The solution the authors gave is to introduce hidden variables $\delta_{t}$, that the decoder network also receives as input in addition to $z_t$. For each frame, $\delta_t$ is optimised so that only the best possible reconstruction is taken into account in the loss function. Thus, the decoder network is allowed to use $\delta$ as a source of nondeterminism to hedge its bets as to what the contents of the next frame will be. This is one step closer to the ideal setting where the decoder network is allowed to give a full probability distribution of possibilities and then is evaluated using a strictly proper scoring rule. This inner loop minimisation (of $\delta$) looks very tedious, and introduces a few more parameters that may be hard to set. The algorithm is reminiscent of the Estep in expectationmaximisation, and also very similar to the iterated closest point algorithm Andrew Fitzgibbon talked about in his tutorial at BMVC this year. In his tutorial, Andrew gave examples where jointly optimising model parameters and auxiliary variables ($\delta$) is advantageous, and I think the same logic applies here. Instead of the inner loop, simultaneous optimisation helps fixing some pathologies, like slow convergence near the optimum. In addition, Andrew advocates exploiting the sparsity structure of the Hessian to implement efficient secondorder gradientbased optimisation methods. These tricks are explained in paragraphs around equation 8 in (Prasad et al, 2010). #### Predictive model: Is it necessary? On a more fundamental level, I question whether the predictive decoder network is really a necessary addition to make L2L work. The authors observe that the objective function is minimised by the "trivial" solutions $z_{t} = at + b$, where $a,b$ can be arbitrary constants. They then say that in order to make sure features do something more than just discover some of these trivial solutions, we also have to include a decoder network, that uses $z_t$ to predict future frames. I believe this is not necessary at all. Because $z_t$ is a deterministic function of $x_t$, and $t$ is not accessible to $z_{t}$ in any other way than through inferring it from $x_t$, as long as $a\neq 0$, the linear solutions are not trivial at all. If the network discovers $z_{t} = at, a\neq 0$, you should in fact be very happy (assuming a single feature). The only problems with trivial solutions occur when $z_{t} = b$ ($z$ doesn't depend on the data at all) or when $z$ is multidimensional and several redundant features are sensitive to exactly the same thing. These trivial solutions could be avoided the same way they are avoided in SFA, by constraining the overall spatial covariance of $z_{t}$ over the videoclip to be $I$. This would force each feature to vary at least a little bit with data hence avoiding the trivial constant solutions. It would also force features to be linearly decorrelated  solving the redundant features problem. So I wonder if the decoder network is indeed a necessary addition to the model. I would love to encourage the authors to implement their new hypothesis of a prior both with and without the decoder. They may already have tried it without and found it really didn’t work, so it might just be a matter of including those results. This would in turn allow us to see SFA and L2L sidebyside, and learn something about whether and why their prior is better than the sl 
[link]
This paper addresses the task of imagebased Q&A on 2 axes: comparison of different models on 2 datasets and creation of a new dataset based on existing captions. The paper is addressing an important and interesting new topic which has seen recent surge of interest (Malinowski2014, Malinowski2015, Antol2015, Gao2015, etc.). The paper is technically sound, wellwritten, and wellorganized. They achieve good results on both datasets and the baselines are useful to understand important ablations. The new dataset is also much larger than previous work, allowing training of stronger models, esp. deep NN ones. However, there are several weaknesses: their main model is not very different from existing work on imageQ&A (Malinowski2015, who also had a VIS+LSTM style model (but they were also jointly training the CNN and RNN, and also decoding with RNNs to produce longer answers) and achieves similar performance (except that adding bidirectionality and 2way image input helps). Also, as the authors themselves discuss, the dataset in its current form, synthetically created from captions, is a good start but is quite conservative and limited, being singleword answers, and the transformation rules only designed for certain simple syntactic cases. It is exploration work and will benefit a lot from a bit more progress in terms of new models and a slightly more broad dataset (at least with answers up to 23 words). Regarding new models, e.g., attentionbased models are very relevant and intuitive here (and the paper would be much more complete with this), since these models should learn to focus on the right area of the image to answer the given question and it would be very interesting to analyze the results of whether this focusing happens correctly. Before attention models, since 2way image input helped (actually, it would be good to ablate 2way versus bidirectionality in the 2VIS+BLSTM model), it would be good to also show the model version that feeds the image vector at every time step of the question. Also, it would be useful to have a nearest neighbor baseline as in Devlin et al., 2015, given their discussion of COCO's properties. Here too, one could imagine copying answers of training questions, for cases where the captions are very similar. Regarding a broaderscope dataset, the issue with the current approach is that it is too similar to the captioning approach or task, which has the drawback that a major motivation to move to imageQ&A is to move away from single, vague (nonspecific), generic, oneeventfocused captions to a more complex and detailed understanding of and reasoning over the image; which doesn't happen with this paper's current dataset creation approach, and so this will also not encourage thinking of very different models to handle imageQ&A, since the best captioning models will continue to work well here. Also, having 23 word answers will capture more realistic and more diverse scenarios; and though it is true that evaluation is harder, one can start with existing metrics like BLEU, METEOR, CIDEr, and human eval. And since these will not be full sentences but just 23 word phrases, such existing metrics will be much more robust and stable already. Originality: The task of imageQ&A is very recent with only a couple of prior and concurrent work, and the dataset creation procedure, despite its limitations (discussed above) is novel. The models are mostly not novel, being very similar to Malinowski2015, but the authors add bidirectionality and 2way image input (but then Malinowski2015 was jointly training the CNN and RNN, and also decoding with RNNs to produce longer answers). Significance: As discussed above, the paper show useful results and ablations on the important, recent task of imageQ&A, based on 2 datasets  an existing small dataset and a new large dataset; however, the second, new dataset is synthetically created by ruletransforming captions and only to singleword answers, thus keeping the impact of the dataset limited, because it keeps the task too similar to the generic captioning task and because there is no generation of answers or prediction of multiword answers. 
[link]
The paper proposes a novel way to train a sparse autoencoder where the hidden unit sparsity is governed by a winnertakeall kind of selection scheme. This is a convincing way to achieve a sparse autoencoder, while the paper could have included some more details about their training strategy and the complexity of the algorithm. The authors present a fully connected autoencoder with a new sparsity constraint called the lifetime sparsity. For each hidden unit across the minibatch, they rank the activation values, keeping only the topk% for reconstruction. The approach is appealing because they don't need to find a hard threshold and it makes sure every hidden unit/filter is updated (no dead filters because their activation was below the threshold). Their encoder is a deep stack of ReLu and the decoder is shallow and linear (note that usually nonsymmetric autoencoders lead to worse results). They also show how to apply to RBM. The effect of sparsity is very effective and noticeable on the images depicting the filters. They extend this autoencoder in a convolutional/deconvolutional framework, making it possible to train on larger images than MNIST or TFD. They add a spatial sparsity, keeping the top activation per feature map for the reconstruction and combine it with the lifetime sparsity presented before. The proposed approach exploits on a mechanism close to the one of ksparse autoencoders proposed by Makkhzani et al [14]. The authors extend the idea from [14] to build winnertakeall encoders (and RBMs), that enforce both spatial and lifetime regularization by keeping only a percentage (the biggest) of activations. The lifetime sparsity allows overcoming problems that could arise with ksparse autoencoders. The authors next propose to embed their modeling framework in convolutional neural nets to deal with larger images than e.g. those of mnist. 
[link]
This paper presents an endtoend version of memory networks (Weston et al., 2015) such that the model doesn't train on the intermediate 'supporting facts' strong supervision of which input sentences are the best memory accesses, making it much more realistic. They also have multiple hops (computational steps) per output symbol. The tasks are Q&A and language modeling, and achieves strong results. The paper is a useful extension of memNN because it removes the strong, unrealistic supervision requirement and still performs pretty competitively. The architecture is defined pretty cleanly and simply. The related work section is quite wellwritten, detailing the various similarities and differences with multiple streams of related work. The discussion about the model's connection to RNNs is also useful. 
[link]
This paper extends the stochastic optimization algorithm SVRG proposed in recent years. These modifications mainly includes: the convergence analysis of SVRG with corrupted full gradient; Mix the iteration of SGD and SVRG; the strategy of minibatch; Using support vectors etc. For each modification, the author makes clear proofs and achieves linear convergence under smooth and strong convex assumptions. However, this paper's novelty is not big enough. The improvement of convergence rate is not obvious and the proof outline is very similar to the original SVRG. The key problem such as the support for nonstrongly convex loss is still unsolved. This paper starts with a key proposition showing that SVRG does not require a very accurate approximation of the total gradient of the objective function needed by SVRG algorithm. The authors use this proposition to derive a batching SVRG algorithm with the same convergence rate as that of original SVRG. Then, the authors propose a mixed stochastic gradient/SVRG approach and give a convergence proof for such a scheme. As a different approach of speeding up, the authors proposed a speedup technique for Huberized hingeloss support vector machine. 
[link]
This paper presents a novel layer that can be used in convolutional neural networks. A spatial transformer layer computes resampling points of the signal based on another neural network. The suggested transformations include scaling, cropping, rotations and nonrigid deformation whose paramerters are trained endtoend with the rest of the model. The resulting resampling grid is then used to create a new representation of the underlying signal through bilinear or nearest neighbor interpolation. This has interesting implications: the network can learn to colocate objects in a set of images that all contain the same object, the transformation parameter localize the attention area explicitly, fine data resolution is restricted to areas important for the task. Furthermore, the model improves over previous stateoftheart on a number of tasks. The layer has one mini neural network that regresses on the parameters of a parametric transformation, e.g. affine), then there is a module that applies the transformation to a regular grid and a third more or less "reads off" the values in the transformed positions and maps them to a regular grid, hence underforming the image or previous layer. Gradients for backpropagation in a few cases are derived. The results are mostly of the classic deep learning variety, including mnist and svhn, but there is also the finegrained birds dataset. The networks with spatial transformers seem to lead to improved results in all cases. 
[link]
This paper addresses the problem of inverse reinforcement learning when the agent can change it's objective during the recording of trajectories. This results in a transition between several reward functions that explain only locally the trajectory of the observed agent. Transition probabilities between reward functions are unknown. The author propose a cascade of an EM and Viterbi algorithms to discover the reward functions and the segments on which they are valid. Their algorithm consists in maximizing the loglikelihood of the expert's demonstrated trajectories depending on some parameters which are the original distributions of states and rewards, the local rewards and the transition function between rewards. To do so, they use the expectationmaximisation (EM) method. Then, via the Viterbi algorithm, they are able to partition the trajectories into segments with local consistent rewards. Strengths of the paper: 1. The authors leverage existing and classical methods from the machine learning and optimization fields such as EM, Viterbi, Value iteration and gradient ascent in order to build their algorithm. This will allow the community to easily reproduce their results. 2. The experiments are conducted on synthetic and realworld data. They compare their method to MLIRL which does not use locally consistent rewards and which is the canonical choice to compare to as their algorithm is a generalization of MLIRL. The results presented show the superiority of their method over MLIRL. 3. The idea presented by the authors is original as far as I know. Weaknesses of the paper: 1. The paper is very dense ( the figures are incorporated in the text) which makes the reading difficult. 2. The algorithm proposed needs the knowledge of the dynamics and the number of rewards. The authors, as future works, plan to extend their algorithm to unknown number of rewards, however they do not mention to get rid off the knowledge of the dynamics. Could the authors comment on that as some IRL algorithms do not need a perfect knowledge of the dynamics? 3. The method needs to solve iteratively MDPs when learning the reward functions. For each theta in the gradient ascent a MDP needs to be solved. Is this prohibitive for huge MDPs? Is there a way to avoid that step? The actionvalue function Q is defined via a softmax operator in order to have a derivable policy, does it allow to solve more efficiently the MDP? 4. The authors are using gradient ascent in the EM method, could they comment on the concavity of their criteria? 5. In the experiments (gridworlds), the number of features for the states is very small and thus it is understandable that a reward which is linear on the features will perform badly. Do the authors consider comparing their method to an IRL method where the number of features defining the states is greater? This is the main problem that I have with the experiments, the features used are not expressive enough to consider using a classical IRL method and this can explain why MLIRL performs badly and that its performance does not improve when the number of expert trajectories grows. 6. The performance is measured by the average loglikelihood of the expert's demonstrated trajectories which is the criterion maximized by the algorithm. I think that a more pertinent measure would be the value function of the policy produced by the optimization of the reward obtained by the algorithm. Could the authors comment on that and explain why their performance metric is more appropriate? 
[link]
This paper deals with the formal question of machine reading. It proposes a novel methodology for automatic dataset building for machine reading model evaluation. To do so, the authors leverage on news resources that are equipped with a summary to generate a large number of questions about articles by replacing the named entities of it. Furthermore a attention enhanced LSTM inspired reading model is proposed and evaluated. The paper is wellwritten and clear, the originality seems to lie on two aspects. First, an original methodology of question answering dataset creation, where contextqueryanswer triples are automatically extracted from news feeds. Such proposition can be considered as important because it opens the way for large model learning and evaluation. The second contribution is the addition of an attention mechanism to an LSTM reading model. the empirical results seem to show relevant improvement with respect to an uptodate list of machine reading models. Given the lack of an appropriate dataset, the author provides a new dataset which scraped CNN and Daily Mail, using both the full text and abstract summaries/bullet points. The dataset was then anonymised (i.e. entity names removed). Next the author presents a two novel Deep longshort term memory models which perform well on the Cloze query task. 
[link]
The paper "Bandits with unobs. confounders: a causal approach" addresses the problem of bandit learning. It is assumed that in the observational setting, the player's decision is influenced by some unobserved context. If we randomize the player's decision, however, this intention is lost. The key idea is now that, using the available data from both scenarios, one can infer whether one should overrule the player's intention. Ultimately, this leads to the following strategy: observe the player's intention and then decide whether he should act accordingly or pull the other arm. The author showed that the current MAB algorithms actually attempt to maximize rewards according to the experimental distribution, which is not optimal in the confounding case, and proposed to make use of the effect of the treatment on the treated (ETT), i.e., by comparing the average payouts obtained by players for going in favor of or against their intuition. To me, the paper is interesting because it addresses the confounding issue in MAB and proposed a way to estimate some properties of the confounder (related to the casino's payout strategy in the given example) based on ETT. At first glance, one might think that the blinking light on the slot machines (B) and the drunkenness of the patron (D) could be either modified or observed in lines 153159, where we read about a hypothetical attempt to optimize reward using traditional Thompson sampling. If those factors were observable or subject to intervention  and I'd think they would be, in reality  then it would be straightforward to do better than the 30% reward rate that's given. The paper eventually makes it clear that both of these variables are unobserved and unalterable. It would help if this were explicit early in the example, or if the cover story were modified to make this aspect more intuitive. 
[link]
TLDR; The authors evaluate the use for 9layer deep CNNs on largescale data sets for text classification, operating directly on onehot encodings of characters. The architecture achieves competitive performance across datasets. #### Key Points  9 Layers, 6 conv/ppol layers, 3 affine layers. 1024dimensional input features for large model, 256dimensional input features for small model.  Authors optionally use English thesaurus for training data augmentation  Fixed input length l: 1014 characters  Simple ngram models performs very well on these data sets and beats other models and the smaller data sets (<= 500k examples). CNN wins on the larger data sets (>1M examples) #### Notes / Questions  Comparing the CNN with input restricted to 1014 characters to models that operate on words seems unfair. Also, how long is the average document? Would've liked to see some dataset statistics. The fixed input length doesn't make a lot of sense to me.  Contribution of this paper is that the architecture works without word knowledge and for any language, but at the same time the authors use a wordlevel English thesaurus to improve their performance? To be fair, the thesaurus doesn't seem to make a huge difference.  The reason this architecture requires so much data is probably because it's very deep (How many parameters?). Did the authors experiment with fewer layers? Did they perform much worse?  What about unsupervised pretraining? Can that reduce the amount of data required to achieve good performance. Currently this model doesn't seem very useful in practice as there are very few datasets of such size out there. 
[link]
TLDR; The authors apply an RNN to modeling the students knowledge. The input is an exercise question and answer (correct/incorrect), either as onehot vectors or embedded. The network then predicts whether or not the student can answer a future question correctly. The authors show that the RNN approach results in significant improvement over previous models, can be used for curriculum optimization, and also discovers the latent structure in exercise concepts. #### Key Points  Two encodings tried: One hot, embedded  RNN/LSTM, 200dimensional hidden layer, output dropout, NLL.  No expert annotation for concepts or question/answers are needed  Blocking (series of exercises of same type) vs Mixing for curriculum optimization: Blocking seems to perform better  Lots of cool future direction ideas #### Question / Notes  Can we not only predict whether an exercise is answered correctly, but also what the most likely student answer would be? My give insight into confusing concepts. 
[link]
TLDR; The authors propose a new architecture called "Pointer Network". A Pointer Network is a seq2seq architecture with attention mechanism where the output vocabulary is the set of input indices. Since the output vocabulary varies based on input sequence length, a Pointer Network can generalize to variablelength inputs. The attention method trough which this is achieved is O(n^2), and only a sight variation of the standard seq2seq attention mechanism. The authors evaluate the architecture on tasks where the outputs correspond to positions of the inputs: Convex Hull, Delaunay Triangulation and Traveling Salesman problems. The architecture performs well these, and generalizes to sequences longer than those found in the training data. #### Key Points  Similar to standard attention, but don't blend the encoder states, use the attention vector directory.  Softmax probabilities of outputs can be interpreted as a fuzzy pointer.  We can solve the same problem artificially using seq2seq and outputting "coordinates", but that ignores the output constraints and would be less efficient.  512 unit LSTM, SGD with LR 1.0, batch size of 128, L2 gradient clipping of 2.0.  In the case of TSP, the "student" networks outperforms the "teacher" algorithm. #### Notes/ Questions  Seems like this architecture could be applied to generating spans (as in the newer "Text Processing From Bytes" paper), for POS tagging for example. That would require outputting classes in addition to input pointers. How? 
[link]
TLDR; The authors show that we can pretrain RNNs using unlabeled data by either reconstructing the original sequence (SALSTM), or predicting the next token as in a language model (LMLSTM). We can then finetune the weights on a supervised task. Pretrained RNNs are more stable, generalize better, and achieve stateoftheart results on various text classification tasks. The authors show that unlabeled data can compensate for a lack of labeled data. #### Data Sets Error Rates for SALSTM, previous best results in parens.  IMDB: 7.24% (7.42%)  Rotten Tomatoes 16.7% (18.5%) (using additional unlabeled data)  20 Newsgroups: 15.6% (17.1%)  DBPedia characterlevel: 1.19% (1.74%) #### Key Takeaways  SALSTM: Predict sequence based on final hidden state  LMLSTM: LanguageModel pretraining  LSTM, 1024dimensional cell, 512dimensional embedding, 512dimensional hidden affine layer + 50% dropout, Truncated backprop 400 steps. Clipped cell outputs and gradients. Word and input embedding dropout tuned on dev set.  Linear Gain: Inject gradient at each step and linearly increase weights of prediction objectives #### Notes / Questions  Not clear when/how linear gain yields improvements. On some data sets it significantly reduces performance, on other it significantly improves performance. Any explanations?  Word dropout is used in the paper but not explained. I'm assuming it's replacing random words with `DROP` tokens?  The authors mention a joint training model, but it's only evaluated on the IMDB data set. I'm assuming the authors didn't evaluate it further because it performed badly, but it would be nice to get an intuition for why it doesn't work, and show results for other data sets.  All tasks are classification tasks. Does SALSTM also improve performance on seq2seq tasks?  What is the training time? :) (I also wonder how the batching is done, are texts padded to the same length with mask?) 
[link]
TLDR; The authors apply the skipthoguth word2vec model to the sentence level, training autoencoders that predict the previous and next sentences. The resulting generalpurpose vector representations are called skipthought vectors. The authors evaluate the performance of these vectors as features on semantic relatedness and classification tasks, achieving competitive results, but not beating finetuned models. #### Key Points  Code at https://github.com/ryankiros/skipthoughts  Training is done on large book corpus (74M sentences, 1B tokens), takes 2 weeks.  Two variations: Bidirectional encoder and unidirectional encoder with 1200 and 2400 units per encoder respectively. GRU cell, Adam optimizer, gradient clipping norm 10.  Vocabulary can be expanded by learning a mapping from a large word2vec voab to the smaller skipthought vocab. Could also used sampling/hierarchical softmax during training for larger vocab, or train on characters. #### Questions/Notes  Authors clearly state that this is not the goal of the paper, though I'd be curious how more sophisticated (nonlinear) classifiers perform with skipthought vectors. Authors probably tried this but it didn't do well ;)  The fact that the story generation doesn't seem work well shows that the model has problems learning or understanding longterm dependencies. I wonder if this can be solved by deeper encoders or attention. 
[link]
The authors perform theoretical analysis about faster convergence with multiplayer normalform games by generalizing techniques for twoplayer zerosum games. They also perform empirical evaluation by using the 4bidder simultaneous auction game. The paper is concerned with two problems: 1. How does the social welfare of players using regret minimization algorithms compare to the optimal welfare. 2. Can one obtain better regret bounds when all players use a regret minimization algorithm The paper deals with bounds on regret minimization algorithms in games. The usual regret bounds on these algorithms is in $O(\sqrt{T})$. However, this assumes that the learner faces a completely adversarial opponent. However, it is natural to assume that on a game everyone will play a regret minimization algorithm and the question is whether or not one can obtain better rates in this scenario. The authors show that regret in $O(T^{1/4})$ is achievable for general games. 
[link]
The paper gives justification for the widespread use of the GoodTuring estimator for discrete distribution estimation through minimax regret analysis with two comparator classes. The paper obtains competitive regret bounds that lead to a more accurate characterization of the performance of the the GoodTuring estimators and in some cases is much better than the best known risk bounds. The comparator classes considered are estimators with knowledge of the distribution up to permutation, and estimators with full knowledge of the distribution, but with the constraint that the must assign the same probability mass to symbols appearing with the same frequencies. 
[link]
This paper describes a learning algorithm for deep neural networks that can be understood as an extension of stacked denoising autoencoders. In short, instead of reconstructing one layer at a time and greedily stacking, a unique unsupervised objective involving the reconstruction of all layers is optimized jointly by all parameters (with the relative importance of each layer cost controlled by hyperparameters). In more details: * The encoding (forward propagation) adds noise (Gaussian) at all layers, while decoding is noisefree. * The target at each layer is the result of noiseless forward propagation. * Direct connections (also known as skipconnections) between a layer and its decoded reconstruction are used. The resulting encoder/decoder architecture thus ressembles a ladder (hence the name Ladder Networks). * Miniature neural networks with a single hidden unit and skipconnections are used to decode the left and top layers into a reconstruction. Each network is applied elementwise (without parameter sharing across reconstructed units). * The unsupervised objective is combined with a supervised objective, corresponding to the regular negative class loglikelihood objective (using an output softmax layer). Two losses are used for each input/target pair: one based on the noisefree forward propagation (which also provides the target of the denoising objective) and one with the noise added (which also corresponds to the encoding stage of the unsupervised autoencoder objective). Batch normalization is used to train the network. Since the model combines unsupervised and supervised learning, it can be used for semisupervised learning, where unlabeled examples can be used to update the network using the unsupervised objective only. State of the art results in the semisupervised setting are presented, for both the MNIST and CIFAR10 datasets. #### My two cents What I find most exciting about this paper is its performance. On MNIST, with only 100 labeled examples, it achieves 1.13% error! That is essentially the performance of stacked denoising autoencoders, trained on the entire training set (though that was before ReLUs and batch normalization, which this paper uses)! This confirms a current line of thought in Deep Learning (DL) that, while recent progress in DL applied on large labeled datasets does not rely on any unsupervised learning (unlike at the "beginning" of DL in the mid 2000s), unsupervised learning might instead be crucial for success in lowlabeled data regime, in the semisupervised setting. Unfortunately, there is one little issue in the experiments, disclosed by the authors: while they used few labeled examples for training, model selection did use all 10k labels in the validation set. This is of course unrealistic. But model selection in the low data regime is arguably, in itself, an open problem. So I like to think that this doesn't invalidate the progress made in this paper, and only suggests that some research needs to be done on doing effective hyperparameter search with a small validation set. Generally, I really hope this paper will stimulate more research on DL methods to the specific case of small labeled dataset / large unlabeled dataset. While this isn't a problem that is as "flashy" as tasks such as the ImageNet Challenge which comes with lots of labeled data, I think this is a crucial research direction for AI in general. Indeed, it seems naive to me to expect that we will be able to collect large labeled dataset for each and every task, on our way to real AI. 
[link]
This paper considers the problem of structured output prediction, in the specific case where the output is a sequence and we represent the sequence as a (conditional) directed graphical model that generates from the first token to the last. The paper starts from the observation that training such models by maximum likelihood (ML) does not reflect well how the model is actually used at test time. Indeed, ML training implies that the model is effectively trained to predict each token conditioned on the previous tokens *from the ground truth* sequence (this is known as "teacher forcing"). Yet, when making a prediction for a new input, the model will actually generate a sequence by generating tokens one after another and conditioning on *its own predicted tokens* instead. So the authors propose a different training procedure, where at training time each *conditioning* ground truth token is sometimes replaced by the model's previous prediction. The choice of replacing the ground truth by the model's prediction is made by "flipping a coin" with some probability, independently for each token. Importantly, the authors propose to start with a high probability of using the ground truth (i.e. start close to ML) and anneal that probability closer to 0, according to some schedule (thus the name Schedule Sampling). Experiments on 3 tasks (image caption generation, constituency parsing and speech recognition) based on neural networks with LSTM units, demonstrate that this approach indeed improves over ML training in terms of the various performance metrics appropriate for each problem, and yields better sequence prediction models. #### My two cents Big fan of this paper. It both identifies an important flaw in how sequential prediction models are currently trained and, most importantly, suggests a solution that is simple yet effective. I also believe that this approach played a nonnegligible role in Google's winner system for image caption generation, in the Microsoft COCO competition. My alternative interpretation of why Scheduled Sampling helps is that ML training does not inform the model about the relative quality of the errors it can make. In terms of ML, it is as bad to put high probability on an output sequence that has just 1 token that's wrong, than it is to put the same amount of probability on a sequence that has all tokens wrong. Yet, say for image caption generation, outputting a sentence that is one word away from the ground truth is clearly preferable from making a mistake on a words (something that is also reflected in the performance metrics, such as BLEU). By training the model to be robust to its own mistakes, Scheduled Sampling ensures that errors won't accumulate and makes predictions that are entirely off much less likely. An alternative to Scheduled Sampling is DAgger (Dataset Aggregation: \cite{journals/jmlr/RossGB11}), which briefly put alternates between training the model and adding to the training set examples that mix model predictions and the ground truth. However, Scheduled Sampling has the advantage that there is no need to explicitly create and store that increasingly large dataset of sampled examples, something that isn't appealing for online learning or learning on large datasets. I'm also very curious and interested by one of the direction of future work mentioned in the conclusion: figuring out a way to backprop through the stochastic predictions made by the model. Indeed, as the authors point out, the current algorithm ignores the fact that, by sometimes taking as input its previous prediction, this induces an additional relationship between the model's parameters and its ultimate prediction, a relationship that isn't taken into account during training. To take it into account, you'd need to somehow backpropagate through the stochastic process that generated the previous token prediction. While the work on variational autoencoders has shown that we can backprop through gaussian samples, backpropagating through the sampling of a discrete multinomial distribution is essentially an open problem. I do believe that there is work that tried to tackle propagating through stochastic binary units however, so perhaps that's a start. Anyways, if the authors could make progress on that specific issue, it could be quite useful not just in the context of Schedule Sampling, but possibly in the context of training networks with discrete stochastic units in general! 
[link]
This paper starts by introducing a trick to reduce the variance of stochastic gradient variational Bayes (SGVB) estimators. In neural networks, SGVB consists in learning a variational (e.g. diagonal Gaussian) posterior over the weights and biases of neural networks, through a procedure that (for the most part) alternates between adding (Gaussian) noise to the model's parameters and then performing a model update with backprop. The authors present a local reparameterization trick, which exploits the fact that the Gaussian noise added into the weights could instead be added directly into the preactivation (i.e. before the activation fonction) vectors during forward propagation. This is due to the fact that computing the preactivation is a linear operation, thus noise at that level is also Gaussian. The advantage of doing so is that, in the context of minibatch training, one can efficiently then add independent noise to the preactivation vectors for each example of the minibatch. The nature of the local reparameterization trick implies that this is equivalent to using one corrupted version of the weights for each example in the minibatch, something that wouldn't be practical computationally otherwise. This is in fact why, in normal SGVB, previous work would normally use a single corrupted version of the weights for all the minibatch. The authors demonstrate that using the local reparameterization trick yields stochastic gradients with lower variance, which should improve the speed of convergence. Then, the authors demonstrate that the Gaussian version of dropout (one that uses multiplicative Gaussian noise, instead of 01 masking noise) can be seen as the local reparameterization trick version of a SGVB objective, with some specific prior and variational posterior. In this SGVB view of Gaussian dropout, the dropout rate is an hyperparameter of this prior, which can now be tuned by optimizing the variational lower bound of SGVB. In other words, we now have a method to also train the dropout rate! Moreover, it becomes possible to tune an individual dropout rate parameter for each layer, or even each parameter of the model. Experiments on MNIST confirm that tuning that parameter works and allows to reach good performance of various network sizes, compared to using a default dropout rate. ##### My two cents This is another thought provoking connection between Bayesian learning and dropout. Indeed, while Deep GPs have allowed to make a Bayesian connection with regular (binary) dropout learning \cite{journals/corr/GalG15}, this paper sheds light on a neat Bayesian connection for the Gaussian version of dropout. This is great, because it suggests that Gaussian dropout training is another legit way of modeling uncertainty in the parameters of neural networks. It's also nice that that connection also yielded a method for tuning the dropout rate automatically. I hope future work (by the authors or by others) can evaluate the quality of the corresponding variational posterior in terms of estimating uncertainty in the network and, in particular, in obtaining calibrated output probabilities. Little detail: I couldn't figure out whether the authors tuned a single dropout rate for the whole network, or used many rates, for instance one per parameter, as they suggest can be done. 
[link]
This paper presents a linear algebraic trick for computing both the value and the gradient update for a loss function that compares a very highdimensional target with a (dense) output prediction. Most of the paper exposes the specific case of the squared error loss, though it can also be applied to some other losses such as the socalled spherical softmax. One use case could be for training autoencoders with the squared error on very highdimensional but sparse inputs. While a naive (i.e. what most people currently do) implementation would scale in $O(Dd)$ where $D$ is the input dimensionality and d the hidden layer dimensionality, they show that their trick allows to scale in $O(d^2)$. Their experiments show that they can achieve speedup factors of over 500 on the CPU, and over 1500 on the GPU. #### My two cents This is a really neat, and frankly really surprising, mathematical contribution. I did not suspect getting rid of the dependence on D in the complexity would actually be achievable, even for the "simpler" case of the squared error. The jury is still out as to whether we can leverage the full power of this trick in practice. Indeed, the squared error over sparse targets isn't the most natural choice in most situations. The authors did try to use this trick in the context of a version of the neural network language model that uses the squared error instead of the negative logsoftmax (or at least I think that's what was done... I couldn't confirm this with 100% confidence). They showed that good measures of word similarity (Simlex999) could be achieved in this way, though using the hierarchical softmax actually achieves better performance in about the same time. But as far as I'm concerned, that doesn't make the trick less impressive. It's still a neat piece of new knowledge to have about reconstruction errors. Also, the authors mention that it would be possible to adapt the trick to the socalled (negative log) spherical softmax, which is like the softmax but where the numerator is the square of the preactivation, instead of the exponential. I hope someone tries this out in the future, as perhaps it could be key to making this trick a real game changer! 
[link]
This paper presents a variational approach to the maximisation of mutual information in the context of a reinforcement learning agent. Mutual information in this context can provide a learning signal to the agent that is "intrinsically motivated", because it relies solely on the agent's state/beliefs and does not require from the ("outside") user an explicit definition of rewards. Specifically, the learning objective, for a current state s, is the mutual information between the sequence of K actions a proposed by an exploration distribution $w(as)$ and the final state s' of the agent after performing these actions. To understand what the properties of this objective, it is useful to consider the form of this mutual information as a difference of conditional entropies: $$I(a,s's) = H(as)  H(as',s)$$ Where $I(..)$ is the (conditional) mutual information and $H(..)$ is the (conditional) entropy. This objective thus asks that the agent find an exploration distribution that explores as much as possible (i.e. has high $H(as)$ entropy) but is such that these actions have predictable consequences (i.e. lead to predictable state s' so that $H(as',s)$ is low). So one could think of the agent as trying to learn to have control of as much of the environment as possible, thus this objective has also been coined as "empowerment". The main contribution of this work is to show how to train, on a large scale (i.e. larger state space and action space) with this objective, using neural networks. They build on a variational lower bound on the mutual information and then derive from it a stochastic variational training algorithm for it. The procedure has 3 components: the exploration distribution $w(as)$, the environment $p(s's,a)$ (can be thought as an encoder, but which isn't modeled and is only interacted with/sampled from) and the planning model $p(as',s)$ (which is modeled and can be thought of as a decoder). The main technical contribution is in how to update the exploration distribution (see section 4.2.2 for the technical details). This approach exploits neural networks of various forms. Neural autoregressive generative models are also used as models for the exploration distribution as well as the decoder or planning distribution. Interestingly, the framework allows to also learn the state representation s as a function of some "raw" representation x of states. For raw states corresponding to images (e.g. the pixels of the screen image in a game), CNNs are used. 
[link]
This paper combines two ideas. The first is stochastic gradient Langevin dynamics (SGLD), which is an efficient Bayesian learning method for larger datasets, allowing to efficiently sample from the posterior over the parameters of a model (e.g. a deep neural network). In short, SGLD is stochastic (minibatch) gradient descent, but where Gaussian noise is added to the gradients before each update. Each update thus results in a sample from the SGLD sampler. To make a prediction for a new data point, a number of previous parameter values are combined into an ensemble, which effectively corresponds to Monte Carlo estimate of the posterior predictive distribution of the model. The second idea is distillation or dark knowledge, which in short is the idea of training a smaller model (student) in replicating the behavior and performance of a much larger model (teacher), by essentially training the student to match the outputs of the teacher. The observation made in this paper is that the step of creating an ensemble of several models (e.g. deep networks) can be expensive, especially if many samples are used and/or if each model is large. Thus, they propose to approximate the output of that ensemble by training a single network to predict to output of ensemble. Ultimately, this is done by having the student predict the output of a teacher corresponding to the model with the last parameter value sampled by SGLD. Interestingly, this process can be operated in an online fashion, where one alternates between sampling from SGLD (i.e. performing a noisy SGD step on the teacher model) and performing a distillation update (i.e. updating the student model, given the current teacher model). The end result is a student model, whose outputs should be calibrated to the bayesian predictive distribution. 
[link]
The authors introduce a new method for actively selecting the model that best fits a dataset. Contrary to active learning, where the next learning point is chosen to get a better estimate of the model hyperparameters, this methods selects the next point to better distinguish between a set of models. Similar active model selection techniques exist, but they need to retrain each model for each new data point to evaluate. The strength of the author's method is that is only requires to evaluate the predictive distributions of models, without retraining. They propose to apply this method to detect noiseinduced hearing loss. The traditional way of screening for NIHL involves testing a wide range of intensities and frequencies, which is time consuming. The authors show that with their method, the number of tests to be run could be drastically decreased, reducing the cost of largescale screenings for NIHL. 
[link]
Machine learning researchers frequently find that they get better results by adding more and more layers to their neural networks, but the difficulties of initialization and decaying/exploding gradients have been severely limiting. Indeed, the difficulties of getting information to flow through deep neural networks arguably kept them out of widespread use for 30 years. This paper addresses this problem head on and demonstrates one method for training 100 layer nets. The paper describes an affective method to train very deep neural networks by means of 'information highways', or building direct connections to upper network layers. Although a generalization of prior techniques, such as crosslayer connections, the authors have shown this method to be effective by experimentation. The contributions are quite novel and well supported by experimental evidence. 
[link]
The paper proposes a sampler for iHMMs, which the authors show has improved mixing properties and performs better in posterior inference problems when compared to the existing stateoftheart sampling methods. An existing Gibbs sampler is turned into a particle Gibbs sampler by using a conditional SMC step to sample the latent sequence of states. The paper uses conjugacy to derive optimal SMC proposals and ancestor sampling to improve the performance of the conditional SMC step. The result is more efficient sampling of the latent states, making the sampler robust to spurious states and yielding faster convergence. 
[link]
The authors' model confidence data from two experiments (conducted by others and previously published in the scientific literature) using a POMDP. In both experiments, subjects saw a randomdot kinematogram on each trial and made a binary choice about the dominant motion direction. The first experiment used monkeys as subjects and stimuli had a fixed duration. The second experiment used people as subjects and stimuli continued until a subject made a response. The paper reports that the POMDP model does a good job of fitting the experimental data, both the accuracy data and the confidence data. 
[link]
Deep rectified neural networks are overparameterized in the sense that scaling of the weights in one layer, can be compensated for exactly in the subsequent layer. This paper introduces PathSGD, a simple modification to the SGD update rule, whose update is invariant to such rescaling. The method is derived from the proximal form of gradient descent, whereby a constraint term is added which preserves the norm of the "product weight" formed along each path in the network (from input to output node). PathSGD is thus principled and shown to yield faster convergence for a standard 2 layer rectifier network, across a variety of dataset (MNIST, CIFAR10, CIFAR100, SVHN). As the method implicitly regularizes the neural weights, this also translates to better generalization performance on half of the datasets. At its core, PathSGD belongs to the family of learning algorithms which aim to be invariant to model reparametrizations. This is the central tenet of Amari's natural gradient (NG) \cite{amari_natural_1998}, whose importance has resurfaced in the area of deep learning. PathSGD can thus be cast an approximation to NG, which focuses on a particular type of rescaling between neighboring layers. The paper would greatly benefit from such a discussion in my opinion. I also believe NG to be a much more direct way to motivate PathSGD, than the heuristics of maxnorm regularization. 
[link]
Endowing memory to recurrent neural networks is clearly one of the most important topics of deep learning and crucial to do real reasoning. The proposed stackaugmented recurrent nets outperform simple RNN and LSTM \cite{journals/neco/HochreiterS97} on a series of synthetic problems (learning simple algorithmic patterns). The complexity of problems is clearly defined and the behavior of resulting stack RNN could be well understood and easily analyzed. However, the conclusions merely depending on those synthetic data set may take a risk. The importance of the problems to real sequence modeling task could be uncertain and the failures of other models could be greatly improved by more and dense hyperparameter searching. Like in \cite{journals/corr/LeJH15}, by a very simple trick a RNN works very well on a toy task (a adding problem) which seems to need to model long term dependencies. 
[link]
The authors propose a probabilistic version of the "line search" procedure that is commonly used as a subroutine in many deterministic optimization algorithms. The new technique can be applied when the evaluations of the objective function and its gradients are corrupted by noise. Therefore, the proposed method can be successfully used in stochastic optimization problems, eliminating the requirement of having to specify a learning rate parameter in this type of problems. The proposed method uses a Gaussian process surrogate model for the objective and its gradients. This allows us to obtain a probabilistic version of the conditions commonly used to terminate line searches in the deterministic scenario. The result is a soft version of those conditions that is used to stop the probabilistic line search process. At each iteration within such process, the next evaluation location is collected by using Bayesian optimization methods. A series of experiments with neural networks on the MNIST and CIFAR10 datasets validate the usefulness of the proposed technique. 
[link]
This paper propose a new inference mechanism for the PlackettLuce model based on the preliminary observation that the ML estimate can be seen as the stationary distribution of a certain Markov chain. In fact, two inferences mechanisms are proposed, one is approximate and consistent, the other converges to the ML estimate but is slower. The authors then debate on the application settings (pairwise preferences, partial rankings). Finally, the authors exhibit three sets of experiments. The first one compares the proposed algorithm to other approximate inference mechanisms for the PL model in terms of statistical efficiency. Then on realworld datasets, one experiment compares the empirical performance of the approximate methods and a second the speed of exact methods to reach a certain level of optimality. 
[link]
The algorithm presented here is simple and interesting. Pixel luminance, chrominance, and illumination chrominance are all histogrammed, and then evaluation is simply each pixel's luminance voting on each pixel's true chrominance for each of the "memorized" illuminations. The model can be trained generative by simply counting pixels in the training set, or can be trained endtoend for a slight performance boost. This algorithm's simplicity and speed are appealing, and additionally it seems like it may be a useful building block for a more sophisticated spatiallyvarying illumination model. 
[link]
This paper describes using an additional time scale over trials to model (slow) nonstationarities. It adds to the successful PLDS model, another gain vector matching the latent dimensions that is constant during each trial. Many neuroscientific datasets indeed show such slow drifts, which could very well be captured by such modeling effort. 
[link]
This paper addresses the problem of learning reserve prices that approximately maximize revenue, using sample draws from an unknown distribution over bidder valuations. The authors introduce tlevel auctions, in which (roughly speaking) each bidder's bid space is effectively discretized into levels, and the bidder whose bid falls on the highest level wins and pays the lowest value that falls on its lowest level required to win. The authors bound the number of samples needed to find an approximately revenuemaximizing auction from all auctions in a set C (e.g., from the set of 10level auctions). They bound the difference in revenue between the revenuemaximizing tlevel auction and the optimal auction. Results are presented for singleitem auctions but are generalized to matroid settings and singleparameter settings. 
[link]
The authors introduce a novel approach for inferring hidden physical properties of objects (mass and friction), which also allows the system to make subsequent predictions that depend on these properties. They use a blackbox generative model (a physics simulator), to perform samplingbased inference, and leverage a tracking algorithm to transform the data into more suitable latent variables (and reduce its dimensionality) as well as a deep model to improve the sampler. The authors assume priors over the hidden physical properties, and make point estimates of the geometry and velocities of objects using a tracking algorithm, which comprise a full specification of the scene that can be input to a physics engine to generate simulated velocities. These simulated velocities then support inference of the hidden properties within an MCMC sampler: the properties' values are proposed and their consequent simulated velocities are generated, which are then scored against the estimated velocities, similar to ABC. A deep network can be trained as a recognition model, from the inferences of the generative model, and also from the Physics 101 dataset directly. Its predictions of the mass and friction can be used to initialize the MCMC sampler. 
[link]
This paper considers a generalization of the Interactive Submodular Set Cover (ISSC) problem \cite{conf/icml/GuilloryB10}. In ISSC, the goal is to interactively collect elements until the value of the set of elements, represented by an unknown submodular function, reaches some threshold. In the original ISSC there is a single correct submodular function, which can be revealed using responses to each selected element, and a single desired threshold. This paper proposes to simultaneously require reaching some threshold for all the possible submodular functions. The threshold value is determined as a convex function of a submodular agreement measure between the given function and the responses to all elements. Each element has a cost, and so the goal is to efficiently decide which elements to collect to satisfy the goal at a small cost. 
[link]
The paper presents results on recovery of lowrank semidefinite matrices from linear measurements, using nonconvex optimization. The approach is inspired by recent work on phase retrieval, and combines spectral initialization with gradient descent. The connection to phase retrieval comes because measurements which are linear in the semidefinite matrix $X = Z Z'$ are quadratic in the factors $Z$. The paper proves recovery results which imply that correct recovery occurs when the number of measurements m is essentially proportional to n $r^2$, where n is the dimensionality and r is the rank. The convergence analysis is based on a form of restricted strong convexity (restricted because there is an $r(r1)/2$dimensional set of equivalent solutions along which the objective is flat). This condition also implies linear convergence of the proposed algorithm. The implementation seems awful. When compared to recent implementations, e.g. http://arxiv.org/abs/1408.2467 the performance seems orders of magnitude away from the state of the art  and being an order of magnitude faster than generalpurpose SDP solver on the nuclear norm does not make it any better. The authors should acknowledge that and compare the results with other codes on some established benchmark (e.g. Lenna), so as to show that the price in terms of runtime brings about much better performance in terms of objective function values (SNR, RMSE)  which is plausible, but far from certain. 
[link]
The paper presents a data visualisation method based on the concept of spacetime. The spacetime representation is capable of showing a broader family of proximities than an Euclidean space with the same dimensionality. Based on the KL measure, the authors argue that the lower dimensional representation of the high dimensional data using the spacetime local embedding method can keep more information than Euclidean embeddings. I am quite convinced, but there is one question about interpretability of the visualised data in spacetime. 
[link]
This work addresses an important special case of the correlation clustering problem: Given as input a graph with edges labeled 1 (disagreement) or +1 (agreement), the goal is to decompose the graph so as to maximize agreement within components. Building on recent work \cite{conf/kdd/BonchiGL14} \cite{conf/kdd/ChierichettiDK14}, this paper contributes two concurrent algorithms, a proof of their approximation ratio, a runtime analysis as well as a set of experiments which demonstrate convincingly the advantage of the proposed algorithms over the state of the art. 
[link]
The paper attacks the problem of describing a sequence of images from blogposts with a sequence of consistent sentences. For this the paper proposes to first retrieve the K=5 most similar images and associated sentences from the training set for each query image. The main contribution of the paper lies in defining a way to select the most relevant sentences for the query image sequence, providing a coherent description. For this sentences are first embedded in a vector and then the sequence of sentences is modeled with a bidirectional LSTM. The output of the bidirectional LSTM is first fed through a relu \cite{conf/icml/NairH10} and fully connected layer and then scored with a compatibility score between image and sentence. Additionally a local coherence model \cite{journals/coling/BarzilayL08} is included to enforce the compatibility between sentences. 
[link]
The paper presents a method to obtain a hierarchical clustering of a planar graph by posing the problem as that of approximating a set of edge weights using an ultrametric. This is accomplished by minimizing the $\ell_2$ norm between the given edge weights and the learnt ultrametric. Learning the ultrametric amounts to estimating a collection of multicuts that satisfies a hierarchical partitioning constraint. An efficient algorithm is presented that solves an approximation based on a finding a linear combination of a subset of possible twoway cuts of the graph. 
[link]
This paper proposes a novel online algorithm for constructing a multiclass classifier that enjoys a time complexity logarithmic in the number of classes k. This is done by constructing online a decision tree which locally maximizes an appropriate novel objective function, which measures the quality of a tree according to a combined "balancedness" and "purity" score. A theoretical analysis (of a probably intractable algorithm) is provided via a boosting argument (assuming weak learnability), essentially extending the work of Kearns and Mansour (1996) \cite{conf/stoc/KearnsM96} to the multiclass setup. A concrete algorithm is given to a relaxed problem (but see below) without any guarantees, but quite simple, natural and interesting. 
[link]
The authors derive an estimator of a "proxy" of the covariance matrix of a stationary stochastic process (in their case asset returns) which is robust to data outliers and does not make assumptions on the tails of the distribution. They show that for elliptical distributions, which includes Gaussians, this proxy is consistent with true covariance matrix up to a scaling factor; and that their proposed estimator of the proxy has bounded error. 
[link]
This paper presents a new method (the "covariancecontrolled adaptive Langevin thermostat") for MCMC posterior sampling for Bayesian inference. Along the lines of previous work in scalable MCMC, this is a stochastic gradient sampling method. The presented method aims to decrease parameterdependent noise (in order to speedup convergence to the given invariant distribution of the Markov chain, and generate beneficial samples more efficiently), while maintaining the desired invariant distribution of the Markov chain. Similar to existing stochastic gradient MCMC methods, this method aims to find use in largescale machine learning settings (i.e. Bayesian inference with large numbers of observations). Experiments on three models (a normalgamma model, Bayesian logistic regression, and a discriminative restricted Boltzmann machine) aim to show that the presented method performs better than Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) \cite{10.1016/03702693(87)91197X} and Stochastic Gradient NoseHoover Thermostat (SGNHT), two similar existing methods. 
[link]
This paper introduces ASUGS (adaptive sequential updating and greedy search), building on the previous work on SUGS by Wang & Dunson 2011 \cite{10.1198/jcgs.2010.07081}, which is a sequential (ie online) MAP inference method for DPMMs. The main contribution of the paper is to provide online updating for the concentration parameter, $\alpha$. The paper shows that the posterior distribution on $\alpha$ can be expected to behave has a gamma distribution (that depends on the current number of clusters and on n) in the largescale limit, assuming an exponential prior on $\alpha$. ASUGS uses the mean of this gamma distribution as the $\alpha$ for updating cluster assignments, the remainder of the algorithm proceeding as in SUGS (ie using conjugacy to update model parameters in an online fashion, with hard assignments of data to clusters.) The paper also shows that this choice of \alpha is bounded by $\log^\epsilon n$ for an arbitrarily small $\epsilon$, so that we may expect this process to converge, or at the very least be stable even in large settings. 
[link]
The paper seeks to establish a connection between algorithmic stability and generalization performance. Notions of algorithmic stability have been proposed before and linked to the generalization performance of learning algorithms \cite{conf/uai/KutinN02} \cite{journals/neco/KearnsR99} and have also been shown to be crucial for learnability \cite{journals/jmlr/ShalevShwartzSSS10}. \cite{PoggioETAL:04} proved that for bounded loss functions, the generalization of ERM is equivalent to the probabilistic leaveoneout stability of the learning algorithm. \cite{journals/jmlr/ShalevShwartzSSS10} then showed that a problem is learnable in Vapnik's general setting of learning iff there exists an asymptotically stability ERM procedure. This paper first establishes that for Vapnik's general setting of learning, a probabilistic notion of stability, is necessary and sufficient for the training losses to converge to test losses uniformly for all distributions. The paper then presents some discussions on how this notion of stability can be interpreted to give results in terms of the capacity of the function class or the size of the population. 
[link]
The paper presents a solution to binary classification with symmetric label noise (SLN). They show that, in order to obtain consistency (w.r.t. to the 01 loss in the "noiseless" case) while using a convex surrogate, one must use the loss $\ell(v,y) = 1  vy$  the "unhinged loss"  , which is shown to enjoy some useful properties, including robustness to SLN. In a more restricted sense of robustness, it is the only such loss, but in any case it overcomes the limitations of other convex losses for the same problem. Different implications of using the unhinged loss are discussed; the problem of classification with SLN with the unhinged loss and "linear" classifiers is investigated and solved analytically. The authors also present an empirical evaluation to motivate that their theoretical considerations have practical impact. 
[link]
The paper proposes a payment rule for crowdsourced tasks. This rule is intended to incentivize workers to accurately report their confidence (e.g. by skipping a task when they have low confidence), and to pay little to spammers. Payment is based on the product of the evaluations of a worker's responses to a set of goldstandard tasks; if the worker gets a single gold standard task wrong and asserts high confidence, the overall payment is zero. 
[link]
This work proposes a two stage object detection algorithm based on convolutional neural network (CNN). The first stage is region proposal, which is based on the traditional sliding window method but working on the top layer feature map of CNN (RPN). In the second stage, a fast RCNN is applied to the proposed regions. Since the convolution layers are shared between RPN and RCNN, and the calculation is speeded up using GPU, the algorithm can achieve near realtime (5fps). 