[link]
If you read modern (that is, 20182020) papers using deep learning on molecular inputs, almost all of them use some variant of graph convolution. So, I decided to go back through the citation chain and read the earliest papers that thought to apply this technique to molecules, to get an idea of lineage of the technique within this domain. This 2015 paper, by Duvenaud et al, is the earliest one I can find. It focuses the entire paper on comparing differentiable, messagepassing networks to the state of the art standard at the time, circular fingerprints (more on that in a bit). I really appreciated this approach, which, rather than trying to claim an unrealistic level of novelty, goes into detail on the prior approach, and carves out specific areas of difference. At a high level, the authors' claim is: our model is, in its simplest case, a more flexible and super version of existing work. The unspoken corollary, which ended up being proven true, is that the flexibility of the neural network structure makes it easy to go beyond this initial level of simplicity. Circular Fingerprinting (or, more properly, ExtendedConnectivity Circular Fingerprints), is a fascinating algorithm that captures many of the elements of convolution: shared weights, a hierarchy of kernels that match patterns at different scales, and a clever way of aggregating information across an arbitrary number of input nodes. Mechanistically, Circular Fingerprints work by: 1) Taking each atom, and creating a concatenated vector of its basic features, along with the basic features of each atom it's bonded to (with bonded neighbors quasirandomly) 2) Calculating nextlevel features by applying some number of hash functions (roughly equivalent to convolutional kernels) to the neighborhood feature vector at the lower level to produce an integer 3) For each feature, setting the value of the fingerprint vector to 1 at the index implied by the integer in step (2) 4) Iterating this process at progressively higher layers, using the hash The effect of this is to assign each index of the vector to an binary feature (modulo hash collisions), where that feature is activated if an exact match is found to a structure within a given atom. Its main downside is that (a) its "kernel" equivalents are fixed and not trainable, since they're just random hashes, and (b) its features represent *exact* matches to lowerlevel feature patterns, which means you can't have one feature activated to different degrees by variations on a pattern it's identifying. https://i.imgur.com/V8FpfVE.png Duvenaud et al present their alternative in terms of keeping a similar structure, but swapping out fixed and binary components for trainable (because differentiable) and continuous ones. Instead of concatenating a random sorting of atom neighbors to enforce invariance to sorting, they simply sum feature vectors across neighbors, which is also an orderinvariantoperation. Instead of applying hash functions, they apply parametrized kernel functions, with the same parameters used across all aggregated neighborhood vectors . This will no longer look for exact matches, but will activate to the extent a structure within an atom matches against a kernel pattern. Then, these features are put into a softmax, which instead setting an index of a vector to a sharp one value, activates different feature indices in the final vector to differing degrees. The final fingerprint is simply the sum of these softmax feature activations for each atom. The authors do a few tests to confirm their substitution is working well, including starting out with a random network (to better approximate the random hash functions), comparing distances between atoms according to either the circular or neural footprint (which had a high correlation), and confirming that that performs similarly to circular fingerprints on a set of supervised learning tasks on modules. When they trained weights to be better than random on three such supervised tasks, they found that their model was comparable or better than circular fingerprints on all three (to break that down: it was basically equivalent on one, and notably better on the other two, according to mean squared error) This really is the simplest possible version of a messagepassing or graph convolutional network (it doesn't use edge features, it doesn't calculate features of a neighborconnection according to the features of each node, etc), but it's really satisfying to see it laid out as a nextstep alternative that offered value just by stepping away from exactmatch feature dynamics and nonrandom functions, even without all the sophisticated additions that would later be added to such models.
2 Comments

[link]
The idea is to combine collaborative filtering with contentbased recommenders to mitigate the user and item coldstart problems. The author distinguishes between positive and negative interactions. The representation of a user and of items is the sum of all their latent representations. This sounds similar to "**Asymmetric factor models**" as described in [the BellKor Netflix price solution](https://www.netflixprize.com/assets/ProgressPrize2007_KorBell.pdf). **The key idea is to encode the latent user (or item) vector as a sum of latent attribute vectors.** Adagrad / asynchronous stochastic gradient descent was used for optimization. ## See also * [Code on GitHub](https://lyst.github.io/lightfm/docs/index.html#) * [Paper on ArXiv](https://arxiv.org/pdf/1507.08439.pdf) 
[link]
Neelakantan et al. study gradient noise for improving neural network training. In particular, they add Gaussian noise to the gradients in each iteration: $\tilde{\nabla}f = \nabla f + \mathcal{N}(0, \sigma^2)$ where the variance $\sigma^2$ is adapted throughout training as follows: $\sigma^2 = \frac{\eta}{(1 + t)^\gamma}$ where $\eta$ and $\gamma$ are hyperparameters and $t$ the current iteration. In experiments, the authors show that gradient noise has the potential to improve accuracy, especially given optimization. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
### **Keyword**: RNN, serialized model; nondifferentiable backpropogarion; action detection in video **Abstract**: This paper uses an endtoend model which is a recurrent neural network trained by REINFORCE to directly predict the temporal bounds of actions. The intuition is that people will observe moments in video and decide where to look to predict when an action is occurring. After training, Serena et al manage to achieve the stateofart result by only observing 2% of the video frames. **Model**: In order to take a long video and output all the instances of given action, they use two parts including an observation network and recurrent network. * observation network: encode the visual representation of video frames. * input: $ln$  the normalized location of the frame + frame $v_{ln}$ * network: fc7 feature of finetuned VGG16 network * output: $on$ of 1024 dimension indicate time and frame feature * recurrent network: sequentially process the visual representations and decide where to watch next and whether to emit detection. ##### for each timestep: * input: $on$  the representation of the frame + previous state $h_{n1}$ * network: $d_n = f_d(h_n; \theta_d)$. $pn = fp(h_n,\theta_p)$,$fd$ is fc. $fp$ is fc+sigmoid * output: $d_n = (s_n,e_n,c_n )$as the candidate detection, where $s_n$,$e_n$ is the start and end of the detection, $c_n$ is confidence level; $p_n$ whether $d_n$ is a valid detection. $l_{n+1}$ where to observe next. all the parameter falls in [0,1] https://i.imgur.com/SeFianV.png **Training**: in order to learn the supervision annotation in long videos and handle the nondifferentiable components, authors use BP to train $d_n$ while use REINFORCE to train $p_n$ and $l_{n+1}$ * for $d_n$: $L(D) = \sum_n L_{cls}(d_n) + \sum_n \sum_m 1[y_{mn} = 1] L_{loc}(d_n,g_m)$ * for $p_n,l_{n+1}$: reward $J(\theta) = \sum_{a\in A} p(a) r(a)$ p(a) is the distribution of action and r(a) is the reward for the action. so training needs to maximize this. **Summary**: This paper uses a serialized model which first extract the feature from each frame, then use the frame feature and previous state info to generate the next observation time, detection and detection indicator. Specifically, in order to use previous information, they use RNN to store information and use REINFORCE to train $p_n$ and $l_n$, where the goal is to maximize reward for an action sequence and use MonteCarlo sampling to numerically calculate the gradient for high dimension function. **questions**: 1. why $p_n$ and $l_n$ are nondifferentiable components? 2. if $p_n$ and $l_n$ are nondifferentiable components indeed, how do we come up with REINFORCE to compute the gradient? 3. why don't we get $p_n$ from $p_n = f_p(h_n, \theta_p)$ directly but rather use fp as the parameter in bernoulli distribution, similar question can be applied to calculation for $l_{n+1}$ in trainning time. 
[link]
This paper argues for the use of normalizing flows  a way of building up new probability distributions by applying multiple sets of invertible transformations to existing distributions  as a way of building more flexible variational inference models. The central premise of a variational autoencoder is that of learning an approximation to the posterior distribution of latent variables  p(zx)  and parameterizing that distribution according to values produced by a neural network. In typical practice, this has meant that VAEs are limited in terms of the complexity of latent variable distributions they can encode, since using an analytically specified distribution tends to limit you to simpler distributional shapes  Gaussians, uniform, and the like. Normalizing flows are here proposed as a way to allow for the model to learn more complex forms of posterior distribution. Normalizing flows work off of a fairly simple intuition: if you take samples from a distribution p(x), and then apply a function f(x) to each x in that sample, you can calculate the expected value of your new distribution f(x) by calculating the expectation of f(x) under the old distribution p(x). That is to say: https://i.imgur.com/NStm7zN.png This mathematical transformation has a pretty delightful name  The Law of the Unconscious Statistician  that came from the fact that so many statisticians just treated this identity as a definitional fact, rather than something actually in need of proving (I very much fall into this bucket as well). The implication of this is that if you apply many transformations in sequence to the draws from some simple distribution, you can work with that distribution without explicitly knowing its analytical formulation, just by being able to evaluate  and, importantly  invert the function. The ability to invert the function is key, because of the way you calculate the derivative: by taking the inverse of the determinant of the derivative of your function f(z) with respect to z. (Note here that q(z) is the original distribution you sampled under, and q’(z) is the implicit density you’re trying to estimate, after your function has been applied). https://i.imgur.com/8LmA0rc.png Combining these ideas together: a variational flow autoencoder works by having an encoder network define the parameters of a simple distribution (Gaussian or Uniform), and then running the samples from that distribution through a series of k transformation layers. This final transformed density over z is then given to the decoder to work with. Some important limitations are in place here, the most salient of which is that in order to calculate derivatives, you have to be able to calculate the determinant of the derivative of a given transformation. Due to this constraint, the paper only tests a few transformations where this is easy to calculate analytically  the planar transformation and radial transformation. If you think about transformations of density functions as fundamentally stretching or compressing regions of density, the planar transformation works by stretching along an axis perpendicular to some parametrically defined plane, and the radial transformation works by stretching outward in a radial way around some parametrically defined point. Even though these transformations are individually fairly simple, when combined, they can give you a lot more flexibility in distributional space than a simple Gaussian or Uniform could. https://i.imgur.com/Xf8HgHl.png 
[link]
Shaham et al. provide an interpretation of adversarial training in the context of robust optimization. In particular, adversarial training is posed as minmax problem (similar to other related work, as I found): $\min_\theta \sum_i \max_{r \in U_i} J(\theta, x_i + r, y_i)$ where $U_i$ is called the uncertainty set corresponding to sample $x_i$ – in the context of adversarial examples, this might be an $\epsilon$ball around the sample quantifying the maximum perturbation allowed; $(x_i, y_i)$ are training samples, $\theta$ the parameters and $J$ the trianing objective. In practice, when the overall minimization problem is tackled using gradient descent, the inner maximization problem cannot be solved exactly (as this would be inefficient). Instead Shaham et al. Propose to alternatingly make single steps both for the minimization and the maximization problems – in the spirit of generative adversarial network training. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Huang et al. propose a variant of adversarial training called “learning with a strong adversary”. In spirit the idea is also similar to related work [1]. In particular, the authors consider the minmax objective $\min_g \sum_i \max_{\r^{(i)}\\leq c} l(g(x_i + r^{(i)}), y_i)$ where $g$ ranges over expressible functions and $(x_i, y_i)$ is a training sample. In the remainder of the paper, Huang et al. Address the problem of efficiently computing $r^{(i)}$ – i.e. a strong adversarial example based on the current state of the network – and subsequently updating the weights of the network by computing the gradient of the augmented loss. Details can be found in the paper. [1] T. Miyato, S. Maeda, M. Koyama, K. Nakae, S. Ishii. Distributional Smoothing by Virtual Adversarial Training. ArXiv:1507.00677, 2015. Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Miyato et al. propose distributional smoothing (or virtual adversarial training) as defense against adversarial examples. However, I think that both terms do not give a good intuition of what is actually done. Essentially, a regularization term is introduced. Letting $p(yx,\theta)$ be the learned model, the regularizer is expressed as $\text{KL}(p(yx,\theta)p(yx+r,\theta)$ where $r$ is the perturbation that maximizes the KullbackLeibler divergence above, i.e. $r = \arg\max_r \{\text{KL}(p(yx,\theta)p(yx+r,\theta)  \r\_2 \leq \epsilon\}$ with hyperparameter $\epsilon$. Essentially, the regularizer is supposed to “simulate” adversarial training – thus, the method is also called virtual adversarial training. The discussed implementation, however, is somewhat cumbersome. In particular, $r$ cannot be computed using firstorder methods as the gradient of $\text{KL}$ is $0$ for $r = 0$. So a secondorder method is used – for which the Hessian needs to be approximated and the corresponding eigenvectors need to be computed. For me it is unclear why $r$ cannot be initialized randomly to solve this issue … Then, the derivative of the regularizer needs to be computed during training. Here, the authors make several simplifications (such as fixing $\theta$ in the first part of the KullbackLeibler divergence and ignoring the derivative of $r$ w.r.t. $\theta$). Overall, however, I like the idea of “virtual” adversarial training as it avoids the need of explicitly using attacks during training to craft adversarial examples. Then, the trained model is often robust against the chosen attacks, but new adversarial examples can be found easily through novel attacks. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Papernot et al. Introduce a novel attack on deep networks based on socalled adversarial saliency maps that are computed independently of a loss. Specifically, they consider – for a given network $F(X)$ – the forward derivative $\nabla F = \frac{\partial F}{\partial X} = \left[\frac{\partial F_j(X)}{\partial x_i}\right]_{i,j}$. Essentially, this is the regular derivative of $F$ with respect to its input; Papernot et al. seem to refer to is as “forward” derivative as it stands in contrast with regular backpropagation where the derivative of the loss with respect to the parameters is considered. They define an adversarial saliency map by considering $S(X, t)_i = \begin{cases}0 & \text{ if } \frac{\partial F_t(X)}{\partial X_i} < 0 \text{ or } \sum_{j\neq t} \frac{\partial F_j(X)}{\partial X_i} > 0\\ \left(\frac{\partial F_t(X)}{\partial X_i}\right) \left \sum_{j \neq t} \frac{\partial F_j(X)}{\partial X_i}\right & \text{ otherwise}\end{cases}$ where $t$ is the target class of the attack. The intuition of this definition is the following: The partial derivative of $F_t$ with respect to $X$ at location $i$ indicates how $X_i$ can be changed in order to increase $F_t$ (which is the goal). At the same time, $F_j$ for all $t \neq j$ is supposed to decrease for the targeted attack, this is implemented using the second (absolute) term. If, at a specific feature $X_i$, not increase of $X_i$ will lead to an increase of $F_t$, or an increase will also lead to an increase in the other $F_j$, the saliency map is zero – indicating that feature $i$ is useless. Note that here, only increases in $X_i$ are considered; Papernot et al. have a analogous formulation for considering decreases of $X_i$. Based on the concept of adversarial saliency maps, a simple attack is implemented as illustrated in Algorithm 1. In particular, the feature $X_i$ for which the saliency map $S(X, t)$ is maximized is chosen and increased by a fixed amount until the network $F$ changes the label to $t$ or a maximum perturbation is reached (in which case the attack fails). https://i.imgur.com/PvJv9yS.png Algorithm 1: The proposed algorithm for generating adversarial examples, see text for details. In experiments on MNIST they show the effectiveness of the proposed attack. Additionally, they attempt to quantify the robustness (called “hardness”) of specific classes. In particular, they show that some classes are harder to attack than others. To this end they derive the socalled adversarial distance $A(X, t) = 1  \frac{1}{M}\sum_i 1_{[S(X, t)_i > 0]}$ which counts the number of features in the adversarial saliency map that are greater than zero (i.e. can be perturbed during the attack in Algorithm 1). Personally, I find this “hardness” measure quite interesting because it is independent of a specific loss, but directly takes statistics of the learned model into account. Also see this summary on [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Papernot et al. build upon the idea of network distillation [1] and propose a simple mechanism to defend networks against adversarial attacks. The main idea of distillation – originally introduced to “distill” the knowledge of very deep networks into smaller ones – is to train a second, possibly smaller network, with the probability distributions of the original, possibly larger network as supervision. Papernot et al. as well as the authors of [1] argue that the probability distributions, i.e. the activations of the final softmax layer (also referred to as “soft” labels), contain rich information about the task in contrast to the true “hard” labels. This allows the network to achieve similar performance while using less parameters or a different architecture. However, Papernot et al. do not distill a network's knowledge into a smaller one; instead they use distillation to make networks robust against adversarial attacks. They argue that most algorithms to generate adversarial examples make use of the “adversarial gradient”; i.e. the gradient of the network's cost w.r.t. its input. The adversarial gradient then guides perturbation of the input image in the direction of wrong classes (the authors consider a simple classification task for simplicity). Therefore, Papernot et al. Argure, the gradient around training samples needs to be reduced – in other words, the model needs to be smoothed. https://i.imgur.com/jXIhIGz.png The proposed approach is very simple, they just distill the knowledge of the network into another network with same architectures and hyper parameters. By using the probability distributions as “soft” labels instead of the hard labels for training, the network is essentially smoothed. The full procedure is illustrated in Figure 1. Despite the simplicity of the approach, I want to highlight some additional key observations:  Distillation is also supposed to help generalization by avoiding overly confident networks.  The success rate of adversarial attacks can be reduced significantly as shown in quantitative experiments.  The amplitude of adversarial gradients can be reduced, which means that the network has been smoothed and is less sensitive to variations in the input samples. Also see this summary on [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Variational Autoencoders are a type of generative model that seek to learn how to generate new data by incentivizing the model to be able to reconstruct input data, after compressing it to a lowdimensional space. Typically, the way that the reconstruction is scored against the original is by comparing the pixel by pixel values: a reconstruction gets a high score if it is able to place pixels of color in the same places that the original did. However, there are compelling reasons why this is a subpar way of scoring images. The central one is: it focuses on and penalizes superficial differences, so if the model accurately reproduces the focal object of the image, but does so, say, 10 pixels to the right of where it was previously, that will incur a penalty we might not actually want to apply. The flip side of this is that a direct pixelcomparison loss doesn’t differentiate between pixel differences that do or don’t change the fundamental substance of the image. For instance, having 100 pixels wrong around the border of a dog, making it seem very slightly larger, would be the same amount of error as having 100 pixels concentrated in a weird bulb that appears to be growing out of a dog’s ear, even though the former does a better job of being recognizable as a dog. The authors of the VAE/GAN paper have a clever approach to solving this problem, that involves taking the typical pixel loss, and breaking it up into two conceptual parts. The first focuses on aligning the conceptual features of the reconstructed image with the conceptual features of the input image. It does so by running both the input and the reconstruction through a discriminative convolutional model which  in the typical way of deep learning  learns ever more abstract features at each layer of the network. These “conceptual features” abstract out the precise pixel values, and instead capture the higher level features of the image. So, instead of calculating the pixelwise squared loss between the specific input x, and its afterbottleneck reconstruction x~, you take the squared loss between the feature maps at some layer for both x and x~, and push them to be closer together, so that the reconstruction shares the same features as the original. The second focuses on detaillevel specifics of images, but, cleverly, does so in a general, rather than a observationspecific way. This is done by training a GANstyle discriminator to tell the difference between generated images* and original image, and then using that loss to train the decoder part of the VAE. The cleverness of this comes from the fact that they are still enforcing that the details and structural features of the reconstructed image are not distinguishable from real images, but doing so in a general sense, rather than requiring the details to be an exact match to the details found in a given input x. https://i.imgur.com/Bmtmac2.png The authors freely admit that existing metrics of scoring images (which themselves *use* pixelwise similarity) rate their method as being worse than existing VAEs. However, they argue, that’s inherently a flawed metric, that doesn’t capture the aspects of clean visual quality we want in generated image. A metric they propose instead involves using an dataset where a list of attributes are attached to each image (old, black, blond, etc). They add these as additional input while training the network, so that whatever signals the decoder part of the model needs to turn someone blonde, it gets those from the externallygiven attribute vector, rather than a learned representation. This means that, once the model is trained, we can set some value of the attribute vector, and have the decoder generate samples conditional on that. The metric is constructed by taking the decoded samples conditioned on some attribute set, and then taking a classifier model that is trained on the real images to detect attribute values from the images. The generated images are then scored by how closely the predictions from the classifier model match the true values of the attributes. If the generator model were working perfectly, this error rate would as low as for real data. By this metric (which: grain of salt, since they invented), the VAE/GAN model is superior to both GANs and vanilla VAEs. 
[link]
There are 2 implementations for the paper: 1. [Reference Implementation of Deep Linear Discriminant Analysis (DeepLDA)](https://github.com/CPJKU/deep_lda). 2. [VahidooX/DeepLDA](https://github.com/VahidooX/DeepLDA). [It seems something is wrong with the cost function implemented](https://github.com/VahidooX/DeepLDA/issues/1#issuecomment392261355). Also, while they derive the Gradient they didn't verify it and in the implementation use Theano's Auto Grad (While other Auto Grad can't work it out). 
[link]
this paper: develop a framework to replay important transitions more frequently > learn efficienty prior work: uniformly sample a replay memory to get experience transitions evaluate: DQN + PER outperform DQN on 41 out of 49 Atari games ## Introduction **issues with online RL:** (solution: experience replay) 1. strongly correlated updates that break the i.i.d. assumption 2. rapid forgetting of rare experiences that could be useful later **key idea:** more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporaldifference (TD) error **issues with prioritization:** 1. loss of diversity > alleviate with stochastic prioritization 2. introduce bias > correct with importance sampling ## Prioritized Replay **criterion:**  the amount the RL agent can learn from a transition in its current state (expected learning progress) > not directly accessible  proxy: the magnitude of a transition’s TD error ~= how far the value is from its nextstep bootstrap estimate **stochastic sampling:** $$P(i)=\frac{p_i^\alpha}{\sum_k p_k^\alpha}$$ *p_i* > 0: priority of transition *i*; 0 <= *alpha* <= 1 determines how much prioritization is used. *two variants:* 1. proportional prioritization: *p_i* = abs(TD\_error\_i) + epsilon (small positive constant to avoid zero prob) 2. rankbased prioritization: *p_i* = 1/rank(i); **more robust as it is insensitive to outliers** https://i.imgur.com/T8je5wj.png **importance sampling:** IS weights: $$w_i = \left(\frac{1}{N} \cdot \frac{1}{P(i)}\right)^\beta $$  weights can be folded into the Qlearning update by using $w_i*\delta_i$ instead of $\delta_i$  weights normalized by $\frac{1}{\max w_i}$ 
[link]
# Very Short The authors define a neural network as a nonlinear dynamical system whose fixed points correspond to the minima of some **energy function**. They then show that if one were to start at a fixedpoint and *perturb* the output units in the direction that minimizes a loss, the initial perturbation that would flow back through the network would be proportional to the gradient of the neural activations with respect to this loss. Thus, the initial propagation of those propagations (i.e. **early inference**) **approximates** the **backpropagated** gradients of the loss. 
[link]
The authors present an iterative approach for optimizing policies with guaranteed monotonic improvement. TRPO is similar to natural policy gradient methods and can be applied effectively in optimization of large nonlinear policies. \cite{KakadeL02} gave monotonic improvement guarantees for mixture of policies $\pi_{new}(as)=(1\alpha)\pi_{old}(as) + \alpha\pi'(as)$ where $\pi'=\mathrm{arg}\max_{\pi'}L_{\pi_{old}}(\pi')$ is the approximated expected return of a policy $\pi'$ in terms of the advantage over $\pi_{old}$, as $\eta(\pi_{new})\geq L_{\pi_{old}}(\pi_{new})  \frac{2\epsilon\gamma}{(1\gamma)^2}\alpha^2$ with $\eta$ the true expected return and $\epsilon$ the maximally expected advantage. The authors extend this approach to be applicable for all stochastic policy classes by replacing $\alpha$ with a distance measure between two policies $\pi_{new}$ and $\pi_{old}$. As distance measure they use the maximal Kullback–Leibler divergence $D_{KL}^{\max}(\pi_{new},\pi_{old})$ and show that $\eta(\pi_{new})\geq L_{\pi_{old}}(\pi_{new}) CD_{KL}^{\max}(\pi_{new},\pi_{old})$, with $C= \frac{4\epsilon\gamma}{(1\gamma)^2}$. From this follows, that one is guaranteed to improve the true objective $\eta$ when performing the following maximaization $\mathrm{maximize}_\pi\left[L_{\pi_{old}}(\pi)CD_{KL}^{\max}(\pi,\pi_{old})\right]$. In practice however $C$ would only allow for small steps. Thus constraining $\mathrm{maximize}_\pi L_{\pi_{old}}(\pi)$ subject to $D_{KL}^{\max}(\pi,\pi_{old}) \leq \delta$ allows for larger steps in a **Trust Region** Due to the large number of constraints this problem is impractical to solve, which is why the authors replace the maximum KL divergence with approximated average KL. TRPO then works as follows: 1. Use a rollout procedure to collet a set of stateactionpairs wit Monte Carlo estimates of their $Q$Values 2. Average over the samples to construct the estimate objective $L_{\pi}$ as well as the constraint 3. Approximately solve the constrained optimization problem to update the policy parameters. They use the conjugate gradient algorithm followed by a linesearch. Their experiments support the claim that TRPO is able to effectively optimize large nonlinear policies. 
[link]
This paper shows how a family of reinforcement learning algorithms known as value gradient methods can be generalised to learn stochastic policies and deal with stochastic environment models. Value gradients are a type of policy gradient algorithm which represent a value function either by: * A learned Qfunction (a critic) * Linking together a policy, an environment model and reward function to define a recursive function to simulate the trajectory and the total return from a given state. By backpropagating though these functions, value gradient methods can calculate a policy gradient. This backpropagation sets them apart from other policy gradient methods (like REINFORCE for example) which are modelfree and sample returns from the real environment. Applying value gradients to stochastic problems requires differentiating the stochastic bellman equation: \begin{equation} V ^t (s) = \int \left[ r^t + γ \int V^{t+1} (s) p(s'  s, a) ds' \right] p(as; θ) da \end{equation} To do that, the authors use a trick called reparameterisation to express the stochastic bellman equation as a deterministic function which takes a noise variable as an input. To differentiate a reparameterised function, one simply samples the noise variable then computes the derivative as if the function were deterministic. This can then be repeated $ M $ times and averaged to arrive at a Monte Carlo estimate for the derivative of the stochastic function. The reparameterised bellman equation is: $ V (s) = \mathbb{E}_{ \rho(\eta) } \left[ r(s, \pi(s, \eta; \theta)) + \gamma \mathbb{E}_{\rho(\xi) } \left[ V' (f(s, \pi(s, \eta; \theta), \xi)) \right] \right] $ It's derivative with respect to the current state and the policy parameters is: $ V_s = \mathbb{E}_{\rho(\eta)} \[ r_\textbf{s} + r_\textbf{a} \pi_\textbf{s} + \gamma \mathbb{E}_{\rho(\xi)} V'_{s'} (\textbf{f}_\textbf{s} + \textbf{f}_\textbf{a} \pi_\textbf{s}) \] $ $ V_\theta = \mathbb{E}_{\rho(\eta)} \[ r_\textbf{a} \pi_\theta + \gamma \mathbb{E}_{\rho(\xi)} \[ V'_{\textbf{s'}} \textbf{f}_\textbf{a} \pi_\textbf{s} + V'_\theta\] \] $ Based on these relationships the authors define two algorithms; SVG(∞), SVG(1) * SVG(∞) takes the trajectory from an entire episode and starting at the terminal state accumulates a gradients $V_{\textbf{s}} $ and $ V_{\theta} $ using the expressions above to arrive at a policy gradient. SVG(∞) is onpolicy and only works with finitehorizon environments * SVG(1) trains a value function then uses its gradient as an estimate for $ V_{\textbf{s}} $ above. SVG(1) also uses importance weighting so as to be offpolicy and can work with infinitehorizon environments. Both algorithms use an environment model which is trained using an experience replay database. The paper also introduces SVG(0) which is a similar to SVG(1), but is modelfree. SVG was analysed using several MuJoCo environments and it was found that: * SVG(∞) outperformed a BBPT planner on a control problem with a stochastic model, indicating that gradient evaluation using real trajectories is more effective than planning for stochastic environments * SVG(1) is more robust to inaccurate environment models and value functions than SVG(∞) * SVG(1) was able to solve several complex environments 
[link]
The authors extend a seq2seq model for MT with a language model. They first pretrain a seq2seq model and a neural language model, then train a separate feedforward component that takes the hidden states from both and combines them together to make a prediction. They compare to simply combining the output probabilities from both models (shallow fusion) and show improvement on different MT datasets. https://i.imgur.com/zD9jb4K.png 
[link]
# Simultaneous Deep transfer across domains and tasks ## Tzeng, Hoffman, Saenko, 2015 * The paper aims to exploit unlabeled and sparsely labeled data from the target domain. * As a baseline, they mention that one could match feature distributions between source and target domain. This work will also explore correlation between categories, such as _bottle_ and _mug._ * The authors derive inspiration from the _Name the dataset_ game by Torralbe and Efros. In this game, you train a classifier to predict which dataset an image originates from. This idea transpires into the domain confusion loss. The domain classifier measures the confusion between learned features from source and target domain. The image classifier learns a feature representation that makes the domain inditinguishable, as measured by the domain confusion. * The second idea also learns the similarity structure between objects in the target domain. This works as follows. _We first compute the average output probability distribution, or “softlabel,” over the source training examples in each category. Then, for each target labeled example, we directly optimize our model to match the distribution over classes to the soft label. In this way we are able to perform task adaptation by transferring information to categories with no explicit labels in the target domain._ * The experiments take place in two situations. The _supervised_ case, where only few labels are present in the target domain. The _semi supervised_ case, where only few labels of a subset of the classes are present. * In the final section, the authors perform analysis on theis own result. They show how the image classifier correctly labeled monitor, while no labels for monitor were present in the target domain. 
[link]
#### Goal + Predict 128 diagnoses for intensive pediatric care patients. #### Dataset: + Children's Hospital LA. + Episode is a multivariate time series that describes the stay of one patient in the intensive care unit. Dataset properties  Value  Number of episodes  10,401 Duration of episodes  From 12h to several months Time series variables  Systolic blood pressure, Diastolic blood pressure, Peripheral capillary refill rate, End tidal CO2, Fraction of inspired O2, Glasgow coma scale, Blood glucose, Heart rate, pH, Respiratory rate, Blood O2 Saturation, Body temperature, Urine output. + Resampling and missing values: + Irregularly sampled timeseries that is resampled to an hourly rate. + Mean measurement within each hour window is taken. + Forward and backfilling are used to fill gaps created by the resampling. + When variable time series is missing entirely: imputation with a clinically *normal* value defined by domain experts. + This paper is followed by [Modeling Missing Data in Clinical Time Series with RNNs](http://www.shortscience.org/paper?bibtexKey=journals/corr/LiptonKW16) from the same research group. + Labels: + Each episode is associated with 0 or more diagnoses. (inhouse taxonomy, ICD9 based). + Dataset contains 429 diagnoses. The paper focuses on the 128 most frequent diagnoses that appear 50 or more times in the dataset. #### Architecture: + LSTM with Target Replication: ![Architecture](https://raw.githubusercontent.com/tiagotvv/mlpapers/master/clinicaldata/images/Lipton2016a_target.png?raw=true "Target Replication") + Loss function: + For the model with target replication, output y is generated at every sequence step. The loss function is then a convex combination of the final loss (logloss in the case of this paper) and the average of the losses over all steps where T is the number of sequence steps and alpha is a hyperparameter. ![Loss function](https://raw.githubusercontent.com/tiagotvv/mlpapers/master/clinicaldata/images/Lipton2016a_loss.png?raw=true "Loss function") #### Experiments and Results: **Methodology**: + Split dataset: 80% training, 10% validation, 10% test + LTSM trained for 100 epochs via gradient stochastic gradient (with momentum). + Regularization L2: 1e6, obtained via validation dataset. + LSTM: 2 hidden layers with 64 cells or 128 cells (and 50% dropout) + Multiple combinations: target replication / auxiliary target variables (trained using the other 301 diagnoses and other clinical information as a target. Inferences are made only for the 128 major diagnoses. + Baselines for comparison: + Logistic Regression  L2 regularized + MLP with 3 hidden layers  ReLU  dropout 50%. + Baselines tested in the raw timeseries and in a feature engineering version made by domain experts. *Metrics*: + Micro AUC, Micro F1: calculated by adding the TPs, FPs, TNs and FNs for the entire dataset and for all classes. + Macro AUC, Macro F1: Arithmetic mean of AUCs and F1 scores for each of the classes. + Precision at 10: Fraction of correct diagnoses among the top 10 predictions of the model. + The upper bound for precision at 10 is 0.2281 since in the test set there are on average 2.281 diagnoses per patient. *Results*: ![All Results](https://raw.githubusercontent.com/tiagotvv/mlpapers/master/clinicaldata/images/Lipton2016a_allresults.png?raw=true "Performance metrics across all labels") *Results for selected diagnoses*: ![Results for Selected Diseases](https://raw.githubusercontent.com/tiagotvv/mlpapers/master/clinicaldata/images/Lipton2016a_selected.png?raw=true "Performance for selected diagnoses") #### Discussion: + Auxiliary outputs improve performance at the expense of increased training time. Very unbalanced dataset for some of the remaining 301 labels makes it spend an entire epoch only to learn that one of the target variables can take values other than 0. + RealTime Predictions: In the future, the authors expect that the proposed solution could be used to make continuously updated realtime alerts and diagnoses. 
[link]
1. UNET learns segmentation in an end to end images. 2. They solved Challenges are * Very few annotated images (approx. 30 per application). * Touching objects of the same class. # How: * Input image is fed in to the network, then the data is propagated through the network along all possible path at the end segmentation maps comes out. * In Unet architecture, each blue box corresponds to a multichannel feature map. The number of channels is denoted on top of the box. The xysize is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations. https://i.imgur.com/Usxmv6r.png * In two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2for down sampling. At each down sampling step they double the number of feature channels. * Contracting path (left side from up to down) is increases the feature channel and reduces the steps and an expansive path (right side from down to up) consists of sequence of up convolution and concatenation with the corresponds high resolution features from contracting path. * The network does not have any fully connected layers and only uses the valid part of each convolution, i.e., the segmentation map only contains the pixels, for which the full context is available in the input image. ## Challenges: 1. Overlaptile strategy for seamless segmentation of arbitrary large images: * To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image. * In fig, segmentation of the yellow area uses input data of the blue area and the raw data extrapolation by mirroring. https://i.imgur.com/NUbBRUG.png 2. Augment training data using deformation: * They use excessive data augmentation by applying elastic deformations to the available training images. * Then the network to learn invariance to such deformations, without the need to see these transformations in the annotated image corpus. * Deformation used to be the most common variation in tissue and realistic deformations can be simulated efficiently. https://i.imgur.com/CyC8Hmd.png 3. Segmentation of touching object of the same class: * They propose the use of a weighted loss, where the separating background labels between touching cells obtain a large weight in the loss function. * Ensure separation of touching objects, in that segmentation mask for training (inserted background between touching objects) get the loss weights for each pixel. https://i.imgur.com/ds7psDB.png 4. Segmentation of neural structure in electromicroscopy(EM): * Ongoing challenge since ISBI 2012 in this dataset structures with low contrast, fuzzy membranes and other cell components. * The training data is a set of 30 images (512x512 pixels) from serial section transmission electron microscopy of the Drosophila first instar larva ventral nerve cord (VNC). Each image comes with corresponding fully annotated ground truth segmentation map for cells(white) and membranes (black). * An evaluation can be obtained by sending the predicted membrane probability map to the organizers. The evaluation is done by thresholding the map at 10 different levels and computation of the warping error, the Rand error and the pixel error. ### Results: * The unet (averaged over 7 rotated versions of the input data) achieves without any further pre or postprocessing a warping error of 0.0003529, a randerror of 0.0382 and a pixel error of 0.0611. https://i.imgur.com/6BDrByI.png * ISBI cell tracking challenge 2015, one of the dataset contains cell phase contrast microscopy has strong shape variations,weak outer borders, strong irrelevant inner borders and cytoplasm has same structure like background. https://i.imgur.com/vDflYEH.png * The first data set PHCU373 contains Glioblastomaastrocytoma U373 cells on a polyacrylimide substrate recorded by phase contrast microscopy It contains 35 partially annotated training images. Here we achieve an average IOU ("intersection over union") of 92%,which is significantly better than the second best algorithm with 83%. https://i.imgur.com/of4rAYP.png * The second data set DICHeLa are HeLa cells on a flat glass recorded by differential interference contrast (DIC) microscopy  It contains 20 partially annotated training images. Here we achieve an average IOU of 77.5% which is significantly better than the second best algorithm with 46%. https://i.imgur.com/Y9wY6Lc.png 
[link]
**Summary**: A CNN is employed to estimate optical flow. The task is defined as a supervised learning problem. 2 architectures are proposed: a generic one (FlowNetSimple); and another including a correlation layer for feature vectors at different image locations (FlowNetCorr). Networks consist of contracting and expanding parts and are trained as a whole using backpropagation. The correlation layer in FlowNetCorr finds correspondences between feature representations of 2 images instead of following the standard matching approach of extracting features from patches of both images and then comparing them. https://i.imgur.com/iUe8ir3.png **Approach**: 1. *Contracting part*: Their first choice is to stack both input images together and feed them through a rather generic network, allowing the network to decide itself how to process the image pair to extract the motion information. This is called 'FlowNetSimple' and consists only of convolutional layers. The second approach 'FlowNetCorr' is to create two separate, identical processing streams for the two images and to combine them at a later stage. The two architectures are illustrated above. The 'correlation layer' performs multiplicative patch comparisons between two feature maps. The correlation of two patches centered at $x_1$ in the first map and $x_2$ in the second map is then defined as: $$c(x_1,x_2) =\sum_{o\in[k,k]*[k,k]} \langle f_{1}(x_{1}+o),f_{2}(x_{2},o) \rangle $$ for a square patch of size $K = 2k+1$. 2. *Expanding part*: It consists of the upconvolutional layers  combination of unpooling and convolution. 'Upconvolution' is applied to feature maps and concatenated with corresponding feature maps from the 'contractive' part of the network. **Experiments**: A new dataset called 'Flying Chairs' is created with $22,872$ image pairs and flow fields. It is created by applying affine transformations to images collected from Flickr and a publicly available set of renderings of 3D chair models [1]. Results are reported on Sintel, KITTI, Middlebury datasets, as well as on their synthetic Flying Chairs dataset. The proposed method is compared with different methods: EpicFlow [2], DeepFlow [3], EDPM [4], and LDOF [5]. The authors inferred that even though the number of parameters of the two networks (FlowNetC, FlowNetCorr) is virtually the same, the FlowNetC slightly more overfits to the training data. The architecture has nine convolutional layers with stride of $2$ in six of them and a $ReLU$ nonlinearity after each layer. As training loss, endpoint error (EPE) is used which is the standard error measure for optical flow estimation. Below figure shows examples of optical flow prediction on the Sintel dataset. Endpoint error is also shown. https://i.imgur.com/xIRpUZQ.png **Scope for Improvement**: It would be interesting to see the performance of the network on more realistic data. **References**: [1] Aubry, Mathieu, et al. "Seeing 3d chairs: exemplar partbased 2d3d alignment using a large dataset of cad models." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014. [2] Revaud, Jerome, et al. "Epicflow: Edgepreserving interpolation of correspondences for optical flow." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. [3] Weinzaepfel, Philippe, et al. "DeepFlow: Large displacement optical flow with deep matching." Proceedings of the IEEE International Conference on Computer Vision. 2013. [4] Bao, Linchao, Qingxiong Yang, and Hailin Jin. "Fast edgepreserving patchmatch for large displacement optical flow." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014. [5] Brox, Thomas, and Jitendra Malik. "Large displacement optical flow: descriptor matching in variational motion estimation." IEEE transactions on pattern analysis and machine intelligence 33.3 (2011): 500513. 
[link]
* They propose a CNNbased approach to detect faces in a wide range of orientations using a single model. However, since the training set is skewed, the network is more confident about upright faces. * The model does not require additional components such as segmentation, boundingbox regression, segmentation, or SVM classifiers ### How * __Data augmentation__: to increase the number of positive samples (24K face annotations), the authors used randomly sampled subwindows of the images with IOU > 50% and also randomly flipped these images. In total, there were 20K positive and 20M negative training samples. * __CNN Architecture__: 5 convolutional layers followed by 3 fullyconnected. The fullyconnected layers were converted to convolutional layers. NonMaximal Suppression is applied to merge predicted bounding boxes. * __Training__: the CNN was trained using Caffe Library in the AFLW dataset with the following parameters: * Finetuning with AlexNet model * Input image size = 227x227 * Batch size = 128 (32+, 96) * Stride = 32 * __Test__: the model was evaluated on PASCAL FACE, AFW, and FDDB dataset. * __Running time__: since the fullyconnected layers were converted to convolutional layers, the input image in running time may be of any size, obtaining a heat map as output. To detect faces of different sizes though, the image is scaled up/down and new heatmaps are obtained. The authors found that rescaling image 3 times per octave gives reasonable good performance. ![DDFD heatmap](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/DDFD__heatmap.png?raw=true "DDFD heatmap") * The authors realized that the model is more confident about upright faces than rotated/occluded ones. This trend is because the lack of good training examples to represent such faces in the training process. Better results can be achieved by using better sampling strategies and more sophisticated data augmentation techniques. ![DDFD example](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/DDFD__example.png?raw=true "DDFD example") * The authors tested different strategies for NMS and the effect of boundingbox regression for improving face detection. They NMSavg had better performance compared to NMSmax in terms of average precision. On the other hand, adding a boundingbox regressor degraded the performance for both NMS strategies due to the mismatch between annotations of the training set and the test set. This mismatch is mostly for sideview faces. ### Results: * In comparison to RCNN, the proposed face detector had significantly better performance independent of the NMS strategy. The authors believe the inferior performance of RCNN due to the loss of recall since selective search may miss some of the face regions; and loss in localization since boundingbox regression is not perfect and may not be able to fully align the segmentation boundingboxes, provided by selective search, with the ground truth. * In comparison to other stateofart methods like structural model, TSM and cascadebased methods the DDFD achieve similar or better results. However, this comparison is not completely fair since the most of methods use extra information of pose annotation or information about facial landmarks during the training. 
[link]
* ELUs are an activation function * The are most similar to LeakyReLUs and PReLUs ### How (formula) * f(x): * `if x >= 0: x` * `else: alpha(exp(x)1)` * f'(x) / Derivative: * `if x >= 0: 1` * `else: f(x) + alpha` * `alpha` defines at which negative value the ELU saturates. * E. g. `alpha=1.0` means that the minimum value that the ELU can reach is `1.0` * LeakyReLUs however can go to `Infinity`, ReLUs can't go below 0. ![ELUs vs LeakyReLUs vs ReLUs](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/ELUs__slopes.png?raw=true "ELUs vs LeakyReLUs vs ReLUs") *Form of ELUs(alpha=1.0) vs LeakyReLUs vs ReLUs.* ### Why * They derive from the unit natural gradient that a network learns faster, if the mean activation of each neuron is close to zero. * ReLUs can go above 0, but never below. So their mean activation will usually be quite a bit above 0, which should slow down learning. * ELUs, LeakyReLUs and PReLUs all have negative slopes, so their mean activations should be closer to 0. * In contrast to LeakyReLUs and PReLUs, ELUs saturate at a negative value (usually 1.0). * The authors think that is good, because it lets ELUs encode the degree of presence of input concepts, while they do not quantify the degree of absence. * So ELUs can measure the presence of concepts quantitatively, but the absence only qualitatively. * They think that this makes ELUs more robust to noise. ### Results * In their tests on MNIST, CIFAR10, CIFAR100 and ImageNet, ELUs perform (nearly always) better than ReLUs and LeakyReLUs. * However, they don't test PReLUs at all and use an alpha of 0.1 for LeakyReLUs (even though 0.33 is afaik standard) and don't test LeakyReLUs on ImageNet (only ReLUs). ![CIFAR100](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/ELUs__cifar100.png?raw=true "CIFAR100") *Comparison of ELUs, LeakyReLUs, ReLUs on CIFAR100. ELUs ends up with best values, beaten during the early epochs by LeakyReLUs. (Learning rates were optimized for ReLUs.)*  ### Rough chapterwise notes * Introduction * Currently popular choice: ReLUs * ReLU: max(0, x) * ReLUs are sparse and avoid the vanishing gradient problem, because their derivate is 1 when they are active. * ReLUs have a mean activation larger than zero. * Nonzero mean activation causes a bias shift in the next layer, especially if multiple of them are correlated. * The natural gradient (?) corrects for the bias shift by adjusting the weight update. * Having less bias shift would bring the standard gradient closer to the natural gradient, which would lead to faster learning. * Suggested solutions: * Centering activation functions at zero, which would keep the offdiagonal entries of the Fisher information matrix small. * Batch Normalization * Projected Natural Gradient Descent (implicitly whitens the activations) * These solutions have the problem, that they might end up taking away previous learning steps, which would slow down learning unnecessarily. * Chosing a good activation function would be a better solution. * Previously, tanh was prefered over sigmoid for that reason (pushed mean towards zero). * Recent new activation functions: * LeakyReLUs: x if x > 0, else alpha*x * PReLUs: Like LeakyReLUs, but alpha is learned * RReLUs: Slope of part < 0 is sampled randomly * Such activation functions with nonzero slopes for negative values seemed to improve results. * The deactivation state of such units is not very robust to noise, can get very negative. * They suggest an activation function that can return negative values, but quickly saturates (for negative values, not for positive ones). * So the model can make a quantitative assessment for positive statements (there is an amount X of A in the image), but only a qualitative negative one (something indicates that B is not in the image). * They argue that this makes their activation function more robust to noise. * Their activation function still has activations with a mean close to zero. * Zero Mean Activations Speed Up Learning * Natural Gradient = Update direction which corrects the gradient direction with the Fisher Information Matrix * HessianFree Optimization techniques use an extended GaussNewton approximation of Hessians and therefore can be interpreted as versions of natural gradient descent. * Computing the Fisher matrix is too expensive for neural networks. * Methods to approximate the Fisher matrix or to perform natural gradient descent have been developed. * Natural gradient = inverse(FisherMatrix) * gradientOfWeights * Lots of formulas. Apparently first explaining how the natural gradient descent works, then proofing that natural gradient descent can deal well with nonzeromean activations. * Natural gradient descent autocorrects bias shift (i.e. nonzeromean activations). * If that autocorrection does not exist, oscillations (?) can occur, which slow down learning. * Two ways to push means towards zero: * Unit zero mean normalization (e.g. Batch Normalization) * Activation functions with negative parts * Exponential Linear Units (ELUs) * *Formula* * f(x): * if x >= 0: x * else: alpha(exp(x)1) * f'(x) / Derivative: * if x >= 0: 1 * else: f(x) + alpha * `alpha` defines at which negative value the ELU saturates. * `alpha=0.5` => minimum value is 0.5 (?) * ELUs avoid the vanishing gradient problem, because their positive part is the identity function (like e.g. ReLUs) * The negative values of ELUs push the mean activation towards zero. * Mean activations closer to zero resemble more the natural gradient, therefore they should speed up learning. * ELUs are more noise robust than PReLUs and LeakyReLUs, because their negative values saturate and thus should create a small gradient. * "ELUs encode the degree of presence of input concepts, while they do not quantify the degree of absence" * Experiments Using ELUs * They compare ELUs to ReLUs and LeakyReLUs, but not to PReLUs (no explanation why). * They seem to use a negative slope of 0.1 for LeakyReLUs, even though 0.33 is standard afaik. * They use an alpha of 1.0 for their ELUs (i.e. minimum value is 1.0). * MNIST classification: * ELUs achieved lower mean activations than ReLU/LeakyReLU * ELUs achieved lower cross entropy loss than ReLU/LeakyReLU (and also seemed to learn faster) * They used 5 hidden layers of 256 units each (no explanation why so many) * (No convolutions) * MNIST Autoencoder: * ELUs performed consistently best (at different learning rates) * Usually ELU > LeakyReLU > ReLU * LeakyReLUs not far off, so if they had used a 0.33 value maybe these would have won * CIFAR100 classification: * Convolutional network, 11 conv layers * LeakyReLUs performed better during the first ~50 epochs, ReLUs mostly on par with ELUs * LeakyReLUs about on par for epochs 50100 * ELUs win in the end (the learning rates used might not be optimal for ELUs, were designed for ReLUs) * CIFER100, CIFAR10 (big convnet): * 6.55% error on CIFAR10, 24.28% on CIFAR100 * No comparison with ReLUs and LeakyReLUs for same architecture * ImageNet * Big convnet with spatial pyramid pooling (?) before the fully connected layers * Network with ELUs performed better than ReLU network (better score at end, faster learning) * Networks were still learning at the end, they didn't run till convergence * No comparison to LeakyReLUs 
[link]
* The paper describes a method to separate content and style from each other in an image. * The style can then be transfered to a new image. * Examples: * Let a photograph look like a painting of van Gogh. * Improve a dark beach photo by taking the style from a sunny beach photo. ### How * They use the pretrained 19layer VGG net as their base network. * They assume that two images are provided: One with the *content*, one with the desired *style*. * They feed the content image through the VGG net and extract the activations of the last convolutional layer. These activations are called the *content representation*. * They feed the style image through the VGG net and extract the activations of all convolutional layers. They transform each layer to a *Gram Matrix* representation. These Gram Matrices are called the *style representation*. * How to calculate a *Gram Matrix*: * Take the activations of a layer. That layer will contain some convolution filters (e.g. 128), each one having its own activations. * Convert each filter's activations to a (1dimensional) vector. * Pick all pairs of filters. Calculate the scalar product of both filter's vectors. * Add the scalar product result as an entry to a matrix of size `#filters x #filters` (e.g. 128x128). * Repeat that for every pair to get the Gram Matrix. * The Gram Matrix roughly represents the *texture* of the image. * Now you have the content representation (activations of a layer) and the style representation (Gram Matrices). * Create a new image of the size of the content image. Fill it with random white noise. * Feed that image through VGG to get its content representation and style representation. (This step will be repeated many times during the image creation.) * Make changes to the new image using gradient descent to optimize a loss function. * The loss function has two components: * The mean squared error between the new image's content representation and the previously extracted content representation. * The mean squared error between the new image's style representation and the previously extracted style representation. * Add up both components to get the total loss. * Give both components a weight to alter for more/less style matching (at the expense of content matching). ![Examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/A_Neural_Algorithm_for_Artistic_Style__examples.jpg?raw=true "Examples") *One example input image with different styles added to it.*  ### Rough chapterwise notes * Page 1 * A painted image can be decomposed in its content and its artistic style. * Here they use a neural network to separate content and style from each other (and to apply that style to an existing image). * Page 2 * Representations get more abstract as you go deeper in networks, hence they should more resemble the actual content (as opposed to the artistic style). * They call the feature responses in higher layers *content representation*. * To capture style information, they use a method that was originally designed to capture texture information. * They somehow build a feature space on top of the existing one, that is somehow dependent on correlations of features. That leads to a "stationary" (?) and multiscale representation of the style. * Page 3 * They use VGG as their base CNN. * Page 4 * Based on the extracted style features, they can generate a new image, which has equal activations in these style features. * The new image should match the style (texture, color, localized structures) of the artistic image. * The style features become more and more abtstract with higher layers. They call that multiscale the *style representation*. * The key contribution of the paper is a method to separate style and content representation from each other. * These representations can then be used to change the style of an existing image (by changing it so that its content representation stays the same, but its style representation matches the artwork). * Page 6 * The generated images look most appealing if all features from the style representation are used. (The lower layers tend to reflect small features, the higher layers tend to reflect larger features.) * Content and style can't be separated perfectly. * Their loss function has two terms, one for content matching and one for style matching. * The terms can be increased/decreased to match content or style more. * Page 8 * Previous techniques work only on limited or simple domains or used nonparametric approaches (see nonphotorealistic rendering). * Previously neural networks have been used to classify the time period of paintings (based on their style). * They argue that separating content from style might be useful and many other domains (other than transfering style of paintings to images). * Page 9 * The style representation is gathered by measuring correlations between activations of neurons. * They argue that this is somehow similar to what "complex cells" in the primary visual system (V1) do. * They note that deep convnets seem to automatically learn to separate content from style, probably because it is helpful for styleinvariant classification. * Page 9, Methods * They use the 19 layer VGG net as their basis. * They use only its convolutional layers, not the linear ones. * They use average pooling instead of max pooling, as that produced slightly better results. * Page 10, Methods * The information about the image that is contained in layers can be visualized. To do that, extract the features of a layer as the labels, then start with a white noise image and change it via gradient descent until the generated features have minimal distance (MSE) to the extracted features. * The build a style representation by calculating Gram Matrices for each layer. * Page 11, Methods * The Gram Matrix is generated in the following way: * Convert each filter of a convolutional layer to a 1dimensional vector. * For a pair of filters i, j calculate the value in the Gram Matrix by calculating the scalar product of the two vectors of the filters. * Do that for every pair of filters, generating a matrix of size #filters x #filters. That is the Gram Matrix. * Again, a white noise image can be changed with gradient descent to match the style of a given image (i.e. minimize MSE between two Gram Matrices). * That can be extended to match the style of several layers by measuring the MSE of the Gram Matrices of each layer and giving each layer a weighting. * Page 12, Methods * To transfer the style of a painting to an existing image, proceed as follows: * Start with a white noise image. * Optimize that image with gradient descent so that it minimizes both the content loss (relative to the image) and the style loss (relative to the painting). * Each distance (content, style) can be weighted to have more or less influence on the loss function. 
[link]
* Generative Moment Matching Networks (GMMN) are generative models that use maximum mean discrepancy (MMD) for their objective function. * MMD is a measure of how similar two datasets are (here: generated dataset and training set). * GMMNs are similar to GANs, but they replace the Discriminator with the MMD measure, making their optimization more stable. ### How * MMD calculates a similarity measure by comparing statistics of two datasets with each other. * MMD is calculated based on samples from the training set and the generated dataset. * A kernel function is applied to pairs of these samples (thus the statistics are acutally calculated in highdimensional spaces). The authors use Gaussian kernels. * MMD can be approximated using a small number of samples. * MMD is differentiable and therefor can be used as a standard loss function. * They train two models: * GMMN: Noise vector input (as in GANs), several ReLU layers into one sigmoid layer. MMD as the loss function. * GMMN+AE: Same as GMMN, but the sigmoid output is not an image, but instead the code that gets fed into an autoencoder's (AE) decoder. The AE is trained separately on the dataset. MMD is backpropagated through the decoder and then the GMMN. I.e. the GMMN learns to produce codes that let the decoder generate good looking images. ![Formula](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Generative_Moment_Matching_Networks__formula.png?raw=true "Formula") *MMD formula, where $x_i$ is a training set example and $y_i$ a generated example.* ![Architectures](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Generative_Moment_Matching_Networks__architectures.png?raw=true "Architectures") *Architectures of GMMN (left) and GMMN+AE (right).* ### Results * They tested only on MNIST and TFD (i.e. datasets that are well suited for AEs...). * Their GMMN achieves similar log likelihoods compared to other models. * Their GMMN+AE achieves better log likelihoods than other models. * GMMN+AE produces good looking images. * GMMN+AE produces smooth interpolations between images. ![Interpolations](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Generative_Moment_Matching_Networks__interpolations.png?raw=true "Interpolations") *Generated TFD images and interpolations between them.*  ### Rough chapterwise notes * (1) Introduction * Sampling in GMMNs is fast. * GMMNs are similar to GANs. * While the training objective in GANs is a minimax problem, in GMMNs it is a simple loss function. * GMMNs are based on maximum mean discrepancy. They use that (implemented via the kernel trick) as the loss function. * GMMNs try to generate data so that the moments in the generated data are as similar as possible to the moments in the training data. * They combine GMMNs with autoencoders. That is, they first train an autoencoder to generate images. Then they train a GMMN to produce sound code inputs to the decoder of the autoencoder. * (2) Maximum Mean Discrepancy * Maximum mean discrepancy (MMD) is a frequentist estimator to tell whether two datasets X and Y come from the same probability distribution. * MMD estimates basic statistics values (i.e. mean and higher order statistics) of both datasets and compares them with each other. * MMD can be formulated so that examples from the datasets are only used for scalar products. Then the kernel trick can be applied. * It can be shown that minimizing MMD with gaussian kernels is equivalent to matching all moments between the probability distributions of the datasets. * (4) Generative Moment Matching Networks * Data Space Networks * Just like GANs, GMMNs start with a noise vector that has N values sampled uniformly from [1, 1]. * The noise vector is then fed forward through several fully connected ReLU layers. * The MMD is differentiable and therefor can be used for backpropagation. * AutoEncoder Code Sparse Networks * AEs can be used to reconstruct highdimensional data, which is a simpler task than to learn to generate new data from scratch. * Advantages of using the AE code space: * Dimensionality can be explicitly chosen. * Disentangling factors of variation. * They suggest a combination of GMMN and AE. They first train an AE, then they train a GMMN to generate good codes for the AE's decoder (based on MMD loss). * For some reason they use greedy layerwise pretraining with later finetuning for the AE, but don't explain why. (That training method is outdated?) * They add dropout to their AE's encoder to get a smoother code manifold. * Practical Considerations * MMD has a bandwidth parameter (as its based on RBFs). Instead of chosing a single fixed bandwidth, they instead use multiple kernels with different bandwidths (1, 5, 10, ...), apply them all and then sum the results. * Instead of $MMD^2$ loss they use $\sqrt{MMD^2}$, which does not go as fast to zero as raw MMD, thereby creating stronger gradients. * Per minibatch they generate a small number of samples und they pick a small number of samples from the training set. They then compute MMD for these samples. I.e. they don't run MMD over the whole training set as that would be computationally prohibitive. * (5) Experiments * They trained on MNIST and TFD. * They used an GMMN with 4 ReLU layers and autoencoders with either 2/2 (encoder, decoder) hidden sigmoid layers (MNIST) or 3/3 (TFD). * They used dropout on the encoder layers. * They used layerwise pretraining and finetuning for the AEs. * They tuned most of the hyperparameters using bayesian optimization. * They use minibatch sizes of 1000 and compute MMD based on those (i.e. based on 2000 points total). * Their GMMN+AE model achieves better log likelihood values than all competitors. The raw GMMN model performs roughly on par with the competitors. * Nearest neighbor evaluation indicates that it did not just memorize the training set. * The model learns smooth interpolations between digits (MNIST) and faces (TFD). 
[link]
#### Problem addressed: Image Segmentation, Pixel labelling, Object recognition #### Summary: The authors approximate the CRF inference procedure using the mean field approximation. They use a specific set of unary and binary potentials. Each step in the mean field inference is modelled as a convolutional layer with appropriate filter sizes and channels. The mean field inference procedure requires multiple iterations (over time) to achieve convergence. This is exploited to model the whole procedure as CNNRNN. The unary potentials and initial pixel labels are learnt using a FCN. The authors train the FCN and CNNRNN separately and jointly and find that joint training gives the better performance of the two on the VOC2007 dataset. #### Novelty: Formulating the mean field CRF inference procedure as a combination of CNN and RNN. Joint training procedure of a fully convolutional network (FCN) + CRF as RNN to perform pixel labelling tasks #### Drawbacks: Does not scale with number of classes. No theoretical justification for success of joint training, only empirical justification #### Datasets: VOC2012, COCO #### Additional remarks: Presentation video available on cedar server #### Resources: http://www.robots.ox.ac.uk/~szheng/papers/CRFasRNN.pdf #### Presenter: Bhargava U. Kota 
[link]
This paper presents an approach to visual question answering by dynamically composing networks of independent neural modules based on the semantic parsing of the question. Main contributions:  Independent neural modules that can be combined together and jointly trained.  Attention: Convolutional layer, with different filters for different instances. For example, attend[dog], attend[cat], etc.  Reattention: FCReLUFCReLU, weights are different for different instances. For example, reattend[above], reattend[not], etc.  Combination: Stacks two attention maps, followed by convReLU to map to a single attention map. For example, combine[and], combine[except], etc.  Classification: Combines attention map and image, followed by FCSoftmax to map to answer. For example, classify[colors].  Measurement: FCReLUFCSoftmax, takes attention map as input. For example, measure[exists].  Structured representations are extracted from questions and these are then mapped to network layouts, including the connections between them.  All leaves become attend modules, all internal nodes become reattend or combine modules dependent on their arity, and root nodes become measure modules for yes/no questions and classify modules for all other question types.  Networks with the same structure but different instantiations can be processed in the same batch. For example, classify[color]\(attend[cat]\), classify[where]\(attend[truck]\).  Predictions from the module network are combined with LSTM representations to get the final answer.  Syntactic regularities: 'what is flying?' and 'what are flying?' get mapped to the same module network.  Semantic regularities: 'green' is an implausible answer for 'what color is the bear?'.  Experiments are performed on the synthetic SHAPES dataset and VQA dataset.  Performance on the SHAPES dataset is better as it is designed to benefit from compositionality. ## Strengths  This model takes advantage of the inherently compositional property of language, which makes a lot of sense. VQA is an extremely complex task and breaking it up into separate functions/modules is an excellent approach. ## Weaknesses / Notes  Mapping from syntactic structure to module network is handdesigned. Ideally, the model should learn this too to generalize.  Due to its compositional nature, this kind of model can possibly be used in the zeroshot learning setting, i.e. generalize to novel question types that the network hasn't seen before. 
[link]
This paper presents a neat method for learning spatiotemporal representations from videos. Convolutional features from intermediate layers of a CNN are extracted, to preserve spatial resolution, and fed into a modified GRU that can (in theory) learn infinite temporal dependencies. Main contributions:  Their variant of GRU (called GRURCN) uses convolution operations instead of fullyconnected units.  This exploits the local correlation in image frames across spatial locations.  Features from pool2, pool3, pool4, pool5 are extracted and fed into independent GRURCNs. Hidden states at last time step are now feature volumes, which are average pooled to reduce to 1x1 spatially, and fed into a linear + softmax classifier. Outputs from each of these classifiers is averaged to get the final prediction.  Other variants that they experiment with are bidirectional GRURCNs and stacked GRURCNs i.e. GRURCNs with connections between them (with maxpool operations for dimensionality reduction).  Bidirectional GRURCNs perform the best.  Stacked GRURCNs perform worse than the other variants, probably because of limited data.  They evaluate their method on action recognition and video captioning, and show significant improvements on a CNN+RNN baseline, comparing favorably with other stateoftheart methods (like C3D). ## Strengths  The idea is simple and elegant. Earlier methods for learning video representations typically used 3D convolutions (k x k x T filters), which suffered from finite temporal capacity, or RNNs sitting on top of lastlayer CNN features, which is unable to capture finer spatial resolution. In theory, this formulation solves both.  Changing fullyconnected operations to convolutions has the additional advantage of requiring lesser parameters (n\_input x n\_output x input\_width x input\_height v/s n\_input x n\_output x k\_width x k\_height). 
[link]
This paper presents a model that can dynamically split computation across coarse, lowcapacity subnetworks and fine, highcapacity subnetworks. The coarse model processes the entire input data and is typically shallow while the fine model focuses on a few important regions of the input and is deeper. For images as input, this is a hard attention mechanism that can be trained with stochastic gradient descent and doesn't require a taskspecific attention policy trained by reinforcement learning. Key ideas:  A deep network h can be decomposed into bottom layers f and top layers g such that $h(x) = g(f(x))$. Further, f consists of two alternate subnetworks $f\_c$ and $f\_f$. $f\_c$ is a lowcapacity subnetwork while $f\_f$ is a highcapacity subnetwork.  g should be able to use representations from $f\_c$ and $f\_f$ dynamically. $f\_c$ processes the entire input while $f\_f$ only a few important regions of the input.  The coarse model processes the entire input and the norm of the gradient of the entropy with respect to the coarse vector at each spatial region is computed which is a measure of saliency. The use of the entropy gradient as a saliency measure encourages selecting input regions that could affect the uncertainty in the model’s predictions the most.  The topk input regions with highest saliency values are processed by the fine model. The refined representation for input to the top layers consists of both coarse and fine vectors. During backpropagation, gradients are computed for the refined model, i.e. propagating gradients at each position into either the coarse or fine features, depending on which was used.  To make sure $f\_c$ and $f\_f$ representations are interchangeable and input to the top layers has smooth transitions, an additional objective term minimizes the squared distance between coarse and fine representations and this additional term is used only to optimize the coarse layers, not the fine layers.  Experiments on cluttered MNIST, SVHN and comparison with RAM, DRAW and study with various values of number of patches for fine processing. ## Strengths  Neat, general way to split computation based on importance of input; a hardattention mechanism that can be trained with SGD, unlike RAM.  Entropy gradient as a measure of saliency is an interesting idea, and it doesn't need labels i.e. can be used at test time. 
[link]
This paper introduces the task of dense captioning and proposes a network architecture that processes an image and produce region descriptions in a single pass and can be trained endtoend. Main contributions:  Dense captioning  Generalization of object detection (caption consists of single word) and image captioning (region consists of whole image).  Fully convolution localization network  Fully differentiable, can be trained jointly with the rest of the network  Consists of a region proposal network, box regression (similar to Faster RCNN) and bilinear interpolation (similar to Spatial Transformer Networks) for sampling.  Network details  Convolutional layer features are extracted for image  For each element in the feature map, k anchor boxes of different aspect ratios are selected in the input image space.  For each of these, the localization layer predicts offsets and confidence.  The region proposals are projected on the convolutional feature map and a sampling grid is computed from output feature map to input (bilinear sampling).  The computed feature map is passed through an MLP to compute representations corresponding to each region.  These are passed (in a batch) as the first word to an LSTM (Show and Tell) which is trained to predict each word of the caption. ## Strengths  Fully differentiable 'spatial attention' mechanism (bilinear interpolation) in place of RoI pooling as in the case of Faster RCNN.  RoI pooling is not differentiable with respect to the input proposal coordinates.  Fast, and impressive qualitative results. ## Weaknesses / Notes The model is very well engineered together from different works (Faster RCNN + Spatial Transformer Networks + Show & Tell). 
[link]
Predict frames of a video using 3 newly proposed and complementary methods: 1. Multi scale cnn 2. GAN 3. Image gradient difference loss Datasets:  * UCF101 * Sports1M GAN  Generator: * Input: several frames of video from dataset * output: next frame of video Discriminator: * input: original and last frame * output: is the last frame from dataset or generated Problem: Still blurry on edges on moving object. Solution: Image gradient difference loss 
[link]
**Contributions**: * Use dropout to get segmentation with a measure of model uncertainty. **Explanation**: We can consider dropout as a way of getting samples from a posterior distribution of models [see these papers: [1]( https://arxiv.org/abs/1506.02142), [2](https://arxiv.org/abs/1506.02158)) and thus can be be used to do Bayesian inference. This amounts to using dropout both during train and test time and getting multiple outputs (i.e sampling from model distribution) in test time. Mean of these outputs is taken as final segmentation and variation as model uncertainty. Sample averaging performs better than weight averaging (i.e usual test time method for dropout) if averaged over more than 6 samples. Paper used 50 samples. General technique which can be applied to any segmentation model. Sample averaging alone improves scores by 23%. **Benchmarks**: <score without dropout > score with dropout> *VOC2012* Dilation Network: 71.3 > 73.1 [Source](http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?cls=mean&challengeid=11&compid=6&submid=6103#KEY_Bayesian%20Dilation%20Network) FCN8: 62.2 > 65.4 [Source](http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?cls=mean&challengeid=11&compid=6&submid=6103#KEY_Bayesian%20FCN) SegNet: 59.1 > 60.5 (Source: reported in the paper) *CamVid* SegNet: 71.20 > 76.3 (Source: reported in the paper) **My comments** Nothing very new. Gotcha is that sample averaging performs better. 
[link]
Layerwise Relevance Propagation (LRP) is a novel technique has been used by authors in multiple usecases (apart from this publication) to demonstrate the robustness and advantage of a *decomposition* method over other heatmap generation methods. Such heatmap generation methods are very crucial for increasing interpretability of Deep Learning models as such. Apart from LRP relevance, authors also discuss quantitative ways to measure the accuracy of the heatmap generated. ### LRP & Alternatives What is LRP ? LRP is a principled approach to decompose a classification decision into pixelwise relevances indicating the contributions of a pixel to the overall classification score. The approach is derived from a layerwise conservation principle , which forces the propagated quantity (e.g. evidence for a predicted class) to be preserved between neurons of two adjacent layers. Denoting by R(l) [i] the relevance associated to the ith neuron of layer and by R (l+1) [j] the relevance associated to the jth neuron in the next layer, the conservation principle requires that ![](https://i.imgur.com/GQxrnCT.png) where R(l) [i] is given as ![](https://i.imgur.com/FD7AAfF.png) where z[i,j] is the activation of jth neuron because of input from ith neuron As per authors this is not necssarily the only relevance funtion which is conserved. The intuition behind using such a function is that lowerlayer neurons that mostly contribute to the activation of the higherlayer neuron receive a larger share of the relevance Rj of the neuron j. A downside of this propagation rule (at least if *epsilon* = 0) is that the denominator may tend to zero if lowerlevel contributions to neuron j cancel each other out. The numerical instability can be overcome by setting *epsilon* > 0. However in that case, the conservation idea is relaxated in order to gain better numerical properties. To conserve relevance, it can be formulated as sum of positive and negative activations ![](https://i.imgur.com/lo7f8AI.png) such that *alpha*  *beta* = 1 #### Alternatives to LRP for heatmap **Senstiivity measurement** In such methods of generating heamaps, gradient of the output with respect to input is used for generating heatmap. This quantity measures how much small changes in the pixel value locally affect the network output. ##### Disadvantages Given most models use ReLU as activation function, the gradient flows only through activation with positive output  thereby making makes the backward mapping discontinuous, and consequently strongly local. Also same applies for maxpool activations  wherein gradients only flow through neurons with maximum intensity in local neighbourhood. Also, given most of these methods use absolute impact on prediction cause by changes in pixel intensities, the granularity of whether the pixel intensity was in favour or against evidence is lost. **Deconvolutional Networks** ##### Disadvantages Here the backward discontinuity problem of sensitivity based methods are absent, hence global features can be captured. However, since the method only takes in activation from final layer (which learns the presence or absence of features mostly) , using this for generating heatmaps is likely to yield avergae maps, lacking image specific localisation effects LRP is able to counter the effects nicely because of the way it uses relevance #### Performance of heatmaps Few concerns that the authors raise are  A heatmap is not a segmentation mask on the contrary missing evidence or the context may be very important for classification  Salient features represent average explanations of what distinguishes one image category from another. For individual images these explanations may be meaningless or even wrong. For instance, salient features for the class ‘bicycle’ may be the wheels and the handlebar. However, in some images a bicycle may be partly occluded so that these parts of a bike are not visible. In these images salient features fail to explain the classifier’s decision (which still may be correct). Authors propose a novel method (MoRF  *Most Relevant First* ) of objectively quantifying quality of a heatmap. A good detailed idea of the measure can best be obtained from the paper. To give an idea, the most reliable method should ideally rank the most relevant regions in the same order even if small perturbations in pixel intensities are observed (in nonrelevant areas. The quantity of interest in this case is the area over the MoRF perturbation curve (AOPC). #### Observation Most of the sensitivity based methods answer to the question  *what change would make the image more or less belong to the category car* which isn't really the classifier's question. LRP plans to answer the real classifier question *what speaks for the presence of a car in the image* An image below would be a good example of how LRPs can denoise heatmaps generated on the basis of sensitivity. ![](https://i.imgur.com/Sq0b5yg.png) 
[link]
This paper is about finding naturally looking images for the analysis of machine learning models in computer vision. There are 3 techniques: * **inversion**: the aim is to reconstruct an image from its representation * **activation maximization**: search for patterns that maximally stimulate a representation component (deep dream). This does NOT use an initial natural image. * **caricaturization**: exaggerate the visual patterns that a representation detects in an image The introduction is nice. ## Code The paper comes with code: [robots.ox.ac.uk/~vgg/research/invrep](http://www.robots.ox.ac.uk/~vgg/research/invrep/index.html) ([GitHub: aravindhm/deepgoggle](https://github.com/aravindhm/deepgoggle)) ## Related * 2013, Zeiler & Fergus: [Visualizing and Understanding Convolutional Networks ](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13#martinthoma) 
[link]
# Addressing the Rare Word Problem in Neural Machine Translation ## Introduction * NMT(Neural Machine Translation) systems perform poorly with respect to OOV(outofvocabulary) words or rare words. * The paper presents a wordalignment based technique for translating such rare words. * [Link to the paper](https://arxiv.org/abs/1410.8206) ## Technique * Annotate the training corpus with information about what do different OOV words (in the target sentence) correspond to in the source sentence. * NMT learns to track the alignment of rare words across source and target sentences and emits such alignments for the test sentences. * As a postprocessing step, use a dictionary to map rare words from the source language to target language. ## Annotating the Corpus ### Copy Model * Annotate the OOV words in the source sentence with tokens *unk1*, *unk2*,..., etc such that repeated words get the same token. * In target language, each OOV word, that is aligned to some OOV word in the source language, is assigned the same token as the word in the source language. * The OOV word in the target language, which has no alignment or is aligned with a known word in the source language. is assigned the null token. * Pros * Very straightforward * Cons * Misses out on words which are not labelled as OOV in the source language. ### PosAll  Positional All Model * All OOV words in the source language are assigned a single *unk* token. * All words in the target sentences are assigned positional tokens which denote that the *jth* word in the target sentence is aligned to the *ith* word in the source sentence. * Aligned words that are too far apart, or are unaligned, are assigned a null token. * Pros * Captures complete alignment between source and target sentences. * Cons * It doubles the length of target sentences. ### PosUnk  Positional Unknown Model * All OOV words in the source language are assigned a single *unk* token. * All OOV words in the target sentences are assigned *unk* token with the position which gives the relative position of the word in the target language with respect to its aligned source word. * Pros: * Faster than PosAll model. * Cons * Does not capture alignment for all words. ## Experiments * Dataset * Subset of WMT'14 dataset * Alignment computed using the [Berkeley Aligner](https://code.google.com/archive/p/berkeleyaligner/) * Used architecture from [Sequence to Sequence Learning with Neural Networks paper](https://gist.github.com/shagunsodhani/a2915921d7d0ac5cfd0e379025acfb9f). ## Results * All the 3 approaches (more specifically the PosUnk approach) improve the performance of existing NMTs in the order PosUnk > PosAll > Copy. * Ensemble models benefit more than individual models as the ensemble of NMT models works better at aligning the OOV words. * Performance gains are more when using smaller vocabulary. * Rare word analysis shows that performance gains are more when proposition of OOV words is higher. 
[link]
TLDR; The authors present an RNNbased variational autoencoder that can learn a latent sentence representation while learning to decode. A linear layer that predicts the parameter of a Gaussian distribution is inserted between encoder and decoder. The loss is a combination of the reconstruction objective and the KL divergence with the prior (Gaussian)  similar to the "standard" VAE does. The authors evaluate the model on Language Modeling and Impution (Inserting Missing Words) tasks and also present a qualitative analysis of the latent space. #### Key Points  Training is tricky. Vanilla training results in the decoder ignoring the encoder and the KL error term becoming zero.  Training Trick 1: KL Cost Annealing. During training, increase weight on the KL term of the cost to anneal from vanilla to VAE.  Training Trick 2: Word dropout using a word keep rate hyperparameter. This forces the decoder to rely more on the global representation.  Results on Language Modeling: Standard model (without cost annealing and word dropout) trails Vanilla RNNLM model, but not by much. KL cost term goes to zero in this setting. In an inputless decoder setting (word keep prob = 0) the VAE outperforms the RNNLM (obviously)  Results on Imputing Missing Words: Benchmarked using an adversarial error classifier. VAE significantly outperforms RNNLM. However, the comparison is somewhat unfair since the RNNML has nothing to condition on and relies on unigram distribution for the first token.  Qualitative: Can use higher word dropout to get more diverse sentences  Qualitative: Can walk the latent space and get grammatical and meaningful sentences. 
[link]
TLDR; The authors train a DQN on textbased games. The main difference is that their QValue functions embeds the state (textual context) and action (textbased choice) separately and then takes the dot product between them. The authors call this a Deep Reinforcement Learning Relevance network. Basically, just a different Q function implementation. Empirically, the authors show that their network can learn to solve "Saving John" and "Machine of Death" text games. 
[link]
In this paper, the authors raise a very important point for instance based image retrieval. For a task like an image recognition features extracted from higher layer of deep networks works really well in general, but for task like instance based image retrieval features extracted from higher layers don't prove to be that useful, so the authors suggest that we take features from lower layer and on those features, apply [VLAD encoding](https://www.robots.ox.ac.uk/~vgg/publications/2013/arandjelovic13/arandjelovic13.pdf). On top of the VLAD encoding as part of post processing, we perform steps like intranormalisation and then apply PCA and reduce the encoding to a size of 128 Dimension. The authors have performed their experiments using [Googlenet](https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf) and [VGG16](https://arxiv.org/pdf/1409.1556v6.pdf), and they tried Inception 3a, Inception 4a and Inception 4e on GoogleNet and conv4_2, conv5_1 and conv5_2 on VGG16. The above mentioned layers has almost similar performance on the dataset they have used. The performance metric used by the authors is Mean Average Precision(MAP). 
[link]
This paper is about pruning a neural network to reduce the FLOPs and memory necessary to use it. This method reduces AlexNet parameters to 1/9 and VGG16 to 1/13 of the original size. ## Receipt 1. Train a network 2. Prune network: For each weight $w$: if w < threshold, then w < 0. 3. Train pruned network ## See also * [Optimal Brain Damage](http://www.shortscience.org/paper?bibtexKey=conf/nips/CunDS89) 
[link]
# Deep Convolutional Generative Adversarial Nets ## Introduction * The paper presents Deep Convolutional Generative Adversarial Nets (DCGAN)  a topologically constrained variant of conditional GAN. * [Link to the paper](https://arxiv.org/abs/1511.06434) ## Benefits * Stable to train * Very useful to learn unsupervised image representations. ## Model * GANs difficult to scale using CNNs. * Paper proposes following changes to GANs: * Replace any pooling layers with strided convolutions (for discriminator) and fractional strided convolutions (for generators). * Remove fully connected hidden layers. * Use batch normalisation in both generator (all layers except output layer) and discriminator (all layers except input layer). * Use LeakyReLU in all layers of the discriminator. * Use ReLU activation in all layers of the generator (except output layer which uses Tanh). ## Datasets * LargeScale Scene Understanding. * Imagenet1K. * Faces dataset. ## Hyperparameters * Minibatch SGD with minibatch size of 128. * Weights initialized with 0 centered Normal distribution with standard deviation = 0.02 * Adam Optimizer * Slope of leak = 0.2 for LeakyReLU. * Learning rate = 0.0002, β1 = 0.5 ## Observations * LargeScale Scene Understanding data * Demonstrates that model scales with more data and higher resolution generation. * Even though it is unlikely that model would have memorized images (due to low learning rate of minibatch SGD). * Classifying CIFAR10 dataset * Features * Train in Imagenet1K and test on CIFAR10. * Max pool discriminator's convolutional features (from all layers) to get 4x4 spatial grids. * Flatten and concatenate to get a 28672dimensional vector. * Linear L2SVM classifier trained over the feature vector. * 82.8% accuracy, outperforms Kmeans (80.6%) * Street View House Number Classifier * Similar pipeline as CIFAR10 * 22.48% test error. * The paper contains many examples of images generated by final and intermediate layers of the network. * Images in the latent space do not show sharp transitions indicating that network did not memorize images. * DCGAN can learn an interesting hierarchy of features. * Networks seems to have some success in disentangling image representation from object representation. * Vector arithmetic can be performed on the Z vectors corresponding to the face samples to get results like `smiling woman  normal woman + normal man = smiling man` visually. 
[link]
*Probabilistic weighted pooling* is proposed in this paper. It is based on maxpooling and dropout. 
[link]
# A Roadmap towards Machine Intelligence ## Introduction * The paper presents some general characteristics that intelligent machines should possess and a roadmap to develop such intelligent machines in small, realistic steps. * [Link to the paper](https://arxiv.org/abs/1511.08130) ## Ability to Communicate * The intelligent agents should be able to communicate with humans, preferably using language as the medium. * Such systems can be programmed through natural language and can access much of the human knowledge which is encoded using natural language. * The learning environment should facilitate interactive communication and the machine should have a minimalistic bit interface for IO to keep the interface simple. * Further, the machine should be free to use any internal representation for learning tasks. ## Ability to Learn * Learning allows the machine to adapt to the external environment and correct their mistakes. * Users should be able to control the motivation of the machine via a communication channel. This is similar to the notion of rewards in reinforcement learning. ## A simulated ecosystem to educate communicationbased intelligent machines * Simulated environment to teach basic linguistic interactions and knowhow to operate in the world. * Though the environment should be challenging enough to force the machine to "learn how to learn", its complexity should be manageable. * Unlike class AI block worlds, the simulated environment is not intended to teach an exhaustive set of functionality to the agent. The aim is to teach the machine how to learn efficiently by combining already acquired skills. ### Description #### Agent * Learner or actor * Teacher * Assigns tasks and rewards to the learner and provides helpful information. * Aim is to kick start the learner's efficient learning capabilities without providing enough direct information. * Environment * Learner explores the environment by giving orders, asking questions and receiving feedback. * Environment uses a controlled language which is more explicit and restricted. Think of learner as a highlevel programming language, the teacher as the programmer and the environment as the compiler. #### Interface Channels * Generic input and output channels. * Teacher and environment write to the input channel. * Reward is written to input channel. * Learner writes to the output channel and learns to use ambigous prefixes to address the agents and services it needs to interact with. #### Reward * Way to provide feedback to the learner. * Rewards should become sparse as the learner's intelligence grows and "curiosity" should be a learnt strategy. * Learner should maximise average reward over time so that faster strategies are preferred in case of equal rewards. #### Incremental Structure * Think of learner progressing through different levels where skills from earlier levels can be used in later levels. * Tasks need not be ordered within a level. * Learner starts by performing basic tasks like repeating characters then learns to associate linguistic strings to action sequences. Further, the learner learns to ask questions and "read" natural text. #### Time Off * Learner is given time to either explore the environment or to interact with the Teacher or to update its internal structure by replaying the previous experience. #### Evaluation * Evaluating the learning agent on only the final behaviour only is not sufficient as it overlooks the number of attempts to reach the optimal behaviour. * Better approach would be to conduct public competition where developers have access to preprogrammed environment for fixed amount of time and learners are evaluated on tasks that are considerably different from the tasks encountered during training. #### Tasks A brief overview of the type of tasks is provided [here](https://github.com/facebookresearch/CommAIenv/blob/master/TASKS.md) ## Types of Learning * Concept of positive and negative rewards. * Discovery of algorithms. * Remember facts, skills, and learning strategies. ## Long term memory * To store facts, algorithms and even ability to learn. ## Compositional Learning Skills * Producing new structures by combining together known facts and skills. * Understanding new concepts should not always require training examples. ## Computational properties of intelligent machines * Computational model should be able to represent any pattern in data (alternatively, represent any algorithm in fixed length). * Among the various Turningcomplete computational systems available, the most natural choice would be a compositional system that can perform computations in parallel. * Alternatively, a nongrowing model with immensely large capacity could be used. * In a growing model, new cells are connected to ones that spawned them leading to topological structures that can contribute to learning. * But it is not clear if such topological structures can arise in a largecapacity unstructured model. 
[link]
FaceNet directly maps face images to $\mathbb{R}^{128}$ where distances directly correspond to a measure of face similarity. They use a triplet loss function. The triplet is (face of person A, other face of person A, face of person which is not A). Later, this is called (anchor, positive, negative). The loss function is learned and inspired by LMNN. The idea is to minimize the distance between the two images of the same person and maximize the distance to the other persons image. ## LMNN Large Margin Nearest Neighbor (LMNN) is learning a pseudometric $$d(x, y) = (x y) M (x y)^T$$ where $M$ is a positivedefinite matrix. The only difference between a pseudometric and a metric is that $d(x, y) = 0 \Leftrightarrow x = y$ does not hold. ## Curriculum Learning: Triplet selection Show simple examples first, then increase the difficulty. This is done by selecting the triplets. They use the triplets which are *hard*. For the positive example, this means the distance between the anchor and the positive example is high. For the negative example this means the distance between the anchor and the negative example is low. They want to have $$f(x_i^a)  f(x_i^p)_2^2 + \alpha < f(x_i^a)  f(x_i^n)_2^2$$ where $\alpha$ is a margin and $x_i^a$ is the anchor, $x_i^p$ is the positive face example and $x_i^n$ is the negative example. They increase $\alpha$ over time. It is crucial that $f$ maps the images not in the complete $\mathbb{R}^{128}$, but on the unit sphere. Otherwise one could double $\alpha$ by simply making $f' = 2 \cdot f$. ## Tasks * **Face verification**: Is this the same person? * **Face recognition**: Who is this person? ## Datasets * 99.63% accuracy on Labeled FAces in the Wild (LFW) * 95.12% accuracy on YouTube Faces DB ## Network Two models are evaluated: The [Zeiler & Fergus model](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13) and an architecture based on the [Inception model](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14). ## See also * [DeepFace](http://www.shortscience.org/paper?bibtexKey=conf/cvpr/TaigmanYRW14#martinthoma) 
[link]
#### Introduction * The paper presents a suite of benchmark tasks to evaluate endtoend dialogue systems such that performing well on the tasks is a necessary (but not sufficient) condition for a fully functional dialogue agent. * [Link to the paper](https://research.facebook.com/publications/evaluatingprerequisitequalitiesforlearningendtoenddialogsystems/) #### Dataset * Created using largescale realworld sources  OMDB (Open Movie Database), MovieLens and Reddit. * Consists of ~75K movie entities and ~3.5M training examples. #### Tasks ##### QA Task * Answering Factoid Questions without relation to the previous dialogue. * KB(Knowledge Base) created using OMDB and stored as triplets of the form (Entity, Relation, Entity). * Question (in Natural Language Form) generated by creating templates using [SimpleQuestions](https://arxiv.org/abs/1506.02075) * Instead of giving out just 1 response, the system ranks all the answers in order of their relevance. ##### Recommendation Task * Providing personalised responses to the user via recommendation instead of providing universal facts as in case 1. * MovieLens dataset with a *user x item* matrix of ratings. * Statements (for any user) are generated by sampling highly ranked movies by the user and forming a statement about these movies using natural language templates. * Like the previous case, a list of ranked responses is generated. ##### QA + Recommendation Task * Maintaining short dialogues involving both factoid and personalised content. * Dataset consists of short conversations of 3 exchanges (3 from each participant). ##### Reddit Discussion Task * Identify most likely response is discussions on Reddit. * Data processed to flatten the potential conversation so that it appears to be a two participant conversation. ##### Joint Task * Combines all the previous tasks into one single task to test all the skills at once. #### Models Tested * **Memory Networks**  Comprises of a memory component that includes both long term memory and short term context. * **Supervised Embedding Models**  Sum the word embeddings of the input and the target independently and compare them with a similarity metric. * **Recurrent Language Models**  RNN, LSTM, SeqToSeq * **Question Answering Systems**  Systems that answer natural language questions by converting them into search queries over a KB. * **SVD(Singular Value Decomposition)**  Standard benchmark for recommendation. * **Information Retrieval Models**  Given a message, find the most similar message in the training dataset and report its output or find a most similar response to input directly. #### Result ##### QA Task * QA System > Memory Networks > Supervised Embeddings > LSTM ##### Recommendation Task * Supervised Embeddings > Memory Networks > LSTM > SVD ##### Task Involving Dialog History * QA + Recommendation Task and Reddit Discussion Task * Memory Networks > Supervised Embeddings > LSTM ##### Joint Task * Supervised word embeddings perform very poorly even when using a large number of dimensions (2000 dimensions). * Memory Networks perform better than embedding models as they can utilise the local context and the longterm memory. But they do not perform as well on standalone QA tasks. 
[link]
#### Introduction The [paper](http://arxiv.org/pdf/1502.05698v10) presents a framework and a set of synthetic toy tasks (classified into skill sets) for analyzing the performance of different machine learning algorithms. #### Tasks * **Single/Two/Three Supporting Facts**: Questions where a single(or multiple) supporting facts provide the answer. More is the number of supporting facts, tougher is the task. * **Two/Three Supporting Facts**: Requires differentiation between objects and subjects. * **Yes/No Questions**: True/False questions. * **Counting/List/Set Questions**: Requires ability to count or list objects having a certain property. * **Simple Negation and Indefinite Knowledge**: Tests the ability to handle negation constructs and model sentences that describe a possibility and not a certainty. * **Basic Coreference, Conjunctions, and Compound Coreference**: Requires ability to handle different levels of coreference. * **Time Reasoning**: Requires understanding the use of time expressions in sentences. * **Basic Deduction and Induction**: Tests basic deduction and induction via inheritance of properties. * **Position and Size Reasoning** * **Path Finding**: Find path between locations. * **Agent's Motivation**: Why an agent performs an action ie what is the state of the agent. #### Dataset * The dataset is available [here](https://research.facebook.com/research/babi/) and the source code to generate the tasks is available [here](https://github.com/facebook/bAbItasks). * The different tasks are independent of each other. * For supervised training, the set of relevant statements is provided along with questions and answers. * The tasks are available in English, Hindi and shuffled English words. #### Data Simulation * Simulated world consists of entities of various types (locations, objects, persons etc) and of various actions that operate on these entities. * These entities have their internal state and follow certain rules as to how they interact with other entities. * Basic simulations are of the form: <actor> <action> <object> eg Bob go school. * To add variations, synonyms are used for entities and actions. #### Experiments ##### Methods * Ngram classifier baseline * LSTMs * Memory Networks (MemNNs) * Structured SVM incorporating externally labeled data ##### Extensions to Memory Networks * **Adaptive Memories**  learn the number of hops to be performed instead of using the fixed value of 2 hops. * **Ngrams**  Use a bag of 3grams instead of a bagofwords. * **Nonlinearity**  Apply 2layer neural network with *tanh* nonlinearity in the matching function. ##### Structured SVM * Uses coreference resolution and semantic role labeling (SRL) which are themselves trained on a large amount of data. * First train with strong supervision to find supporting statements and then use a similar SVM to find the response. ##### Results * Standard MemNN outperform Ngram and LSTM but still fail on a number of tasks. * MemNNs with Adaptive Memory improve the performance for multiple supporting facts task and basic induction task. * MemNNs with Ngram modeling improves results when word order matters. * MemNNs with Nonlinearity performs well on Yes/No tasks and indefinite knowledge tasks. * Structured SVM outperforms vanilla MemNNs but not as good as MemNNs with modifications. * Structured SVM performs very well on path finding task due to its nongreedy search approach. 
[link]
#### Introduction This [paper](https://github.com/lucastheis/ride) introduces *recurrent image density estimator* (RIDE), a generative model by combining a *multidimensional* recurrent neural network with mixtures of experts to model the distribution of natural image. In this work, the authors used *spatial* LSTMs (SLSTM) to capture the semantics in the form of hidden states where these hidden vectors are then fed into a factorized *factorized mixtures of conditional Gaussian scale mixtures* (MCGSMs) to predict the state of the corresponding pixels. ##### __1. Spatial long shortterm memory (SLSTM)__ This is a straightforward extension of the multidimensional RNN in order to capture long range interaction. Let $\mathbf{x}$ be a grayscale image patch and $x_{ij}$ be the intensity of pixel at location ${ij}$. At each location $ij$, each LSTM unit perform the following operations: $\mathbf{c}_{ij} = \mathbf{g}_{ij} \odot \mathbf{i}_{ij} + \mathbf{c}_{i,j1} \odot \mathbf{f}^c_{ij} + \mathbf{c}_{i1,j} \odot \mathbf{f}^r_{ij} $ $\mathbf{h}_{ij} = \tanh(\mathbf{c}_{ij} \odot \mathbf{o}_{ij})$ $\begin{pmatrix} \mathbf{g}_{ij} \\ \mathbf{o}_{ij} \\ \mathbf{i}_{ij} \\ \mathbf{g}_{ij}\\ \mathbf{f}_{ij}^r\\ \mathbf{f}_{ij}^c \end{pmatrix} = \begin{pmatrix} \tanh \\ \sigma \\ \sigma \\ \sigma \\ \sigma\\ \sigma \end{pmatrix} T_{\mathbf{A,b}} \begin{pmatrix} \mathbf{x}_{<ij} \\ \mathbf{h}_{i,j1} \\ \mathbf{h}_{i1,j} \end{pmatrix} $ where $\mathbf{c}_{ij}$ and $\mathbf{h}_{ij}$ are memory units and hidden units respectively. Note that, there are 2 different forget gates $\mathbf{f}^c_{ij}$ and $\mathbf{f}^r_{ij}$ for the 2 preceding memory states $\mathbf{c}_{i,j1}$ and $\mathbf{c}_{i1,j}$. Also note that $\mathbf{x}_{<ij}$ here denotes a set of *causal neighborhood* by applying Markov assumption. ![ride_1](http://i.imgur.com/W8ugGvl.png) As shown in Fig. C, although the prediction of a pixel depends only on its neighborhood (green) through feedforward connections, there is an indirect connection to a much larger region (red) via recurrent connections. ##### __2. Factorized mixtures of conditional Gaussian scale mixtures__ A generative model can usually be expressed as $p(\mathbf{x};\mathbf{\theta}) = \prod_{i,j} p(x_{ij}\mathbf{x}_{<ij}; \mathbf{\theta})$ using chain rule. One way to improve the representational power of a model is to introduce different sets of parameters for each pixel, i.e. $p(\mathbf{x}; \{ \mathbf{\theta} \}) = \prod_{i,j} p(x_{ij}\mathbf{x}_{<ij}; \mathbf{\theta}_{ij})$. However, untying shared parameters will lead to drastic increase of parameters. Therefore, the author applied 2 simple common used assumptions: 1. __Markov assumption__: $\mathbf{x}_{<ij}$ is limited to small neighborhood around $x_{ij}$ (causal neighborhood) 2. __Stationary and shift invariance__: the same set of $\mathbf{\theta}_{ij}$ is used for every location ${ij}$ which corresponds to recurrent structure in RNN. Therefore, the hidden vector from SLSTMs can be fed into the MCGSM to predict the state of corresponding label, i.e. $p(x_{ij}  \textbf{x}_{<ij}) = p(x_{ij}  \textbf{h}_{ij})$. The conditional distribution distribution in MCGSM is represented as a mixture of experts: $p(x_{ij}  \mathbf{x}_{<ij}; \mathbf{\theta}_{ij}) = \sum_{c,s} p(c, s  \mathbf{x}_{<ij}, \mathbf{\theta}_{ij}) p (x_{ij}  \mathbf{x}_{<ij}, c, s, \mathbf{\theta}_{ij})$. where the first and second term correspond to gate and experts respectively. To further reduce the number of parameters, the authors proposed using a *factorized* MCGSM in order to use larger neighborhoods and more mixture components. (*__Remarks__: I am not too sure about the exact training of MCGSM, but as far as I understand, the MCGSM is firstly trained endtoend with SLSTM using SGD with momentum and then finetuned using LBFGS after each epoch by fixing the parameters of SLSTM.*) * For training: ``` for n in range(num_epochs): for b in range(0, inputs.shape[0]  batch_size + 1, batch_size): # compute gradients f, df = f_df(params, b) loss.append(f / log(2.) / self.num_channels) # update SLSTM parameters for l in train_layers: for key in params['slstm'][l]: diff['slstm'][l][key] = momentum * diff['slstm'][l][key]  df['slstm'][l][key] params['slstm'][l][key] = params['slstm'][l][key] + learning_rate * diff['slstm'][l][key] # update MCGSM parameters diff['mcgsm'] = momentum * diff['mcgsm']  df['mcgsm'] params['mcgsm'] = params['mcgsm'] + learning_rate * diff['mcgsm'] ``` * Finetuning (part of the code) ``` for l in range(self.num_layers): self.slstm[l] = SLSTM( num_rows=hiddens.shape[1], num_cols=hiddens.shape[2], num_channels=hiddens.shape[3], num_hiddens=self.num_hiddens, batch_size=min([hiddens.shape[0], self.MAX_BATCH_SIZE]), nonlinearity=self.nonlinearity, extended=self.extended, slstm=self.slstm[l], verbosity=self.verbosity) hiddens = self.slstm[l].forward(hiddens) # finetune with early stopping based on validation performance return self.mcgsm.train( hiddens_train, outputs_train, hiddens_valid, outputs_valid, parameters={ 'verbosity': self.verbosity, 'train_means': train_means, 'max_iter': max_iter}) ``` 
[link]
Facebook has [released a series of papers](https://research.facebook.com/blog/learningtosegment/) for object segmentation and detection. This paper is the first in that series. This is how modern object detection works (think [RCNN](https://arxiv.org/abs/1311.2524), [Fast RCNN](http://arxiv.org/abs/1504.08083)): 1. A rich set of object proposals (i.e., a set of image regions which are likely to contain an object) is generated using a fast (but possibly imprecise) algorithm. 2. A CNN classifier is applied on each of the proposals. The current paper improves the step 1, i.e., region/object proposals. Most object proposals approaches fall into three categories: * Objectness scoring * Seed Segmentation * Superpixel Merging Current method is different from these three. It share similarities with [Faster RCNN](https://arxiv.org/abs/1506.01497) in that proposals are generated using a CNN. The method predicts a segmentation mask given an input *patch* and assigns a score corresponding to how likely the patch is to contain an object. ## Model and Training Both mask and score predictions are achieved with a single convolutional network but with multiple outputs. All the convolutional layers except the last few are from VGGA pretrained model. Each training sample is a triplet of RGB input patch, the binary mask corresponding to the input patch, a label which specifies whether the patch contains an object. A patch is given label 1 only if it satisfies the following constraints: * the patch contains an object roughly centered in the input patch * the object is fully contained in the patch and in a given scale range Note that the network must output a mask for a single object at the center even when multiple objects are present. Figure 1 shows the architecture and sampling for training. ![figure1](https://i.imgur.com/zSyP0ij.png) Model is then jointly trained for segmentation and objectness. Negative samples are not used for segmentation. ## Inference During full image inference, model is applied densely at multiple locations and scales. This can be done efficiently since all computations are convolutional like in a fully convolutional network (FCN). ![figure2](https://i.imgur.com/dQWfy8R.png) This approach surpasses the previous state of the art by a large margin in both box and segmentation proposal generation. 
[link]
Originally posted [here](https://github.com/abhshkdz/papers/blob/master/reviews/actions~transformations.md). This paper introduces a novel representation for actions in videos as transformations that change the state of the environment from what it was before the action (precondition) to what it will be after it (effect).  Model  The model utilizes a Siamese architecture with each head having convolutional and fullyconnected layers (similar to VGG16). Each head extracts features for a subset of video frames (precondition or effect) that are aggregated by average pooling and followed by a fullyconnected layer.  The precondition frames are indexed from 1 to z\_p and the effect frames from z\_e to t. Both z\_p and z\_e are latent variables, constrained to be from [1/3t, 1/2t] and [1/2t, 2/3t] respectively and estimated via brute force search during training.  The action is represented as a linear transformation between the final fullyconnected layers of the two heads. For n action categories, the transformation layer has n transformation matrices.  The model is trained with a contrastive loss function to 1) maximize cosine similarity between the effect embedding and the transformed precondition embedding, and 2) maximize distance for incorrect transformations if greater than a chosen margin.  ACT Dataset  50 keywords, 43 classes, ~500 YouTube videos per keyword.  The authors collect the ACT dataset primarily for the task of crosscategory generalization (as it doesn't allow models to overfit to contextual information). For example, how would a model learned on "opening a window" generalize to recognize "opening the trunk of the car"? How about generalizing from a model trained on "climbing a cliff" to recognize "climbing a tree"?  The ACT dataset has class and superclass annotations from human workers. Each superclass has different subcategories which are the same action under different subjects, objects and scenes.  Experiments  Action recognition on UCF101, HMDB51, ACT.  Crosscategory generalization on ACT.  Visualizations  Nearest neighbor: modeling the actions as transformations gives semantically meaningful retrievals that don't just depend on motion and color.  Gradient visualizations (Simonyan et al. 2014): model focuses on changes in scene (human + object) than context.  Embedding retrievals based on transformed precondition embeddings. ** Thoughts **  Modeling action as a transformation from precondition to effect is a very neat idea.  The exact formulation and supporting experiments and ablation studies are thorough.  During inference, the model first extracts features for all frames and then does a brute force search over (y,z\_p,z\_e) to estimate the action category and segmentation into precondition and effect. For longer sequences, this seems expensive. Although hard decisions aren't differentiable, a soft attention mechanism on z might be feasible and reduce computation to a single forward pass. 
[link]
I spend the week at ICML, and this paper on generative models is one of my favourites so far: To be clear: this post doesn't add much to the presentation of the paper, but I will attempt to summarise my understanding of it. Also, I want to make clear that this is not my work. Unsupervised learning has been one of the most interesting areas of machine learning in the last decades, but it is in the spotlight again since the deep learning crowd started to care about it. Unsupervised learning is hard because evaluating the loss function people want to use (log likelihood) is intractable for most interesting models. Therefore people come up with  alternative objective functions, such as adversarial training, maximum mean discrepancy, or pseudolikelihood, which can be evaluated for a large class of interesting models  alternative optimisation methods or approximate inference methods such as contrastive divergence or variational Bayes  models that have some nice properties. This paper is an example of the latter #### The key idea behind the paper What we typically try to do in representation learning is to map data to a latent representation. While the Data can have arbitrarily complex distribution along some complicated nonlinear manifold, we want the computed latent representations to have a nice distribution, like a multivariate Gaussian. This paper takes this idea very explicitly using a stochastic mapping to turn data into a representation: a random diffusion process. If you take any data, and apply Brownian motionlike stochastic process to this, you will end up with a standard Gaussian distributed variable, due to the stationarity of the Brownian motion. Below image shows an example: 2D observations (left) have a complex data distribution along the Swiss roll manifold. If one applies Brownian motion to each datapoint, the complicated structure starts to diffuse, and eventually the data is scrambled to become white noise (right). ![](http://www.inference.vc/content/images/2015/07/ScreenShot20150709at132741.png) Now the trick the authors used is to train a dynamical system to inverts this random walk, to be able to reconstruct the original data distribution from the random Gaussian noise. Amazingly, this works, and the traninig objective becomes very similar to variational autoencoders. Below is a figure showing what happens when we try to reconstruct data in the Swiss roll example: The top images from right to left: we start with a bunch of points drawn from random noise (top right). We apply the inverse nonlinear transformation to these points (top middle). Over time points will be pushed towards the original Swiss roll manifold (top left). `The information about the data distribution is encoded in the approximate inverse dynamical system` The bottom pictures show where this dynamical system tries to push points as time progresses. ![](http://www.inference.vc/content/images/2015/07/ScreenShot20150709at133016.png) This is super cool. Now we have a deep generative process that can turn random noise into something that looks like our datapoints. It can generate roughly naturallooking images like these: ![](http://www.inference.vc/content/images/2015/07/ScreenShot20150709at133338.png) #### Advantages In this model a lot of things that are otherwise hard to do are easy to do: 1. generating/imagining data is straightforward 2. inference, i.e. calculating the latent representation from data, is simple 3. you can multiply the distribution with another distribution, making Bayesian calculations for stuff like denoising or superresolution possible. #### Drawbacks and extensions I think a drawback of the model is that if you run the diffusion process for too long (i.e. make the model deeper), the mutual information between datapoint and its representation is bound to decrease, due to the stationarity of Brownian motion. I guess this is going to be an important limitation to the depth of these models. Also, the latent representations at each layer are assumed to be exactly if the same dimensionality and type as the data itsef. So if we are modeling 100x100 images, then all layers in the resulting network will have 100k nodes. I guess this can be overcome by combining variational autoencoders with this method. Also, you can imagine augmenting your space with extra 'pixels' that are only used for richer representations in the intermediate layers. Anyway, this is super cool, go read the paper. 
[link]
This is my second favourite paper from ICML last week, and I think the title really does not do it justice. It is a great idea about training rich, tractable autoregressive generative models of data, and doing so by using standard techniques from autoencoder training with dropout. Caveat (again): this is not my work, and this blog post does not really add anything new to the paper, only my perspective on it. #### Unsupervised learning primer (again) Unsupervised learning is about modelling the probability distribution $p(\mathbf{x})$ of some data, from which we observe independent samples $\mathbf{x}_i$. Often, the vector $\mathbf{x}$ is high dimensional, such as in images, where different components of $\mathbf{x}$ encode pixel intensities. Typically, a probability model is specified as $q(\mathbf{x};\theta) = \frac{f(\mathbf{x};\theta)}{Z_\theta}$, where $f(\mathbf{x};\theta)$ is some positive function parametrised by $\theta$. The denominator $Z_\theta$ is called the normalisation constant which makes sure that $q$ is a valid probability model: it has to sum up to one over all possible configurations of $\mathbf{x}$. The central problem in unsupervised learning is that for the most interesting models in high dimensional spaces, calculating $Z_\theta$ is intractable, so crucial quantities such as the model likelihood cannot be calculated and the model cannot be fitted. The community is therefore in search for  interesting models that have tractable normalisation constants  fancy methods to deal with intractable models (pseudolikelihood, adversarial networks, contrastive divergence) This paper is about the former. #### Core ingredient: autoregressive models This paper sidesteps the high dimensional normalisation problem by restricting the class of probability distributions to autoregressive models, which can be written as follows: $$q(\mathbf{x};\theta) = \prod_{d=1}^{D} q(x_{d}\vert x_{1:d1};\theta).$$ Here $x_d$ denotes the $d^{th}$ component of the input vector $\mathbf{x}$. In a model like this, we only need to compute the normalisation of each $q(x_{d}\vert x_{1:d1};\theta)$ term, and we can be sure that the resulting model is a valid model over the whole vector $\mathbf{x}$. But as normalising these onedimensional probability distributions is a lot easier, we have a whole range of interesting tractable distributions at our disposal. #### Training multiple models simultaneously Autoregressive models are used a lot in time series modelling and language modelling: hidden Markov models or recurrent neural networks are examples. There, autoregressive models are a very natural way to model data because the data comes ordered (in time). What's weird about using autoregressive models in this context is that it is sensitive to ordering of dimensions, even though that ordering might not mean anything. If $\mathbf{x}$ encodes an image, you can think about multiple orders in which pixel values can be serialised: sweeping lefttoright, toptobottom, insideout etc. For images, neither of these orderings is particularly natural, yet all of these different ordering specifies a different model above. But it turns out, you don't have to choose one ordering, you can choose all of them at the same time. The neat trick in the masking autoencoder paper is to train multiple autoregressive models all at the same time, all of them sharing (a subset of) parameters $\theta$, but defined over different ordering of coordinates. This can be achieved by thinking of deep autoregressive models as a special cases of an autoencoder, only with a few edges missing. ![](http://www.inference.vc/content/images/2015/07/ScreenShot20150713at104854.png) Consider a fixed ordering of input dimensions. Now take a fully connected autoencoder, which defines a probability distribution $q(\hat{\mathbf{x}}\vert\mathbf{x};\theta)$. You can write this as $$q(\hat{\mathbf{x}}\vert\mathbf{x};\theta) = \prod_{d=1}^{D} q(\hat{x}_{d}\vert x_{1:D};\theta)$$ Note the similarity to the autoregressive equation above, the only difference being that each coordinate now depends on every other coordinate ($x_{1:D}$), rather than only coordinates that precede it in the ordering ($x_{1:d1}$). To turn this equation into autoregressive equation above, we simply have to remove dependencies of each output coordinate $\hat{x}_{d}$ on any input coordinate $\hat{x}_{e}$, where $e>=d$. This can be done by removing edges along all paths from the input coordinate $\hat{x}_{e}$ to output coordinate $\hat{x}_{d}$. You can achieve this cutting of edges by multiplying the weight matrices $\mathbf{W}^{l}$of the autoencoder neural network elementwise by binary masking matrices $\mathbf{M}^{\mathbf{W}^{l}}$. Hence the name masked autoencoder. The procedure above considered a fixed ordering of coordinates. You can repeat this process for any arbitrary ordering, for which you obtain different masking matrices but otherwise the same procedure. If you train this autoencoder network with randomly sampled masking matrices, you essentially train a family of autoregressive models, each sharing some parameters via the underlying autoencoder network. Because masking is similar to the popular dropout training, implementing it is relatively straightforward and requires minimal change to existing autoencoder code. However, now you have a generative model  in fact, a large set of generative models  which has a lot of nice properties for you to enjoy. The slight concern Of course, this would be all too good to be true: a powerful deep generative model that is easy to evaluate and all. I think the problem with this is the following: If you train just one of these autoregressive models, that's tractable, exact and fine. But you really want to combine all (or many) of these becuause individually they are weak. What is the interpretation of training with randomly drawn masking matrices? You can think of it as stochastic gradient descent on the following objective: $$\mathbb{E}_{\mathbf{x}\sim p}\mathbb{E}_{\pi \sim U} \log q(\mathbf{x},\pi,\theta)$$ Here, I used $\pi$ to denote a permutation of the coordinates, and $\mathbb{E}_{\pi \sim U}$ to take an expectation over a uniform distribution over permutations. The distribution $q(\mathbf{x},\pi,\theta)$ is the autoregressive model defined by $\theta$ and the masking matrices corresponding to permutation $\pi$. $\mathbb{E}_{\mathbf{x}\sim p}$ denotes averaging over the empirical data distribution. Combining as a mixture model One way to combine autoregressive models is to take a mixture model. In the paper, the authors actually use an ensemble to make predictions, which is analogous to an equal mixture model where the mixture weights are uniform and fixed. The likelihood for this model would be the following: $$\mathbb{E}_{\mathbf{x}\sim p} \log \mathbb{E}_{\pi \sim U} q(\mathbf{x},\pi,\theta)$$ Notice that the averaging over permutations now takes place inside the logarithm. By Jensen's inequality, we can say that randomly sampling masking matrices during training amounts to optimising a stochastically estimated lower bound to the likelihood of an equal mixture. This raises the question whether actually learning the weights in such a model would be hard using something like an EM algorithm with a sparsityenforcing regulariser/prior over mixture weights. Combining as a product of experts model Combining these autoregressive models as a mixture is not ideal. In mixture modeling the sharpness of the mixture distribution is bounded by the sharpness of component distributions. Your combined prediction can never be more confident than the your most confident model. In this case, I expect the AR models to be pretty poor models individually, and therefore not to be very sharp, particularly along the first few coordinates in the corresponding ordering. A better way to combine probabilistic models is via product of experts. You can actually interpret training by random masking matrices as a form of product of experts, but with the global normalisation ignored. I'm not sure if it would be possible/tractable to do anything better than this. 
[link]
This post is a comment on the Laplacian pyramidbased generative model proposed by researchers from NYU/Facebook AI Research. Let me start by saying that I really like this model, and I think  looking at the samples drawn  it represents a nice big step towards convincing generative models of natural images. To summarise the model, the authors use the Laplacian pyramid representation of images, where you recursively decompose the image to a lower resolution subsampled component and the highfrequency residual. The reason this decomposition is favoured in image processing is the fact that the highfrequency residuals tend to be very sparse, so they are relatively easy to compress and encode. In this paper the authors propose using convolutional neural networks at each layer of the Laplacian pyramid representation to generate an image sequentially, increasing the resolution at each step. The convnet at each layer is conditioned on the lower resolution image, and some noise component $z_k$, and generates a random higher resolution image. The process continues recursively until the desired resilution is reached. For training they use the adversarial objective function. Below is the main figure that explains how the generative model works, I encourage everyone to have a look at the paper for more details: ![](http://www.inference.vc/content/images/2015/07/ScreenShot20150723at111517.png) #### An argument about Conditional Entropies What I think is weird about the model is the precise amount of noise that is injected at each layer/resolution. In the schematic above, these are the $z_k$ variables. Adding the noise is crucial to defining a probabilistic generative process; this is how it defines a probability distribution. I think it's useful to think about entropies of natural images at different resolutions. When doing generative modelling or unsuperised learning, we want to capture the distribution of data. One important aspect of a probability distribution is its entropy, which measures the variability of the random quantity. In this case, we want to describe the statistics of the full resolution observed natural image $I_0$. (I borrow the authors' notation where $I_0$ represents the highest resolution image, and $I_k$ represent the $k$times subsampled version. Using the Laplacian pyramid representation, we can decompose the entropy of an image in the following way: $$\mathbb{H}[I_0] = \mathbb{H}[I_{K}] + \sum_{k=0}^{K1} \mathbb{H}[I_k\vert I_{k+1}].$$ The reason why the above decomposition holds is very simple. Because $I_{k+1}$ is a deterministic function of $I_{k}$ (subsampling), the conditional entropy $\mathbb{H}[I_{k+1}\vert I_{k}] = 0$. Therefore the joint entropy of the two variables is simply the entropy of the higher resolution image $I_{k}$, that is $\mathbb{H}[I_{k},I_{k+1}] = \mathbb{H}[I_{k}] + \mathbb{H}[I_{k+1}\vert I_{k}] = \mathbb{H}[I_{k}]$. So by induction, the join entropy of all images $I_{k}$ is just the marginal entropy of the highest resolution image $I_0$. Applying the chain rule for joint entropies we get the expression above. Now, the interesting bit is how the conditional entropies $\mathbb{H}[I_k\vert I_{k+1}]$ are 'achieved' in the Laplacian pyramid generative model paper. These entropies are provided by the injected random noise variables $z_k$. By the information processing lemma $\mathbb{H}[I_k\vert I_{k+1}] \leq \mathbb{H}[z_k]$. The authors choose $z_k$ to be uniform random variables whose dimensionality grows with the resolution of $I_k$. To quote them "The noise input $z_k$ to $G_k$ is presented as a 4th color plane to lowpass $l_k$, hence its dimensionality varies with the pyramid level." Therefore $\mathbb{H}[z_k] \propto 4^{k}$, assuming that the pixel count quadruples at each layer. So the conditional entropy $\mathbb{H}[I_k\vert I_{k+1}]$ is allowed to grow exponentially with resolution, at the same rate it would grow if the images contained pure white noise. In their model, they allow the perpixel conditional entropy $c\cdot 4^{k}\cdot \mathbb{H}[I_k\vert I_{k+1}]$ to be constant across resolutions. To me, this seems undesirable. My intuition is, for natural images, $\mathbb{H}[I_k\vert I_{k+1}]$ may grow as $k$ decreases (because the dimensionality gorws), but the perpixel value $c\cdot 4^{k}\cdot \mathbb{H}[I_k\vert I_{k+1}]$ should decrease or converge to $0$ as the resolution increases. Very low lowresolution subsampled natural images behave a little bit like white noise, there is a lot of variability in them. But as you increase the resolution, the probability distribution of the highres image given the lowres image will become a lot sharper. In terms of model capacity, this is not a problem, inasmuch as the convolutional models $G_{k}$ can choose to ignore some variance in $z_k$ and learn a more deterministic superresolution process. However, adding unnecessarily high entropy will almost certainly make the fitting of such model harder. For example, the adversarial training process relies on sampling from $z_k$, and the procedure is pretty sensitive to sampling noise. If you make the distribution of $z_k$ unneccessarily high entropy, you will end up doing a lot of extra work during training until the network figures out to ignore the extra variance. To solve this problem, I propose to keep the entropy of the noise vectors constant, or make them grow sublinearly with the number of pixels in the image. This mperhaps akes the generative convnets harder to implement. Another quick solution would be to introduce dependence between components of $z_k$ via a lowrank covariance matrix, or some sort of a hashing trick. #### Adversarial training vs superresolution autoencoders Another weird thing is that the adversarial objective function forgets the identity of the image. For example, you would want your model so that `"if at the previous layer you have a lowresolution parrot, the next layer should be a higherresolution parrot"` Instead, what you get with the adversarial objective is `"if at the previous layer you have a lowresolution parrot, the next layer should output a higherdimensional image that looks like a plausible natural image"` So, there is nothing in the objective function that enforces dependency between subsequent layers of the pyramid. I think if you made $G_k$ very complex, it could just learn to model natural images by itself, so that $I_{k}$ is in essence independent of $I_{k+1}$ and is purely driven by the noise $z_{k}$. You could sidestep this problem by restricting the complexity of the generative nets, or, again, to restrict the entropy of the noise. Overall, I think the approach would benefit from a combination of the adversarial and a supervised (superresolution autoencoder) objective function. 
[link]
I think this paper has two main ideas in there, I see them as independent, for reasons explained below:  A new penalty function that aims at regularising the second derivative of the trajectory the latent representation traces over time. I see this as a generalisation of the slowness principle or temporal constancy, more about this in the next section.  A new autoencoderlike method to predict future frames in video. Video is really hard to forwardpredict with nonprobabilistic models because high level aspects of video are genuinely uncertain. For example, in a football game, you can't really predict whether the ball will hit the goalpost, but the results might look completely different visually. This, combined with L2 penalties often results in overly conservative, blurry predictions. The paper improves things by introducing extra hidden variables, that allow the model to represent uncertainty in its predictions. More on this later. #### Inductive bias: penalising curvature The key idea of this paper is to learn good distributed representations of natural images from video in an unsupervised way. Intuitively, there is a lot of information contained in video, which is lost if you scramble the video and look at statistics individual frames only. The race is on to develop the right kind of prior and inductive bias that helps us fully exploit this temporal information. This paper presents a way, which is called learning to linearise (I'm going to call this L2L). Naturally occurring images are thought to reside on some complex, nonlinear manifold whose intrinsic dimension is substantially lower than the number of pixels in an image. It is then natural to think about video as a journey on this manifold surface, along some smooth path. Therefore, if we aim to learn good generic features that correspond to coordinates on this underlying manifold, we should expect that these features vary in a smooth fashion over time as you play the video. L2L uses this intuition to motivate their choice of an objective function that penalises a scaleinvariant measure of curvature over time. In a way it tries to recover features that transform nearly linearly as time progresses and the video is played. In their notations, $x_{t}$ denotes the data in frame $t$, which is transformed by a deep network to obtain the latent representation $z_{t}$. The penalty for the latent representation is as follows. $$\sum_{t} \frac{(z_t  z_{t1})^{T}(z_{t+1}  z_{t})}{\z_t  z_{t1}\\z_{t+1}  z_{t}\}$$ The expression above has a geometric meaning as the cosine of the angle between the vectors $(z_t  z_{t1})$ and $(z_{t+1}  z_{t})$. The penalty is minimised if these two vectors are parallel and point in the same direction. In other words the penalty prefers when the latent feature representation keeps its momentum and continues along a linear path  and it does not like sharp turns or jumps. This seems like a sensible prior assumption to build on. L2L is very similar to another popular inductive bias used in slow feature analysis: the temporal slowness principle. According to this principle, the most relevant underlying features don't change very quickly. The slowness principle has a long history both in machine learning and as a model of human visual perception. In SFA one would minimise the following penalty on the latent representation: $$\sum_{t} (z_t  z_{t1})^{2},$$ where the square is applied componentwise. There are additional constraints in SFA, more about this later. We can understand the connection between SFA and this paper's penalty if we plot the penalty for a single hidden feature $z_{t,f}$ at time $t$, keeping all other features and values at neighbouring timesteps constant. This is plotted in the figure below (scaled and translated so the objectives line up nicely). ![](http://www.inference.vc/content/images/2015/09/RfX4Dp2Y3YAAAAASUVORK5CYII.png) As you can see, both objectives have a minimum at the same location: they both try to force $z_{t,f}$ to linearly interpolate between the neighbouring timesteps. However, while SFA has a quadratic penalty, the learning to linearise objective tapers off at long distances. Compare this to Tukey's loss function used in outlierresistant robust regression. Based on this, my prediction is that compared to SFA, this loss function is more tolerant of outliers, which in the temporal domain would mean abrupt jumps in the latent representation. So while SFA is equivalent to assuming that the latent features follow a Brownianmotionlike Ornstein–Uhlenbeck process, I'd imagine this prior corresponds to something like a jump diffusion process (although I don't think the analogy holds mathematically). Which one of these inductive biases/priors are better at exploiting temporal information in natural video? Slow Brownian motion, or nearlylinear trajectories with potentially a few jumps Unfortunately, don't expect any empirical answer to that from the paper. All experiments seem to be performed on artificially constructed examples, where the temporal information is synthetically engineered. Nor there is any real comparison to SFA. #### Representing predictive uncertainty with auxillary variables While the encoder network learns to construct smoothly varrying features $z_t$, the model also has a decoder network that tries to reconstruct $x_t$ and predict subsequent frames. This, the authors agree, is necessary in order for $z_t$ to contain enough relevant information about the frame $x_t$ (more about whether or not this is necessary later). The precise way this decoding is done has a novel idea as well: minimising over auxillary variables. Let's say our task is to predict a future frame $x_{t+k}$ based on the latent representation $z_{t}$. The problem is, this is a very hard problem. In video, just like in real life, anything can happen. Imagine you're modelling soccer footage, and the ball is about to hit the goalpost. In order to predict the next frames, not only do we have to know about natural image statistics, we also have to be able to predict whether the goal is in or not. An optimal predictive model would give a highly multimodal probability distribution as its answer. If you use the L2 loss with a deterministic feedforward predictive network, it's likely to come up with a very blurry image, which would correspont to the average of this nasty multimodal distribution. This calls for something better, either a smarter objective function, or a better way of representing predictive uncertainty. The solution the authors gave is to introduce hidden variables $\delta_{t}$, that the decoder network also receives as input in addition to $z_t$. For each frame, $\delta_t$ is optimised so that only the best possible reconstruction is taken into account in the loss function. Thus, the decoder network is allowed to use $\delta$ as a source of nondeterminism to hedge its bets as to what the contents of the next frame will be. This is one step closer to the ideal setting where the decoder network is allowed to give a full probability distribution of possibilities and then is evaluated using a strictly proper scoring rule. This inner loop minimisation (of $\delta$) looks very tedious, and introduces a few more parameters that may be hard to set. The algorithm is reminiscent of the Estep in expectationmaximisation, and also very similar to the iterated closest point algorithm Andrew Fitzgibbon talked about in his tutorial at BMVC this year. In his tutorial, Andrew gave examples where jointly optimising model parameters and auxiliary variables ($\delta$) is advantageous, and I think the same logic applies here. Instead of the inner loop, simultaneous optimisation helps fixing some pathologies, like slow convergence near the optimum. In addition, Andrew advocates exploiting the sparsity structure of the Hessian to implement efficient secondorder gradientbased optimisation methods. These tricks are explained in paragraphs around equation 8 in (Prasad et al, 2010). #### Predictive model: Is it necessary? On a more fundamental level, I question whether the predictive decoder network is really a necessary addition to make L2L work. The authors observe that the objective function is minimised by the "trivial" solutions $z_{t} = at + b$, where $a,b$ can be arbitrary constants. They then say that in order to make sure features do something more than just discover some of these trivial solutions, we also have to include a decoder network, that uses $z_t$ to predict future frames. I believe this is not necessary at all. Because $z_t$ is a deterministic function of $x_t$, and $t$ is not accessible to $z_{t}$ in any other way than through inferring it from $x_t$, as long as $a\neq 0$, the linear solutions are not trivial at all. If the network discovers $z_{t} = at, a\neq 0$, you should in fact be very happy (assuming a single feature). The only problems with trivial solutions occur when $z_{t} = b$ ($z$ doesn't depend on the data at all) or when $z$ is multidimensional and several redundant features are sensitive to exactly the same thing. These trivial solutions could be avoided the same way they are avoided in SFA, by constraining the overall spatial covariance of $z_{t}$ over the videoclip to be $I$. This would force each feature to vary at least a little bit with data hence avoiding the trivial constant solutions. It would also force features to be linearly decorrelated  solving the redundant features problem. So I wonder if the decoder network is indeed a necessary addition to the model. I would love to encourage the authors to implement their new hypothesis of a prior both with and without the decoder. They may already have tried it without and found it really didn’t work, so it might just be a matter of including those results. This would in turn allow us to see SFA and L2L sidebyside, and learn something about whether and why their prior is better than the sl 
[link]
### Evaluating Generative Models A key topic I'm very interested in is the choices of objective functions used in unsupervised learning and generative models. The key organising principle should be this: the objective function we use for training a probabilistic model should match the way we ultimately want to use the model. Yet, in unsupervised learning this is often overlooked and I think we lack clarity around what the models are used for and how they should be trained and evaluated. This paper tries to clarify this a bit in the context of generative models. I also want to mention that another ICLR submission this year also deals with this fundamental question: I highly recommend taking a look. Here, I'm going to consider a narrow definition of generative models: models we actually want to use to generate samples from which are then shown to a human user/observer. This includes usecases such as image captioning, texture generation, machine translation, speech synthesis and dialogue systems, but excludes things like unsupervised pretraining for supervised learning, semisupervised learning, data compression, denoising and many others. Very often people don't make this distinction clear when talking about generative models which is one of the reasons why there is still no clarity about what different objective functions do. I argue that when the goal is to train a model that can generate naturallooking samples, maximum likelihood is not a desirable training objective. Maximum likelihood is consistent so it can learn any distribution if it is given infinite data and a perfect model class. However, under model misspecification and finite data (that is, in pretty much every practically interesting scenario), it has a tendency to produce models that overgeneralise. #### KL divergence as a perceptual loss Generative modelling is about finding a probabilistic model $Q$ that in some sense approximates the natural distribution of data $P$. When researchers (or users of their product) evaluate generative models for perceptual quality, they draw samples from it, then  for lack of a better word  eyeball the samples. In visual information processing this is often referred to as noreference perceptual quality assessment \citep[see e.,g.\ ][]{wang2002noreference}. In the paper, I propose that the KL divergence $KL[Q\ P]$ can be used as an idealised objective function to describe this scenario. This related to maximum likelihood which minimises $KL[P\Q]$, but different in fundamental ways which I will explain later. Here is why I think $KL[Q\P]$ should be used: First, we can make the assumption that the perceived quality of each sample is related to the \emph{surprisal} $\log Q_{human}(x)$ under the human observers' subjective prior of stimuli $Q_{human}(x)$. For those of you not familiar with computational cognitive science, this will seem adhoc, but it's a relatively common assumption to make when modelling reaction times in experiments for example. We further assume that the human observer maintains a very accurate model of natural stimuli, thus, $Q_{human}(x) \approx P(x)$. This is a fancy way of saying things like the observer being a native speaker therefore understanding all the nuances in language. These two assumptions suggest that in order to optimise our chances in this Turing testlike scenario, we need to minimise the following crossentropy or perplexity term: \begin{equation}  \mathbb{E}_{x\sim Q} \log P(x) \end{equation} This perplexity is the exact opposite average negative log likelihood $ \mathbb{E}_{x\sim P} \log Q(x)$, with the role of $P$ and $Q$ changed. However, the perplexity alone would be maximised by a model $Q$ that deterministically picks the most likely stimulus. To enforce diversity one can simultaneously try to maximise the Shannon entropy of $Q$. This leaves us with the following KL divergence to optimise: \begin{equation} KL[Q\ P] =  \mathbb{E}{x\sim Q} \log P(x) + \mathbb{E}{x\sim Q} \log Q(x) \end{equation} So if we want to train models that produce nice samples, my recommendation is to try to use $KL[Q\P]$ as an objective function or something that behaves like it. How does maximum likelihood compare? #### Differences between maximum likelihood and $KL[Q\P]$ Maximum likelihood is roughly the same as minimising $KL[P\Q]$. The differences between minimising $KL[P\Q]$ and $K[Q\P]$ are well understood and it frequently comes up in the context of Bayesian approximate inference as well. Both divergences ensure consistency, minimising either converges to the true $P$ in the limit of infinite data and a perfect model class. However, they differ fundamentally in the way they deal with finite data and model misspecification (in almost every practical scenario): $KL[P\Q]$ tends to favour approximations $Q$ that overgeneralise $P$. If P is multimodal, the optimal $Q$ will tend to cover all the modes of $P$, even at the cost of introducing probability mass where $P$ has $0$ mass. Practically this means that the model will occasionally sample unplausible samples that don't look anything like samples from $P$. $KL[Q\P]$ tends to favour undergeneralisation. The optimal $Q$ will typically describe the single largest mode of $P$ well, at the cost of ignoring other modes if they are hard to model without covering lowprobability areas as well. Practically this means that $KL[Q\P]$ will try to avoid introducing unplausible samples, sometimes at the cost of missing the majority of plausible samples under $P$. In other words: $KL[P\Q]$ is liberal, $KL[Q\P]$ is conservative. In yet other words: $KL[P\Q]$ is an optimist, $KL[Q\P]$ is a pessimist. The problem of course is that $KL[Q\P]$ is super hard to optimise beased on a finite sample from $P$. Even harder than maximum likelihood. Not only that, the KL divergence is also not very well behaved, and is not welldefined unless $P$ is positive everywhere where $Q$ is positive. So there is little hope we can turn $KL[Q\P]$ into a practical training algorithm. #### Generalised Adversarial Training Generative Adversarial Networks(GANs) train a generative model jointly with an adversarial discriminative model that tries to differentiate between artificial and real data. The idea is, a generative model is good if it can fool the best discriminative model into thinking the generated samples are real. GANs have produced some of the nicest looking samples you'll find on the Internet and got people very excited about generative models again: human faces, album covers, etc. How do they come into this picture? It's because they can be understood as approximately minimising the JensenShannon divergence: \begin{equation} JSD[P\Q] = JSD[P\Q] = \frac{1}{2}KL\left[P\middle\\frac{P+Q}{2}\right] + \frac{1}{2}KL\left[Q\middle\\frac{P+Q}{2}\right]. \end{equation} Looking at the equation above you can immediately see how it's related to this topic. JS divergence is a bit like a symmetrised version of KL divergence. It's not $KL[P\Q]$, not $KL[Q\P]$, but a bit of both. So one can expect that minimising JS divergence would exhibit a behaviour that is kind of halfway between the two extremes explained above. And that means that they would generate better samples than methods trained via maximum likelihood and similar objectives. What's more, one can generalise JS divergence to a whole family of divergences, parametrised by a probability $0<\pi<1$ as follows: \begin{equation} JS_{\pi}[P\Q] = \pi \cdot KL[P\\pi P+(1\pi)Q] + (1\pi)KL[Q\\pi P+(1\pi)Q]. \end{equation} What I show in the paper is that by varrying $\pi$ between the two extremes, one can effectively interpolate between the behaviour of maximum likelihood ($\pi\rightarrow 0$) and minimising $KL[Q\P]$ ($\pi\rightarrow 1$). See the paper for details. This interpolation between behaviours is explained in this main figure below: ![](http://www.inference.vc/content/images/2015/11/ScreenShot20151116at161910.png) For any given value of $\pi$, we can optimise $JS_{\pi}$ approximately using an algorithm that is a slightly changed version of the original GAN algorithm. This is because the generalised JS divergence still has an elegant information theoretic interpretation. Consider a communications channel on which we can transmit a single data point of some kind. We toss a coin and with probability $\pi$, we send a sample from $P$, and with probability $1\pi$ we send a sample from $Q$ instead. The receiver doesn't know the outcome of the coinflip, she only observes the sample. The $JS_{\pi}$ is the mutual information between the observed sample and the coinflip. It is also an upper bound on how well any algorithm can do in guessing the coinflip from the observed sample. To implement an adversarial training algorithm for $JS_{\pi}$ one simply needs to change the ratio of samples the discriminative network sees from $Q$ vs $P$ (or apply appropriate weights during training). In the original method the discriminator network is faced with a balanced classification problem, i.e. $\pi=\frac{1}{2}$. It is hard to believe, but this irrelevantlooking modification changes the behaviour of the GAN algorithm dramatically, and can in theory allow the GAN algorithm to approximate both maximum likelihood or $KL[Q\P]$. This analysis explains why GANs have been so successful in generating very nice looking images, and relatively few weirdlooking ones. It is also worth pointing out that the GAN method is still in its infantcy and has many issues and limitations. The main issue is that it is based on sampling from $Q$ which doesn't work well in high dimensions. Hopefully some of these limitations can be overcome and then we should have a pretty powerful framework for training good generative models. 
[link]
#### Summary of this post: * an overview the motivation behind adversarial autoencoders and how they work * a discussion on whether the adversarial training is necessary in the first place. tl;dr: I think it's an overkill and I propose a simpler method along the lines of kernel moment matching. #### Adversarial Autoencoders Again, I recommend everyone interested to read the actual paper, but I'll attempt to give a high level overview the main ideas in the paper. I think the main figure from the paper does a pretty good job explaining how Adversarial Autoencoders are trained: ![](http://www.inference.vc/content/images/2016/01/ScreenShot20160108at144825.png) The top part of this image is a probabilistic autoencoder. Given the input $\mathbf{x}$, some latent code $\mathbf{z}$ is generated by sampling from an encoding distribution $q(\mathbf{z}\vert\mathbf{x})$. This distribution is typically modeled as the output a deep neural network. In normal autoencoders this encoder would be deterministic, now we allow it to be probabilistic. A decoder network is then trained to decode $\mathbf{z}$ and reconstruct the original input $\mathbf{x}$. Of course, reconstruction will not be perfect, but we train the networks to minimise reconstruction error, this is typically just mean squared error. The reconstruction cost ensures that the encoding process retains information about the input image, but it doesn't enforce anything else about what these latent representations $\mathbf{z}$ should do. In general, their distribution is described as the aggregate posterior $q(\mathbf{z})=\mathbb{E}_\mathbf{x} q(\mathbf{z}\vert\mathbf{x})$. Often, we would like this distribution to match a certain prior $p(\mathbf{z})$. For example. we may want $\mathbf{z}$ to have independent components and Gaussian distributed (nonlinear ICA,PCA). Or we may want to force the latent representations to correspond to discrete class labels, or binary factors. Or we may simply want to ensure there are 'no gaps' in the latent space, and any random $\mathbf{z}$ would lead to a viable sample when squashed through the decoder network. So there are multiple reasons why one might want to control the aggregate posterior $q(\mathbf{z})$ to match a predefined prior $p(\mathbf{z})$. The authors achieve this by introducing an additional term in the autoencoder loss function, one that measures the divergence between $q$ and $p$. The authors chose to do this via adversarial training: they train a discriminator network that constantly learns to discriminate between real code vectors $\mathbb{z}$ produced by encoding real data, and random code vectors sampled from $p$. If $q$ matches $p$ perfectly, the optimal discriminator network should have a large classification error. #### Is this an overkill? My main question about this paper was whether the adversarial cost is really needed here, because I think it's an overkill. Let me explain: Adversarial training is powerful when all else fails to quantify divergence between complicated, potentially degenerate distributions in high dimensions, such as images or video. Our toolkit for dealing with images is limited, CNNs are the best tool we have, so it makes sense to incorporate them in training generative models for images. GANs  when applied directly to images  are a great idea. However, here adversarial training is applied to an easier problem: to quantify the divergence between a simple, fixed prior (e.g. Gaussian) and an empirical distribution of latents. The latent space is usually lowerdimensional, distributions better behaved. Therefore, matching to $p(\mathbf{z})$ in latent space should be considerably easier than matching distributions over images. Adversarial training makes no assumptions about the distributions compared, other than sampling from them. This comes very handy when both $p$ and $q$ are nasty such as in the generative adversarial network scenario: there, $p$ is the distribution of natural images, $q$ is a super complicated, degenerate distribution produced by squashing noise through a deep convnet. The price we pay for this flexibility is this: when $p$ or $q$ are actually easy to work with, adversarial training cannot exploit that, it still has to sample. (it would be interesting to see if expectations over $p(\mathbf{z})$ could be computed analytically). So even though in this work $p$ is as simple as a mixture of ten 2D Gaussians, we need to approximate everything by drawing samples. #### Other things might work: kernel moment matching Why can’t one use easier divergences? For example, I think moment matching based on kernel MMD would work brilliantly in this scenario. It would have the following advantages over the adversarial cost.  closed form expressions: Depending on the choice of the prior $p(\mathbf{z})$ and kernel used in MMD, the expectations over $p$ may be available in closed form, without sampling. So for example if we use a squared exponential kernel and a mixture of Gaussians as $p$, the divergence from $p$ can be precomputed in closed form that is easy to evaluate.  no nasty inner loop: Adversarial training requires the discriminator network to be reoptimised every time the generative model changes. So we end up with a gradient descent in the inner loop of a gradient descent, which is anything but nice to work with. This is why it takes so long to get it working, the whole thing is pretty unstable. In contrast, to evaluate MMD, the inner loop is not needed. In fact, MMD can also be thought of as the solution to a convex maximisation problem, but via the kernel trick the maximum has a closed form solution.  the problem is well suited for MMD: because the distributions are smooth, and the space is nice and lowdimensional, MMD might work very well. Kernelbased methods struggle with complicated manifoldlike structure of natural images, so I wouldn't expect MMD to be competitive with adversarial training if it is applied directly in the image space. Therefore, I actually prefer generative adversarial networks to generative moment matching networks. However, here we have an easier problem, simpler space, simpler distributions where MMD shines, and adversarial training is just not needed. 
[link]
 I give an overview of the paper which proposes an exponential schedule of dilated convolutional layers as a way to combine local and global knowledge  I point out the connection between 2D dilated convolutions and Kronecker products  cascades of exponentially dilated convolutions  as proposed in the paper  can be thought of as parametrising a large convolution kernel as a Kronecker product of small kernels  the relationship to Kronecker factorisation only holds under particular assumptions, in this sense cascades of exponenetially diluted convolutions are a generalisation of the Kronecker layer (Zhou et al. 2015)  I note that dilated convolutions are equivariant under image translation, a property that other multiscale architectures often violate. #### Background The key application the dilated convolution authors have in mind is dense prediction: vision applications where the predicted object that has similar size and structure to the input image. For example, semantic segmentation with one label per pixel; image superresolution, denoising, demosaicing, bottomup saliency, keypoint detection, etc. In many such applications one wants to integrate information from different spatial scales and balance two properties: 1. local, pixellevel accuracy, such as precise detection of edges, and 2. integrating knowledge of the wider, global context To address this problem, people often use some kind of multiscale convolutional neural networks, which often relies on spatial pooling. Instead the authors here propose using layers dilated convolutions, which allow us to address the multiscale problem efficiently without increasing the number of parameters too much. #### Dilated Convolutions It's perhaps useful to first note why vanilla convolutions struggle to integrate global context. Consider a purely convolutional network composed of layers of $k\times k$ convolutions, without pooling. It is easy to see that size of the receptive field of each unit  the block of pixels which can influence its activation  is $l*(k1)+k$, where $l$ is the layer index. So the effective receptive field of units can only grow linearly with layers. This is very limiting, especially for highresolution input images. Dilated convolutions to the rescue! The dilated convolution between signal $f$ and kernel $k$ and dilution factor $l$ is defined as: $$ \left(k \ast_{l} f\right)_t = \sum_{\tau=\infty}^{\infty} k_\tau \cdot f_{t  l\tau} $$ Note that I'm using slightly different notation than the authors. The above formula differs from vanilla convolution in last subscript $f_{t  l\tau}$. For plain old convolution this would be $f_{t  \tau}$. In the dilated convolution, the kernel only touches the signal at every $l^{th}$ entry. This formula applies to a 1D signal, but it can be straightforwardly extended to 2D convolutions. The authors then build a network out of multiple layers of diluted convolutions, where the dilation factor $l$ increases exponentially at each layer. When you do that, even though the number of parameters grows only linearly with layers, the effective receptive field of units grows exponentially with layer depth. This is illustrated in the figure below: ![](http://www.inference.vc/content/images/2016/05/ScreenShot20160512at094712.png) What this figure doesn't really show is the parameter sharing and parameter dependencies across the receptive field (frankly, it's pretty hard to visualise exactly with more than 2 layers). The receptive field grows at a faster rate than the number of parameters, and it is obvious that this can only be achieved by introducing additional constraints on the parameters across the receptive field. The network won't be able to learn arbitrary receptive field behaviours, so one question is, how severe is that restriction? #### Relationship to Kronecker Products To me this whole dilated convolution paper cries Kronecker product, although this connection is never made in the paper itself. It's easy to see that a 2D dilated convolution with matrix/filter $K$ is the same as vanilla convolution with a diluted filter $\hat{K}_{l}$ which can be represented as the following Kronecker product: $$ \hat{K}_l = K \otimes \begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & \ddots & & 0 \\ 0 & \ddots & \ddots & \ddots & \\ 0 & & \ddots & \ddots & 0 \\ 0 & 0 & 0 & 0 & 0 \end{bmatrix} $$ Using this, and properties of convolutions and Kronecker products (I suggest beginners to make extensive use of the matrix cookbook) we can even understand something about exponentially iterated dilated convolutions. Let's assume we apply several layers of dilated convolutions, without nonlinearity, as in Equation 3 of the paper. For simplicity, I assume that that all convolution kernels $K_l, L=1\ldots L$ are $a\times a$ in size, the dilation factor at layer $l$ is $a^{l}$, and we only have a single channel throughout ($C=1$). In this case we can show that: $$ F_{L+1} = K_L \ast_{a^L} \left( K_{L1} \ast_{a^{(L1)}} \left( \cdots K_1 \ast_{a} \left( K_0 \ast F_0 \right) \cdots \right) \right) = \left( K_L \otimes K_{L1} \otimes \cdots \otimes K_{0} \right) \ast F_0 $$ The lefthand side of this equation is the same construction as in Equation 3 in the paper, but expanded. The right hand side is a single vanilla convolution, but with a convolution kernel that is constructed as the Kronecker product of all the $a\times a$ kernels $K_l$. It turns out Kroneckerfactored parametrisations of convolution tensors are already used in CNNs, a quick googling revealed this paper: Shuchang Zhou, JiaNan Wu, Yuxin Wu, Xinyu Zhou (2015) Exploiting Local Structures with the Kronecker Layer in Convolutional Networks What can Kroneckerfactored filters represent? Let's look at what kind of kernels can we represent with Kronecker products, and hence what behaviour should we expect from dilated convolutions. Here are a few examples of $27\times 27$ kernels that result from taking the Kronecker product of three random $3\times 3$ kernels: ![](http://www.inference.vc/content/images/2016/05/VzORx0FEfAAAAAElFTkSuQmCC.png) These look somehow natural, at least to me. They look like pretty plausible texture patches taken from some pixellated video game. You will notice the repeated patterns and the hierarchical structure. Indeed, we can draw cool selfsimilar fractallike filters if we keep taking the Kronecker product of the same kernel with itself, some examples of such random fractals: ![](http://www.inference.vc/content/images/2016/05/YSJIkSZIkLYw3bCRJkiRJkhbGGzaSJEmSJEkL8zeSmRmMrhHPQgAAAABJRU5ErkJggg.png) I would say these kernels are not entirely unreasonable for a ConvNet, and if you allow for multiple channels ($C>1$) they can represent pretty nice structured patterns and shapes with reasonable number of parameters. Compare these filters to another common technique for reducing parameters of convolution tensors: lowrank decompositions (see e.g. Lebedev et al, 2014). Spatially, a lowrank approximation to a square 2D convolution filter can be understood as subsequently applying two smaller rectangular filters: one with a limited horizontal extent and one with limited vertical extent. Here are a few random samples of $27\times 27$ filters with a rank of 1. These can be represented using the same number of parameters (27) as the Kronecker samples above. To me, these don't look so natural. Notice also that for lowrank representations the number of parameters has to scale linearly with the spatial extent of the filter, whereas this scaling can be logarithmic if we use a Kronecker parametrisation. This is the real deal when using Kronecker products or dilated convolutions. Here is another cool illustration of the naturalness of the Kronecker approximation, taken out of the Kronecker layer paper: ![](http://www.inference.vc/content/images/2016/05/ScreenShot20160512at145833.png) So in general, parametrising convolution kernels as Kroneckerproducts seems like a pretty good idea. The dilated convolutions paper presents a more flexible approach than just Kroneckerfactors. Firstly, you can add nonlinearities after each layer of dilated convolution, which would now be different from Kronecker products. Secondly, the Kronecker analogy only holds if the dilation factor and the kernel size are the same. In the paper the authors used a kernel size of $3$ and dilation factor of $2$. #### Final note on translational equivariance One desirable property of convolutions is that they are translationally equivariant: if you shift the input image by any amount, the output remains the same, shifted by the same amount. This is a very useful inductive bias/prior assumtion to use in a dense prediction task. One way to introduce multiscale thinking to ConvNets is to use architectures that look like the figure below: we first decrease the spatial extent of featuremaps via pooling, then grow them back again via unpooling/deconvolution. Additional shortcut connections ensure that pixellevel local accuracy can be retained. The example below is from the SegNet paper, but there are multiple other papers such as this one on recombinator networks. ![](http://www.inference.vc/content/images/2016/05/convdeconv.png) However, as soon as you include spatial pooling, the translational equivariance property of the whole network might break. For example the SegNet above is not translationally equivariant anymore: the network's predictions are sensitive to small, singlepixel shifts to the input image, which is undesirable. Thankfully, layers of dilated convolutions are still translationally equivariant, which is a good thing. #### Summary This dilated convolutions idea is pretty cool, and I think these papers are just scratching the surface of this topic. The dilated convolution architecture generalises Kroneckerfactored convolutional filters, it allows for very large receptive fields while only growing the number o 
[link]
This paper addresses the task of imagebased Q&A on 2 axes: comparison of different models on 2 datasets and creation of a new dataset based on existing captions. The paper is addressing an important and interesting new topic which has seen recent surge of interest (Malinowski2014, Malinowski2015, Antol2015, Gao2015, etc.). The paper is technically sound, wellwritten, and wellorganized. They achieve good results on both datasets and the baselines are useful to understand important ablations. The new dataset is also much larger than previous work, allowing training of stronger models, esp. deep NN ones. However, there are several weaknesses: their main model is not very different from existing work on imageQ&A (Malinowski2015, who also had a VIS+LSTM style model (but they were also jointly training the CNN and RNN, and also decoding with RNNs to produce longer answers) and achieves similar performance (except that adding bidirectionality and 2way image input helps). Also, as the authors themselves discuss, the dataset in its current form, synthetically created from captions, is a good start but is quite conservative and limited, being singleword answers, and the transformation rules only designed for certain simple syntactic cases. It is exploration work and will benefit a lot from a bit more progress in terms of new models and a slightly more broad dataset (at least with answers up to 23 words). Regarding new models, e.g., attentionbased models are very relevant and intuitive here (and the paper would be much more complete with this), since these models should learn to focus on the right area of the image to answer the given question and it would be very interesting to analyze the results of whether this focusing happens correctly. Before attention models, since 2way image input helped (actually, it would be good to ablate 2way versus bidirectionality in the 2VIS+BLSTM model), it would be good to also show the model version that feeds the image vector at every time step of the question. Also, it would be useful to have a nearest neighbor baseline as in Devlin et al., 2015, given their discussion of COCO's properties. Here too, one could imagine copying answers of training questions, for cases where the captions are very similar. Regarding a broaderscope dataset, the issue with the current approach is that it is too similar to the captioning approach or task, which has the drawback that a major motivation to move to imageQ&A is to move away from single, vague (nonspecific), generic, oneeventfocused captions to a more complex and detailed understanding of and reasoning over the image; which doesn't happen with this paper's current dataset creation approach, and so this will also not encourage thinking of very different models to handle imageQ&A, since the best captioning models will continue to work well here. Also, having 23 word answers will capture more realistic and more diverse scenarios; and though it is true that evaluation is harder, one can start with existing metrics like BLEU, METEOR, CIDEr, and human eval. And since these will not be full sentences but just 23 word phrases, such existing metrics will be much more robust and stable already. Originality: The task of imageQ&A is very recent with only a couple of prior and concurrent work, and the dataset creation procedure, despite its limitations (discussed above) is novel. The models are mostly not novel, being very similar to Malinowski2015, but the authors add bidirectionality and 2way image input (but then Malinowski2015 was jointly training the CNN and RNN, and also decoding with RNNs to produce longer answers). Significance: As discussed above, the paper show useful results and ablations on the important, recent task of imageQ&A, based on 2 datasets  an existing small dataset and a new large dataset; however, the second, new dataset is synthetically created by ruletransforming captions and only to singleword answers, thus keeping the impact of the dataset limited, because it keeps the task too similar to the generic captioning task and because there is no generation of answers or prediction of multiword answers. 
[link]
The paper proposes a novel way to train a sparse autoencoder where the hidden unit sparsity is governed by a winnertakeall kind of selection scheme. This is a convincing way to achieve a sparse autoencoder, while the paper could have included some more details about their training strategy and the complexity of the algorithm. The authors present a fully connected autoencoder with a new sparsity constraint called the lifetime sparsity. For each hidden unit across the minibatch, they rank the activation values, keeping only the topk% for reconstruction. The approach is appealing because they don't need to find a hard threshold and it makes sure every hidden unit/filter is updated (no dead filters because their activation was below the threshold). Their encoder is a deep stack of ReLu and the decoder is shallow and linear (note that usually nonsymmetric autoencoders lead to worse results). They also show how to apply to RBM. The effect of sparsity is very effective and noticeable on the images depicting the filters. They extend this autoencoder in a convolutional/deconvolutional framework, making it possible to train on larger images than MNIST or TFD. They add a spatial sparsity, keeping the top activation per feature map for the reconstruction and combine it with the lifetime sparsity presented before. The proposed approach exploits on a mechanism close to the one of ksparse autoencoders proposed by Makkhzani et al [14]. The authors extend the idea from [14] to build winnertakeall encoders (and RBMs), that enforce both spatial and lifetime regularization by keeping only a percentage (the biggest) of activations. The lifetime sparsity allows overcoming problems that could arise with ksparse autoencoders. The authors next propose to embed their modeling framework in convolutional neural nets to deal with larger images than e.g. those of mnist. 
[link]
This paper presents an endtoend version of memory networks (Weston et al., 2015) such that the model doesn't train on the intermediate 'supporting facts' strong supervision of which input sentences are the best memory accesses, making it much more realistic. They also have multiple hops (computational steps) per output symbol. The tasks are Q&A and language modeling, and achieves strong results. The paper is a useful extension of memNN because it removes the strong, unrealistic supervision requirement and still performs pretty competitively. The architecture is defined pretty cleanly and simply. The related work section is quite wellwritten, detailing the various similarities and differences with multiple streams of related work. The discussion about the model's connection to RNNs is also useful. 
[link]
This paper presents a novel layer that can be used in convolutional neural networks. A spatial transformer layer computes resampling points of the signal based on another neural network. The suggested transformations include scaling, cropping, rotations and nonrigid deformation whose paramerters are trained endtoend with the rest of the model. The resulting resampling grid is then used to create a new representation of the underlying signal through bilinear or nearest neighbor interpolation. This has interesting implications: the network can learn to colocate objects in a set of images that all contain the same object, the transformation parameter localize the attention area explicitly, fine data resolution is restricted to areas important for the task. Furthermore, the model improves over previous stateoftheart on a number of tasks. The layer has one mini neural network that regresses on the parameters of a parametric transformation, e.g. affine), then there is a module that applies the transformation to a regular grid and a third more or less "reads off" the values in the transformed positions and maps them to a regular grid, hence underforming the image or previous layer. Gradients for backpropagation in a few cases are derived. The results are mostly of the classic deep learning variety, including mnist and svhn, but there is also the finegrained birds dataset. The networks with spatial transformers seem to lead to improved results in all cases. 
[link]
This paper deals with the formal question of machine reading. It proposes a novel methodology for automatic dataset building for machine reading model evaluation. To do so, the authors leverage on news resources that are equipped with a summary to generate a large number of questions about articles by replacing the named entities of it. Furthermore a attention enhanced LSTM inspired reading model is proposed and evaluated. The paper is wellwritten and clear, the originality seems to lie on two aspects. First, an original methodology of question answering dataset creation, where contextqueryanswer triples are automatically extracted from news feeds. Such proposition can be considered as important because it opens the way for large model learning and evaluation. The second contribution is the addition of an attention mechanism to an LSTM reading model. the empirical results seem to show relevant improvement with respect to an uptodate list of machine reading models. Given the lack of an appropriate dataset, the author provides a new dataset which scraped CNN and Daily Mail, using both the full text and abstract summaries/bullet points. The dataset was then anonymised (i.e. entity names removed). Next the author presents a two novel Deep longshort term memory models which perform well on the Cloze query task. 
[link]
TLDR; The authors propose an Attention with Intention (AWI) model for Conversation Modeling. AWI consists of three recurrent networks: An encoder that embeds the source sentence from the user, an intention network that models the intention of the conversation over time, and a decoder that generates responses. The authors show that the network can general natural responses. #### Key Points  Intuition: Intention changes over the course of a conversation, e.g. communicate problem > resolve issue > acknowledge.  Encoder RNN: Depends on last state of the decoder. Reads the input sequence and converts it into a fixedlength vector.  Intention RNN: Gets encoder representation, previous intention state, and previous decoder state as input and generates new representation of the intention.  Decoder RNN: Gets current intention state and attention vector over the encoder as an input. Generates a new output.  Architecture is evaluated on an internal helpdesk chat dataset with 10k dialogs, 100k turns and 2M tokens. Perplexity scores and a sample conversation are reported. #### Notes/Questions  It's a pretty short paper and not sure what to make of the results. The PPL scores were not compared to alternative implementations and no other evaluations (e.g. crowdsourced as in Neural Conversational Model) are done. 
[link]
TLDR; The authors apply an RNN to modeling the students knowledge. The input is an exercise question and answer (correct/incorrect), either as onehot vectors or embedded. The network then predicts whether or not the student can answer a future question correctly. The authors show that the RNN approach results in significant improvement over previous models, can be used for curriculum optimization, and also discovers the latent structure in exercise concepts. #### Key Points  Two encodings tried: One hot, embedded  RNN/LSTM, 200dimensional hidden layer, output dropout, NLL.  No expert annotation for concepts or question/answers are needed  Blocking (series of exercises of same type) vs Mixing for curriculum optimization: Blocking seems to perform better  Lots of cool future direction ideas #### Question / Notes  Can we not only predict whether an exercise is answered correctly, but also what the most likely student answer would be? My give insight into confusing concepts. 
[link]
TLDR; The authors show that we can distill the knowledge of a complex ensemble of models into a smaller model by letting the smaller model learn directly from the "soft targets" (softmax output with high temperature) of the ensemble. Intuitively, this works because the errors in probability assignment (e.g. assigning 0.1% to the wrong class) carry a lot of information about what the network learns. Learning directly from logits (unnormalized scores) as was done in a previous paper, is a special case of the distillation approach. The authors show how distillation works on the MNIST and an ASR data set. #### Key Points  Can use unlabeled data to transfer knowledge, but using the same training data seems to work well in practice.  Use softmax with temperature, values from 110 seem to work well, depending on the problem.  The MNIST networks learn to recognize digits without ever having seen base, solely based on the "errors" that the teacher network makes. (Bias needs to be adjusted)  Training on soft targets with less data performs much better than training on hard targets with same amount of data. #### Notes/Question  Breaking up the complex models into specialists didn't really fit into this paper without distilling those experts into one model. Also would've liked to see training of only specialists (without general network) and then distill their knowledge. 
[link]
TLDR; The authors use a Maximum Mutual Information (MMI) objective function to generate conversational responses. They still train their models with maximum likelihood, but use MMI to generate responses during decoding. The idea behind MMI is that it promotes more diversity and penalizes trivial responses. The authors evaluate their method using BLEU scores, human evaluators, and qualitative analysis and find that the proposed metric indeed leads to more diverse responses. #### Key Points  In practice, NCM (Neural Conversation Models) often generate trivial responses using highfrequency terms partly due to the likelihood objective function.  Two models: MMIantiLM and MMIbidi depending on the formulation of the MMI objective. These objectives are used during response generation, not during training.  Use Deep 4layer LSTM with 1000dimensional hidden state, 1000dimensional word embeddings.  Datasets: Twitter triples with 129M contextmessageresponse triples. OpenSubtitles with 70M spoken lines that are noisy and don't include turn information.  Authors state that perplexity is not a good metric because their objective is to explicitly steer away from the high probability responses. #### Notes  BLEU score seems like a bad metric for this. Shouldn't more diverse responses result in a lower BLEU score?  Not sure if I like the direction of this. To me it seems wrong to "artificially" promote diversity. Shouldn't diversity come naturally as a function of context and intention? 
[link]
TLDR; The authors evaluate Paragraph Vectors on large Wikipedia and arXiv document retrieval tasks and compare the results to LDA, BoW and word vector averaging models. Paragraph Vectors either outperform or match the performance of other models. The authors show how the embedding dimensionality affects the results. Furthermore, the authors find that one can perform arithemetic operations on paragraph vectors and obtain meaningful results and present qualitative analyses in the form of visualizations and document examples. #### Data Sets Accuracy is evaluated by constructing triples, where a pair of items are close to each other and the third one is unrelated (or less related). Cosine similarity is used to evaluate semantic closeness. Wikipedia (handbuilt) PV: 93% Wikipedia (handbuilt) LDA: 82% Wikipedia (distantly supervised) PV: 78.8% Wikipedia (distantly supervised) LDA: 67.7% arXiv PV: 85% arXiv LDA: 85% #### Key Points  Jointly training PV and word vectors seems to improve performance.  Used Hierarchical Softmax as Huffman tree for large vocabulary  The use only the PVBoW model, because it's more efficient. #### Questions/Notes  Why the performance discrepancy between the arXiv and Wikipedia tasks? BoW performs surprisingly well on Wikipedia, but not arXiv. LDA is the opposite. 
[link]
TLDR; The authors propose new ways to incorporate context (previous sentences) into a Recurrent Language Model (RLM). They propose 3 ways to model the context, and 2 ways to incorporate the context into the predictions for the current sentence. Context can be modeled with BoW, Sequence BoW (BoW for each sentence), and Sequence BoW with attention. Context can be incorporated using "early fusion", which gives the context as an input to the RNN, or "late fusion", which modifies the LSTM to directly incorporate the context. The authors evaluate their architecture on IMDB, BBC and Penn TreeBank corpora, and show that most approaches perform well (reducing perplexity), with Sequence BoW with attention + late fusion outperforming all others. #### Key Points:  Context as BoW: Compress N previous sentences into a single BoW vector  Context as Sequential Bow: Compress each of the N previous sentences into a BoW vector and use an LSTM to "embed" them. Alternatively, use an attention mechanism.  Early Fusion: Give the context vector as an input to the LSTM, together with the current word.  Late Fusion: Add another gate to the LSTM that incorporates the context vector. Helps to combat vanishing gradients.  Interestingly the Sequence BoW without attention performs very poorly. The reason here seems to be the same as for seq2seq, it's hard to compress the sentence vectors into a single fixedlength representation using an LSTM.  LSTM models trained with 1000 units, Adadelta. Only sentences up to 50 words are considered.  Noun phrases seem to benefit the most from the context, which makes intuitive sense. #### Notes/Questions:  A problem with current Language Models is that they are corpusspecific. A model trained on one corpus doesn't do well on another corpus because all sentences are treated as being independent. However, if we can correctly incorporate context we may be able to train a generalpurpose LM that does well across various corpora. So I think this is important work.  I am surprised that the authors did not try using a sentence embedding (skipthought, paragraphvector) to construct their context vectors. That seems like an obvious choice over using BoW.  The argument for why the Sequence BoW without attention model performs poorly isn't convincing. In the seq2seq work the argument for attention was based on the length of the sequence. However, here the sequence is very short, so the LSTM should be able to capture all the dependencies. The performance may be poor due to the BoW representation, or due too little training data.  Would've been nice to visualize what the attention mechanism is modeling.  I'm not sure if I agree with the authors that relying explicit sentence boundaries is an advantage, I see it as a limiting factor. 
[link]
TLDR; The authors present DVngram, a new method to learn document embeddings. DVngrams is a variation on Paragraph Vectors with a training objective of predicting words and ngrams solely based on the document vector, forcing the embedding to capture the semantics of the text. The authors evaluate their model on the IMDB data sets, beating both ngram based and Deep Learning models. #### Key Points  When the word vectors are already sufficiently predictive of the next words, the standard PV embedding cannot learn anything useful.  Training objective: Predict words and ngrams solely based on document vector. Negative Sampling to deal with large vocabulary. In practice, each ngram is treated as a special token and appended to the document.  Code will be at https://github.com/libofang/DVngram #### Question/Notes  The argument that PV may not work when the word vectors themselves are predictive enough makes intuitive sense. But what about applying wordlevel dropout? Wouldn't that also force the PV to learn the document semantics?  It seems to be that predicting ngrams leads to a huge sparse vocabulary space. I wonder how this method scales, even with negative sampling. I am actually surprised this works well at all.  The authors mention that they beat "other Deep Learning models, including PV, but neither their model nor PV are "deep learning". The networks are not deep ;) 
[link]
TLDR; The authors train a deep seq2seq LSTM directly on bytelevel input of several langauges (shuffling the examples of all languages) and apply it to NER and POS tasks, achieving stateoftheart or close to that. The model outputs spans of the form `[START_POSITION, LENGTH, LABEL]`, where each span element is a separate token prediction. A single model works well for all languages and learns shared highlevel representations. The authors also present a novel way to dropout input tokens (bytes in their case), by randomly replacing them with a `DROP` symbol. #### Data and model performance Data:  POS Tagging: 13 languages, 2.87M tokens, 25.3M training segments  NER: 4 languags, 0.88M tokens, 6M training segments Results:  POS CRF Accuracy (average across languages): 95.41  POS BTS Accuracy (average across languages): 95.85  NER BTS en/de/es/nl F1: 86.50/76.22/82.95/82.84  (See paper for NER comparsion models) #### Key Takeaways  Surprising to me that the span generations works so well without imposing independence assumptions on it. It's state the LSTM has to keep in memory.  0.20.3 Dropout, 320dimensional embeddings, 320 units LSTM, 4 layers seems to perform well. The resulting model is surprisingly compact (~1M parameters) due to the small vocabulary size of 256 bytes. Changing input sequence order didn't have much of an effect. Dropout and Byte Dropout significantly (74 > 78 > 82) improved F1 for NER.  To limit sequence length the authors split the text into k=60 sized segment, with 50% overlap to avoid splitting midspan.  Byte Dropout can be seen as "blurring text". I believe I've seen the same technique applied to words before and labeled word dropout.  Training examples for all languages are shuffled together. The biggest improvements in scores are seen observed for lowresource languages.  Not clear how to tune recall of the model since nonspans are simply not annotated. #### Notes / Questions  I wonder if the fixedvector embedding of the input sequence is a bottleneck since the decoder LSTM has to carry information not only about the input sequence, but also about the structure that has been produced so far. I wonder if the authors have experimented with varying `k`, or using attention mechanisms to deal with long sequences (I've seen papers dealing with sequences of 2000 tokens?). 60 seems quite short to me. Of course, output vocabulary size is also a concern with longer sequences.  What about LSTM initialization? When feeding spans coming from the same document, is the state kept around or reinitialized? I strongly suspect it's kept since 60 bytes probably don't contain enough information for proper labeling, but didn't see an explicit reference.  Why not a bidirectional LSTM? Seems to be the standard in most other papers.  How exactly are multiple languages encoded in the LSTM memories? I *kind of* understand the reasoning behind this, but it's unclear what these "highlevel" representations are. Experiments that demonstrate what the LSTM cells represent would be valuable.  Is there a way to easily retrain the model for a new language? 
[link]
TLDR; The authors show that we can improve the performance of a reference task (like translation) by simultaneously training other tasks, like image caption generation or parsing, and vice versa. The authors evaluate 3 MLT (MultiTask Learning) scenarios: Onetomany, manytoone and manytomany. The authors also find that using skipthought unsupervised training works well for improving translation performance, but sequence autoencoders don't. #### Key Points  4Layer seq2seq LSTM, 1000dimensional cells each layer and embedding, batch size 128, dropout 0.2, SGD wit LR 0.7 and decay.  The authors define a mixing ratio for parameter updates that is defined with respect to a reference tasks. Picking the right mixing ratio is a hyperparameter.  OneToMany experiments: Translation (EN > GER) + Parsing (EN). Improves result for both tasks. Surprising that even a very small amount of parsing updates significantly improves MT result.  ManytoOne experiments: Captioning + Translation (GER > EN). Improves result for both tasks (wrt. to reference task)  ManytoMany experiments: Translation (EN <> GER) + Autoencoders or SkipThought. SkipThought vectors improve the result, but autoencoders make it worse.  No attention mechanism #### Questions / Notes  I think this is very promising work. it may allow us to build generalpurpose systems for many tasks, even those that are not strictly seq2seq. We can easily substitute classification.  How do the authors pick the mixing ratios for the parameter updates, and how sensitive are the results to these ratios? It's a new hyperparameter and I would've liked to see graphs for these. Makes me wonder if they picked "just the right" ratio to make their results look good, or if these architectures are robust.  The authors found that seq2seq autoencoders don't improve translation, but skipthought does. In fact, autoencoders made translation performance significantly worse. That's very surprising to me. Is there any intuition behind that? 
[link]
TLDR; The authors apply a neural seq2seq model to sentence summarization. The model uses an attention mechanism (soft alignment). #### Key Points  Summaries generated on the sentence level, not paragraph level  Summaries have fixed length output  Beam search decoder  Extractive tuning for scoring function to encourage the model to take words from the input sequence  Training data: Headline + first sentence pair. 
[link]
TLDR; The authors train a seq2seq model on conversations, building a chat bot. The first data set is an IT Helpdesk dataset with 33M tokens. The trained model can help solve simple IT problems. The second data set is the OpenSubtitles data with ~1.3B tokens (62M sentences). The resulting model learns simple world knowledge, can generalize to new questions, but lacks a coherent personality. #### Key Points  IT Helpdesk: 1layer LSTM, 1024dimensional cells, 20k vocabulary. Perplexity of 8.  OpenSubtitles: 2layer LSTM, 4096dimensional cells, 100k vocabulary, 2048 affine layer. Attention did not help.  OpenSubtitles: Treat two consecutive sentences as coming from different speakers. Noisy dataset.  Model lacks personality, gives different answers to similar questions (What do you do? What's your job?)  Feed previous context (whole conversation) into encoder, for IT data only.  In both data sets, the neural models achieve better perplexity than ngram models. #### Notes / Questions  Authors mention that Attention didn't help in OpenSubtitles. It seems like the encoder/decoder context is very short (just two sentences, not a whole conversation). So perhaps attention doesn't help much here, as it's meant for longrange dependencies (or dealing with little data?)  Can we somehow encode conversation context in a separate vector, similar to paragraph vectors?  It seems like we need a principled way to deal with long sequences and context. It doesn't really make sense to treat each sentence tuple in OpenSubtitles as a separate conversation. Distant Supervision based on subtitles timestamps could also be interesting, or combine with multimodal learning.  How we can learn a "personality vector"? Do we need world knowledge or is it learnable from examples? 
[link]
TLDR; The author train a three variants of a seq2seq model to generate a response to social media posts taken from Weibo. The first variant, NRMglo is the standard model without attention mechanism using the last state as the decoder input. The second variant, NRMloc, uses an attention mechanism. The third variant, NRMhyb combines both by concatenating local and global state vectors. The authors use human users to evaluate their responses and compare them to retrievelbased and SMTbased systems. The authors find that SRM models generate reasonable responses ~75% of the time. #### Key Points  STC: Shorttext conversation. Generate only a response to a post. Don't need to keep track of a whole conversation.  Training data: 200k posts, 4M responses.  Authors use GRU with 1000 hidden units.  Vocabulary: Most frequent 40k words for both input and response.  Retrieval is done using beam search with beam size 10.  Hybrid model is difficult to train jointly. The authors train the model individually and then finetune the hybrid model.  Tradeoff with retrieval based methods: Responses are written by a human and don't have grammatical errors, but cannot easily generalize to unseen inputs. 
[link]
TLDR; The authors propose three neural models to generate a response (r) based on a context and message pair (c,m). The context is defined as a single message. The first model, RLMT, is a basic Recurrent Language Model that is fed the whole (c,m,r) triple. The second model, DCGM1, encodes context and message into a BoW representation, put it through a feedforward neural network encoder, and then generates the response using an RNN decoder. The last model, DCGM2, is similar but keeps the representations of context and message separate instead of encoding them into a single BoW vector. The authors train their models on 29M triple data set from Twitter and evaluate using BLEU, METEOR and human evaluator scores. #### Key Points:  3 Models: RLMT, DCGM1, DCGM2  Data: 29M triples from Twitter  Because (c,m) is very long on average the authors expect RLMT to perform poorly.  Vocabulary: 50k words, trained with NCE loss  Generates responses degrade with length after ~8 tokens #### Notes/Questions:  Limiting the context to a single message kind of defeats the purpose of this. No real conversations have only a single message as context, and who knows how well the approach works with a larger context?  Authors complain that dealing with long sequences is hard, but they don't even use an LSTM/GRU. Why? 
[link]
TLDR; The authors propose an importancesampling approach to deal with large vocabularies in NMT models. During training, the corpus is partitioned, and for each partition only target words occurring in that partition are chosen. To improve decoding speed over the full vocabulary, the authors build a dictionary mapping from source sentence to potential target vocabulary. The authors evaluate their approach on standard MT tasks and perform better than the baseline models with smaller vocabulary. #### Key Points:  Computing partition function is the bottleneck. Use samplingbased approach.  Dealing with large vocabulary during training is separate from dealing with large vocab during decoding. Training is handled with importance sampling. Decoding is handled with sourcebased candidate list.  Decoding with candidate list takes around 0.12s (0.05) per token on CPU (GPU). Without target list 0.8s (0.25s).  Issue: Candidate list is depended on source sentence, so it must be recomputed for each sentence.  Reshuffling the data set is expensive as new partitions need to be calculated (not necessary, but improved scores). #### Notes:  How is the corpus partitioned? What's the effect of the partitioning strategy?  The authors say that they replace UNK tokens using "another word alignment model" but don't go into detail what this is. The results show that doing this results in much larger score bump than increasing the vocab does. (The authors do this for all comparison models though).  Reshuffling the dataset also results in a significant performance bump, but this operation is expensive. IMO the authors should take all these into account when reporting performance numbers. A single training update may be a lot faster, but the setup time increases. I'd would've like to see the authors assign a global time budget to train/test and then compare the models based on that.  The authors only briefly mentioned that rebuilding the target vocab for each source sentence is an issue and how they solve it, no details given. 
[link]
TLDR; The authors propose a new architecture called "Pointer Network". A Pointer Network is a seq2seq architecture with attention mechanism where the output vocabulary is the set of input indices. Since the output vocabulary varies based on input sequence length, a Pointer Network can generalize to variablelength inputs. The attention method trough which this is achieved is O(n^2), and only a sight variation of the standard seq2seq attention mechanism. The authors evaluate the architecture on tasks where the outputs correspond to positions of the inputs: Convex Hull, Delaunay Triangulation and Traveling Salesman problems. The architecture performs well these, and generalizes to sequences longer than those found in the training data. #### Key Points  Similar to standard attention, but don't blend the encoder states, use the attention vector directory.  Softmax probabilities of outputs can be interpreted as a fuzzy pointer.  We can solve the same problem artificially using seq2seq and outputting "coordinates", but that ignores the output constraints and would be less efficient.  512 unit LSTM, SGD with LR 1.0, batch size of 128, L2 gradient clipping of 2.0.  In the case of TSP, the "student" networks outperforms the "teacher" algorithm. #### Notes/ Questions  Seems like this architecture could be applied to generating spans (as in the newer "Text Processing From Bytes" paper), for POS tagging for example. That would require outputting classes in addition to input pointers. How? 
[link]
TLDR; The authors propose a novel architecture called ReNet, which replaces convolutional and maxpooling layers with RNNs that sweep over the image vertically and horizontally. These RNN layers are then stacked. The authors demonstrate that ReNet architecture is a viable alternative to CNNs. ReNet doesn't outperform CNNs in this paper, but further optimizations and hyperparameter tuning are likely going to lead to improved results in the future. #### Key Points:  Split images into patches, feed one patch per time step into RNN, vertically then horizontally. 4 RNNs per layer, 2 vertical and 2 horizontal, one per diretion.  Because the RNNs sweep over the whole image they can see the context of the full image, as opposed to just a local context in the case of conv/pool layers.  Smooth from endend to end.  In experiments, 2 256dimensional ReNet layers, 2x2 patches, 4096dimensional affine layers.  Flipping and shifting for data augmentation. #### Notes/Questions:  What is the training time/complexity compared to a CNN?  Why split the image into patches at all? I wonder if the authors have experimented with various patch sizes, like defining patches that go over the full vertical height. 2x2 patches as used in the experiment seem quite small and like a waste of computational resources. 
[link]
TLDR; The authors show that we can pretrain RNNs using unlabeled data by either reconstructing the original sequence (SALSTM), or predicting the next token as in a language model (LMLSTM). We can then finetune the weights on a supervised task. Pretrained RNNs are more stable, generalize better, and achieve stateoftheart results on various text classification tasks. The authors show that unlabeled data can compensate for a lack of labeled data. #### Data Sets Error Rates for SALSTM, previous best results in parens.  IMDB: 7.24% (7.42%)  Rotten Tomatoes 16.7% (18.5%) (using additional unlabeled data)  20 Newsgroups: 15.6% (17.1%)  DBPedia characterlevel: 1.19% (1.74%) #### Key Takeaways  SALSTM: Predict sequence based on final hidden state  LMLSTM: LanguageModel pretraining  LSTM, 1024dimensional cell, 512dimensional embedding, 512dimensional hidden affine layer + 50% dropout, Truncated backprop 400 steps. Clipped cell outputs and gradients. Word and input embedding dropout tuned on dev set.  Linear Gain: Inject gradient at each step and linearly increase weights of prediction objectives #### Notes / Questions  Not clear when/how linear gain yields improvements. On some data sets it significantly reduces performance, on other it significantly improves performance. Any explanations?  Word dropout is used in the paper but not explained. I'm assuming it's replacing random words with `DROP` tokens?  The authors mention a joint training model, but it's only evaluated on the IMDB data set. I'm assuming the authors didn't evaluate it further because it performed badly, but it would be nice to get an intuition for why it doesn't work, and show results for other data sets.  All tasks are classification tasks. Does SALSTM also improve performance on seq2seq tasks?  What is the training time? :) (I also wonder how the batching is done, are texts padded to the same length with mask?) 
[link]
TLDR; The authors evaluate the impact of hyperparameters (embeddings, filter region size, number of feature maps, activation function, pooling, dropout and l2 norm constraint) on Kim's (2014) CNN for sentence classification. The authors present empirical findings with variance nunbers based on a large number of experiments on 7 classification data sets, and give practical recommendation for architecture decisions. #### Key Points  Recommended Baseline configuration: word2vec, (3,4,5) filter regions, 100 feature maps per region size, ReLU activation, 1maxpooling, 0.5 dropout, l2 norm constraint on weight vector of 3.  Onehot vectors perform worse than pretrained embeddings. word2vec outperforms GloVe most of the time.  Filter region size is dependent on data set in the range of 225. Recommended to do a line search over single region size and then combine multiple sizes.  Increasing the number of feature maps per filter region to more than 600 doesn't seem to help much.  ReLU almost always best activation function  Maxpooling almost always best pooling strategy  Dropout from 0.1 to 0.5 helps, l2 norm constraint not much #### Notes/Questions  All datasets analyzed in this paper are rather similar. They have similar average and max sentence length, and even the number of examples is of roughly the same magnitude. It would be interesting to see how the result change with very different datasets, such as long documents, or very large numbers of training examples. 
[link]
TLDR; The authors use an attention mechanism in image caption generation, allowing the decoder RNN focus on specific parts of the image. In order find the correspondence between words and image patches, the RNN uses a lower convolutional layer as its input (before pooling). The authors propose both a "hard" attention (trained using sampling methods) and "soft" attention (trained endtoend) mechanism, and show qualitatively that the decoder focuses on sensible regions while generating text, adding an additional layer of interpretability to the model. The attentionbased models achieve stateofthe art on Flickr8k, Flickr30 and MS Coco. #### Key Points  To find image correspondence use lower convolutional layers to attend to.  Two attention mechanisms: Soft and hard. Depending on evaluation metric (BLEU vs. METERO) one or the other performs better.  Largest data set (MS COCO) takes 3 days to train on Titan Black GPU. Oxford VGG.  Soft attention is same as for seq2seq models.  Attention weights are visualized by upsampling and applying a Gaussian #### Notes/Questions  Would've liked to see an explanation of when/how soft vs. hard attention does better.  What is the computational overhead of using the attention mechanism? Is it significant? 
[link]
TLDR; The authors apply the skipthoguth word2vec model to the sentence level, training autoencoders that predict the previous and next sentences. The resulting generalpurpose vector representations are called skipthought vectors. The authors evaluate the performance of these vectors as features on semantic relatedness and classification tasks, achieving competitive results, but not beating finetuned models. #### Key Points  Code at https://github.com/ryankiros/skipthoughts  Training is done on large book corpus (74M sentences, 1B tokens), takes 2 weeks.  Two variations: Bidirectional encoder and unidirectional encoder with 1200 and 2400 units per encoder respectively. GRU cell, Adam optimizer, gradient clipping norm 10.  Vocabulary can be expanded by learning a mapping from a large word2vec voab to the smaller skipthought vocab. Could also used sampling/hierarchical softmax during training for larger vocab, or train on characters. #### Questions/Notes  Authors clearly state that this is not the goal of the paper, though I'd be curious how more sophisticated (nonlinear) classifiers perform with skipthought vectors. Authors probably tried this but it didn't do well ;)  The fact that the story generation doesn't seem work well shows that the model has problems learning or understanding longterm dependencies. I wonder if this can be solved by deeper encoders or attention. 
[link]
TLDR; The authors evaluate softmax, hierarchical softmax, target sampling, NCE, selfnormalization and differentiated softmax (novel technique presented in the paper) on data sets with varying vocabulary size (10k, 100k, 800k) with a fixedtime training budget. The authors find that techniques that work best for small vocabluaries are not necessarily the ones that work best for large vocabularies. #### Data and Models Models:  Sotmax  Hierarchical Softmax (crossvalidation of clustering techniques)  Differentiated softmax, adjusting capacity based on token frequency (crossvalidation of number of frequency bands and size)  Target Sampling (crossvalidation of number of distractors)  NCE (crossvalidation of noise ratio)  Selfnormalization (crossvalidation of regularization strenth) Data:  PTB (1M tokens, 10k vocab)  Gigaword (5B tokens, 100k vocab)  billionW (800M tokens, 800k vocab) #### Key Takeaways  Techniques that work best for small vocabluaries are not necessarily the ones that work best for large vocabularies.  Differentiated softmax varies the capacity (size of matrix slice in the last layer) based on token frequency. In practice, it's implemented as separate matrices with different sizes.  Perplexity doesn't seem to improve much after ~500M tokens  Models are trained for 1 week each  The competitiveness of softmax diminishes with vocabulary sizes. It seems to perform relatively well on 10k and 100k, but poorly on 800k since it need more processing time per example.  Traning time, not training data, is the main factor of limiting performance. The authors found that very large models are still making progress after one week and may eventually beat if the other models if allowed to run longer. #### Questions / Notes  What about the hyperparameters for Differentiated Softmax? The paper doesn't show an analysis. Also, the fact that this method introduces two additional hyperparameters makes it harder to apply in practice.  Would've liked to see more comparisons for Softmax, which is the simplest technique of all and doesn't need hyperparameter tuning. It doesn't work well on 800k vocab, but it does for 100k. So, the authors only show how it breaks down for one dataset. 