Welcome to ShortScience.org! 
[link]
Xception Net or Extreme Inception Net brings a new perception of looking at the Inception Nets. Inception Nets, as was first published (as GoogLeNet) consisted of NetworkinNetwork modules like this ![Inception Modules](http://i.imgur.com/jwYhi8t.png) The idea behind Inception modules was to look at crosschannel correlations ( via 1x1 convolutions) and spatial correlations (via 3x3 Convolutions). The main concept being that crosschannel correlations and spatial correlations are sufficiently decoupled that it is preferable not to map them jointly. This idea is the genesis of Xception Net, using depthwise separable convolution ( convolution which looks into spatial correlations across all channels independently and then uses pointwise convolutions to project to the requisite channel space leveraging interchannel correlations). Chollet, does a wonderful job of explaining how regular convolution (looking at both channel & spatial correlations simultaneously) and depthwise separable convolution (looking at channel & spatial correlations independently in successive steps) are end points of spectrum with the original Inception Nets lying in between. ![Extreme version of Inception Net](http://i.imgur.com/kylzfIQ.png) *Though for Xception Net, Chollet uses, depthwise separable layers which perform 3x3 convolutions for each channel and then 1x1 convolutions on the output from 3x3 convolutions (opposite order of operations depicted in image above)* ##### Input Input for would be images that are used for classification along with corresponding labels. ##### Architecture Architecture of Xception Net uses one for VGG16 with convolutionmaxpool blocks replaced by residual blocks of depthwise separable convolution layers. The architecture looks like this ![architecture of Xception Net](http://i.imgur.com/9hfdyNA.png) ##### Results Xception Net was trained using hyperparameters tuned for best performance of Inception V3 Net. And for both internal dataset and ImageNet dataset, Xception outperformed Inception V3. Points to be noted  Both Xception & Inception V3 have roughly similar no of parameters (~24 M), hence any improvement in performance can't be attributed to network size  Xception normally takes slightly lower training time compared to Inception V3, which can be configured to be lower in future 
[link]
This paper describes how rank pooling, a very recent approach for pooling representations organized in a sequence $\\{{\bf v}_t\\}_{t=1}^T$, can be used in an endtoend trained neural network architecture. Rank pooling is an alternative to average and max pooling for sequences, but with the distinctive advantage of maintaining some order information from the sequence. Rank pooling first solves a regularized (linear) support vector regression (SVR) problem where the inputs are the vector representations ${\bf v}_t$ in the sequence and the target is the corresponding index $t$ of that representation in the sequence (see Equation 5). The output of rank pooling is then simply the linear regression parameters $\bf{u}$ learned for that sequence. Because of the way ${\bf u}$ is trained, we can see that ${\bf u}$ will capture order information, as successful training would imply that ${\bf u}^\top {\bf v}_t < {\bf u}^\top {\bf v}_{t'} $ if $t < t'$. See [this paper](https://www.robots.ox.ac.uk/~vgg/rg/papers/videoDarwin.pdf) for more on rank pooling. While previous work has focused on using rank pooling on handdesigned and fixed representations, this paper proposes to use ConvNet features (pretrained on ImageNet) for the representation and backpropagate through rank pooling to finetune the ConvNet features. Since the output of rank pooling corresponds to an argmin operation, passing gradients through this operation is not as straightforward as for average or max pooling. However, it turns out that if the objective being minimized (in our case regularized SVR) is twice differentiable, gradients with respect to its argmin can be computed (see Lemmas 1 and 2). The authors derive the gradient for rank pooling (Equation 21). Finally, since its gradient requires inverting a matrix (corresponding to a hessian), the authors propose to either use an efficient procedure for computing it by exploiting properties of sums of rankone matrices (see Lemma 3) or to simply use an approximation based on using a diagonal hessian. In experiments on two small scale video activity recognition datasets (UCFSports and Hollywood2), the authors show that finetuning the ConvNet features significantly improves the performance of rank pooling and makes it superior to max and average pooling. **My two cents** This paper was eye opening for me, first because I did not realize that one could backpropagate through an operation corresponding to an argmin that doesn't have a closed form solution (though apparently this paper isn't the first to make that observation). Moreover, I did not know about rank pooling, which itself is a really thought provoking approach to pooling representations in a way that preserves some organizational information about the original representations. I wonder how sensitive the results are to the value of the regularization constant of the SVR problem. The authors mention some theoretical guaranties on the stability of the solution found by SVR in general, but intuitively I would expect that the regularization constant would play a large role in the stability. I'll be looking forward to any future attempts to increase the speed of rank pooling (or any similar method). Indeed, as the authors mention, it is currently too slow to be used on the larger video datasets that are currently available. Code for computing rank pooling (though not for computing its gradients) seems to be available [here](https://bitbucket.org/bfernando/videodarwin).
2 Comments

[link]
Originally posted on my Github repo [papernotes](https://github.com/karpathy/papernotes/blob/master/vin.md). # Value Iteration Networks By Berkeley group: Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel This paper introduces a poliy network architecture for RL tasks that has an embedded differentiable *planning module*, trained endtoend. It hence falls into a category of fun papers that take explicit algorithms, make them differentiable, embed them in a larger neural net, and train everything endtoend. **Observation**: in most RL approaches the policy is a "reactive" controller that internalizes into its weights actions that historically led to high rewards. **Insight**: To improve the inductive bias of the model, embed a specificallystructured neural net planner into the policy. In particular, the planner runs the value Iteration algorithm, which can be implemented with a ConvNet. So this is kind of like a modelbased approach trained with modelfree RL, or something. Lol. NOTE: This is very different from the more standard/obvious approach of learning a separate neural network environment dynamics model (e.g. with regression), fixing it, and then using a planning algorithm over this intermediate representation. This would not be endtoend because we're not backpropagating the end objective through the full model but rely on auxiliary objectives (e.g. log prob of a state given previous state and action when training a dynamics model), and in practice also does not work well. NOTE2: A recurrent agent (e.g. with an LSTM policy), or a feedforward agent with a sufficiently deep network trained in a modelfree setting has some capacity to learn planninglike computation in its hidden states. However, this is nowhere near as explicit as in this paper, since here we're directly "baking" the planning compute into the architecture. It's exciting. ## Value Iteration Value Iteration is an algorithm for computing the optimal value function/policy $V^*, \pi^*$ and involves turning the Bellman equation into a recurrence: ![Screen Shot 20160813 at 3.26.04 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/vin/Screen%20Shot%2020160813%20at%203.26.04%20PM.png) This iteration converges to $V^*$ as $n \rightarrow \infty$, which we can use to behave optimally (i.e. the optimal policy takes actions that lead to the most rewarding states, according to $V^*$). ## Gridworld domain The paper ends up running the model on several domains, but for the sake of an effective example consider the gridworld task where the agent is at some particular position in a 2D grid and has to reach a specific goal state while also avoiding obstacles. Here is an example of the toy task: ![Screen Shot 20160813 at 4.43.04 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/vin/Screen%20Shot%2020160813%20at%204.43.04%20PM.png) The agent gets a reward +1 in the goal state, 1 in obstacles (black), and 0.01 for each step (so that the shortest path to the goal is an optimal solution). ## VIN model The agent is implemented in a very straightforward manner as a single neural network trained with TRPO (Policy Gradients with a KL constraint on predictive action distributions over a batch of trajectories). So the only loss function used is to maximize expected reward, as is standard in modelfree RL. However, the policy network of the agent has a very specific structure since it (internally) runs value iteration. First, there's the core Value Iteration **(VI) Module** which runs the recurrence formula (reproducing again): ![Screen Shot 20160813 at 3.26.04 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/vin/Screen%20Shot%2020160813%20at%203.26.04%20PM.png) The input to this recurrence are the two arrays R (the reward array, reward for each state) and P (the dynamics array, the probabilities of transitioning to nearby states with each action), which are of course unknown to the agent, but can be predicted with neural networks as a function of the current state. This is a little funny because the networks take a _particular_ state **s** and are internally (during the forward pass) predicting the rewards and dynamics for all states and actions in the entire environment. Notice, extremely importantly and once again, that at no point are the reward and dynamics functions explicitly regressed to the observed transitions in the environment. They are just arrays of numbers that plug into value iteration recurrence module. But anyway, once we have **R,P** arrays, in the Gridworld above due to the local connectivity, value iteration can be implemented with a repeated application of convolving **P** over **R**, as these filters effectively *diffuse* the estimated reward function (**R**) through the dynamics model (**P**), followed by max pooling across the actions. If **P** is a not a function of the state, it would simply be the filters in the Conv layer. Notice that posing this as convolution also assumes that the env dynamics are positioninvariant. See the diagram below on the right:![Screen Shot 20160813 at 4.58.42 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/vin/Screen%20Shot%2020160813%20at%204.58.42%20PM.png) Once the array of numbers that we interpret as holding the estimated $V^*$ is computed after running **K** steps of the recurrence (K is fixed beforehand. For example for a 16x16 map it is 20, since that's a bit more than the amount of steps needed to diffuse rewards across the entire map), we "pluck out" the stateaction values $Q(s,.)$ at the state the agent happens to currently be in (by an "attention" operator $\psi$), and (optionally) append these Q values to the feedforward representation of the current state $\phi(s)$, and finally predicting the action distribution. ## Experiments **Baseline 1**: A vanilla ConvNet policy trained with TRPO. [(50 3x3 filters)\*2, 2x2 max pool, (100 3x3 filters)\*3, 2x2 max pool, FC(100), FC(4), Softmax]. **Baseline 2**: A fully convolutional network (FCN), 3 layers (with a filter that spans the whole image), of 150, 100, 10 filters. i.e. slightly different and perhaps a bit more domainappropriate ConvNet architecture. **Curriculum** is used during training where easier environments are trained on first. This is claimed to work better but not quantified in tables. Models are trained with TRPO, RMSProp, implemented in Theano. Results when training on **5000** random gridworld instances (hey isn't that quite a bit low?):![Screen Shot 20160813 at 5.47.23 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/vin/Screen%20Shot%2020160813%20at%205.47.23%20PM.png) TLDR VIN generalizes better. The authors also run the model on the **Mars Rover Navigation** dataset (wait what?), a **Continuous Control** 2D path planning dataset, and the **WebNav Challenge**, a languagebased search task on a graph (of a subset of Wikipedia). Skipping these because they don't add _too_ much to the core cool idea of the paper. ## Misc **The good**: I really like this paper because the core idea is cute (the planner is *embedded* in the policy and trained endtoend), novel (I don't think I saw this idea executed on so far elsewhere), the paper is wellwritten and clear, and the supplementary materials are thorough. **On the approach**: Significant challenges remain to make this approach more practicaly viable, but it also seems that much more exciting followup work can be done in this framework. I wish the authors discussed this more in the conclusion. In particular, it seems that one has to explicitly encode the environment connectivity structure in the internal model $\bar{M}$. How much of a problem is this and what could be done about it? Or how could we do the planning in more higherlevel abstract spaces instead of the actual lowlevel state space of the problem? Also, it seems that a potentially nice feature of this approach is that the agent could dynamically "decide" on a reward function at runtime, and the VI module can diffuse it through the dynamics and hence do the planning. A potentially interesting outcome is that the agent could utilize this kind of computation so that an LSTM controller could learn to "emit" reward function subgoals and the VI planner computes how to meet them. A nice/clean division of labor one could hope for in principle. **The experiments**. Unfortunately, I'm not sure why the authors preferred breadth of experiments and sacrificed depth of experiments. I would have much preferred a more indepth analysis of the gridworld environment. For instance:  Only 5,000 training examples are used for training, which seems very little. Presumable, the baselines get stronger as you increase the number of training examples?  Lack of visualizations: Does the model actually learn the "correct" rewards **R** and dynamics **P**? The authors could inspect these manually and correlate them to the actual model. This would have been reaaaallllyy cool. I also wouldn't expect the model to exactly learn these, but who knows.  How does the model compare to the baselines in the number of parameters? or FLOPS? It seems that doing VI for 30 steps at each single iteration of the algorithm should be quite expensive.  The authors should study the performance as a function of the number of recurrences **K**. A particularly funny experiment would be K = 1, where the model would be effectively predicting **V*** directly, without planning. What happens?  If the output of VI $\psi(s)$ is concatenated to the state parameters, are these Q values actually used? What if all the weights to these numbers are zero in the trained models?  Why do the authors only evaluate success rate when the training criterion is expected reward? Overall a very cute idea, well executed as a first step and well explained, with a bit of unsatisfying lack of depth in the experiments in favor of breadth that doesn't add all that much.
2 Comments

[link]
This paper describes using Relation Networks (RN) for reasoning about relations between objects/entities. RN is a plugandplay module and although expects object representations as input, the semantics of what an object is need not be specified, so object representations can be convolutional layer feature vectors or entity embeddings from text, or something else. And the feedforward network is free to discover relations between objects (as opposed to being handassigned specific relations).  At its core, RN has two parts:  a feedforward network `g` that operates on pairs of object representations, for all possible pairs, all pairwise computations pooled via elementwise addition  a feedforward network `f` that operates on pooled features for downstream task, everything being trained endtoend  When dealing with pixels (as in CLEVR experiment), individual object representations are spatially distinct convolutional layer features (196 512d object representations for VGG conv5 say). The other experiment on CLEVR uses explicit factored object state representations with 3D coordinates, shape, material, color, size.  For bAbI, object representations are LSTM encodings of supporting sentences.  For VQA tasks, `g` conditions its processing on question encoding as well, as relations that are relevant for figuring out the answer would be questiondependent. ## Strengths  Very simple idea, clearly explained, performs well. Somewhat shocked that it hasn't been tried before. ## Weaknesses / Notes Fairly simple idea — let a feedforward network operate on all pairs of object representations and figure out relations necessary for downstream task with endtoend training. And it is fairly general in its design, relations aren't handdesigned and neither are object representations — for RGB images, these are spatially distinct convolutional layer features, for text, these are LSTM encodings of supporting facts, and so on. This module can be dropped in and combined with more sophisticated networks to improve performance at VQA. RNs also offer an alternative design choice to prior works on CLEVR, that have this explicit notion of programs or modules with specialized roles (that need to be predefined), as opposed to letting these relations emerge, reducing dependency on handdesigning modules and adding in inductive biases from an architectural pointofview for the network to reason about relations (earlier endtoend VQA models didn't have the capacity to figure out relations). 
[link]
This paper describes how to apply the idea of batch normalization (BN) successfully to recurrent neural networks, specifically to LSTM networks. The technique involves the 3 following ideas: **1) Careful initialization of the BN scaling parameter.** While standard practice is to initialize it to 1 (to have unit variance), they show that this situation creates problems with the gradient flow through time, which vanishes quickly. A value around 0.1 (used in the experiments) preserves gradient flow much better. **2) Separate BN for the "hiddens to hiddens preactivation and for the "inputs to hiddens" preactivation.** In other words, 2 separate BN operators are applied on each contributions to the preactivation, before summing and passing through the tanh and sigmoid nonlinearities. **3) Use of largest timestep BN statistics for longer testtime sequences.** Indeed, one issue with applying BN to RNNs is that if the input sequences have varying length, and if one uses pertimestep mean/variance statistics in the BN transformation (which is the natural thing to do), it hasn't been clear how do deal with the last time steps of longer sequences seen at test time, for which BN has no statistics from the training set. The paper shows evidence that the preactivation statistics tend to gradually converge to stationary values over time steps, which supports the idea of simply using the training set's last time step statistics. Among these ideas, I believe the most impactful idea is 1). The papers mentions towards the end that improper initialization of the BN scaling parameter probably explains previous failed attempts to apply BN to recurrent networks. Experiments on 4 datasets confirms the method's success. **My two cents** This is an excellent development for LSTMs. BN has had an important impact on our success in training deep neural networks, and this approach might very well have a similar impact on the success of LSTMs in practice. 