Welcome to ShortScience.org! 
[link]
This paper describes using Relation Networks (RN) for reasoning about relations between objects/entities. RN is a plugandplay module and although expects object representations as input, the semantics of what an object is need not be specified, so object representations can be convolutional layer feature vectors or entity embeddings from text, or something else. And the feedforward network is free to discover relations between objects (as opposed to being handassigned specific relations).  At its core, RN has two parts:  a feedforward network `g` that operates on pairs of object representations, for all possible pairs, all pairwise computations pooled via elementwise addition  a feedforward network `f` that operates on pooled features for downstream task, everything being trained endtoend  When dealing with pixels (as in CLEVR experiment), individual object representations are spatially distinct convolutional layer features (196 512d object representations for VGG conv5 say). The other experiment on CLEVR uses explicit factored object state representations with 3D coordinates, shape, material, color, size.  For bAbI, object representations are LSTM encodings of supporting sentences.  For VQA tasks, `g` conditions its processing on question encoding as well, as relations that are relevant for figuring out the answer would be questiondependent. ## Strengths  Very simple idea, clearly explained, performs well. Somewhat shocked that it hasn't been tried before. ## Weaknesses / Notes Fairly simple idea — let a feedforward network operate on all pairs of object representations and figure out relations necessary for downstream task with endtoend training. And it is fairly general in its design, relations aren't handdesigned and neither are object representations — for RGB images, these are spatially distinct convolutional layer features, for text, these are LSTM encodings of supporting facts, and so on. This module can be dropped in and combined with more sophisticated networks to improve performance at VQA. RNs also offer an alternative design choice to prior works on CLEVR, that have this explicit notion of programs or modules with specialized roles (that need to be predefined), as opposed to letting these relations emerge, reducing dependency on handdesigning modules and adding in inductive biases from an architectural pointofview for the network to reason about relations (earlier endtoend VQA models didn't have the capacity to figure out relations). 
[link]
**Object detection** is the task of drawing one bounding box around each instance of the type of object one wants to detect. Typically, image classification is done before object detection. With neural networks, the usual procedure for object detection is to train a classification network, replace the last layer with a regression layer which essentially predicts pixelwise if the object is there or not. An bounding box inference algorithm is added at last to make a consistent prediction (see [Deep Neural Networks for Object Detection](http://papers.nips.cc/paper/5207deepneuralnetworksforobjectdetection.pdf)). The paper introduces RPNs (Region Proposal Networks). They are endtoend trained to generate region proposals.They simoultaneously regress region bounds and bjectness scores at each location on a regular grid. RPNs are one type of fully convolutional networks. They take an image of any size as input and output a set of rectangular object proposals, each with an objectness score. ## See also * [RCNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen) * [Fast RCNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen) * [Faster RCNN](http://www.shortscience.org/paper?bibtexKey=conf/nips/RenHGS15#martinthoma) * [Mask RCNN](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeGDG17) 
[link]
Originally posted on my Github repo [papernotes](https://github.com/karpathy/papernotes/blob/master/vin.md). # Value Iteration Networks By Berkeley group: Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel This paper introduces a poliy network architecture for RL tasks that has an embedded differentiable *planning module*, trained endtoend. It hence falls into a category of fun papers that take explicit algorithms, make them differentiable, embed them in a larger neural net, and train everything endtoend. **Observation**: in most RL approaches the policy is a "reactive" controller that internalizes into its weights actions that historically led to high rewards. **Insight**: To improve the inductive bias of the model, embed a specificallystructured neural net planner into the policy. In particular, the planner runs the value Iteration algorithm, which can be implemented with a ConvNet. So this is kind of like a modelbased approach trained with modelfree RL, or something. Lol. NOTE: This is very different from the more standard/obvious approach of learning a separate neural network environment dynamics model (e.g. with regression), fixing it, and then using a planning algorithm over this intermediate representation. This would not be endtoend because we're not backpropagating the end objective through the full model but rely on auxiliary objectives (e.g. log prob of a state given previous state and action when training a dynamics model), and in practice also does not work well. NOTE2: A recurrent agent (e.g. with an LSTM policy), or a feedforward agent with a sufficiently deep network trained in a modelfree setting has some capacity to learn planninglike computation in its hidden states. However, this is nowhere near as explicit as in this paper, since here we're directly "baking" the planning compute into the architecture. It's exciting. ## Value Iteration Value Iteration is an algorithm for computing the optimal value function/policy $V^*, \pi^*$ and involves turning the Bellman equation into a recurrence: ![Screen Shot 20160813 at 3.26.04 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/vin/Screen%20Shot%2020160813%20at%203.26.04%20PM.png) This iteration converges to $V^*$ as $n \rightarrow \infty$, which we can use to behave optimally (i.e. the optimal policy takes actions that lead to the most rewarding states, according to $V^*$). ## Gridworld domain The paper ends up running the model on several domains, but for the sake of an effective example consider the gridworld task where the agent is at some particular position in a 2D grid and has to reach a specific goal state while also avoiding obstacles. Here is an example of the toy task: ![Screen Shot 20160813 at 4.43.04 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/vin/Screen%20Shot%2020160813%20at%204.43.04%20PM.png) The agent gets a reward +1 in the goal state, 1 in obstacles (black), and 0.01 for each step (so that the shortest path to the goal is an optimal solution). ## VIN model The agent is implemented in a very straightforward manner as a single neural network trained with TRPO (Policy Gradients with a KL constraint on predictive action distributions over a batch of trajectories). So the only loss function used is to maximize expected reward, as is standard in modelfree RL. However, the policy network of the agent has a very specific structure since it (internally) runs value iteration. First, there's the core Value Iteration **(VI) Module** which runs the recurrence formula (reproducing again): ![Screen Shot 20160813 at 3.26.04 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/vin/Screen%20Shot%2020160813%20at%203.26.04%20PM.png) The input to this recurrence are the two arrays R (the reward array, reward for each state) and P (the dynamics array, the probabilities of transitioning to nearby states with each action), which are of course unknown to the agent, but can be predicted with neural networks as a function of the current state. This is a little funny because the networks take a _particular_ state **s** and are internally (during the forward pass) predicting the rewards and dynamics for all states and actions in the entire environment. Notice, extremely importantly and once again, that at no point are the reward and dynamics functions explicitly regressed to the observed transitions in the environment. They are just arrays of numbers that plug into value iteration recurrence module. But anyway, once we have **R,P** arrays, in the Gridworld above due to the local connectivity, value iteration can be implemented with a repeated application of convolving **P** over **R**, as these filters effectively *diffuse* the estimated reward function (**R**) through the dynamics model (**P**), followed by max pooling across the actions. If **P** is a not a function of the state, it would simply be the filters in the Conv layer. Notice that posing this as convolution also assumes that the env dynamics are positioninvariant. See the diagram below on the right:![Screen Shot 20160813 at 4.58.42 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/vin/Screen%20Shot%2020160813%20at%204.58.42%20PM.png) Once the array of numbers that we interpret as holding the estimated $V^*$ is computed after running **K** steps of the recurrence (K is fixed beforehand. For example for a 16x16 map it is 20, since that's a bit more than the amount of steps needed to diffuse rewards across the entire map), we "pluck out" the stateaction values $Q(s,.)$ at the state the agent happens to currently be in (by an "attention" operator $\psi$), and (optionally) append these Q values to the feedforward representation of the current state $\phi(s)$, and finally predicting the action distribution. ## Experiments **Baseline 1**: A vanilla ConvNet policy trained with TRPO. [(50 3x3 filters)\*2, 2x2 max pool, (100 3x3 filters)\*3, 2x2 max pool, FC(100), FC(4), Softmax]. **Baseline 2**: A fully convolutional network (FCN), 3 layers (with a filter that spans the whole image), of 150, 100, 10 filters. i.e. slightly different and perhaps a bit more domainappropriate ConvNet architecture. **Curriculum** is used during training where easier environments are trained on first. This is claimed to work better but not quantified in tables. Models are trained with TRPO, RMSProp, implemented in Theano. Results when training on **5000** random gridworld instances (hey isn't that quite a bit low?):![Screen Shot 20160813 at 5.47.23 PM](https://raw.githubusercontent.com/karpathy/papernotes/master/img/vin/Screen%20Shot%2020160813%20at%205.47.23%20PM.png) TLDR VIN generalizes better. The authors also run the model on the **Mars Rover Navigation** dataset (wait what?), a **Continuous Control** 2D path planning dataset, and the **WebNav Challenge**, a languagebased search task on a graph (of a subset of Wikipedia). Skipping these because they don't add _too_ much to the core cool idea of the paper. ## Misc **The good**: I really like this paper because the core idea is cute (the planner is *embedded* in the policy and trained endtoend), novel (I don't think I saw this idea executed on so far elsewhere), the paper is wellwritten and clear, and the supplementary materials are thorough. **On the approach**: Significant challenges remain to make this approach more practicaly viable, but it also seems that much more exciting followup work can be done in this framework. I wish the authors discussed this more in the conclusion. In particular, it seems that one has to explicitly encode the environment connectivity structure in the internal model $\bar{M}$. How much of a problem is this and what could be done about it? Or how could we do the planning in more higherlevel abstract spaces instead of the actual lowlevel state space of the problem? Also, it seems that a potentially nice feature of this approach is that the agent could dynamically "decide" on a reward function at runtime, and the VI module can diffuse it through the dynamics and hence do the planning. A potentially interesting outcome is that the agent could utilize this kind of computation so that an LSTM controller could learn to "emit" reward function subgoals and the VI planner computes how to meet them. A nice/clean division of labor one could hope for in principle. **The experiments**. Unfortunately, I'm not sure why the authors preferred breadth of experiments and sacrificed depth of experiments. I would have much preferred a more indepth analysis of the gridworld environment. For instance:  Only 5,000 training examples are used for training, which seems very little. Presumable, the baselines get stronger as you increase the number of training examples?  Lack of visualizations: Does the model actually learn the "correct" rewards **R** and dynamics **P**? The authors could inspect these manually and correlate them to the actual model. This would have been reaaaallllyy cool. I also wouldn't expect the model to exactly learn these, but who knows.  How does the model compare to the baselines in the number of parameters? or FLOPS? It seems that doing VI for 30 steps at each single iteration of the algorithm should be quite expensive.  The authors should study the performance as a function of the number of recurrences **K**. A particularly funny experiment would be K = 1, where the model would be effectively predicting **V*** directly, without planning. What happens?  If the output of VI $\psi(s)$ is concatenated to the state parameters, are these Q values actually used? What if all the weights to these numbers are zero in the trained models?  Why do the authors only evaluate success rate when the training criterion is expected reward? Overall a very cute idea, well executed as a first step and well explained, with a bit of unsatisfying lack of depth in the experiments in favor of breadth that doesn't add all that much.
2 Comments

[link]
Everyone has been thinking about how to apply GANs to discrete sequence data for the past year or so. This paper presents the model that I would guess most people thought of as the firstthingtotry: 1. Build a recurrent generator model which samples from its softmax outputs at each timestep. 2. Pass sampled sequences to a recurrent discriminator model which distinguishes between sampled sequences and realdata sequences. 3. Train the discriminator under the standard GAN loss. 4. Train the generator with a REINFORCE (policy gradient) objective, where each trajectory is assigned a single episodic reward: the score assigned to the generated sequence by the discriminator. Sounds hacky, right? We're learning a generator with a highvariance modelfree reinforcement learning algorithm, in a very seriously nonstationary environment. (Here the "environment" is a discriminator being jointly learned with the generator.) There's just one trick in this paper on top of that setup: for nonterminal states, the reward is defined as the *expectation* of the discriminator score after stochastically generating from that state forward. To restate using standard (somewhat sloppy) RL syntax, in different terms than the paper: (under stochastic sequential policy $\pi$, with current state $s_t$, trajectory $\tau_{1:T}$ and discriminator $D(\tau)$) $$r_t = \mathbb E_{\tau_{t+1:T} \sim \pi(s_t)} \left[ D(\tau_{1:T}) \right]$$ The rewards are estimated via Monte Carlo — i.e., just take the mean of $N$ rollouts from each intermediate state. They claim this helps to reduce variance. That makes intuitive sense, but I don't see any results in the paper demonstrating the effect of varying $N$.  Yep, so it turns out that this sort of works.. with a big caveat: ## The big caveat Graph from appendix: ![](https://www.dropbox.com/s/5fqh6my63sgv5y4/Bildschirmfoto%2020160927%20um%2021.34.44.png?raw=1) SeqGANs don't work without supervised pretraining. Makes sense — with a cold start, the generator just samples a bunch of nonsense and the discriminator overfits. Both the generator and discriminator are pretrained on supervised data in this paper (see Algorithm 1). I think it must be possible to overcome this with the proper training tricks and enough sweat. But it's probably more worth our time to address the fundamental problem here of developing better RL for structured prediction tasks.
4 Comments

[link]
[Batch Normalization Ioffe et. al 2015](Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift) is one of the remarkable ideas in the era of deep learning that sits with the likes of Dropout and Residual Connections. Nonetheless, last few years have shown a few shortcomings of the idea, which two years later Ioffe has tried to solve through the concept that he calls Batch Renormalization. Issues with Batch Normalization  Different parameters used to compute normalized output during training and inference  Using Batch Norm with small minibatches  Noni.i.d minibatches can have a detrimental effect on models with batchnorm. For e.g. in a metric learning scenario, for a minibatch of size 32, we may randomly select 16 labels then choose 2 examples for each of these labels, the examples interact at every layer and may cause model to overfit to the specific distribution of minibatches and suffer when used on individual examples. The problem with using moving averages in training, is that it causes gradient optimization and normalization in opposite direction and leads to model blowing up. Idea of Batch Renormalization We know that, ${\frac{x_i  \mu}{\sigma} = \frac{x_i  \mu_B}{\sigma_B}.r + d}$ where, ${r = \frac{\sigma_B}{\sigma}, d = \frac{\mu_B  \mu}{\sigma}}$ So the batch renormalization algorithm is defined as follows ![Batch Renorm Algo](https://fractalanalyticmy.sharepoint.com/personal/shubham_jain_fractalanalytics_com/_layouts/15/guestaccess.aspx?docid=0c2c627424786442f8de65367755e1fd1&authkey=ARSCi3QfpM_uBVuWCYARKNg) Ioffe writes further that for practical purposes, > In practice, it is beneficial to train the model for a certain number of iterations with batchnorm alone, without the correction, then ramp up the amount of allowed correction. We do this by imposing bounds on r and d, which initially constrain them to 1 and 0, respectively, and then are gradually relaxed. In experiments, For Batch Renorm, author used $r_{max}$ = 1, $d_{max}$ = 0 (i.e. simply batchnorm) for the first 5000 training steps, after which these were gradually relaxed to reach $r_{max}$ = 3 at 40k steps, and $d_{max}$ = 5 at 25k steps. A training step means, an update to the model.
2 Comments
