Welcome to ShortScience.org! 
[link]
**Object detection** is the task of drawing one bounding box around each instance of the type of object one wants to detect. Typically, image classification is done before object detection. With neural networks, the usual procedure for object detection is to train a classification network, replace the last layer with a regression layer which essentially predicts pixelwise if the object is there or not. An bounding box inference algorithm is added at last to make a consistent prediction (see [Deep Neural Networks for Object Detection](http://papers.nips.cc/paper/5207deepneuralnetworksforobjectdetection.pdf)). The paper introduces RPNs (Region Proposal Networks). They are endtoend trained to generate region proposals.They simoultaneously regress region bounds and bjectness scores at each location on a regular grid. RPNs are one type of fully convolutional networks. They take an image of any size as input and output a set of rectangular object proposals, each with an objectness score. ## See also * [RCNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen) * [Fast RCNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen) * [Faster RCNN](http://www.shortscience.org/paper?bibtexKey=conf/nips/RenHGS15#martinthoma) * [Mask RCNN](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeGDG17) 
[link]
* They describe a variation of convolutions that have a differently structured receptive field. * They argue that their variation works better for dense prediction, i.e. for predicting values for every pixel in an image (e.g. coloring, segmentation, upscaling). ### How * One can image the input into a convolutional layer as a 3dgrid. Each cell is a "pixel" generated by a filter. * Normal convolutions compute their output per cell as a weighted sum of the input cells in a dense area. I.e. all input cells are right next to each other. * In dilated convolutions, the cells are not right next to each other. E.g. 2dilated convolutions skip 1 cell between each input cell, 3dilated convolutions skip 2 cells etc. (Similar to striding.) * Normal convolutions are simply 1dilated convolutions (skipping 0 cells). * One can use a 1dilated convolution and then a 2dilated convolution. The receptive field of the second convolution will then be 7x7 instead of the usual 5x5 due to the spacing. * Increasing the dilation factor by 2 per layer (1, 2, 4, 8, ...) leads to an exponential increase in the receptive field size, while every cell in the receptive field will still be part in the computation of at least one convolution. * They had problems with badly performing networks, which they fixed using an identity initialization for the weights. (Sounds like just using resdiual connections would have been easier.) ![Receptive field](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/MultiScale_Context_Aggregation_by_Dilated_Convolutions__receptive.png?raw=true "Receptive field") *Receptive fields of a 1dilated convolution (1st image), followed by a 2dilated conv. (2nd image), followed by a 4dilated conv. (3rd image). The blue color indicates the receptive field size (notice the exponential increase in size). Stronger blue colors mean that the value has been used in more different convolutions.* ### Results * They took a VGG net, removed the pooling layers and replaced the convolutions with dilated ones (weights can be kept). * They then used the network to segment images. * Their results were significantly better than previous methods. * They also added another network with more dilated convolutions in front of the VGG one, again improving the results. ![Segmentation performance](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/MultiScale_Context_Aggregation_by_Dilated_Convolutions__segmentation.png?raw=true "Segmentation performance") *Their performance on a segmentation task compared to two competing methods. They only used VGG16 without pooling layers and with convolutions replaced by dilated convolutions.* 
[link]
_Objective:_ Design FeedForward Neural Network (fully connected) that can be trained even with very deep architectures. * _Dataset:_ [MNIST](yann.lecun.com/exdb/mnist/), [CIFAR10](https://www.cs.toronto.edu/%7Ekriz/cifar.html), [Tox21](https://tripod.nih.gov/tox21/challenge/) and [UCI tasks](https://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits). * _Code:_ [here](https://github.com/bioinfjku/SNNs) ## Innerworkings: They introduce a new activation functio the Scaled Exponential Linear Unit (SELU) which has the nice property of making neuron activations converge to a fixed point with zeromean and unitvariance. They also demonstrate that upper and lower bounds and the variance and mean for very mild conditions which basically means that there will be no exploding or vanishing gradients. The activation function is: [![screen shot 20170614 at 11 38 27 am](https://userimages.githubusercontent.com/17261080/271259011a4f727650f611e7857debad1ac94789.png)](https://userimages.githubusercontent.com/17261080/271259011a4f727650f611e7857debad1ac94789.png) With specific parameters for alpha and lambda to ensure the previous properties. The tensorflow impementation is: def selu(x): alpha = 1.6732632423543772848170429916717 scale = 1.0507009873554804934193349852946 return scale*np.where(x>=0.0, x, alpha*np.exp(x)alpha) They also introduce a new dropout (alphadropout) to compensate for the fact that [![screen shot 20170614 at 11 44 42 am](https://userimages.githubusercontent.com/17261080/27126174e67d212c50f611e78952acad98b850be.png)](https://userimages.githubusercontent.com/17261080/27126174e67d212c50f611e78952acad98b850be.png) ## Results: Batch norm becomes obsolete and they are also able to train deeper architectures. This becomes a good choice to replace shallow architectures where random forest or SVM used to be the best results. They outperform most other techniques on small datasets. [![screen shot 20170614 at 11 36 30 am](https://userimages.githubusercontent.com/17261080/27125798bd04c25650f511e78a74b3b6a3fe82ee.png)](https://userimages.githubusercontent.com/17261080/27125798bd04c25650f511e78a74b3b6a3fe82ee.png) Might become a new standard for fullyconnected activations in the future. 
[link]
This paper describes using Relation Networks (RN) for reasoning about relations between objects/entities. RN is a plugandplay module and although expects object representations as input, the semantics of what an object is need not be specified, so object representations can be convolutional layer feature vectors or entity embeddings from text, or something else. And the feedforward network is free to discover relations between objects (as opposed to being handassigned specific relations).  At its core, RN has two parts:  a feedforward network `g` that operates on pairs of object representations, for all possible pairs, all pairwise computations pooled via elementwise addition  a feedforward network `f` that operates on pooled features for downstream task, everything being trained endtoend  When dealing with pixels (as in CLEVR experiment), individual object representations are spatially distinct convolutional layer features (196 512d object representations for VGG conv5 say). The other experiment on CLEVR uses explicit factored object state representations with 3D coordinates, shape, material, color, size.  For bAbI, object representations are LSTM encodings of supporting sentences.  For VQA tasks, `g` conditions its processing on question encoding as well, as relations that are relevant for figuring out the answer would be questiondependent. ## Strengths  Very simple idea, clearly explained, performs well. Somewhat shocked that it hasn't been tried before. ## Weaknesses / Notes Fairly simple idea — let a feedforward network operate on all pairs of object representations and figure out relations necessary for downstream task with endtoend training. And it is fairly general in its design, relations aren't handdesigned and neither are object representations — for RGB images, these are spatially distinct convolutional layer features, for text, these are LSTM encodings of supporting facts, and so on. This module can be dropped in and combined with more sophisticated networks to improve performance at VQA. RNs also offer an alternative design choice to prior works on CLEVR, that have this explicit notion of programs or modules with specialized roles (that need to be predefined), as opposed to letting these relations emerge, reducing dependency on handdesigning modules and adding in inductive biases from an architectural pointofview for the network to reason about relations (earlier endtoend VQA models didn't have the capacity to figure out relations). 
[link]
Everyone has been thinking about how to apply GANs to discrete sequence data for the past year or so. This paper presents the model that I would guess most people thought of as the firstthingtotry: 1. Build a recurrent generator model which samples from its softmax outputs at each timestep. 2. Pass sampled sequences to a recurrent discriminator model which distinguishes between sampled sequences and realdata sequences. 3. Train the discriminator under the standard GAN loss. 4. Train the generator with a REINFORCE (policy gradient) objective, where each trajectory is assigned a single episodic reward: the score assigned to the generated sequence by the discriminator. Sounds hacky, right? We're learning a generator with a highvariance modelfree reinforcement learning algorithm, in a very seriously nonstationary environment. (Here the "environment" is a discriminator being jointly learned with the generator.) There's just one trick in this paper on top of that setup: for nonterminal states, the reward is defined as the *expectation* of the discriminator score after stochastically generating from that state forward. To restate using standard (somewhat sloppy) RL syntax, in different terms than the paper: (under stochastic sequential policy $\pi$, with current state $s_t$, trajectory $\tau_{1:T}$ and discriminator $D(\tau)$) $$r_t = \mathbb E_{\tau_{t+1:T} \sim \pi(s_t)} \left[ D(\tau_{1:T}) \right]$$ The rewards are estimated via Monte Carlo — i.e., just take the mean of $N$ rollouts from each intermediate state. They claim this helps to reduce variance. That makes intuitive sense, but I don't see any results in the paper demonstrating the effect of varying $N$.  Yep, so it turns out that this sort of works.. with a big caveat: ## The big caveat Graph from appendix: ![](https://www.dropbox.com/s/5fqh6my63sgv5y4/Bildschirmfoto%2020160927%20um%2021.34.44.png?raw=1) SeqGANs don't work without supervised pretraining. Makes sense — with a cold start, the generator just samples a bunch of nonsense and the discriminator overfits. Both the generator and discriminator are pretrained on supervised data in this paper (see Algorithm 1). I think it must be possible to overcome this with the proper training tricks and enough sweat. But it's probably more worth our time to address the fundamental problem here of developing better RL for structured prediction tasks.
4 Comments
