Welcome to ShortScience.org! 
[link]
In object detection the boost in speed and accuracy is mostly gained through network architecture changes.This paper takes a different route towards achieving that goal,They introduce a new loss function called focal loss. The authors identify class imbalance as the main obstacle toward one stage detectors achieving results which are as good as two stage detectors. The loss function they introduce is a dynamically scaled cross entropy loss,Where the scaling factor decays to zero as the confidence in the correct class increases. They add a modulating factor as shown in the image below to the cross entropy loss https://i.imgur.com/N7R3M9J.png Which ends up looking like this https://i.imgur.com/kxC8NCB.png in experiments though they add an additional alpha term to it,because it gives them better results. **Retina Net** The network consists of a single unified network which is composed of a backbone network and two task specific subnetworks.The backbone network computes the feature maps for the input images.The first subnetwork helps in object classification of the backbone networks output and the second subnetwork helps in bounding box regression. The backbone network they use is Feature Pyramid Network,Which they build on top of ResNet. 
[link]
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. Advantages: * Learning the identity becomes learning 0 which is simpler * Loss in information flow in the forward pass is not a problem anymore * No vanishing / exploding gradient * Identities don't have parameters to be learned ## Evaluation The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{4}$), momentum of 0.9. They use minibatches of size 128. * ImageNet ILSVRC 2015: 3.57% (ensemble) * CIFAR10: 6.43% * MS COCO: 59.0% mAp@0.5 (ensemble) * PASCAL VOC 2007: 85.6% mAp@0.5 * PASCAL VOC 2012: 83.8% mAp@0.5 ## See also * [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993) 
[link]
The main contribution of [Understanding the difficulty of training deep feedforward neural networks](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) by Glorot et al. is a **normalized weight initialization** $$W \sim U \left [  \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}} \right ]$$ where $n_j \in \mathbb{N}^+$ is the number of neurons in the layer $j$. Showing some ways **how to debug neural networks** might be another reason to read the paper. The paper analyzed standard multilayer perceptrons (MLPs) on a artificial dataset of $32 \text{px} \times 32 \text{px}$ images with either one or two of the 3 shapes: triangle, parallelogram and ellipse. The MLPs varied in the activation function which was used (either sigmoid, tanh or softsign). However, no regularization was used and many minibatch epochs were learned. It might be that batch normalization / dropout might change the influence of initialization very much. Questions that remain open for me: * [How is weight initialization done today?](https://www.reddit.com/r/MLQuestions/comments/4jsge9) * Figure 4: Why is this plot not simply completely dependent on the data? * Is softsign still used? Why not? * If the only advantage of softsign is that is has the plateau later, why doesn't anybody use $\frac{1}{1+e^{0.1 \cdot x}}$ or something similar instead of the standard sigmoid activation function?
1 Comments

[link]
This method is based on improving the speed of RCNN \cite{conf/cvpr/GirshickDDM14} 1. Where RCNN would have two different objective functions, Fast RCNN combines localization and classification losses into a "multitask loss" in order to speed up training. 2. It also uses a pooling method based on \cite{journals/pami/HeZR015} called the RoI pooling layer that scales the input so the images don't have to be scaled before being set an an input image to the CNN. "RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of subwindows of approximate size $h/H \times w/W$ and then maxpooling the values in each subwindow into the corresponding output grid cell." 3. Backprop is performed for the RoI pooling layer by taking the argmax of the incoming gradients that overlap the incoming values. This method is further improved by the paper "Faster RCNN" \cite{conf/nips/RenHGS15} 
[link]
Recently, DeepMind released a new paper showing strong performance on board game tasks using a mechanism similar to the Value Prediction Network one in this paper, which inspired me to go back and get a grounding in this earlier work. A goal of this paper is to design a modelbased RL approach that can scale to complex environment spaces, but can still be used to run simulations and do explicit planning. Traditional, modelbased RL has worked by learning a dynamics model of the environment  predicting the next observation state given the current one and an action, and then using that model of the world to learn values and plan with. In addition to the advantages of explicit planning, a hope is that modelbased systems generalize better to new environments, because they predict onestep changes in local dynamics in a way that can be more easily separated from longterm dynamics or reward patterns. However, a downside of MBRL is that it can be hard to train, especially when your observation space is highdimensional, and learning a straight model of your environment will lead to you learning details that aren't actually unimportant for planning or creating policies. The synthesis proposed by this paper is the Value Prediction Network. Rather than predicting observed state at the next step, it learns a transition model in latent space, and then learns to predict nextstep reward and future value from that latent space vector. Because it learns to encode latentspace state from observations, and also learns a transition model from one latent state to another, the model can be used for planning, by simulating multiple transitions between latent state. However, unlike a normal dynamics model, whose training signal comes from a loss against observational prediction, the signal for training both latent → reward/value/discount predictions, and latent → latent transitions comes from using this pipeline to predict reward values. This means that if an aspect of the environment isn't useful for predicting reward, it won't generally be encoded into latent state, meaning you don't waste model capacity predicting irrelevant detail. https://i.imgur.com/4bJylms.png Once this model exists, it can be used for generating a policy through a treesearch planning approach: simulating future trajectories and aggregating the predicted reward along those trajectories, and then taking the highestvalue one. The authors find that their model is able to do better than both modelfree and modelbased methods on the tasks they tested on. In particular, they find that it has many of the benefits of a model that predicts full observations, but that the Value Prediction Network learns more quickly, and is more robust to stochastic environments where there's an inherent ceiling on how well a nextstep observation prediction can work. My main question coming into this paper is: how is this different from simply a value estimator like those used in DQN or A2C, and my impression is that the difference comes from this model's ability to do explicit state simulation in latent space, and then predict a value off of the *latent* state, whereas a value network predicts value from observational state. 