ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 9 years ago

Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. 

Advantages:

* Learning the identity becomes learning 0 which is simpler
* Loss in information flow in the forward pass is not a problem anymore
    * No vanishing / exploding gradient
* Identities don't have parameters to be learned

## Evaluation

The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128.

* ImageNet ILSVRC 2015: 3.57% (ensemble)
* CIFAR-10: 6.43%
* MS COCO: 59.0% mAp@0.5 (ensemble)
* PASCAL VOC 2007: 85.6% mAp@0.5
* PASCAL VOC 2012: 83.8% mAp@0.5

## See also

* [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993)

arxiv.org
scholar.google.com

Adaptive Computation Time for Recurrent Neural Networks
Graves, Alex
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper proposes a neural architecture that allows to backpropagate gradients though a procedure that can go through a variable and adaptive number of iterations. These "iterations" for instance could be the number of times computations are passed through the same recurrent layer (connected to the same input) before producing an output, which is the case considered in this paper.

This is essentially achieved by pooling the recurrent states and respective outputs computed by each iteration. The pooling mechanism is essentially the same as that used in the really cool Neural Stack architecture of Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman and Phil Blunsom \cite{conf/nips/GrefenstetteHSB15}. It relies on the introduction of halting units, which are sigmoidal units computed at each iteration and which gives a soft weight on whether the computation should stop at the current iteration.

Crucially, the paper introduces a new ponder cost $P(x)$, which is a regularization cost that penalizes what is meant to be a smooth upper bound on the number of iterations $N(t)$ (more on that below).

The paper presents experiment on RNNs applied on sequences where, at each time step t (not to be confused with what I'm calling computation iterations, which are indexed by n) in the sequence the RNN can produce a variable number $N(t)$ of intermediate states and outputs. These are the states and outputs that are pooled, to produce a single recurrent state and output for the time step t. During each of the $N(t)$ iterations at time step t, the intermediate states are connected to the same time-step-t input. After the $N(t)$ iterations, the RNN pools the $N(t)$ intermediate states and outputs, and then moves to the next time step $t+1$. To mark the transitions between time steps, an extra binary input is appended, which is 1 only for the first intermediate computation iteration.

Results are presented on a variety of synthetic problems and a character prediction problem.

arxiv.org
scholar.google.com

RandomOut: Using a convolutional gradient norm to win The Filter Lottery
Cohen, Joseph Paul and Lo, Henry Z. and Ding, Wei
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 9 years ago

Basically they observe a pattern they call The Filter Lottery (TFL) where the random seed causes a high variance  in the training accuracy:

![](http://i.imgur.com/5rWig0H.png)

They use the convolutional gradient norm ($CGN$) \cite{conf/fgr/LoC015} to determine how much impact a filter has on the overall classification loss function by taking the derivative of the loss function with respect each weight in the filter.

$$CGN(k) = \sum_{i} \left|\frac{\partial L}{\partial w^k_i}\right|$$

They use the CGN to evaluate the impact of a filter on error, and re-initialize filters when the gradient norm of its weights falls below a specific threshold.

dx.doi.org
sci-hub
scholar.google.com

Fast R-CNN
Girshick, Ross B.
International Conference on Computer Vision - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 9 years ago

This method is based on improving the speed of R-CNN \cite{conf/cvpr/GirshickDDM14}

1. Where R-CNN would have two different objective functions, Fast R-CNN combines localization and classification losses into a "multi-task loss" in order to speed up training.
2. It also uses a pooling method based on \cite{journals/pami/HeZR015} called the RoI pooling layer that scales the input so the images don't have to be scaled before being set an an input image to the CNN. "RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of sub-windows of approximate size $h/H \times w/W$ and then max-pooling the values in each sub-window into the corresponding output grid cell."
3. Backprop is performed for the RoI pooling layer by taking the argmax of the incoming gradients that overlap the incoming values.

This method is further improved by the paper "Faster R-CNN" \cite{conf/nips/RenHGS15}

arxiv.org
scholar.google.com

Curiosity-driven reinforcement learning with homeostatic regulation
Ildefons Magrans de Abril and Ryota Kanai
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.AI
more

[link] Summary by Natalia Diaz Rodriguez, PhD 7 years ago

Exploring an environment with non-linearities in a continuous action space can be optimized by regulating the agent curiosity with an homeostatic drive. This means that a heterostatic drive to move away from habitual states is blended with a homeostatic motivation to encourage actions that lead to states where the agent is familiar with a state-action pair.

This approach improves upon forward models and ICM Pathak et al 17 with an enhanced information gain that basically consists of the following: while the reward in \cite{Pathak17} is formulated as the forward model prediction error, the extended forward model loss in this paper is extended by substracting from the forward model prediction error the error knowing not only $s_t$ and $a_t$, but also $a_{t+1}$.

Curiosity-driven reinforcement learning shows that an additional homeostatic drive enhances the information gain of a classical curious/heterostatic agent.

Implementation: They take advantage of a new Bellman-like equation of information gain and simplify the computation of the local rewards. It could help by prioritizing the exploration of the state-action space according to how hard is to learn each region.

Background: The concept of homeostatic regulation in social robots was first proposed in Breazeal et al. 04. They extend existing approaches by compensating the heterostacity drive encouraged by the curiosity reward with an additional homeostatic drive. 1) The first component implements the heterostatic drive (same as referred to in Pathak et al 17). In other words, this one refers to the tendency to push away our agent from its habitual state; 2) Homeostatic motivation: the second component is our novel contribution. It encourages taking actions $a_t$ that lead to future states $s_{t+1}$ where the corresponding future action $a_{t+1}$ gives us additional information about $s_{t+1}$. This situation happens when the agent is "familiar" with the state-action pair: $\{s_{t+1}, a_{t+1}\}$.

The article misses exact comparison with Pathak et al regarding a joint task. In this paper the tasks consists of a 3 room navigation map is used to measure exploration.