ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 9 years ago

Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. 

Advantages:

* Learning the identity becomes learning 0 which is simpler
* Loss in information flow in the forward pass is not a problem anymore
    * No vanishing / exploding gradient
* Identities don't have parameters to be learned

## Evaluation

The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128.

* ImageNet ILSVRC 2015: 3.57% (ensemble)
* CIFAR-10: 6.43%
* MS COCO: 59.0% mAp@0.5 (ensemble)
* PASCAL VOC 2007: 85.6% mAp@0.5
* PASCAL VOC 2012: 83.8% mAp@0.5

## See also

* [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993)

dx.doi.org
sci-hub
scholar.google.com

Fast R-CNN
Girshick, Ross B.
International Conference on Computer Vision - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 9 years ago

This method is based on improving the speed of R-CNN \cite{conf/cvpr/GirshickDDM14}

1. Where R-CNN would have two different objective functions, Fast R-CNN combines localization and classification losses into a "multi-task loss" in order to speed up training.
2. It also uses a pooling method based on \cite{journals/pami/HeZR015} called the RoI pooling layer that scales the input so the images don't have to be scaled before being set an an input image to the CNN. "RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of sub-windows of approximate size $h/H \times w/W$ and then max-pooling the values in each sub-window into the corresponding output grid cell."
3. Backprop is performed for the RoI pooling layer by taking the argmax of the incoming gradients that overlap the incoming values.

This method is further improved by the paper "Faster R-CNN" \cite{conf/nips/RenHGS15}

arxiv.org
scholar.google.com

RandomOut: Using a convolutional gradient norm to win The Filter Lottery
Cohen, Joseph Paul and Lo, Henry Z. and Ding, Wei
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 9 years ago

Basically they observe a pattern they call The Filter Lottery (TFL) where the random seed causes a high variance  in the training accuracy:

![](http://i.imgur.com/5rWig0H.png)

They use the convolutional gradient norm ($CGN$) \cite{conf/fgr/LoC015} to determine how much impact a filter has on the overall classification loss function by taking the derivative of the loss function with respect each weight in the filter.

$$CGN(k) = \sum_{i} \left|\frac{\partial L}{\partial w^k_i}\right|$$

They use the CGN to evaluate the impact of a filter on error, and re-initialize filters when the gradient norm of its weights falls below a specific threshold.

arxiv.org
scholar.google.com

Adaptive Computation Time for Recurrent Neural Networks
Graves, Alex
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Hugo Larochelle 9 years ago

This paper proposes a neural architecture that allows to backpropagate gradients though a procedure that can go through a variable and adaptive number of iterations. These "iterations" for instance could be the number of times computations are passed through the same recurrent layer (connected to the same input) before producing an output, which is the case considered in this paper.

This is essentially achieved by pooling the recurrent states and respective outputs computed by each iteration. The pooling mechanism is essentially the same as that used in the really cool Neural Stack architecture of Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman and Phil Blunsom \cite{conf/nips/GrefenstetteHSB15}. It relies on the introduction of halting units, which are sigmoidal units computed at each iteration and which gives a soft weight on whether the computation should stop at the current iteration.

Crucially, the paper introduces a new ponder cost $P(x)$, which is a regularization cost that penalizes what is meant to be a smooth upper bound on the number of iterations $N(t)$ (more on that below).

The paper presents experiment on RNNs applied on sequences where, at each time step t (not to be confused with what I'm calling computation iterations, which are indexed by n) in the sequence the RNN can produce a variable number $N(t)$ of intermediate states and outputs. These are the states and outputs that are pooled, to produce a single recurrent state and output for the time step t. During each of the $N(t)$ iterations at time step t, the intermediate states are connected to the same time-step-t input. After the $N(t)$ iterations, the RNN pools the $N(t)$ intermediate states and outputs, and then moves to the next time step $t+1$. To mark the transitions between time steps, an extra binary input is appended, which is 1 only for the first intermediate computation iteration.

Results are presented on a variety of synthetic problems and a character prediction problem.

arxiv.org
arxiv-vanity.com
scholar.google.com

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
Alec Radford and Luke Metz and Soumith Chintala
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG, cs.CV
more

[link] Summary by Shagun Sodhani 8 years ago

# Deep Convolutional Generative Adversarial Nets

## Introduction

* The paper presents Deep Convolutional Generative Adversarial Nets (DCGAN) - a topologically constrained variant of conditional GAN.
* [Link to the paper](https://arxiv.org/abs/1511.06434)

## Benefits

* Stable to train
* Very useful to learn unsupervised image representations.

## Model

* GANs difficult to scale using CNNs.
* Paper proposes following changes to GANs:
* Replace any pooling layers with strided convolutions (for discriminator) and fractional strided convolutions (for generators).
* Remove fully connected hidden layers.
* Use batch normalisation in both generator (all layers except output layer) and discriminator (all layers except input layer).
* Use LeakyReLU in all layers of the discriminator.
* Use ReLU activation in all layers of the generator (except output layer which uses Tanh).

## Datasets

* Large-Scale Scene Understanding.
* Imagenet-1K.
* Faces dataset.

## Hyperparameters

* Minibatch SGD with minibatch size of 128.
* Weights initialized with 0 centered Normal distribution with standard deviation = 0.02
* Adam Optimizer
* Slope of leak = 0.2 for LeakyReLU.
* Learning rate = 0.0002, β1 = 0.5

## Observations

* Large-Scale Scene Understanding data
* Demonstrates that model scales with more data and higher resolution generation.
* Even though it is unlikely that model would have memorized images (due to low learning rate of minibatch SGD).
* Classifying CIFAR-10 dataset
* Features
* Train in Imagenet-1K and test on CIFAR-10.
* Max pool discriminator's convolutional features (from all layers) to get 4x4 spatial grids.
* Flatten and concatenate to get a 28672-dimensional vector.
* Linear L2-SVM classifier trained over the feature vector.
* 82.8% accuracy, outperforms K-means (80.6%)
* Street View House Number Classifier
* Similar pipeline as CIFAR-10
* 22.48% test error.
* The paper contains many examples of images generated by final and intermediate layers of the network.
* Images in the latent space do not show sharp transitions indicating that network did not memorize images.
* DCGAN can learn an interesting hierarchy of features.
* Networks seems to have some success in disentangling image representation from object representation.
* Vector arithmetic can be performed on the Z vectors corresponding to the face samples to get results like `smiling woman - normal woman + normal man = smiling man` visually.