[link]
Summary by Tiago Vinhoza 6 years ago
#### Goal:
+ Reformulate neural network architecture to address the degratation problem due to the very large number of layers.
#### Motivation:
![Motivation](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/He2015_motivation.png?raw=true "Motivation")
+ Degradation problem:
+ Increasing the depth of the network: accuracy gets saturated and then starts degrading.
+ As number of layers increase: higher training error.
+ In theory, such problem should not occur. Given a neural network, one can add new layers with identity mappings. In reality, optimization algorithms probably have difficulties to find (in feasible time)these solutions.
#### Residual Block:
+ Layers are reformulated as learning residual functions with reference to the layer inputs.
![Motivation](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/He2015_framework.png?raw=true "Motivation")
+ Residual Mapping.
+ Neural networks with shortcut connections to perform identity mappings.
+ Identity shortcut connections do no add extra parameters or complexity to the network.
+ The problem can be seen as follows. Given the activation of layer L of a neural net, a[L], one can write the activation at layer L+2 as follows.
a[L+2] = ReLu(W[L+2] * a[L+1] + b[L+2] + a[L])
where W[L+2] is the weight matrix and b[L+2] is the bias vector at layer L+2.
+ The problem of learning an identity mapping is easier in this case, if weight decay is applied, W[L+2] goes to zero, as well as b[L+2]. The activation function at layer a[L+2] = ReLu(a[L]) = a[L].
+ One should take take to match the dimensions. A linear projection of the previous activation function could be used before the sum.
#### Datasets:
+ For the image classification task, two datasets were used: ImageNet and CIFAR-10
|ImageNet|CIFAR-10
----|-----|-----
Training images | 1.2M | 50K
Validation images| 50K | (*)
Testing images | 100K | 10K
Number of classes | 1000 | 10
(*) in the experiments with CIFAR-10, the training images are split into 45K/5K training/validation sets.
#### Experiments and Results
**ImageNet Dataset**
+ Input images:
+ Scale jittering as in [Simonyan2015](https://github.com/tiagotvv/ml-papers/blob/master/convolutional/Very_Deep_Convolutional_Networks_for_Large_Scale_Image_Recognition.md). Image is resized with shorter size sampled to be in between [256, 480].
+ 224x224 crop from image is used.
+ Data augmentation following [Krizhevsky2012](https://github.com/tiagotvv/ml-papers/blob/master/convolutional/ImageNet_Classification_with_Deep_Convolutional_Neural_Networks.md) methodology: image flips, change RGB levels.
+ Training
+ Weight initialization: follows previous work by the authors.
+ Gradient descent with batch normalization, weight decay = 0.0001, momentum = 0.9
+ Mini-batch size = 256
+ Learning rate starts at 0.1 and is divided by 10 when accuracy stop increasing at the validation set.
+ Dropout is not employed
+ Testing
+ Multi-crop procedure from [Krizhevsky2012](https://github.com/tiagotvv/ml-papers/blob/master/convolutional/ImageNet_Classification_with_Deep_Convolutional_Neural_Networks.md) is employed: 10 crops.
+ Fully connected layers are converted into convolutional layers.
+ Average of scores at multiple scales is employed. Testing scales used: {224, 256, 384, 480, 640}.
+ Configurations tested on ImageNet dataset
![Motivation](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/He2015_architectures.png?raw=true "Motivation")
+ Single Model Results (validation set):
Architecture | top-1 error (%) | top-5 error (%)
----|:-----:|:-----:
VGG (ILSVRC'14) | - | 8.43
GoogLeNet (ILSVRC'14) |- | 7.89
VGG (v5) | 24.4 | 7.1
PReLU-net | 21.59 | 5.71
BN-inception | 21.99 | 5.81
ResNet-34 B (projections + identitites) | 21.84 | 5.71
ResNet-34 (projections) | 21.53 | 5.60
ResNet-50 | 20.74 | 5.25
ResNet-101| 19.87 | 4.60
ResNet-152 | **19.38** | **4.49**
+ Ensemble Models Results (test set):
Architecture | top-5 error (%)
----|:-----:|
VGG (ILSVRC'14) | 7.32
GoogLeNet (ILSVRC'14) | 6.66
VGG (v5) | 6.8
PReLU-net | 4.94
BN-Inception | 4.82
ResNet (ILSVRC'15) | **3.57**
**CIFAR-10 Dataset**
+ Input images:
+ Inputs: 32x32 images
+ Configurations tested on this dataset:
output map size | 32x32 | 16x16 | 8x8
----------------|-------|-------|----
num. layers | 1+2n | 2n | 2n
num. filters | 16 | 32 | 64
+ Shortcuts connected to pairs of 3x3 layers (3n shortcuts)
+ Training
+ Weight initialization: follows previous work by the authors.
+ Gradient descent with batch normalization, weight decay = 0.0001, momentum = 0.9
+ Mini-batch size = 128, 2 GPUs.
+ Learning rate starts at 0.1 and is divided by 10 at 32k and 48k iterations. Stopped at 64k iterations.
+ Dropout is not employed
+ Testing
+ Single 32x32 image
+ Results
![CIFAR-10 results](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/He2015_CIFAR.png?raw=true "CIFAR-10 results")
more
less