Imagenet classification with deep convolutional neural networks on ShortScience.org

4

[link] Summary by Martin Thoma 9 years ago

This paper is about Convolutional Neural Networks for Computer Vision. It was the first break-through in the ImageNet classification challenge (LSVRC-2010, 1000 classes).

ReLU was a key aspect which was not so often used before. The paper also used Dropout in the last two layers.

## Training details

* Momentum of 0.9
* Learning rate of $\varepsilon$ (initialized at 0.01)
* Weight decay of $0.0005 \cdot \varepsilon$.
* Batch size of 128
* The training took 5 to 6 days on two NVIDIA GTX 580 3GB GPUs.

## See also

* [Stanford presentation](http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf)

Your comment:

2

[link] Summary by Evan Su 9 years ago

Deep convolutional neural networks (DCNN) has been a popular model for image classification over the last few years. This paper proposes a DCNN structure, also known as AlexNet, for the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). To train AlexNet, which has 60 million parameters, this paper uses Rectified Linear Units (ReLU) and multiple GPU to accelerate training. This paper also report that using local response normalization and overlapping pooling can reduce error rate. To prevent over fitting, they suggest data augmentation and apply dropout in the fully connected layer. 
Technical details

The following figure shows the architecture of AlexNet. It contains five convolutional and three fully connected layers. Response-normalization layers follow the first and second convolutional layers. Max-pooling layers follow the first and second response-normalization layers and the fifth convolutional layer.

![](http://i.imgur.com/2iqwCq1.png)

Your comment:

2

[link] Summary by Abhishek Das 7 years ago

This paper introduces a deep convolutional neural network (CNN) architecture
that achieved record-breaking performance in the 2012 ImageNet LSVRC. Notably,
it brings together a bunch of neat ideas in an end-to-end, trainable model.
Main contributions:

- Achieves state-of-the-art performance in ILSVRC-2012.
- Makes available an efficient, parallelized GPU implementation of their model.
- Describes in detail the features of their model that help in improving performance
and reducing training time, along with extensive ablative studies.
- Uses data augmentation and dropout to prevent overfitting.

## Strengths

- Uses (and popularizes) ReLUs instead of tanh as the non-linear activation unit, which makes training six times faster.
- Uses local response normalization and overlapped pooling.
- Data augmentation
    - Extracts random crops and performs image translations, horizontal reflections maintaining the label distribution.
    - Alters RGB pixel values by performing PCA on training set, and adding multiples of eigenvalues times a random variable drawn from a Gaussian to image. Provides invariance to changes in intensity and color of illumination.
- Dropout prevents overfitting. Randomly drops half of the neurons in the fully connected layers, and can be interpreted as averaging over exponentially-many dropout networks.

## Weaknesses / Notes

- Lacks theoretical insight. Design decisions are motivated solely by results.

Your comment:

2

[link] Summary by Tiago Vinhoza 7 years ago

#### Goal:
+ Train a deep convolutional neural network to classify 1.2 million images into 1000 different categories.

#### Convolutional Neural Networks:
+ Make strong and correct assumptions about the nature of the images (stationarity, pixel dependencies). 
+ Much fewer connections and parameters: easier to train than fully connected neural networks.

#### Dataset
+ ImageNet: 15 million labeled high-resolution images from 22000 categories. Labeled manually using Amazon Mechanical Turk.
+ ImageNet Large-Scale Visual Recognition Challenge (ILSVRC): subset of ImageNet
    + 1.2 million training images, 50000 validation images, 150000 test images.
    + 1000 categories
+ Variable resolution images:
    + Images downsampled to a fixed resolution of 256 x 256.


#### Architecture:
+ 8 layers: 5 convolutional and 3 fully-connected, 1000-way softmax at the output.

![Architecture](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/Krizhevsky2012_architecture.png?raw=true "Architecture")

**Methodology**

+ ReLU activation function: train several times faster than tanh units.
    + Faster learning had influence on the performance of large models trained on large datasets
+ Training on Multiple GPUs
+ Local Response Normalization
    + mimics a form of lateral inhibition found on real neurons.
    + applied after ReLU in the 1st and 2nd convolutional layers.
    + improves top-1 and top-5 error rates by 1.4% and 1.2%
+ Overlapping pooling
    + Neighborhood z = 3 and stride s = 2.
    + Max-pooling employed in the 1st and 2nd convolutional layers (after response normalization) and as well as after the 5th convolutinal layer.
+ Reducing Overfitting
    + Data Augmentation
        + Generate image translations and horizontal reflections.
        + Alter the intensities of RGB channels.
    + Dropout
        + Used in the first two fully-connected layers - p(keep) = 0.5
+ Learning
    + Stochastic Gradient Descent, batch size = 128, momentum = 0.9, weight decay = 0.0005
    + Weights initialized from Gaussian distribution with mean = 0 and standard deviation = 0.01
        + Bias in 2nd, 4th, and 5th convolutional layers initialized as 1. This accelerated learning as the ReLU was fed with positive inputs from the start.
        + Bias in remaining layers initialized as zeros.
    + Learning rate ($\epsilon$)
        + Equal for all layers
        + Adjusted manually (divided by 10 when validation error stopped decreasing).
        + Initialized at 0.01 and reduced 3 times during training.

            ![Update equations](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/Krizhevsky2012_update.png?raw=true "Update equations")

    + Trained during 90 epochs (5-6 days on two NVIDIA GTX 580 3GB GPUs).

#### Results

+ Results on ILSVRC-2010 images
    + Baselines: sparse coding and Fisher vectors

Model | Top-1 | Top-5
------|-------|-------
Sparse Coding | 47.1% | 28.2%
SIFT + FVs | 45.7% | 25.7%
CNN | 37.5% | 17.0%

+ Results on ILSVRC-2012

Model | Top-1 (val) | Top-5 (val) | Top-5 (test)
------|-------|-------|-------
Sparse Coding | -- | -- | 26.2%
1 CNN | 40.7% | 18.2% | --
5 CNNs | 38.1% | 16.4% | 16.4%
1 CNN* | 39.0% | 16.6% | --
7 CNNs* | 36.7% | 15.4% | 15.3%

CNN* are convolutional neural networks pretrained on ImageNet 2011 Fall release and fine-tuned on ILSVRC-2012 training data.

+ Qualitative assessment
    + Convolutional kernels showed *specialization*

    ![Kernels](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/Krizhevsky2012_weights.png?raw=true "Convolutional kernels from 1st layer")

    + Most of top-5 labels were reasonable
    + Image similarity based on the feature activations induced at the last fully connected layer:

    ![Qualitative Assessment](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/Krizhevsky2012_qualitative.png?raw=true "Qualitative assessment")



#### Caveat:
+ Most of the choices made in the paper were based on experimental results. There is not too much theory behind.

Your comment: