#### Goal:
+ Train a deep convolutional neural network to classify 1.2 million images into 1000 different categories.
#### Convolutional Neural Networks:
+ Make strong and correct assumptions about the nature of the images (stationarity, pixel dependencies).
+ Much fewer connections and parameters: easier to train than fully connected neural networks.
#### Dataset
+ ImageNet: 15 million labeled high-resolution images from 22000 categories. Labeled manually using Amazon Mechanical Turk.
+ ImageNet Large-Scale Visual Recognition Challenge (ILSVRC): subset of ImageNet
+ 1.2 million training images, 50000 validation images, 150000 test images.
+ 1000 categories
+ Variable resolution images:
+ Images downsampled to a fixed resolution of 256 x 256.
#### Architecture:
+ 8 layers: 5 convolutional and 3 fully-connected, 1000-way softmax at the output.
![Architecture](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/Krizhevsky2012_architecture.png?raw=true "Architecture")
**Methodology**
+ ReLU activation function: train several times faster than tanh units.
+ Faster learning had influence on the performance of large models trained on large datasets
+ Training on Multiple GPUs
+ Local Response Normalization
+ mimics a form of lateral inhibition found on real neurons.
+ applied after ReLU in the 1st and 2nd convolutional layers.
+ improves top-1 and top-5 error rates by 1.4% and 1.2%
+ Overlapping pooling
+ Neighborhood z = 3 and stride s = 2.
+ Max-pooling employed in the 1st and 2nd convolutional layers (after response normalization) and as well as after the 5th convolutinal layer.
+ Reducing Overfitting
+ Data Augmentation
+ Generate image translations and horizontal reflections.
+ Alter the intensities of RGB channels.
+ Dropout
+ Used in the first two fully-connected layers - p(keep) = 0.5
+ Learning
+ Stochastic Gradient Descent, batch size = 128, momentum = 0.9, weight decay = 0.0005
+ Weights initialized from Gaussian distribution with mean = 0 and standard deviation = 0.01
+ Bias in 2nd, 4th, and 5th convolutional layers initialized as 1. This accelerated learning as the ReLU was fed with positive inputs from the start.
+ Bias in remaining layers initialized as zeros.
+ Learning rate ($\epsilon$)
+ Equal for all layers
+ Adjusted manually (divided by 10 when validation error stopped decreasing).
+ Initialized at 0.01 and reduced 3 times during training.
![Update equations](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/Krizhevsky2012_update.png?raw=true "Update equations")
+ Trained during 90 epochs (5-6 days on two NVIDIA GTX 580 3GB GPUs).
#### Results
+ Results on ILSVRC-2010 images
+ Baselines: sparse coding and Fisher vectors
Model | Top-1 | Top-5
------|-------|-------
Sparse Coding | 47.1% | 28.2%
SIFT + FVs | 45.7% | 25.7%
CNN | 37.5% | 17.0%
+ Results on ILSVRC-2012
Model | Top-1 (val) | Top-5 (val) | Top-5 (test)
------|-------|-------|-------
Sparse Coding | -- | -- | 26.2%
1 CNN | 40.7% | 18.2% | --
5 CNNs | 38.1% | 16.4% | 16.4%
1 CNN* | 39.0% | 16.6% | --
7 CNNs* | 36.7% | 15.4% | 15.3%
CNN* are convolutional neural networks pretrained on ImageNet 2011 Fall release and fine-tuned on ILSVRC-2012 training data.
+ Qualitative assessment
+ Convolutional kernels showed *specialization*
![Kernels](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/Krizhevsky2012_weights.png?raw=true "Convolutional kernels from 1st layer")
+ Most of top-5 labels were reasonable
+ Image similarity based on the feature activations induced at the last fully connected layer:
![Qualitative Assessment](https://raw.githubusercontent.com/tiagotvv/ml-papers/master/convolutional/images/Krizhevsky2012_qualitative.png?raw=true "Qualitative assessment")
#### Caveat:
+ Most of the choices made in the paper were based on experimental results. There is not too much theory behind.