Visualizing and Understanding Convolutional Networks on ShortScience.org

arxiv.org
scholar.google.com

Visualizing and Understanding Convolutional Networks
Zeiler, Matthew D. and Fergus, Rob
arXiv e-Print archive - 2013 via Local Bibsonomy
Keywords: cnn, deeplearning

Summaries/Notes 2

[link] Summary by Martin Thoma 8 years ago

The main contribution of this paper is a new way to analyze CNNs by (a) visualizing intermediate learned features and (b) occlusion sensitivity analysis.

## Analyzation techniques
### Visualization

A multi-layer deconvolutional network is used to project the feature activations back into pixel space, showing what input pattern originally caused a given activation in the feature maps. The idea is to train a network which is given the result of a layer $L_i$ and has to reconstruct the input feature map of $L_i$. This is repeated until the input image is reached.

The deconv-net has a special **unpooling layer**: The max-pooling layers have to save where an activation came from and store those to a switch variable, which is used in unpooling.

### Occlusion sensitivity analysis

* Occlude(I, x, y): Put a gray square centered at $(x, y)$ over a part of the image $I$. Run the classifier.
* Create an image like this:
    * Run Occlude(I, x, y) for all $(x, y)$ (possible with stride)
    * At $(x, y)$, either ...
        * (d) ... place a pixel which color-encodes the probability of the correct class
        * (e) ... place a pixel which color-encodes the most probable class

The following image from the Zeiler & Fergus paper visualizes this pretty well:

If the dogs face is occluded, the probability of the correct class drops a lot:
![Imgur](http://i.imgur.com/Q1Ama2z.png)

If the dogs face is occluded, the most likely class suddenly is "tennisball" and no longer "Pomeranian".
![Imgur](http://i.imgur.com/5QYKh7b.png)

See [LIME](http://www.shortscience.org/paper?bibtexKey=journals/corr/1602.04938#martinthoma).


## How visualization helped to construct ZF-Net

* "The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies" -> Lower filter size from $11 \times 11$ to $7 \times 7$
* "the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions" -> Lower stride from 4 to 2
* The occlusion analysis helps to boost confidence that the kind of features being learned are actually correct. 

## ZF-Net
Zeiler and Fergus also created a new network for ImageNet.

The network consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations and max pooling layers.

Training setup:

* **Preprocessing**: Resize smallest dimension to 256, per-pixel mean subtraction per channel, crop $224\text{px} \times 224\text{px}$ region
* **Optimization**: Mini-Batch SGD, learning rate $= 10^{-2}$, momentum = $0.9$, 70 epochs
* **Resources**: took around 12 days on a single GTX580 GPU

The network was evaluated on

* ImageNet 2012: 14.8% error
* Caltech-101: $86.5 \pm 0.5$ (pretrained on ImageNet)
* Caltech-256: $74.2\% \pm 0.3$ (pretrained on ImageNet)

## Minor errors

* typo: "goes give" (also: something went wrong with the link there - the whole block is a link)

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private