[link]
Summary by Martin Thoma 8 years ago
The main contribution of this paper is a new way to analyze CNNs by (a) visualizing intermediate learned features and (b) occlusion sensitivity analysis.
## Analyzation techniques
### Visualization
A multi-layer deconvolutional network is used to project the feature activations back into pixel space, showing what input pattern originally caused a given activation in the feature maps. The idea is to train a network which is given the result of a layer $L_i$ and has to reconstruct the input feature map of $L_i$. This is repeated until the input image is reached.
The deconv-net has a special **unpooling layer**: The max-pooling layers have to save where an activation came from and store those to a switch variable, which is used in unpooling.
### Occlusion sensitivity analysis
* Occlude(I, x, y): Put a gray square centered at $(x, y)$ over a part of the image $I$. Run the classifier.
* Create an image like this:
* Run Occlude(I, x, y) for all $(x, y)$ (possible with stride)
* At $(x, y)$, either ...
* (d) ... place a pixel which color-encodes the probability of the correct class
* (e) ... place a pixel which color-encodes the most probable class
The following image from the Zeiler & Fergus paper visualizes this pretty well:
If the dogs face is occluded, the probability of the correct class drops a lot:
![Imgur](http://i.imgur.com/Q1Ama2z.png)
If the dogs face is occluded, the most likely class suddenly is "tennisball" and no longer "Pomeranian".
![Imgur](http://i.imgur.com/5QYKh7b.png)
See [LIME](http://www.shortscience.org/paper?bibtexKey=journals/corr/1602.04938#martinthoma).
## How visualization helped to construct ZF-Net
* "The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies" -> Lower filter size from $11 \times 11$ to $7 \times 7$
* "the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions" -> Lower stride from 4 to 2
* The occlusion analysis helps to boost confidence that the kind of features being learned are actually correct.
## ZF-Net
Zeiler and Fergus also created a new network for ImageNet.
The network consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations and max pooling layers.
Training setup:
* **Preprocessing**: Resize smallest dimension to 256, per-pixel mean subtraction per channel, crop $224\text{px} \times 224\text{px}$ region
* **Optimization**: Mini-Batch SGD, learning rate $= 10^{-2}$, momentum = $0.9$, 70 epochs
* **Resources**: took around 12 days on a single GTX580 GPU
The network was evaluated on
* ImageNet 2012: 14.8% error
* Caltech-101: $86.5 \pm 0.5$ (pretrained on ImageNet)
* Caltech-256: $74.2\% \pm 0.3$ (pretrained on ImageNet)
## Minor errors
* typo: "goes give" (also: something went wrong with the link there - the whole block is a link)
more
less