[link]
The main contribution of this paper is a new way to analyze CNNs by (a) visualizing intermediate learned features and (b) occlusion sensitivity analysis. ## Analyzation techniques ### Visualization A multi-layer deconvolutional network is used to project the feature activations back into pixel space, showing what input pattern originally caused a given activation in the feature maps. The idea is to train a network which is given the result of a layer $L_i$ and has to reconstruct the input feature map of $L_i$. This is repeated until the input image is reached. The deconv-net has a special **unpooling layer**: The max-pooling layers have to save where an activation came from and store those to a switch variable, which is used in unpooling. ### Occlusion sensitivity analysis * Occlude(I, x, y): Put a gray square centered at $(x, y)$ over a part of the image $I$. Run the classifier. * Create an image like this: * Run Occlude(I, x, y) for all $(x, y)$ (possible with stride) * At $(x, y)$, either ... * (d) ... place a pixel which color-encodes the probability of the correct class * (e) ... place a pixel which color-encodes the most probable class The following image from the Zeiler & Fergus paper visualizes this pretty well: If the dogs face is occluded, the probability of the correct class drops a lot: ![Imgur](http://i.imgur.com/Q1Ama2z.png) If the dogs face is occluded, the most likely class suddenly is "tennisball" and no longer "Pomeranian". ![Imgur](http://i.imgur.com/5QYKh7b.png) See [LIME](http://www.shortscience.org/paper?bibtexKey=journals/corr/1602.04938#martinthoma). ## How visualization helped to construct ZF-Net * "The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies" -> Lower filter size from $11 \times 11$ to $7 \times 7$ * "the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions" -> Lower stride from 4 to 2 * The occlusion analysis helps to boost confidence that the kind of features being learned are actually correct. ## ZF-Net Zeiler and Fergus also created a new network for ImageNet. The network consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations and max pooling layers. Training setup: * **Preprocessing**: Resize smallest dimension to 256, per-pixel mean subtraction per channel, crop $224\text{px} \times 224\text{px}$ region * **Optimization**: Mini-Batch SGD, learning rate $= 10^{-2}$, momentum = $0.9$, 70 epochs * **Resources**: took around 12 days on a single GTX580 GPU The network was evaluated on * ImageNet 2012: 14.8% error * Caltech-101: $86.5 \pm 0.5$ (pretrained on ImageNet) * Caltech-256: $74.2\% \pm 0.3$ (pretrained on ImageNet) ## Minor errors * typo: "goes give" (also: something went wrong with the link there - the whole block is a link)
Your comment:
|