[link]
The main contribution of this paper is a new way to analyze CNNs by (a) visualizing intermediate learned features and (b) occlusion sensitivity analysis. ## Analyzation techniques ### Visualization A multi-layer deconvolutional network is used to project the feature activations back into pixel space, showing what input pattern originally caused a given activation in the feature maps. The idea is to train a network which is given the result of a layer $L_i$ and has to reconstruct the input feature map of $L_i$. This is repeated until the input image is reached. The deconv-net has a special **unpooling layer**: The max-pooling layers have to save where an activation came from and store those to a switch variable, which is used in unpooling. ### Occlusion sensitivity analysis * Occlude(I, x, y): Put a gray square centered at $(x, y)$ over a part of the image $I$. Run the classifier. * Create an image like this: * Run Occlude(I, x, y) for all $(x, y)$ (possible with stride) * At $(x, y)$, either ... * (d) ... place a pixel which color-encodes the probability of the correct class * (e) ... place a pixel which color-encodes the most probable class The following image from the Zeiler & Fergus paper visualizes this pretty well: If the dogs face is occluded, the probability of the correct class drops a lot: ![Imgur](http://i.imgur.com/Q1Ama2z.png) If the dogs face is occluded, the most likely class suddenly is "tennisball" and no longer "Pomeranian". ![Imgur](http://i.imgur.com/5QYKh7b.png) See [LIME](http://www.shortscience.org/paper?bibtexKey=journals/corr/1602.04938#martinthoma). ## How visualization helped to construct ZF-Net * "The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies" -> Lower filter size from $11 \times 11$ to $7 \times 7$ * "the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions" -> Lower stride from 4 to 2 * The occlusion analysis helps to boost confidence that the kind of features being learned are actually correct. ## ZF-Net Zeiler and Fergus also created a new network for ImageNet. The network consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations and max pooling layers. Training setup: * **Preprocessing**: Resize smallest dimension to 256, per-pixel mean subtraction per channel, crop $224\text{px} \times 224\text{px}$ region * **Optimization**: Mini-Batch SGD, learning rate $= 10^{-2}$, momentum = $0.9$, 70 epochs * **Resources**: took around 12 days on a single GTX580 GPU The network was evaluated on * ImageNet 2012: 14.8% error * Caltech-101: $86.5 \pm 0.5$ (pretrained on ImageNet) * Caltech-256: $74.2\% \pm 0.3$ (pretrained on ImageNet) ## Minor errors * typo: "goes give" (also: something went wrong with the link there - the whole block is a link)
Your comment:
|
[link]
This paper introduces a novel visualization technique to understand representations learnt by intermediate layers of a deep convolutional neural network - DeconvNet. Using DeconvNet visualizations as a diagnostic tool in different settings, the authors propose changes to the model proposed by Alex Krizhevsky, which performs slightly better and generalizes well to other datasets. Key contributions: - Deconvolutional network - Feature activations are mapped back to input pixel space by setting other activations in the layer to zero and successively unpooling, rectifying and filtering (using the same parameters). - Unpooling is approximated by using switch variables to remember the location of highest input activation (and hence these visualizations are image-specific). - Rectification involves passing the signal through a ReLU non-linearity. - Filtering involves convolving the reconstructed signal with the transpose of the convolutional layer filters. - Well-designed experiments to provide insights ## Strengths - Observation of evolution of features - Visualizations clearly demonstrate that lower layers converge within a few epochs and upper layers develop after a considerable number of epochs (40-50). - Feature invariance - Visualizations show that small transformations have a dramatic effect on lower layers and lesser impact on higher layers. The model is fairly stable to translation and scaling, not so much to rotation. - Occlusion sensitivity analysis - Parts of the image are occluded, and posterior and activities are visualized. Clearly show that activities drop when the object is occluded. - Correspondence analysis - The intuition is that CNNs implicitly learn the correspondence between different parts. - To verify this, dog images with frontal pose are taken and the same part of the face is occluded in each of them. Then the difference in feature maps for each of those and the original image is calculated, and the consistency of this difference across all image pairs is verified by Hamming distance. Lower scores as compared to random occlusions does show that the model learns correspondences. - Proposed model performs better than Alex Krizhevsky's model, and generalizes well to other datasets. ## Weaknesses / Notes - The justification / intuition for choice of smaller filters wasn't convincing enough. - Why does removing layer 7 give better top-1 error rate on train and val? - Rotation invariance might be something worth looking into. |