Visualizing and Understanding Convolutional Networks on ShortScience.org

arxiv.org
scholar.google.com

Visualizing and Understanding Convolutional Networks
Zeiler, Matthew D. and Fergus, Rob
arXiv e-Print archive - 2013 via Local Bibsonomy
Keywords: cnn, deeplearning

Summaries/Notes 2

[link] Summary by Martin Thoma 7 years ago

The main contribution of this paper is a new way to analyze CNNs by (a) visualizing intermediate learned features and (b) occlusion sensitivity analysis.

## Analyzation techniques
### Visualization

A multi-layer deconvolutional network is used to project the feature activations back into pixel space, showing what input pattern originally caused a given activation in the feature maps. The idea is to train a network which is given the result of a layer $L_i$ and has to reconstruct the input feature map of $L_i$. This is repeated until the input image is reached.

The deconv-net has a special **unpooling layer**: The max-pooling layers have to save where an activation came from and store those to a switch variable, which is used in unpooling.

### Occlusion sensitivity analysis

* Occlude(I, x, y): Put a gray square centered at $(x, y)$ over a part of the image $I$. Run the classifier.
* Create an image like this:
    * Run Occlude(I, x, y) for all $(x, y)$ (possible with stride)
    * At $(x, y)$, either ...
        * (d) ... place a pixel which color-encodes the probability of the correct class
        * (e) ... place a pixel which color-encodes the most probable class

The following image from the Zeiler & Fergus paper visualizes this pretty well:

If the dogs face is occluded, the probability of the correct class drops a lot:
![Imgur](http://i.imgur.com/Q1Ama2z.png)

If the dogs face is occluded, the most likely class suddenly is "tennisball" and no longer "Pomeranian".
![Imgur](http://i.imgur.com/5QYKh7b.png)

See [LIME](http://www.shortscience.org/paper?bibtexKey=journals/corr/1602.04938#martinthoma).


## How visualization helped to construct ZF-Net

* "The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies" -> Lower filter size from $11 \times 11$ to $7 \times 7$
* "the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions" -> Lower stride from 4 to 2
* The occlusion analysis helps to boost confidence that the kind of features being learned are actually correct. 

## ZF-Net
Zeiler and Fergus also created a new network for ImageNet.

The network consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations and max pooling layers.

Training setup:

* **Preprocessing**: Resize smallest dimension to 256, per-pixel mean subtraction per channel, crop $224\text{px} \times 224\text{px}$ region
* **Optimization**: Mini-Batch SGD, learning rate $= 10^{-2}$, momentum = $0.9$, 70 epochs
* **Resources**: took around 12 days on a single GTX580 GPU

The network was evaluated on

* ImageNet 2012: 14.8% error
* Caltech-101: $86.5 \pm 0.5$ (pretrained on ImageNet)
* Caltech-256: $74.2\% \pm 0.3$ (pretrained on ImageNet)

## Minor errors

* typo: "goes give" (also: something went wrong with the link there - the whole block is a link)

Your comment:

[link] Summary by Abhishek Das 7 years ago

This paper introduces a novel visualization technique to understand
representations learnt by intermediate layers of a deep convolutional
neural network - DeconvNet. Using DeconvNet visualizations as a
diagnostic tool in different settings, the authors propose changes to the
model proposed by Alex Krizhevsky, which performs slightly better and
generalizes well to other datasets. Key contributions:

- Deconvolutional network
- Feature activations are mapped back to input pixel space by setting
other activations in the layer to zero and successively unpooling,
rectifying and filtering (using the same parameters).
- Unpooling is approximated by using switch variables to remember
the location of highest input activation (and hence these visualizations
are image-specific).
- Rectification involves passing the signal through a ReLU
non-linearity.
- Filtering involves convolving the reconstructed signal with
the transpose of the convolutional layer filters.
- Well-designed experiments to provide insights

## Strengths

- Observation of evolution of features
- Visualizations clearly demonstrate that lower layers
converge within a few epochs and upper layers
develop after a considerable number of epochs (40-50).
- Feature invariance
- Visualizations show that small transformations have a
dramatic effect on lower layers and lesser impact on higher
layers. The model is fairly stable to translation and scaling,
not so much to rotation.
- Occlusion sensitivity analysis
- Parts of the image are occluded, and posterior and activities
are visualized. Clearly show that activities drop when the object
is occluded.
- Correspondence analysis
- The intuition is that CNNs implicitly learn the correspondence between different parts.
- To verify this, dog images with frontal pose are taken and the same part of the face
is occluded in each of them. Then the difference in feature maps for each of those and the
original image is calculated, and the consistency of this difference across all image pairs
is verified by Hamming distance. Lower scores as compared to random occlusions does show
that the model learns correspondences.
- Proposed model performs better than Alex Krizhevsky's model, and generalizes
well to other datasets.

## Weaknesses / Notes

- The justification / intuition for choice of smaller filters wasn't convincing
enough.
- Why does removing layer 7 give better top-1 error rate on train and
val?
- Rotation invariance might be something worth looking into.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private