Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Simonyan, Karen
and
Vedaldi, Andrea
and
Zisserman, Andrew
arXiv e-Print archive - 2013 via Local Bibsonomy
Keywords:
dblp
#### Introduction
* The paper presents gradient computation based techniques to visualise image classification models.
* [Link to the paper](https://arxiv.org/abs/1312.6034)
#### Experimental Setup
* Single deep convNet trained on ILSVRC-2013 dataset (1.2M training images and 1000 classes).
* Weight layer configuration is: conv64-conv256-conv256-conv256-conv256-full4096-full4096-full1000.
#### Class Model Visualisation
* Given a learnt ConvNet and a class (of interest), start with the zero image and perform optimisation by back propagating with respect to the input image (keeping the ConvNet weights constant).
* Add the mean image (for training set) to the resulting image.
* The paper used unnormalised class scores so that optimisation focuses on increasing the score of target class and not decreasing the score of other classes.
#### Image-Specific Class Saliency Visualisation
* Given an image, class of interest, and trained ConvNet, rank the pixels of the input image based on their influence on class scores.
* Derivative of the class score with respect to image gives an estimate of the importance of different pixels for the class.
* The magnitude of derivative also indicated how much each pixel needs to be changed to improve the class score.
##### Class Saliency Extraction
* Find the derivative of the class score with respect with respect to the input image.
* This would result in one single saliency map per colour channel.
* To obtain a single saliency map, take the maximum magnitude of derivative across all colour channels.
##### Weakly Supervised Object Localisation
* The saliency map for an image provides a rough encoding of the location of the object of the class of interest.
* Given an image and its saliency map, an object segmentation map can be computed using GraphCut colour segmentation.
* Color continuity cues are needed as saliency maps might capture only the most dominant part of the object in the image.
* This weakly supervised approach achieves 46.4% top-5 error on the test set of ILSVRC-2013.
#### Relation to Deconvolutional Networks
* DeconvNet-based reconstruction of the $n^{th}$ layer input is similar to computing the gradient of the visualised neuron activity $f$ with respect to the input layer.
* One difference is in the way RELU neurons are treated:
* In DeconvNet, the sign indicator (for the derivative of RELU) is computed on output reconstruction while in this paper, the sign indicator is computed on the layer input.