Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
Kaiming He
and
Xiangyu Zhang
and
Shaoqing Ren
and
Jian Sun
arXiv e-Print archive - 2014 via Local arXiv
Keywords:
cs.CV
First published: 2014/06/18 (10 years ago) Abstract: Existing deep convolutional neural networks (CNNs) require a fixed-size
(e.g., 224x224) input image. This requirement is "artificial" and may reduce
the recognition accuracy for the images or sub-images of an arbitrary
size/scale. In this work, we equip the networks with another pooling strategy,
"spatial pyramid pooling", to eliminate the above requirement. The new network
structure, called SPP-net, can generate a fixed-length representation
regardless of image size/scale. Pyramid pooling is also robust to object
deformations. With these advantages, SPP-net should in general improve all
CNN-based image classification methods. On the ImageNet 2012 dataset, we
demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures
despite their different designs. On the Pascal VOC 2007 and Caltech101
datasets, SPP-net achieves state-of-the-art classification results using a
single full-image representation and no fine-tuning.
The power of SPP-net is also significant in object detection. Using SPP-net,
we compute the feature maps from the entire image only once, and then pool
features in arbitrary regions (sub-images) to generate fixed-length
representations for training the detectors. This method avoids repeatedly
computing the convolutional features. In processing test images, our method is
24-102x faster than the R-CNN method, while achieving better or comparable
accuracy on Pascal VOC 2007.
In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our
methods rank #2 in object detection and #3 in image classification among all 38
teams. This manuscript also introduces the improvement made for this
competition.
Spatial Pyramid Pooling (SPP) is a technique which allows Convolutional Neural Networks (CNNs) to use input images of any size, not only $224\text{px} \times 224\text{px}$ as most architectures do. (However, there is a lower bound for the size of the input image).
## Idea
* Convolutional layers operate on any size, but fully connected layers need fixed-size inputs
* Solution:
* Add a new SPP layer on top of the last convolutional layer, before the fully connected layer
* Use an approach similar to bag of words (BoW), but maintain the spatial information. The BoW approach is used for text classification, where the order of the words is discarded and only the number of occurences is kept.
* The SPP layer operates on each feature map independently.
* The output of the SPP layer is of dimension $k \cdot M$, where $k$ is the number of feature maps the SPP layer got as input and $M$ is the number of bins.
Example: We could use spatial pyramid pooling with 21 bins:
* 1 bin which is the max of the complete feature map
* 4 bins which divide the image into 4 regions of equal size (depending on the input size) and rectangular shape. Each bin gets the max of its region.
* 16 bins which divide the image into 4 regions of equal size (depending on the input size) and rectangular shape. Each bin gets the max of its region.
## Evaluation
* Pascal VOC 2007, Caltech101: state-of-the-art, without finetuning
* ImageNet 2012: Boosts accuracy for various CNN architectures
* ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014: Rank #2
## Code
The paper claims that the code is [here](http://research.microsoft.com/en-us/um/people/kahe/), but this seems not to be the case any more.
People have tried to implement it with Tensorflow ([1](http://stackoverflow.com/q/40913794/562769), [2](https://github.com/fchollet/keras/issues/2080), [3](https://github.com/tensorflow/tensorflow/issues/6011)), but by now no public working implementation is available.
## Related papers
* [Atrous Convolution](https://arxiv.org/abs/1606.00915)