Summaries from IEEE Transactions on Pattern Analysis and Machine Intelligence on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun
arXiv e-Print archive - 2014 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Martin Thoma 8 years ago

Spatial Pyramid Pooling (SPP) is a technique which allows Convolutional Neural Networks (CNNs) to use input images of any size, not only $224\text{px} \times 224\text{px}$ as most architectures do. (However, there is a lower bound for the size of the input image).

## Idea

* Convolutional layers operate on any size, but fully connected layers need fixed-size inputs
* Solution:
  * Add a new SPP layer on top of the last convolutional layer, before the fully connected layer
  * Use an approach similar to bag of words (BoW), but maintain the spatial information. The BoW approach is used for text classification, where the order of the words is discarded and only the number of occurences is kept.
  * The SPP layer operates on each feature map independently.
  * The output of the SPP layer is of dimension $k \cdot M$, where $k$ is the number of feature maps the SPP layer got as input and $M$ is the number of bins.

Example: We could use spatial pyramid pooling with 21 bins:

* 1 bin which is the max of the complete feature map
* 4 bins which divide the image into 4 regions of equal size (depending on the input size) and rectangular shape. Each bin gets the max of its region.
* 16 bins which divide the image into 4 regions of equal size (depending on the input size) and rectangular shape. Each bin gets the max of its region.

## Evaluation

* Pascal VOC 2007, Caltech101: state-of-the-art, without finetuning
* ImageNet 2012: Boosts accuracy for various CNN architectures
* ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014: Rank #2


## Code

The paper claims that the code is [here](http://research.microsoft.com/en-us/um/people/kahe/), but this seems not to be the case any more.

People have tried to implement it with Tensorflow ([1](http://stackoverflow.com/q/40913794/562769), [2](https://github.com/fchollet/keras/issues/2080), [3](https://github.com/tensorflow/tensorflow/issues/6011)), but by now no public working implementation is available.


## Related papers

* [Atrous Convolution](https://arxiv.org/abs/1606.00915)

1 Comments

arxiv.org
scholar.google.com

Deep Visual-Semantic Alignments for Generating Image Descriptions
Karpathy, Andrej and Li, Fei-Fei
arXiv e-Print archive - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Two Minute Papers 9 years ago

https://www.youtube.com/watch?v=e-WB4lfg30M


This technique is a combination of two powerful machine learning algorithms:
- convolutional neural networks are excellent at image classification, i.e., finding out what is seen on an input image,
- recurrent neural networks that are capable of processing a sequence of inputs and outputs, therefore it can create sentences of what is seen on the image.

Combining these two techniques makes it possible for a computer to describe in a sentence what is seen on an input image.