[link]
* They compare the results of various models for pedestrian detection. * The various models were developed over the course of ~10 years (2003-2014). * They analyze which factors seemed to improve the results. * They derive new models for pedestrian detection from that. ### Comparison: Datasets * Available datasets * INRIA: Small dataset. Diverse images. * ETH: Video dataset. Stereo images. * TUD-Brussels: Video dataset. * Daimler: No color channel. * Daimler stereo: Stereo images. * Caltech-USA: Most often used. Large dataset. * KITTI: Often used. Large dataset. Stereo images. * All datasets except KITTI are part of the "unified evaluation toolbox" that allows authors to easily test on all of these datasets. * The evaluation started initially with per-window (FPPW) and later changed to per-image (FPPI), because per-window skewed the results. * Common evaluation metrics: * MR: Log-average miss-rate (lower is better) * AUC: Area under the precision-recall curve (higher is better) ### Comparison: Methods * Families * They identified three families of methods: Deformable Parts Models, Deep Neural Networks, Decision Forests. * Decision Forests was the most popular family. * No specific family seemed to perform better than other families. * There was no evidence that non-linearity in kernels was needed (given sophisticated features). * Additional data * Adding (coarse) optical flow data to each image seemed to consistently improve results. * There was some indication that adding stereo data to each image improves the results. * Context * For sliding window detectors, adding context from around the window seemed to improve the results. * E.g. context can indicate whether there were detections next to the window as people tend to walk in groups. * Deformable parts * They saw no evidence that deformable part models outperformed other models. * Multi-Scale models * Training separate models for each sliding window scale seemed to improve results slightly. * Deep architectures * They saw no evidence that deep neural networks outperformed other models. (Note: Paper is from 2014, might have changed already?) * Features * Best performance was usually achieved with simple HOG+LUV features, i.e. by converting each window into: * 6 channels of gradient orientations * 1 channel of gradient magnitude * 3 channels of LUV color space * Some models use significantly more channels for gradient orientations, but there was no evidence that this was necessary to achieve good accuracy. * However, using more different features (and more sophisticated ones) seemed to improve results. ### Their new model: * They choose Decisions Forests as their model framework (2048 level-2 trees, i.e. 3 thresholds per tree). * They use features from the [Integral Channels Features framework](http://pages.ucsd.edu/~ztu/publication/dollarBMVC09ChnFtrs_0.pdf). (Basically just a mixture of common/simple features per window.) * They add optical flow as a feature. * They add context around the window as a feature. (A second detector that detects windows containing two persons.) * Their model significantly improves upon the state of the art (from 34 to 22% MR on Caltech dataset).  *Overview of models developed over the years, starting with Viola Jones (VJ) and ending with their suggested model (Katamari-v1). (DF = Decision Forest, DPM = Deformable Parts Model, DN = Deep Neural Network; I = Inria Dataset, C = Caltech Dataset)* ![]() |
[link]
Spatial Pyramid Pooling (SPP) is a technique which allows Convolutional Neural Networks (CNNs) to use input images of any size, not only $224\text{px} \times 224\text{px}$ as most architectures do. (However, there is a lower bound for the size of the input image). ## Idea * Convolutional layers operate on any size, but fully connected layers need fixed-size inputs * Solution: * Add a new SPP layer on top of the last convolutional layer, before the fully connected layer * Use an approach similar to bag of words (BoW), but maintain the spatial information. The BoW approach is used for text classification, where the order of the words is discarded and only the number of occurences is kept. * The SPP layer operates on each feature map independently. * The output of the SPP layer is of dimension $k \cdot M$, where $k$ is the number of feature maps the SPP layer got as input and $M$ is the number of bins. Example: We could use spatial pyramid pooling with 21 bins: * 1 bin which is the max of the complete feature map * 4 bins which divide the image into 4 regions of equal size (depending on the input size) and rectangular shape. Each bin gets the max of its region. * 16 bins which divide the image into 4 regions of equal size (depending on the input size) and rectangular shape. Each bin gets the max of its region. ## Evaluation * Pascal VOC 2007, Caltech101: state-of-the-art, without finetuning * ImageNet 2012: Boosts accuracy for various CNN architectures * ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014: Rank #2 ## Code The paper claims that the code is [here](http://research.microsoft.com/en-us/um/people/kahe/), but this seems not to be the case any more. People have tried to implement it with Tensorflow ([1](http://stackoverflow.com/q/40913794/562769), [2](https://github.com/fchollet/keras/issues/2080), [3](https://github.com/tensorflow/tensorflow/issues/6011)), but by now no public working implementation is available. ## Related papers * [Atrous Convolution](https://arxiv.org/abs/1606.00915) ![]()
1 Comments
|