Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation on ShortScience.org

5

[link] Summary by nandini 7 years ago

# Object detection system overview.

https://i.imgur.com/vd2YUy3.png

1. takes an input image,
2. extracts around 2000 bottom-up region proposals,
3. computes features for each proposal using a large convolutional neural network (CNN), and then
4. classifies each region using class-specific linear SVMs.
* R-CNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010.
* On the 200-class ILSVRC2013 detection dataset, R-CNN’s mAP is 31.4%, a large improvement over OverFeat , which had the previous best result at 24.3%.

## There is a 2 challenges faced in object detection
1. localization problem
2. labeling the data

1 localization problem :
* One approach frames localization as a regression problem. they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method.
* An alternative is to build a sliding-window detector. considered adopting a sliding-window approach increases the number of convolutional layers to 5, have very large receptive fields (195 x 195 pixels) and strides (32x32 pixels) in the input image, which makes precise localization within the sliding-window paradigm.

2 labeling the data:
* The conventional solution to this problem is to use unsupervised pre-training, followed by supervise fine-tuning
* supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain specific fine-tuning on a small dataset (PASCAL),
* fine-tuning for detection improves mAP performance by 8 percentage points.
* Stochastic gradient descent via back propagation was used to effective for training convolutional neural networks (CNNs)

## Object detection with R-CNN
This system consists of three modules
* The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector.
* The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region.
* The third module is a set of class specific linear SVMs.

Module design

1 Region proposals
* which detect mitotic cells by applying a CNN to regularly-spaced square crops.
* use selective search method in fast mode (Capture All Scales, Diversification, Fast to Compute).
* the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU)

2 Feature extraction.
* extract a 4096-dimensional feature vector from each region proposal using the Caffe implementation of the CNN
* Features are computed by forward propagating a mean-subtracted 227x227 RGB image through five convolutional layers and two fully connected layers.
* warp all pixels in a tight bounding box around it to the required size
* The feature matrix is typically 2000x4096

3 Test time detection
* At test time, run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments).
* warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class.
* Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over union (IoU) overlap with a higher scoring selected region larger than a learned threshold.
## Training

1 Supervised pre-training:
* pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using image-level annotations only (bounding box labels are not available for this data)

2 Domain-specific fine-tuning.
* use the stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals with learning rate of 0.001.

3 Object category classifiers.
* use intersection-over union (IoU) overlap threshold method to label a region with The overlap threshold of 0.3.
* Once features are extracted and training labels are applied, we optimize one linear SVM per class.
* adopt the standard hard negative mining method to fit large training data in memory.

### Results on PASCAL VOC 201012

1 VOC 2010
* compared against four strong baselines including SegDPM, DPM, UVA, Regionlets.
* Achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster
https://i.imgur.com/0dGX9b7.png
2 ILSVRC2013 detection.
* ran R-CNN on the 200-class ILSVRC2013 detection dataset
* R-CNN achieves a mAP of 31.4%
https://i.imgur.com/GFbULx3.png
#### Performance layer-by-layer, without fine-tuning
1 pool5 layer
* which is the max pooled output of the network’s fifth and final convolutional layer.
*The pool5 feature map is 6 x6 x 256 = 9216 dimensional
* each pool5 unit has a receptive field of 195x195 pixels in the original 227x227 pixel input

2 Layer fc6
* fully connected to pool5
* it multiplies a 4096x9216 weight matrix by the pool5 feature map (reshaped as a 9216-dimensional vector) and then adds a vector of biases

3 Layer fc7
* It is implemented by multiplying the features computed by fc6 by a 4096 x 4096 weight matrix, and similarly adding a vector of biases and applying half-wave rectification
#### Performance layer-by-layer, with fine-tuning
* CNN’s parameters fine-tuned on PASCAL.
* fine-tuning increases mAP by 8.0 % points to 54.2%

### Network architectures
* 16-layer deep network, consisting of 13 layers of 3 _ 3 convolution kernels, with five max pooling layers interspersed, and topped with three fully-connected layers. We refer to this network as “O-Net” for OxfordNet and the baseline as “T-Net” for TorontoNet.
* RCNN with O-Net substantially outperforms R-CNN with TNet, increasing mAP from 58.5% to 66.0%
* drawback in terms of compute time, with in terms of compute time, with than T-Net.

1 The ILSVRC2013 detection dataset
* dataset is split into three sets: train (395,918), val (20,121), and test (40,152)

#### CNN features for segmentation.
* full R-CNN: The first strategy (full) ignores the re region’s shape and computes CNN features directly on the warped window. Two regions might have very similar bounding boxes while having very little overlap.
* fg R-CNN: the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction.
* full+fg R-CNN: The third strategy (full+fg) simply concatenates the full and fg features
https://i.imgur.com/n1bhmKo.png

Nice description

Your comment:

2

[link] Summary by Joseph Paul Cohen 9 years ago

The R-CNN method is a way to localize objects in an image. It is restricted to finding one of each object in an image. 

1. Regions are generated based on any method including brute force sliding window.
2. Each region is classified using AlexNet.
3. The classifications for each label are searched to find the location which expresses that label the most.

Your comment:

1

[link] Summary by Martin Thoma 9 years ago

The [R-CNN](http://arxiv.org/abs/1311.2524) paper presents a method based on convolutional neural networks (CNNs) for object detection. It does so by region proposals (hence the "R"). The key insight was to train CNNs on classification tasks and use the learned features for the region proposals. The do *not* use a sliding window approach such as Overfeat. They create around 2000 category-independent region proposals. For each proposal, they crop the part of that image. Then they resize the cropped part to fit into the CNN and classify it.


Notable follow-ups are:

* [Fast R-CNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15)
* [Faster R-CNNs](http://www.shortscience.org/paper?bibtexKey=conf/nips/RenHGS15)

Your comment:

1

[link] Summary by Abhishek Das 7 years ago

This paper presents R-CNN, an approach to do object detection using CNNs pre-trained for image classification. Object proposals are extracted from the image using Selective Search, dilated by few pixels, warped to CNN input size and fed into the CNN to extract features (they experiment with pool5, fc6, fc7). These extracted feature vectors are scored using SVMs, one per class. Bounding box regression, where they predict parameters to move the proposal closer to ground-truth, further boosts localization.

The authors use AlexNet, pre-trained on ImageNet and finetuned for detection. Object proposals with IOU overlap greater than 0.5 are treated as positive examples, and others as negative, and a 21-way classification (20 object categories + background) is set up to finetune the CNN. After finetuning, SVMs are trained per class, taking only the ground-truth boxes as positives, and IOU <= 0.3 as negatives.

R-CNN achieves major performance improvements on PASCAL VOC 2007/2010 and ILSVRC2013 detection datasets. Finally, this method is extended to do semantic segmentation and achieves competitive results.

## Strengths

- The method is simple and effective.
- Extensive ablation studies show why R-CNN works.
    - FC7 is the best feature to use (against pool5, fc6).
    - Fine-tuning provides a large boost in performance.
    - VGG performs better than AlexNet.
    - Bounding box regression further improves localization.

## Weaknesses / Notes

- Each region proposal is treated independently, which adds up to compute time.
- There are lots of different parts; the network can't be trained end-to-end.

Your comment:

1

[link] Summary by Alexander Jung 7 years ago

* Previously, methods to detect bounding boxes in images were often based on the combination of manual feature extraction with SVMs.
* They replace the manual feature extraction with a CNN, leading to significantly higher accuracy.
* They use supervised pre-training on auxiliary datasets to deal with the small amount of labeled data (instead of the sometimes used unsupervised pre-training).
* They call their method R-CNN ("Regions with CNN features").

### How
* Their system has three modules: 1) Region proposal generation, 2) CNN-based feature extraction per region proposal, 3) classification.
* ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Rich_feature_hierarchies_for_accurate_object_detection_and_semantic_segmentation__architecture.jpg?raw=true "Architecture")
* Region proposals generation
* A region proposal is a bounding box candidate that *might* contain an object.
* By default they generate 2000 region proposals per image.
* They suggest "simple" (i.e. not learned) algorithms for this step (e.g. objectneess, selective search, CPMC).
* They use selective search (makes it comparable to previous systems).
* CNN features
* Uses a CNN to extract features, applied to each region proposal (replaces the previously used manual feature extraction).
* So each region proposal ist turned into a fixed length vector.
* They use AlexNet by Krizhevsky et al. as their base CNN (takes 227x227 RGB images, converts them into 4096-dimensional vectors).
* They add `p=16` pixels to each side of every region proposal, extract the pixels and then simply resize them to 227x227 (ignoring aspect ratio, so images might end up distorted).
* They generate one 4096d vector per image, which is less than what some previous manual feature extraction methods used. That enables faster classification, less memory usage and thus more possible classes.
* Classification
* A classifier that receives the extracted feature vectors (one per region proposal) and classifies them into a predefined set of available classes (e.g. "person", "car", "bike", "background / no object").
* They use one SVM per available class.
* The regions that were not classified as background might overlap (multiple bounding boxes on the same object).
* They use greedy non-maximum suppresion to fix that problem (for each class individually).
* That method simply rejects regions if they overlap strongly with another region that has higher score.
* Overlap is determined via Intersection of Union (IoU).
* Training method
* Pre-Training of CNN
* They use AlexNet pretrained on Imagenet (1000 classes).
* They replace the last fully connected layer with a randomly initialized one that leads to `C+1` classes (`C` object classes, `+1` for background).
* Fine-Tuning of CNN
* The use SGD with learning rate `0.001`.
* Batch size is 128 (32 positive windows, 96 background windows).
* A region proposal is considered positive, if its IoU with any ground-truth bounding box is `>=0.5`.
* SVM
* They train one SVM per class via hard negative mining.
* For positive examples they use here an IoU threshold of `>=0.3`, which performed better than 0.5.

### Results
* Pascal VOC 2010
* They: 53.7% mAP
* Closest competitor (SegDPM): 40.4% mAP
* Closest competitor that uses the same region proposal method (UVA): 35.1% mAP
* ![Scores on Pascal VOC 2010](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Rich_feature_hierarchies_for_accurate_object_detection_and_semantic_segmentation__scores.jpg?raw=true "Scores on Pascal VOC 2010")
* ILSVRC2013 detection
* They: 31.4% mAP
* Closest competitor (OverFeat): 24.3% mAP
* The feed a large number of region proposals through the network and log for each filter in the last conv-layer which images activated it the most:
* ![Activations](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Rich_feature_hierarchies_for_accurate_object_detection_and_semantic_segmentation__activations.jpg?raw=true "Activations")
* Usefulness of layers:
* They remove later layers of the network and retrain in order to find out which layers are the most useful ones.
* Their result is that both fully connected layers of AlexNet seemed to be very domain-specific and profit most from fine-tuning.
* Using VGG16:
* Using VGG16 instead of AlexNet increased mAP from 58.5% to 66.0% on Pascal VOC 2007.
* Computation time was 7 times higher.
* They train a linear regression model that improves the bounding box dimensions based on the extracted features of the last pooling layer. That improved their mAP by 3-4 percentage points.
* The region proposals generated by selective search have a recall of 98% on Pascal VOC and 91.6% on ILSVRC2013 (measured by IoU of `>=0.5`).

Your comment:

0

[link] Summary by Evan Su 9 years ago

This paper presents a object detection algorithm that improves mAP on PASCAL VOC dataset by over 20% to previous state-of-the-art. Unlike image classification which take an image or the center part of an image as input, object detection task requires an algorithm to detect bounding boxes of objects in an image. To use the high capacity CNN features in object detection, the proposed algorithm first generates region proposals. CNN features are extracted from those region proposals and are feed to a set of class-specific linear SVMs which tell whether objects are detected in those regions.
Technical details

The figure below show the object detection system in this paper.
![](http://3.bp.blogspot.com/-O6e43qcpcYA/VWapFWyXt5I/AAAAAAAAA8c/rcjlQJAQ35s/s320/system.png)

Because the PASCAL VOC dataset is not large enough for training high capacity CNN features, this paper use supervised pre-training on a large auxiliary dataset (ILSVRC 2012). The CNN is then fine-tuned with a portion of the PASCAL VOC dataset.

Results

The following table shows the detection mAP on VOC 2007 test.

![](http://1.bp.blogspot.com/-AmEd1cI6iWs/VWaqnD1YYwI/AAAAAAAAA8o/07pfZpdvwck/s400/table%2B2.png)

Your comment: