Feature Pyramid Networks for Object Detection
Lin, Tsung-Yi
and
Dollár, Piotr
and
Girshick, Ross B.
and
He, Kaiming
and
Hariharan, Bharath
and
Belongie, Serge J.
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords:
dblp
Feature Pyramid Networks (FPNs) build on top of the state-of-the-art implementation for object detection net - Faster RCNN. Faster RCNN faces a major problem in training for scale-invariance as the computations can be memory-intensive and extremely slow. So FRCNN only applies multi-scale approach while testing.
On the other hand, feature pyramids were mainstream when hand-generated features were used -primarily to counter scale-invariance. Feature pyramids are collections of features computed at multi-scale versions of the same image. Improving on a similar idea presented in *DeepMask*, FPN brings back feature pyramids using different feature maps of conv layers with differing spatial resolutions with predictiosn happening on all levels of pyramid. Using feature maps directly as it is, would be tough as initial layers tend to contain lower level representations and poor semantics but good localisation whereas deeper layers tend to constitute higher level representations with rich semantics but suffer poor localisation due to multiple subsampling.
##### Methodology
FPN can be used with any normal conv architecture, that's used for classification. In such an architecture all layers have progressively decreasing spatial resolutions (say C1, C2,..C5). FPN would now take C5 and convolve with 1x1 kernel to reduce filters to give P5. Next, P5 is upsampled and merged it to C4 (C4 is convolved with 1x1 kernel to decrease filter size in order to match that of upsampled P5) by adding element wise to produce P4. Similarly P4 is upsampled and merged with C3(in a similar way) to give P3 and so on. The final set of feature maps, in this case {P2 .. P5} are used as feature pyramids.
This is how pyramids would look like
![](https://i.imgur.com/oHFmpww.png)
*Usage of combination of {P2,..P5} as compared to only P2* : P2 produces highest resolution, most semantic features and could as well be the default choice but because of shared weights across rest of feature layers and the learned scale invariance makes the pyramidal variant more robust to generating false ROIs
For next steps, it could be RPN or RCNN, the regression and classifier would share weights across for all *anchors* (of varying aspect ratios) at each level of the feature pyramids. This step is similar to [Single Shot Detector (SSD) Networks ](http://www.shortscience.org/paper?bibtexKey=conf/eccv/LiuAESRFB16)
##### Observation
The FPN was used in FRCNN in both parts of RPN and RCNN separately and then combined FPN in both parts and produced state-of-the-art result in MS COCO challenges bettering results of COCO '15 & '16 winner models ( Faster RCNN +++ & GMRI) for mAP. FPN also can be used for instance segmentation by using fully convolutional layers on top of the image pyramids. FPN outperforms results from *DeepMask*, *SharpMask*, *InstanceFCN*
* They suggest a modified network architecture for object detectors (i.e. bounding box detectors).
* The architecture aggregates features from many scales (i.e. before each pooling layer) to detect both small and large object.
* The network is shaped similar to an hourglass.
### How
* Architecture
* They have two branches.
* The first one is similar to any normal network:
Convolutions and pooling.
The exact choice of convolutions (e.g. how many) and pooling is determined by the used base network (e.g. ~50 convolutions with ~5x pooling in ResNet-50).
* The second branch starts at the first one's output.
It uses nearest neighbour upsampling to re-increase the resolution back to the original one.
It does not contain convolutions.
All layers have 256 channels.
* There are connections between the layers of the first and second branch.
These connections are simply 1x1 convolutions followed by an addition (similar to residual connections).
Only layers with similar height and width are connected.
* Visualization:
* ![architecture](https://github.com/aleju/papers/blob/master/neural-nets/images/Feature_Pyramid_Networks_for_Object_Detection/architecture.jpg?raw=true "architecture")
* Integration with Faster R-CNN
* They base the RPN on their second branch.
* While usually an RPN is applied to a single feature map of one scale, in their case it is applied to many feature maps of varying scales.
* The RPN uses the same parameters for all scales.
* They use anchor boxes, but only of different aspect ratios, not of different scales (as scales are already covered by their feature map heights/widths).
* Ground truth bounding boxes are associated with the best matching anchor box (i.e. one box among all scales).
* Everything else is the same as in Faster R-CNN.
* Integration with Fast R-CNN
* Fast R-CNN does not use an RPN, but instead usually uses Selective Search to find region proposals (and applies RoI-Pooling to them).
* Here, they simply RoI-Pool from the FPN's output of the second branch.
* They do not pool over all scales. Instead they pick only the scale/layer that matches the region proposal's size (based on its height/width).
* They process each pooled RoI using two 1024-dimensional fully connected layers (initalizes randomly).
* Everything else is the same as in Fast R-CNN.
### Results
* Faster R-CNN
* FPN improves recall on COCO by about 8 points, compared to using standard RPN.
* Improvement is stronger for small objects (about 12 points).
* For some reason no AP values here, only recall.
* The RPN uses some convolutions to transform each feature map into region proposals.
Sharing the features of these convolutions marginally improves results.
* Fast R-CNN
* FPN improves AP on COCO by about 2 points.
* Improvement is stronger for small objects (about 2.1 points).