Feature Pyramid Networks for Object Detection
Lin, Tsung-Yi
and
Dollár, Piotr
and
Girshick, Ross B.
and
He, Kaiming
and
Hariharan, Bharath
and
Belongie, Serge J.
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords:
dblp
Feature Pyramid Networks (FPNs) build on top of the state-of-the-art implementation for object detection net - Faster RCNN. Faster RCNN faces a major problem in training for scale-invariance as the computations can be memory-intensive and extremely slow. So FRCNN only applies multi-scale approach while testing.
On the other hand, feature pyramids were mainstream when hand-generated features were used -primarily to counter scale-invariance. Feature pyramids are collections of features computed at multi-scale versions of the same image. Improving on a similar idea presented in *DeepMask*, FPN brings back feature pyramids using different feature maps of conv layers with differing spatial resolutions with predictiosn happening on all levels of pyramid. Using feature maps directly as it is, would be tough as initial layers tend to contain lower level representations and poor semantics but good localisation whereas deeper layers tend to constitute higher level representations with rich semantics but suffer poor localisation due to multiple subsampling.
##### Methodology
FPN can be used with any normal conv architecture, that's used for classification. In such an architecture all layers have progressively decreasing spatial resolutions (say C1, C2,..C5). FPN would now take C5 and convolve with 1x1 kernel to reduce filters to give P5. Next, P5 is upsampled and merged it to C4 (C4 is convolved with 1x1 kernel to decrease filter size in order to match that of upsampled P5) by adding element wise to produce P4. Similarly P4 is upsampled and merged with C3(in a similar way) to give P3 and so on. The final set of feature maps, in this case {P2 .. P5} are used as feature pyramids.
This is how pyramids would look like
![](https://i.imgur.com/oHFmpww.png)
*Usage of combination of {P2,..P5} as compared to only P2* : P2 produces highest resolution, most semantic features and could as well be the default choice but because of shared weights across rest of feature layers and the learned scale invariance makes the pyramidal variant more robust to generating false ROIs
For next steps, it could be RPN or RCNN, the regression and classifier would share weights across for all *anchors* (of varying aspect ratios) at each level of the feature pyramids. This step is similar to [Single Shot Detector (SSD) Networks ](http://www.shortscience.org/paper?bibtexKey=conf/eccv/LiuAESRFB16)
##### Observation
The FPN was used in FRCNN in both parts of RPN and RCNN separately and then combined FPN in both parts and produced state-of-the-art result in MS COCO challenges bettering results of COCO '15 & '16 winner models ( Faster RCNN +++ & GMRI) for mAP. FPN also can be used for instance segmentation by using fully convolutional layers on top of the image pyramids. FPN outperforms results from *DeepMask*, *SharpMask*, *InstanceFCN*