Training Region-based Object Detectors with Online Hard Example Mining
Abhinav Shrivastava
and
Abhinav Gupta
and
Ross Girshick
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.CV, cs.LG
First published: 2016/04/12 (8 years ago) Abstract: The field of object detection has made significant advances riding on the
wave of region-based ConvNets, but their training procedure still includes many
heuristics and hyperparameters that are costly to tune. We present a simple yet
surprisingly effective online hard example mining (OHEM) algorithm for training
region-based ConvNet detectors. Our motivation is the same as it has always
been -- detection datasets contain an overwhelming number of easy examples and
a small number of hard examples. Automatic selection of these hard examples can
make training more effective and efficient. OHEM is a simple and intuitive
algorithm that eliminates several heuristics and hyperparameters in common use.
But more importantly, it yields consistent and significant boosts in detection
performance on benchmarks like PASCAL VOC 2007 and 2012. Its effectiveness
increases as datasets become larger and more difficult, as demonstrated by the
results on the MS COCO dataset. Moreover, combined with complementary advances
in the field, OHEM leads to state-of-the-art results of 78.9% and 76.3% mAP on
PASCAL VOC 2007 and 2012 respectively.
The problem statement this paper tries to address is that the training set is distinguished by a large imbalance between the number of foreground examples and background examples-To make the point concrete cases like sliding window object detectors like deformable parts model, the imbalance may be as extreme as 100,000 background examples to one annotated foreground example.
Before i proceed to give you the details of Hard Example mining, i just want to note that HEM in its essence is mostly while training you sort your losses and train your model on the most difficult examples which mostly means the ones with the most loss.(An extension to this can be found in the paper Focal Loss). This is a simple but powerful technique.
So taking this as out background,The authors propose a simple but effective method to train an Fast-RCNN.
Their approach is as follows,
1. For an input image at SGD iteration t, they first compute a convolution feature map using the conv-Network
2. The ROI Network uses this feature map and all the input ROI's to do a forward pass
3. Hard examples are sorted by loss and taking the B/N examples for which the current network performs worse.(Here B is batch size and N is Number of examples)
4. While doing this, The researchers notice that Co-located ROI's with high overlap are likely to have co-related losses. Also If you notice Overlapping ROI's will project onto the mostly the same region in the Conv-feature map because the feature map is a denser/smaller representation of the feature map.So this might lead to loss double counting.To deal with this They use standard Non-Maximum Supression.
5. Now how NMS works here is, It iteratively selects the ROI with the highest loss and removes all lower loss ROI's that have high overlap with the selected region.Here they use a IOU threshold of 0.7