Focal Loss for Dense Object Detection
Tsung-Yi Lin
and
Priya Goyal
and
Ross Girshick
and
Kaiming He
and
Piotr Dollár
arXiv e-Print archive - 2017 via Local arXiv
Keywords:
cs.CV
First published: 2017/08/07 (7 years ago) Abstract: The highest accuracy object detectors to date are based on a two-stage
approach popularized by R-CNN, where a classifier is applied to a sparse set of
candidate object locations. In contrast, one-stage detectors that are applied
over a regular, dense sampling of possible object locations have the potential
to be faster and simpler, but have trailed the accuracy of two-stage detectors
thus far. In this paper, we investigate why this is the case. We discover that
the extreme foreground-background class imbalance encountered during training
of dense detectors is the central cause. We propose to address this class
imbalance by reshaping the standard cross entropy loss such that it
down-weights the loss assigned to well-classified examples. Our novel Focal
Loss focuses training on a sparse set of hard examples and prevents the vast
number of easy negatives from overwhelming the detector during training. To
evaluate the effectiveness of our loss, we design and train a simple dense
detector we call RetinaNet. Our results show that when trained with the focal
loss, RetinaNet is able to match the speed of previous one-stage detectors
while surpassing the accuracy of all existing state-of-the-art two-stage
detectors. Code is at: https://github.com/facebookresearch/Detectron.
In object detection the boost in speed and accuracy is mostly gained through network architecture changes.This paper takes a different route towards achieving that goal,They introduce a new loss function called focal loss.
The authors identify class imbalance as the main obstacle toward one stage detectors achieving results which are as good as two stage detectors.
The loss function they introduce is a dynamically scaled cross entropy loss,Where the scaling factor decays to zero as the confidence in the correct class increases.
They add a modulating factor as shown in the image below to the cross- entropy loss https://i.imgur.com/N7R3M9J.png
Which ends up looking like this https://i.imgur.com/kxC8NCB.png
in experiments though they add an additional alpha term to it,because it gives them better results.
**Retina Net**
The network consists of a single unified network which is composed of a backbone network and two task specific subnetworks.The backbone network computes the feature maps for the input images.The first sub-network helps in object classification of the backbone networks output and the second sub-network helps in bounding box regression.
The backbone network they use is Feature Pyramid Network,Which they build on top of ResNet.