Learning to Segment Object Candidates
Pinheiro, Pedro H. O.
and
Collobert, Ronan
and
Dollár, Piotr
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords:
dblp
Facebook has [released a series of papers](https://research.facebook.com/blog/learning-to-segment/) for object segmentation and detection. This paper is the first in that series.
This is how modern object detection works (think [RCNN](https://arxiv.org/abs/1311.2524), [Fast RCNN](http://arxiv.org/abs/1504.08083)):
1. A rich set of object proposals (i.e., a set of image regions which are likely to contain an object) is generated using a fast (but possibly imprecise) algorithm.
2. A CNN classifier is applied on each of the proposals.
The current paper improves the step 1, i.e., region/object proposals.
Most object proposals approaches fall into three categories:
* Objectness scoring
* Seed Segmentation
* Superpixel Merging
Current method is different from these three.
It share similarities with [Faster R-CNN](https://arxiv.org/abs/1506.01497) in that proposals are generated using a CNN.
The method predicts a segmentation mask given an input *patch* and assigns a score corresponding to how likely the patch is to contain an object.
## Model and Training
Both mask and score predictions are achieved with a single convolutional network but with multiple outputs. All the convolutional layers except the last few are from VGG-A pretrained model.
Each training sample is a triplet of RGB input patch, the binary mask corresponding to the input patch, a label which specifies whether the patch contains an object. A patch is given label 1 only if it satisfies the following constraints:
* the patch contains an object roughly centered in the input patch
* the object is fully contained in the patch and in a given scale range
Note that the network must output a mask for a single object at the center even when multiple objects are present.
Figure 1 shows the architecture and sampling for training.
![figure1](https://i.imgur.com/zSyP0ij.png)
Model is then jointly trained for segmentation and objectness. Negative samples are not used for segmentation.
## Inference
During full image inference, model is applied densely at multiple locations and scales.
This can be done efficiently since all computations are convolutional like in a fully convolutional network (FCN).
![figure2](https://i.imgur.com/dQWfy8R.png)
This approach surpasses the previous state of the art by a large margin in both box and segmentation proposal generation.