Facebook has [released a series of papers](https://research.facebook.com/blog/learning-to-segment/) for object segmentation and detection. This paper is the first in that series.
This is how modern object detection works (think [RCNN](https://arxiv.org/abs/1311.2524), [Fast RCNN](http://arxiv.org/abs/1504.08083)):
1. A rich set of object proposals (i.e., a set of image regions which are likely to contain an object) is generated using a fast (but possibly imprecise) algorithm.
2. A CNN classifier is applied on each of the proposals.
The current paper improves the step 1, i.e., region/object proposals.
Most object proposals approaches fall into three categories:
* Objectness scoring
* Seed Segmentation
* Superpixel Merging
Current method is different from these three.
It share similarities with [Faster R-CNN](https://arxiv.org/abs/1506.01497) in that proposals are generated using a CNN.
The method predicts a segmentation mask given an input *patch* and assigns a score corresponding to how likely the patch is to contain an object.
## Model and Training
Both mask and score predictions are achieved with a single convolutional network but with multiple outputs. All the convolutional layers except the last few are from VGG-A pretrained model.
Each training sample is a triplet of RGB input patch, the binary mask corresponding to the input patch, a label which specifies whether the patch contains an object. A patch is given label 1 only if it satisfies the following constraints:
* the patch contains an object roughly centered in the input patch
* the object is fully contained in the patch and in a given scale range
Note that the network must output a mask for a single object at the center even when multiple objects are present.
Figure 1 shows the architecture and sampling for training.
Model is then jointly trained for segmentation and objectness. Negative samples are not used for segmentation.
During full image inference, model is applied densely at multiple locations and scales.
This can be done efficiently since all computations are convolutional like in a fully convolutional network (FCN).
This approach surpasses the previous state of the art by a large margin in both box and segmentation proposal generation.