Deep Extreme Cut: From Extreme Points to Object Segmentation
Kevis-Kokitsi Maninis
and
Sergi Caelles
and
Jordi Pont-Tuset
and
Luc Van Gool
arXiv e-Print archive - 2017 via Local arXiv
Keywords:
cs.CV
First published: 2017/11/24 (6 years ago) Abstract: This paper explores the use of extreme points in an object (left-most,
right-most, top, bottom pixels) as input to obtain precise object segmentation
for images and videos. We do so by adding an extra channel to the image in the
input of a convolutional neural network (CNN), which contains a Gaussian
centered in each of the extreme points. The CNN learns to transform this
information into a segmentation of an object that matches those extreme points.
We demonstrate the usefulness of this approach for guided segmentation
(grabcut-style), interactive segmentation, video object segmentation, and dense
segmentation annotation. We show that we obtain the most precise results to
date, also with less user input, in an extensive and varied selection of
benchmarks and datasets. All our models and code are publicly available on
http://www.vision.ee.ethz.ch/~cvlsegmentation/dextr/.
This paper introduces a CNN based segmentation of an object that is defined by a user using four extreme points (i.e. bounding box). Interestingly, in a related work, it has been shown that clicking extreme points is about 5 times more efficient than drawing a bounding box in terms of speed.
https://i.imgur.com/9GJvf17.png
The extreme points have several goals in this work. First, they are used as a bounding box to crop the object of interest. Secondly, they are utilized to create a heatmap with activations in the regions of extreme points.
The heatmap is created as a 2D Gaussian centered around each of the extreme points. This heatmap is matched to the size of the resized crop (i.e. 512x512) and is concatenated with the original RGB channels of the crop.
The concatenated input of channel depth=4 is fed to the network which is a ResNet-101 with FC and last two maxpool layers removed. In order to maintain the same receptive field, an astrous convolution is used. Pyramid scene parsing module from PSPNet is used to aggregate global context. The network is trained with a standard cross-entropy loss weighted by a normalization factor (i.e. a frequency of a class in a dataset).
How does it compare to "Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++
" paper in terms of accuracy? Specifically, if the polygon is wrong it is easy to correct points on the polygon that are wrong. However, it is unclear how to obtain preferred segmentation when no matter how many (greater than four) extreme points are selected, the object of interest is not segmented properly.