[link]
Summary by Hadrien Bertrand 5 years ago
The paper proposes a method to perform joint instance and semantic segmentation. The method is fast as it is meant to run in an embedded environment (such as a robot). While the semantic map may seem redundant given the instance one, it is not as semantic segmentation is a key part of obtaining the instance map.
# Architecture
![image](https://user-images.githubusercontent.com/8659132/63187959-24cdb380-c02e-11e9-9121-77e0923e91c6.png)
The image is first put through a typical CNN encoder (specifically a ResNet derivative), followed by 3 separate decoders. The output of the decoder is at a low resolution for faster processing.
Decoders:
- Semantic segmentation: coupled with the encoder, it's U-Net-like. The output is a segmentation map.
- Instance center: for each pixel, outputs the confidence that it is the center of an object.
- Embedding: for each pixel, computes a 32 dimensional embedding. This embedding must have a low distance to embedding of other pixels of the same instance, and high distance to embedding of other pixels.
To obtain the instance map, the segmentation map is used to mask the other 2 decoder outputs to separate the embeddings and centers of each class. Centers are thresholded at 0.7, and centers with embedding distances lower than a set amount are discarded, as they are considered duplicates.
Then for each class, a similarity matrix is computed between all pixels from that class and centers from that class. Pixels are assigned to their closest centers, which represent different instances of the class.
Finally, the segmentation and instance maps are upsampled using the SLIC algorithm.
# Loss
There is one loss for each decoder head.
- Semantic segmentation: weighted cross-entropy
- Instance center: cross-entropy term modulated by a $\gamma$ parameter to counter the over-representation of the background over the target classes.
![image](https://user-images.githubusercontent.com/8659132/63286485-22659680-c286-11e9-9134-f1b823a34217.png)
- Embedding: composed of 3 parts, an attracting force between embeddings of the same instance, a repelling force between embeddings of different instances, and a l2 regularization on the embedding.
![image](https://user-images.githubusercontent.com/8659132/63286399-f1856180-c285-11e9-9136-feb6c4a555e5.png)
![image](https://user-images.githubusercontent.com/8659132/63286411-fcd88d00-c285-11e9-939f-0771579d8263.png)
$\hat{e}$ are the embeddings, $\delta_a$ is an hyper-parameter defining "close enough", and $\delta_b$ defines "far enough"
The whole model is trained jointly using a weighted sum of the 3 losses.
# Experiments and results
The authors test their method on the Cityscape dataset, which is composed of 5000 annotated images and 8 instance classes. They compare their methods both for semantic segmentation and instance segmentation.
![image](https://user-images.githubusercontent.com/8659132/63287573-a882dc80-c288-11e9-83e0-b352e43bdf28.png)
For semantic segmentation, their method is ok, though ENet for example performs better on average and is much faster.
![image](https://user-images.githubusercontent.com/8659132/63287643-d700b780-c288-11e9-9d40-5bcaf695a744.png)
On the other hand, for instance segmentation, their method is much faster than the other while still performing well. Not SOTA on performance, but considering the real-time constraint, it's much better.
# Comments
- Most instance segmentation methods tend to be sluggish and overly complicated. This approach is much more elegant in my opinion.
- If they removed the aggressive down/up sampling, I wonder if they would beat MaskRCNN and PANet.
- I'm not sure what's the point of upsampling the semantic map given that we already have the instance map.
more
less