Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
Feng Liang and Bichen Wu and Xiaoliang Dai and Kunpeng Li and Yinan Zhao and Hang Zhang and Peizhao Zhang and Peter Vajda and Diana Marculescu
arXiv e-Print archive - 2022 via Local arXiv
Keywords: cs.CV, cs.LG
more

Summaries/Notes 1

[link] Summary by ngthanhtinqn 2 years ago

Open-vocabulary semantic segmentation is a method to generate semantic segment regions based on text descriptions. Due to the text descriptions, this model can detect unseen objects that have not been seen in the training phase.

Some works create two-stage methods to first create class-agnostic segments and then use CLIP to assign each segment to a phrase. 
https://i.imgur.com/eyME6i1.png
To compute the prediction for an image, they ensemble two types of prediction scores.

(1) If we want to classify a mask into $K$ classes, firstly, we encode $K$ class names into $K$ phrase embedding, each phrase embedding is denoted as $t_{k}$, and also encode the mask into a visual embedding, it is denoted as $v_{i}$, then calculate the score  $p_{k}$ between $K$ phrase embedding and the visual embedding.

$p_{k} = e(sigmoid(v_{i}, t_{k})/temperature)/\sum(e(sigmoid(v_{i}, t_{k})/temperature))$

(2) Another way to classify a mask into $K$ classes is to feed the mask into the CLIP vision encoder and reduce the size to $K$ embedding vector, to get the score $p^{'}_{k}$.

Then, the final prediction will be the ensemble between these two scores, $p = p_{k}^{1-lambda}*p^{' lambda}_{k}$ where $lambda \in [0,1]$

But CLIP does not work well on masked images (segments), because CLIP was trained on the full image resolution. A critical problem with masked images is that it contains blank areas, so when these areas are fed into CLIP, they will become zero tokens, and according to the paper, these tokens not only bring no information but also bring domain distribution shift to the model.

In this work, they made CLIP work well on masked images by converting these zero tokens into learnable tokens, and this is called mask prompt. 
https://i.imgur.com/muhdGxP.png

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private