[link]
Summary by ngthanhtinqn 1 year ago
Open-vocabulary semantic segmentation is a method to generate semantic segment regions based on text descriptions. Due to the text descriptions, this model can detect unseen objects that have not been seen in the training phase.
Some works create two-stage methods to first create class-agnostic segments and then use CLIP to assign each segment to a phrase.
https://i.imgur.com/eyME6i1.png
To compute the prediction for an image, they ensemble two types of prediction scores.
(1) If we want to classify a mask into $K$ classes, firstly, we encode $K$ class names into $K$ phrase embedding, each phrase embedding is denoted as $t_{k}$, and also encode the mask into a visual embedding, it is denoted as $v_{i}$, then calculate the score $p_{k}$ between $K$ phrase embedding and the visual embedding.
$p_{k} = e(sigmoid(v_{i}, t_{k})/temperature)/\sum(e(sigmoid(v_{i}, t_{k})/temperature))$
(2) Another way to classify a mask into $K$ classes is to feed the mask into the CLIP vision encoder and reduce the size to $K$ embedding vector, to get the score $p^{'}_{k}$.
Then, the final prediction will be the ensemble between these two scores, $p = p_{k}^{1-lambda}*p^{' lambda}_{k}$ where $lambda \in [0,1]$
But CLIP does not work well on masked images (segments), because CLIP was trained on the full image resolution. A critical problem with masked images is that it contains blank areas, so when these areas are fed into CLIP, they will become zero tokens, and according to the paper, these tokens not only bring no information but also bring domain distribution shift to the model.
In this work, they made CLIP work well on masked images by converting these zero tokens into learnable tokens, and this is called mask prompt.
https://i.imgur.com/muhdGxP.png
more
less