Do Vision-Language Pretrained Models Learn Composable Primitive Concepts? on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?
Tian Yun and Usha Bhalla and Ellie Pavlick and Chen Sun
arXiv e-Print archive - 2022 via Local arXiv
Keywords: cs.CV, cs.AI
more

Summaries/Notes 1

[link] Summary by ngthanhtinqn 2 years ago

This paper proposed a way to do classification using primitive concepts such as color, shape, texture, etc.

The framework is simple, they have two sub-models:

(1) the first one is a trained VL model such as CLIP,  ViLT, and ALBEF. The input of this step is the primitive concepts or let's say, attribute concepts and an image, then the output will be the scores for each concept.

(2) the second one is a linear model that uses the concepts and their scores to do classification. This model is trained in a supervised manner.

https://i.imgur.com/7WMmGyv.png

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private