Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?
arXiv e-Print archive - 2022 via Local arXiv
First published: 2024/02/26 (just now) Abstract: In this paper, we study whether representations of primitive concepts--such
as colors and shapes of object parts--emerge automatically within these
pretrained VL models. We propose a two-step framework, Compositional Concept
Mapping (CompMap), to investigate this. CompMap asks a VL model to generate
concept activations with text prompts from a predefined list of primitive
concepts, and then learns to construct an explicit composition model that maps
the primitive concept activations (e.g. the likelihood of black tail or red
wing) to composite concepts (e.g. a red-winged blackbird). We demonstrate that
a composition model can be designed as a set operation, and show that a
composition model is straightforward for machines to learn from ground truth
primitive concepts (as a linear classifier). We thus hypothesize that if
primitive concepts indeed emerge in a VL pretrained model, its primitive
concept activations can be used to learn a composition model similar to the one
designed by experts. We propose a quantitative metric to measure the degree of
similarity, and refer to the metric as the interpretability of the learned
primitive concept representations of VL models. We also measure the
classification accuracy when using the primitive concept activations and the
learned composition model to predict the composite concepts, and refer to it as
the usefulness metric. Our study reveals that state-of-the-art VL pretrained
models learn primitive concepts that are highly useful for fine-grained visual
recognition on the CUB dataset, and compositional generalization tasks on the
MIT-States dataset. However, we observe that the learned composition models
have low interpretability in our qualitative analyses. Our results reveal the
limitations of existing VL models, and the necessity of pretraining objectives
that encourage the acquisition of primitive concepts.
This paper proposed a way to do classification using primitive concepts such as color, shape, texture, etc.
The framework is simple, they have two sub-models:
(1) the first one is a trained VL model such as CLIP, ViLT, and ALBEF. The input of this step is the primitive concepts or let's say, attribute concepts and an image, then the output will be the scores for each concept.
(2) the second one is a linear model that uses the concepts and their scores to do classification. This model is trained in a supervised manner.