ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

FALCON: Fast Visual Concept Learning by Integrating Images, Linguistic descriptions, and Conceptual Relations
Lingjie Mei and Jiayuan Mao and Ziqi Wang and Chuang Gan and Joshua B. Tenenbaum
arXiv e-Print archive - 2022 via Local arXiv
Keywords: cs.CV, cs.AI, cs.CL, cs.LG
more

[link] Summary by ngthanhtinqn 1 year ago

This paper proposed a method to locate an object based on an image and a sentence describing objects in the image. Then, predicting a new visual concept embedding based on two graphs (1) a graph that describes the relationship between objects in a supplemental sentence describing several objects, and (2) a graph that describes the relationship between the detected object in the image and example images related to objects in the supplementary sentence. This embedding can be used for many downstream tasks such as Visual entailment, Visual Reasoning.

For example, in the domain 1 image, the new visual concept is red, and the model can locate where is the red cub in the image (1a). Then, in (1b) the model can interpret the supplemental sentences that relate the novel concept with other concepts.

https://i.imgur.com/yBIteYT.png

To locate the box that describes the object, this work utilized MaskRCNN to first detect the object in the scene, then used a Neuro-Symbolic Program to match the object mentioned in the input sentence with the detected objects by MaskRCNN.

https://i.imgur.com/2cG9IUX.png

To learn the concept embedding for that object, this work needs a supplemental sentence that describes several objects, they are known concepts except one is a novel concept. Then, building two graphs $GNN_{concept}$, and $GNN_{example}$.

In $GNN_{concept}$, this is a graph representing the relationship between known concepts and the novel concept.
For example in this graph, "White-eyed Vireo" is a new concept.
https://i.imgur.com/1LjirJz.png

In $GNN_{example}$, this is a graph representing the relationship between the detected object that represents the novel object and example images of the novel object.
https://i.imgur.com/SdR74Vu.png

Then learn the concept embedding for this novel concept.

https://i.imgur.com/YGYEPvc.png

https://i.imgur.com/VXOBn6n.png

arxiv.org
arxiv-vanity.com
scholar.google.com

Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification
Yue Yang and Artemis Panagopoulou and Shenghao Zhou and Daniel Jin and Chris Callison-Burch and Mark Yatskar
arXiv e-Print archive - 2022 via Local arXiv
Keywords: cs.CV, cs.CL
more

[link] Summary by ngthanhtinqn 1 year ago

what is the paper doing?
This paper proposed a way to explain the model decision by human-readable concepts. For example, if the model thinks the following image is a black-throated sparrow, then a human can understand this decision via input descriptors.

https://i.imgur.com/xVleDhp.png

The descriptors were obtained from GPT-3, they got 500 descriptors for each class and then remove the class name in each descriptor. Then, for each class, they chose $k$ concepts to make sure that every class has an equal amount of concepts.

After that, they put these concepts into a concept selection module to select a more fine-grained subset of concepts for each class.

Then, they put these concepts and the image into CLIP to learn the score for each concept.

Finally, they put a Class-concept weight matrix on top of CLIP to fine-tune these scores and output the predicted class name. Note that, this weight matrix was initialized with language priors.

https://i.imgur.com/r9Op5Lm.png

arxiv.org
arxiv-vanity.com
scholar.google.com

Open-Vocabulary DETR with Conditional Matching
Yuhang Zang and Wei Li and Kaiyang Zhou and Chen Huang and Chen Change Loy
arXiv e-Print archive - 2022 via Local arXiv
Keywords: cs.CV, cs.AI
more

[link] Summary by ngthanhtinqn 1 year ago

The paper proposed a new object detection method to detect novel classes by using Conditional Matching. This detector can be conditioned on either image or text, which means a user can use an image or text to let the model detect the corresponding bounding boxes in the picture.

This model has 2 changes compared to other open-vocabulary detectors:

1) Other detectors rely on Region Proposal Network (RPN) which can not cover all the objects in a picture, so it will worsen the performance of detecting novel objects. So in this work, they use CLIP to detect novel objects, it is better than RPN because it uses queries as a reader to read the whole picture, then these queries can cover many objects in the picture.

https://i.imgur.com/GqvvSVs.png

2) Other detectors rely on Bipartite Matching to match between class label names and detected bounding boxes. But the downside of Bipartite Matching is that it can not match the novel objects with any label names because the novel objects do not have the labels. So, in this work, they proposed to use Conditional Matching which turns the matching problem into a binary matching problem. By using Conditional Matching, an object can be assigned to a "matched" or "not matched" label.

https://i.imgur.com/FjI2iub.png

arxiv.org
arxiv-vanity.com
scholar.google.com

Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?
Tian Yun and Usha Bhalla and Ellie Pavlick and Chen Sun
arXiv e-Print archive - 2022 via Local arXiv
Keywords: cs.CV, cs.AI
more

[link] Summary by ngthanhtinqn 1 year ago

This paper proposed a way to do classification using primitive concepts such as color, shape, texture, etc.

The framework is simple, they have two sub-models:

(1) the first one is a trained VL model such as CLIP,  ViLT, and ALBEF. The input of this step is the primitive concepts or let's say, attribute concepts and an image, then the output will be the scores for each concept.

(2) the second one is a linear model that uses the concepts and their scores to do classification. This model is trained in a supervised manner.

https://i.imgur.com/7WMmGyv.png

arxiv.org
arxiv-vanity.com
scholar.google.com

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li and Dongxu Li and Silvio Savarese and Steven Hoi
arXiv e-Print archive - 2023 via Local arXiv
Keywords: cs.CV
more

[link] Summary by ngthanhtinqn 1 year ago

This paper has a way to leverage pre-trained Vision Language encoders to do VL tasks such as VQA, and Image Captioning.

To have a good VL model, the modality gap must be reduced. In this paper, they proposed a Q-Former which is a Transformer module that is trained first with a frozen image encoder, then trained with this frozen image encoder and a frozen text encoder (from a Large Language Model).

https://i.imgur.com/rQ3V3oQ.png

The reason why the Q-Former needs to train in two stages is:

(1) Trained with frozen image encoder to learn the most informative visual features.

https://i.imgur.com/gshAy1p.png

(2) Trained with frozen text encoder to learn the visual features related to the textual feature the most.

https://i.imgur.com/gPz40GC.png