BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li and Dongxu Li and Silvio Savarese and Steven Hoi
arXiv e-Print archive - 2023 via Local arXiv
Keywords: cs.CV
more

Summaries/Notes 1

[link] Summary by ngthanhtinqn 2 years ago

This paper has a way to leverage pre-trained Vision Language encoders to do VL tasks such as VQA, and Image Captioning.

To have a good VL model, the modality gap must be reduced. In this paper, they proposed a Q-Former which is a Transformer module that is trained first with a frozen image encoder, then trained with this frozen image encoder and a frozen text encoder (from a Large Language Model).

https://i.imgur.com/rQ3V3oQ.png

The reason why the Q-Former needs to train in two stages is:

(1) Trained with frozen image encoder to learn the most informative visual features.

https://i.imgur.com/gshAy1p.png

(2) Trained with frozen text encoder to learn the visual features related to the textual feature the most.

https://i.imgur.com/gPz40GC.png

Your comment: