Summaries from arXiv e-Print archive on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Language Is Not All You Need: Aligning Perception with Language Models
Shaohan Huang and Li Dong and Wenhui Wang and Yaru Hao and Saksham Singhal and Shuming Ma and Tengchao Lv and Lei Cui and Owais Khan Mohammed and Qiang Liu and Kriti Aggarwal and Zewen Chi and Johan Bjorck and Vishrav Chaudhary and Subhojit Som and Xia Song and Furu Wei
arXiv e-Print archive - 2023 via Local arXiv
Keywords: cs.CL, cs.CV
more

[link] Summary by ngthanhtinqn 2 years ago

This paper is about Multimodal Large Language Model (MLLM). In this paper, they proposed an MLLM model called KOSMOS-1 that can do instruction following, VQA, IQ-testing, visual dialog, etc.

https://i.imgur.com/9P3Vuse.png

https://i.imgur.com/HcYtbdD.png

The input of this model is image-caption pairs and interleaved data of images and texts.
https://i.imgur.com/LL4HiM3.png

The input data will be fed into an embedding module to encode the data into vectors, then the vectors will be fed into a Transformer Decoder. Then the decoder will predict the next token based on the previous context.

I think this large model's downside is that it can only predict phrases, not images.

arxiv.org
arxiv-vanity.com
scholar.google.com

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li and Dongxu Li and Silvio Savarese and Steven Hoi
arXiv e-Print archive - 2023 via Local arXiv
Keywords: cs.CV
more

[link] Summary by ngthanhtinqn 2 years ago

This paper has a way to leverage pre-trained Vision Language encoders to do VL tasks such as VQA, and Image Captioning.

To have a good VL model, the modality gap must be reduced. In this paper, they proposed a Q-Former which is a Transformer module that is trained first with a frozen image encoder, then trained with this frozen image encoder and a frozen text encoder (from a Large Language Model).

https://i.imgur.com/rQ3V3oQ.png

The reason why the Q-Former needs to train in two stages is:

(1) Trained with frozen image encoder to learn the most informative visual features.

https://i.imgur.com/gshAy1p.png

(2) Trained with frozen text encoder to learn the visual features related to the textual feature the most.

https://i.imgur.com/gPz40GC.png

arxiv.org
arxiv-vanity.com
scholar.google.com

STAIR: Learning Sparse Text and Image Representation in Grounded Tokens
Chen Chen and Bowen Zhang and Liangliang Cao and Jiguang Shen and Tom Gunter and Albin Madappally Jose and Alexander Toshev and Jonathon Shlens and Ruoming Pang and Yinfei Yang
arXiv e-Print archive - 2023 via Local arXiv
Keywords: cs.CV
more

[link] Summary by ngthanhtinqn 2 years ago

This paper aims to learn a sparse semantic representation of texts and images instead of a dense representation trained by CLIP or ALIGN. 

The sparse embeddings are achieved by:

(1) For an input (image or text), extract it to a feature (using Transformer) $h$ where $h_{j}$ corresponds to the $jth$ word in the input.

(2) Each $j$ word embedding will be transformed to $p(h_{j})$ in vocabulary space $V$ by using a mapping function (in this paper, this is BERT Masked Language Model MLM). So each $p(h_{j})$ is a token in a vocabulary space $V$.

(3) A max pooling layer will be applied to $p(h_{j})$ to get a value denoted for that token. So in the end, we will have a sparse vector living in V-dimensional space.

https://i.imgur.com/BTvndLR.png

Training:
To achieve two goals (1) aligning text and images in the sparse embedding and (2) grounding the sparse vector with the human-understandable word in the vocabulary, they proposed 3-stage training:

Stage 1: Training image embedding with masked tokens. In the first stage, they co-trained both the image and text encoders and apply a binary mask on the text embedding. By matching with the masked text embedding, the image encoder is learned to ground its image embedding on the tokens from the pairing text. Therefore, after the stage 1 training, the image embedding is living in the vocabulary’s interpretable space.

Stage 2: Training with frozen image encoder. In this stage, they focus on grounding the text embedding to the same interpretable space where the image embedding is trained to reside in from stage 1. The key idea is to let the image encoder teach the text encoder as a teacher model. After stage 2 training, both image and text embeddings are in the same human-interpretable embedding space constructed by the vocabulary.

Stage 3: Fine-tuning both encoders, they boosted the image-text matching performance by finetuning both encoders jointly.

https://i.imgur.com/PWrEbkk.png

To further encourage the sparsity, they proposed to use FLOPs regularization loss such that only a small number of token embeddings in V are non-zeros.