Language Is Not All You Need: Aligning Perception with Language Models on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Language Is Not All You Need: Aligning Perception with Language Models
Shaohan Huang and Li Dong and Wenhui Wang and Yaru Hao and Saksham Singhal and Shuming Ma and Tengchao Lv and Lei Cui and Owais Khan Mohammed and Qiang Liu and Kriti Aggarwal and Zewen Chi and Johan Bjorck and Vishrav Chaudhary and Subhojit Som and Xia Song and Furu Wei
arXiv e-Print archive - 2023 via Local arXiv
Keywords: cs.CL, cs.CV
more

Summaries/Notes 1

[link] Summary by ngthanhtinqn 2 years ago

This paper is about Multimodal Large Language Model (MLLM). In this paper, they proposed an MLLM model called KOSMOS-1 that can do instruction following, VQA, IQ-testing, visual dialog, etc.

https://i.imgur.com/9P3Vuse.png

https://i.imgur.com/HcYtbdD.png

The input of this model is image-caption pairs and interleaved data of images and texts.
https://i.imgur.com/LL4HiM3.png

The input data will be fed into an embedding module to encode the data into vectors, then the vectors will be fed into a Transformer Decoder. Then the decoder will predict the next token based on the previous context.

I think this large model's downside is that it can only predict phrases, not images.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private