VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari
and
Liangzhe Yuan
and
Rui Qian
and
Wei-Hong Chuang
and
Shih-Fu Chang
and
Yin Cui
and
Boqing Gong
arXiv e-Print archive - 2021 via Local arXiv
Keywords:
cs.CV, cs.AI, cs.LG, cs.MM, eess.IV
First published: 2024/12/21 (just now) Abstract: We present a framework for learning multimodal representations from unlabeled
data using convolution-free Transformer architectures. Specifically, our
Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts
multimodal representations that are rich enough to benefit a variety of
downstream tasks. We train VATT end-to-end from scratch using multimodal
contrastive losses and evaluate its performance by the downstream tasks of
video action recognition, audio event classification, image classification, and
text-to-video retrieval. Furthermore, we study a modality-agnostic
single-backbone Transformer by sharing weights among the three modalities. We
show that the convolution-free VATT outperforms state-of-the-art ConvNet-based
architectures in the downstream tasks. Especially, VATT's vision Transformer
achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on Kinetics-600,and
41.1% on Moments in Time, new records while avoiding supervised pre-training.
Transferring to image classification leads to 78.7% top-1 accuracy on ImageNet
compared to 64.7% by training the same Transformer from scratch, showing the
generalizability of our model despite the domain gap between videos and images.
VATT's audio Transformer also sets a new record on waveform-based audio event
recognition by achieving the mAP of 39.4% on AudioSet without any supervised
pre-training. VATT's source code is publicly available.