This paper is to design a generalized multimodal architecture that can solve all Vision language tasks.
Concretely, they will pre-train their model on 4 main tasks (MLM, ITM, WRA, MRM) and will evaluate various downstream tasks (VQA, VCR, NLVR).
As shown in Fig 1, UNITER first encodes image regions (visual features and bounding box features) and textual words (tokens and positions) into a common embedding space with Image Embedder and Text Embedder.
Then, a Transformer module is applied to learn generalizable contextualized embeddings for each region and each word.
The contribution is two-fold:
(1) Masked language/region modeling is conditioned on full observation of image/text, rather than applying joint random masking to both modalities
(2) Introducing a novel WRA pre-training task via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions.
Intuitively, OT-based learning aims to optimize distribution matching by minimizing the cost of transporting one distribution to another. In our context, we aim to minimize the cost of transporting the embeddings from image regions to words in a sentence (and vice versa), thus optimizing toward better cross-modal alignment.