Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Peter Anderson
and
Xiaodong He
and
Chris Buehler
and
Damien Teney
and
Mark Johnson
and
Stephen Gould
and
Lei Zhang
arXiv e-Print archive - 2017 via Local arXiv
Keywords:
cs.CV
First published: 2017/07/25 (7 years ago) Abstract: Top-down visual attention mechanisms have been used extensively in image
captioning and visual question answering (VQA) to enable deeper image
understanding through fine-grained analysis and even multiple steps of
reasoning. In this work, we propose a combined bottom-up and top-down attention
mechanism that enables attention to be calculated at the level of objects and
other salient image regions. This is the natural basis for attention to be
considered. Within our approach, the bottom-up mechanism (based on Faster
R-CNN) proposes image regions, each with an associated feature vector, while
the top-down mechanism determines feature weightings. Applying this approach to
image captioning, our results on the MSCOCO test server establish a new
state-of-the-art for the task, improving the best published result in terms of
CIDEr score from 114.7 to 117.9 and BLEU-4 from 35.2 to 36.9. Demonstrating the
broad applicability of the method, applying the same approach to VQA we obtain
first place in the 2017 VQA Challenge.
This paper solves two tasks: Image Captioning and VQA.
The main idea is to use Faster R-CNN to embed images (kx2048 from k bounding boxes) instead of ResNet (14x14x2048) and apply attention over k vectors.
For **VQA**, this is basically (Faster R-CNN + ShowAttendAskAnswer). SAAA(ShowAskAttendAnswer) calculates a 2D attention map from the concatenation of a text vector (2048-dim from LSTM) and image tensor (2048x14x14 from ResNet). This image feature can be thought as a collection of 2048-dim feature vectors. This paper uses Faster R-CNN to get k bounding boxes. Each bounding box is a 2048-dim vector so we have kx2048, which is fed to SAAA.
**SAAA**:
https://i.imgur.com/2FnPXi0.png
**This paper (VQA)**:
https://i.imgur.com/xib77Iy.png
For **Image Captioning**, it uses 2-layer LSTM. The first layer gets the average of k 2048-dim vectors. The output is used to calculate the attention weights over k vectors. The second layer gets the weight-averaged 2048-dim vector and the output of the first layer.
https://i.imgur.com/GeXaC30.png