Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CV
more

Summaries/Notes 2

[link] Summary by elbaro 7 years ago

This paper solves two tasks: Image Captioning and VQA.
The main idea is to use Faster R-CNN to embed images (kx2048 from k bounding boxes) instead of ResNet (14x14x2048) and apply attention over k vectors.

For **VQA**, this is basically (Faster R-CNN + ShowAttendAskAnswer). SAAA(ShowAskAttendAnswer) calculates a 2D attention map from the concatenation of a text vector (2048-dim from LSTM) and image tensor (2048x14x14 from ResNet). This image feature can be thought as a collection of 2048-dim feature vectors. This paper uses Faster R-CNN to get k bounding boxes. Each bounding box is a 2048-dim vector so we have kx2048, which is fed to SAAA.


**SAAA**:
https://i.imgur.com/2FnPXi0.png

**This paper (VQA)**:
https://i.imgur.com/xib77Iy.png

For **Image Captioning**, it uses 2-layer LSTM. The first layer gets the average of k 2048-dim vectors. The output is used to calculate the attention weights over k vectors. The second layer gets the weight-averaged 2048-dim vector and the output of the first layer.

https://i.imgur.com/GeXaC30.png

Your comment: