Show, Attend and Tell: Neural Image Caption Generation with Visual Attention on ShortScience.org

jmlr.org
scholar.google.com

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Xu, Kelvin and Ba, Jimmy and Kiros, Ryan and Cho, Kyunghyun and Courville, Aaron C. and Salakhutdinov, Ruslan and Zemel, Richard S. and Bengio, Yoshua
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 2

[link] Summary by Denny Britz 8 years ago

TLDR; The authors use an attention mechanism in image caption generation, allowing the decoder RNN focus on specific parts of the image. In order find the correspondence between words and image patches, the RNN uses a lower convolutional layer as its input (before pooling). The authors propose both a "hard" attention (trained using sampling methods) and "soft" attention (trained end-to-end) mechanism, and show qualitatively that the decoder focuses on sensible regions while generating text, adding an additional layer of interpretability to the model. The attention-based models achieve state-of-the art on Flickr8k, Flickr30 and MS Coco.

#### Key Points

- To find image correspondence use lower convolutional layers to attend to.
- Two attention mechanisms: Soft and hard. Depending on evaluation metric (BLEU vs. METERO) one or the other performs better.
- Largest data set (MS COCO) takes 3 days to train on Titan Black GPU. Oxford VGG.
- Soft attention is same as for seq2seq models.
- Attention weights are visualized by upsampling and applying a Gaussian

#### Notes/Questions

- Would've liked to see an explanation of when/how soft vs. hard attention does better.
- What is the computational overhead of using the attention mechanism? Is it significant?

Your comment:

[link] Summary by jerpint 5 years ago

# Summary

The authors present a way to generate captions describing the content of images using attention-based mechanisms. They present two ways of training the network, one via standard backpropagation techniques and another using stochastic processes. They also show how their model can selectively "focus" on the relevant parts of an image to generate appropriate captions, as shown in the classic example of the famous woman throwing a frisbee. Finally, they validate their model on Flicker8k, Flicker30k and MSCOCO.

![image](https://user-images.githubusercontent.com/18450628/61397054-10639300-a897-11e9-8b4a-f4cd804c3229.png)

# Model

At a very high level, the model takes as input an image I and returns a caption generated from a pre-defined vocabulary:

![image](https://user-images.githubusercontent.com/18450628/61398513-20c93d00-a89a-11e9-8e93-72ccf7a61be1.png)

A high-level overview of the model is presented in Figure 1:

![image](https://user-images.githubusercontent.com/18450628/61398365-de076500-a899-11e9-8413-55ec755f0f83.png)

## Visual extractor

A CNN is used to extract features from the image. The authors experimented with VGG-19 pretrained on ImageNet and not finetuned. They use the features from the last convolutional layer as their representations. Starting with images of 224x224, the last CNN feature map has shape 14x14x512, which they flatten along width and height to obtain a vector representation of 196x512. These 512 vectors are used as inputs to the language model.

## Sentence generation

An LSTM network is used to generate a sequence of words from a fixed vocabulary of size L. As input, a weighted sum based on attention values of the vectors of the flattened image features is used. The previous word is also fed as input to the LSTM. The hidden layer from the previous timestep as well as the layers from the CNN a_i are fed through an MLP + softmax layer and used to generate attention values for each flattened image feature vector that sum to one.

![image](https://user-images.githubusercontent.com/18450628/61462206-41e46900-a940-11e9-991d-e3a9e4b98837.png)

![image](https://user-images.githubusercontent.com/18450628/61462544-c9ca7300-a940-11e9-8c31-dbf85bf8301f.png)

The authors propose two ways to compute phi, i.e. the attention, which they refer to as "soft attention" and "hard attention". These will be covered in a later section.

The output of the LSTM, z, is then fed to a deep network to generate the next word. This is detailed in the following figure.

![image](https://user-images.githubusercontent.com/18450628/61408594-6132b600-a8ae-11e9-894c-392396e299b0.png)

## Attention

The paper proposes two methods of attention, a "soft" attention and a "hard" attention.

### Soft attention

Soft attention is the most intuitive one and is relatively straight forward. In order to compute the vector representing the image as input to the LSTM, **z**, the expectation of the context vector is computed by using a weighted average scheme:

![image](https://user-images.githubusercontent.com/18450628/61408939-34cb6980-a8af-11e9-989e-24308be3ed3c.png)

where alpha are the attention weights and a_i are the vectors of the feature representation.

To ensure that all image features are used somewhat equally, a regularization term is added to the loss function:

![image](https://user-images.githubusercontent.com/18450628/61412057-d2c23280-a8b5-11e9-9d9c-7f35edc650ef.png)

This ensures that the image feature vectors over time sum to 1 as closely as possible and that no part of the image is ignored.

### Hard attention

The authors propose an alternative method to calculate attention. Each attention parameter is treated as an intermediate latent variables that can be represented in one-hot encoding, i.e. on or off. To do so, they use a multinoulli distribution parametrized by alpha, the softmax output of f_att. They show how they can approximate the gradient using monte-carlo methods:

![image](https://user-images.githubusercontent.com/18450628/61413158-d1463980-a8b8-11e9-8aef-b2a9d9bb6bad.png)

Refer to the paper for more mathemagical details. Finally, they use soft attention with probability 0.5 when using hard attention.

## Visualizing features

One of the contributions of this work is showing what the network is "attending" to. To do so, the authors use the final layer of VGG-19, which consists of 14x14x512 features, upsample the resulting filters to the original image size of 224x224 and use a Gaussian blur to recreate the receptive field.

## Results

The authors evaluate their methods on 3 caption datasets, i.e. Flicker8k, Flicker30k and MSCOCO. For all our experiments, they used a fixed vocabulary size of 10,000. They report both BLEU and METEOR scores on the task.

As can be seen in the following figure, both the soft and hard attention mechanisms beat all state of the art methods at the time of publishing. Hard attention outperformed soft attention most of the time.

![image](https://user-images.githubusercontent.com/18450628/61412281-77dd0b00-a8b6-11e9-882f-73e86638bc9d.png)

# Comments

Cool paper which lead the way in terms of combining text and images and using attention mechanisms. They show an interesting way to visualize what the network is attending to, although it was not clear to me why they should expect the final layer of the CNN to be showing that in the first place since they did not finetune on the datasets they were training on. I would expect that to mean that their method would work best on datasets most "similar" to ImageNet.

Their hard attention mechanism seems a lot more complicated than the soft attention mechanism and it isn't always clear that it is much better other than it offers a stronger regularization and a type of dropout.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private