DenseCap: Fully Convolutional Localization Networks for Dense Captioning
Justin Johnson
and
Andrej Karpathy
and
Li Fei-Fei
arXiv e-Print archive - 2015 via Local arXiv
Keywords:
cs.CV, cs.LG
First published: 2015/11/24 (8 years ago) Abstract: We introduce the dense captioning task, which requires a computer vision
system to both localize and describe salient regions in images in natural
language. The dense captioning task generalizes object detection when the
descriptions consist of a single word, and Image Captioning when one predicted
region covers the full image. To address the localization and description task
jointly we propose a Fully Convolutional Localization Network (FCLN)
architecture that processes an image with a single, efficient forward pass,
requires no external regions proposals, and can be trained end-to-end with a
single round of optimization. The architecture is composed of a Convolutional
Network, a novel dense localization layer, and Recurrent Neural Network
language model that generates the label sequences. We evaluate our network on
the Visual Genome dataset, which comprises 94,000 images and 4,100,000
region-grounded captions. We observe both speed and accuracy improvements over
baselines based on current state of the art approaches in both generation and
retrieval settings.
This paper introduces the task of dense captioning and proposes
a network architecture that processes an image and produce region descriptions
in a single pass and can be trained end-to-end. Main contributions:
- Dense captioning
- Generalization of object detection (caption consists of single word)
and image captioning (region consists of whole image).
- Fully convolution localization network
- Fully differentiable, can be trained jointly with the rest of the network
- Consists of a region proposal network, box regression (similar to Faster R-CNN)
and bilinear interpolation (similar to Spatial Transformer Networks) for
sampling.
- Network details
- Convolutional layer features are extracted for image
- For each element in the feature map, k anchor boxes of different aspect ratios
are selected in the input image space.
- For each of these, the localization layer predicts offsets and confidence.
- The region proposals are projected on the convolutional feature map and a sampling
grid is computed from output feature map to input (bilinear sampling).
- The computed feature map is passed through an MLP to compute representations
corresponding to each region.
- These are passed (in a batch) as the first word to an LSTM (Show and Tell) which
is trained to predict each word of the caption.
## Strengths
- Fully differentiable 'spatial attention' mechanism (bilinear interpolation)
in place of RoI pooling as in the case of Faster R-CNN.
- RoI pooling is not differentiable with respect to the input proposal coordinates.
- Fast, and impressive qualitative results.
## Weaknesses / Notes
The model is very well engineered together from different works (Faster R-CNN +
Spatial Transformer Networks + Show & Tell).