DenseCap: Fully Convolutional Localization Networks for Dense Captioning
Justin Johnson
and
Andrej Karpathy
and
Li Fei-Fei
arXiv e-Print archive - 2015 via Local arXiv
Keywords:
cs.CV, cs.LG
First published: 2015/11/24 (9 years ago) Abstract: We introduce the dense captioning task, which requires a computer vision
system to both localize and describe salient regions in images in natural
language. The dense captioning task generalizes object detection when the
descriptions consist of a single word, and Image Captioning when one predicted
region covers the full image. To address the localization and description task
jointly we propose a Fully Convolutional Localization Network (FCLN)
architecture that processes an image with a single, efficient forward pass,
requires no external regions proposals, and can be trained end-to-end with a
single round of optimization. The architecture is composed of a Convolutional
Network, a novel dense localization layer, and Recurrent Neural Network
language model that generates the label sequences. We evaluate our network on
the Visual Genome dataset, which comprises 94,000 images and 4,100,000
region-grounded captions. We observe both speed and accuracy improvements over
baselines based on current state of the art approaches in both generation and
retrieval settings.