DeViSE: A Deep Visual-Semantic Embedding Model
Frome, Andrea
and
Corrado, Gregory S.
and
Shlens, Jonathon
and
Bengio, Samy
and
Dean, Jeffrey
and
Ranzato, Marc'Aurelio
and
Mikolov, Tomas
Neural Information Processing Systems Conference - 2013 via Local Bibsonomy
Keywords:
dblp
This computer vision paper uses an unsupervised, neural net based semantic embedding of a Wikipaedia text corpus trained using skip-gram coding to enhance the performance of the Krizhevsky et al deep network \cite{krizhevsky2012imagenet} that won the 2012 ImageNet large scale visual recognition challenge, particularly for zero-shot learning problems (i.e. previously unseen classes with some similarity to previously seen ones). The two networks are trained separately, then the output layer of \cite{krizhevsky2012imagenet} is replaced with a linear mapping to the semantic text representation and re-trained on ImageNet 1k using a dot product loss reminiscent of a structured output SVM one. The text representation is not currently re-trained. The model is tested on ImageNet 1k and 21k. With the semantic embedding output it does not quite manage to reproduce the ImageNet 1k flat-class hit rates of the original softmax-output model, but it does better than the original on hierarchical-class hit rates and on previously unseen classes from ImageNet 21k. For unseen classes, the improvements are modest in absolute terms (albeit somewhat larger in relative ones).
It consists of the following steps:
1. Learn an embedding of a large number of words in a Euclidean space.
2. Learn a deep architecture which takes images as input and predicts one of 1,000 object categories.
The 1,000 categories are a subset of the 'large number of words' of step (1).
3. Remove the last layer of the visual model -- leaving what is referred to as the 'core' visual model.
Replace it by the word embeddings and add a layer to map the core visual model output to the word embeddings.