The authors present a neural model that maps images and sentences into the same space, in order to perform cross-modal retrieval – find images based on a sentence or find sentences based on an image.
https://i.imgur.com/DCFYzN8.png
The image vectors come from a pre-trained VGG image detection network. The sentence vectors are constructed using Fisher vectors, but they also explore simpler options, such as mean word2vec vectors and tfidf. Both are then mapped through nonlinearities and normalised, and Euclidean distance is used to measure vector similarity. They also investigate the task of mapping noun phrases from the image caption to specific areas of the image.