Learning Deep Structure-Preserving Image-Text Embeddings on ShortScience.org

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Learning Deep Structure-Preserving Image-Text Embeddings
Wang, Liwei and Li, Yin and Lazebnik, Svetlana
Conference and Computer Vision and Pattern Recognition - 2016 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 1

[link] Summary by Marek Rei 6 years ago

The authors present a neural model that maps images and sentences into the same space, in order to perform cross-modal retrieval – find images based on a sentence or find sentences based on an image.

https://i.imgur.com/DCFYzN8.png

The image vectors come from a pre-trained VGG image detection network. The sentence vectors are constructed using Fisher vectors, but they also explore simpler options, such as mean word2vec vectors and tfidf. Both are then mapped through nonlinearities and normalised, and Euclidean distance is used to measure vector similarity. They also investigate the task of mapping noun phrases from the image caption to specific areas of the image.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private