Order-Embeddings of Images and Language on ShortScience.org

arxiv.org
scholar.google.com

Order-Embeddings of Images and Language
Vendrov, Ivan and Kiros, Ryan and Fidler, Sanja and Urtasun, Raquel
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 1

[link] Summary by Hugo Larochelle 9 years ago

This paper proposes to learn embeddings of text and/or images according to a dissimilarity metric that is asymmetric and implements the notion of partial order. For example, we'd like the metric to capture that the sentence "a dog in the yard" is more specific than just "a dog". Similarly, given the image of a scene and a caption describing it, we'd also like to capture that the image is more specific than the caption, since captions only describe the main elements of the scene. We'd also like to capture the hypernym relation between single words, e.g. where "woman" is more specific than "person".

To achieve this, they propose to use the following dissimilarity metric:

$$E(x,y) = ||max(0,y-x)||^2$$

where x and y are embedding vectors and the max operation is applied element-wise. The way to use this metric is to learn embeddings such that, for a pair x,y where the object (e.g. "a dog in the yard") represented by $x$ is more specific than the object (e.g. "a dog") represented by $y$, then $E(x,y)$ is as small as possible.

For example, let's assume that $x$ and y are the output of a neural network, where each output dimension detects a certain concept, i.e. is non-zero only if the concept associated with that dimension is present in the input. For x representing "a dog in the yard", we could expect having only two dimensions that are non-zero: one detecting the concept "dog" (let's note it $x_j$) and another detecting the concept "yard" ($x_k$). For y representing "a dog", only the dimension associated with "dog" ($y_j$) would be non-zero and have the same value as $x_j$. In this situation, it is easy to see that $E(x,y)$ would be 0, but $E(y,x)$ would be greater than zero, thus capturing appropriately the asymmetric relationship between the two.

The authors show in the paper how to leverage this new asymmetric metric in training losses that are appropriate for 3 problems: hypernym detection, caption-image retrieval and textual entailment. They show that the proposed metric yields superior performance on these problems compared to symmetric metrics that have been used by prior work.

Your comment: