Summary by nishnik 8 years ago
Here the authors present a model which projects queries and documents into a low dimensional space, where you can fetch relevant documents by computing distance, *here cosine is used*, between the query vector and the document vectors.
### Model Description
#### Word Hashing Layer
They have used bag of tri-grams for representing words(office -> #office# -> {#of, off, ffi, fic, ice, ce#}). This is able to generalize unseen words and maps morphological variation of same words to points which are close in n-gram space.
#### Context Window Vector
Then for representing a sentence they are taking a `Window Size` around a word and appending them to form a context window vector. If we take `Window Size` = 3:
(He is going to Office -> { [vec of 'he', vec of 'is', vec of 'going'], [vec of 'is', vec of 'going', vec of 'to'], [vec of 'going', vec of 'to', vec of 'Office'] }
#### Convolutional Layer and Max-Pool layer
Run a convolutional layer over each of the context window vector (for an intuition these are local features). Max pool over the resulting features to get global features. The output dimension is taken here to be 300.
#### Semantic Layer
Use a fully connected layer and project the 300-D vector to a 128-D vector.
They have used two different networks, one for queries and other for documents. Now for each query and document (we are given labeled documents, one of them is positive and rest are negative) they compute the cosine similarity of the 128-D output vector. And then they learn the weights of convolutional filters and the fully connected layer by maximizing conditional likelihood of positive documents.
My thinking is that they have used two different networks as their is significant difference between Query length and Document Length.
