[link]
Here the authors present a model which projects queries and documents into a low dimensional space, where you can fetch relevant documents by computing distance, *here cosine is used*, between the query vector and the document vectors. ### Model Description #### Word Hashing Layer They have used bag of tri-grams for representing words(office -> #office# -> {#of, off, ffi, fic, ice, ce#}). This is able to generalize unseen words and maps morphological variation of same words to points which are close in n-gram space. #### Context Window Vector Then for representing a sentence they are taking a `Window Size` around a word and appending them to form a context window vector. If we take `Window Size` = 3: (He is going to Office -> { [vec of 'he', vec of 'is', vec of 'going'], [vec of 'is', vec of 'going', vec of 'to'], [vec of 'going', vec of 'to', vec of 'Office'] } #### Convolutional Layer and Max-Pool layer Run a convolutional layer over each of the context window vector (for an intuition these are local features). Max pool over the resulting features to get global features. The output dimension is taken here to be 300. #### Semantic Layer Use a fully connected layer and project the 300-D vector to a 128-D vector. They have used two different networks, one for queries and other for documents. Now for each query and document (we are given labeled documents, one of them is positive and rest are negative) they compute the cosine similarity of the 128-D output vector. And then they learn the weights of convolutional filters and the fully connected layer by maximizing conditional likelihood of positive documents. My thinking is that they have used two different networks as their is significant difference between Query length and Document Length. |
[link]
Dynamic Memory Network has: 1. **Input module**: This module processes the input data about which a question is being asked into a set of vectors termed facts. This module consists of GRU over input words. 2. **Question Module**: Representation of question as a vector. (final hidden state of the GRU over the words in the question) 3. **Episodic Memory Module**: Retrieves the information required to answer the question from the input facts (input module). Consists of two parts 1. attention mechanism 2. memory update mechanism To get it more intuitive: When we see a question, we only have the question in our memory(i.e. the initial memory vector == question vector), then based on our question and previous memory we pass over the input facts and generate a contextual vector (this is the work of attention mechanism), then memory is updated again based upon the contextual vector and the previous memory, this is repeated again and again. 4. **Answer Module**: The answer module uses the question vector and the most updated memory from 3rd module to generate answer. (a linear layer with softmax activation for single word answers, RNNs for complicated answers) **Improved DMN+** The input module used single GRU to process the data. Two shortcomings: 1. The GRU only allows sentences to have context from sentences before them, but not after them. This prevents information propagation from future sentences. Therfore bi-directional GRUs were used in DMN+. 2. The supporting sentences may be too far away from each other on a word level to allow for these distant sentences to interact through the word level GRU. In DMN+ they used sentence embeddings rather than word embeddings. And then used the GRUs to interact between the sentence embeddings(input fusion layer). **For Visual Question Answering** Split the image into parts, consider them parallel to sentences in input module for text. Linear layer with tanh activation to project the regional vectors(from images) to textual feature space (for text based question answering they used positional encoding for embedding sentences). Again use bi-directional GRUs to form the facts. Now use the same process as mentioned for text based question answering. |
[link]
Here the author presents a way to retrieve similar images in O(distance) time (where distance can be termed as 100 - correlation in percentage between two images), but it uses O(2^n) memory (n is the number of bits in which we are encoding the image). Therefore this approach is independent of the size of database (Though the value of 'n' is very important, it is kind of precision measure for the correlation, if we choose very small 'n' the difference between similar images to given image can't be scored efficiently, while if we use very large 'n' the semantic similarity won't be captured efficiently enough). (But this is only hashing, what is peculiar about this paper? It uses semantic hashing.) For all the images, they are fed to autoencoder as input, a code is generated as output which is used for hashing. And then for the query image, again a code is generated and k-nearest images are retrieved from the output space. But what is autoencoder? It is an artificial neural network with layers which are not intra-connected but inter-connected. They are comprised of two parts: encoder and decoder. In the paper n is 28 that means the autoencoder will have 28 units in the middle layer. *Training:* 1. The encoder training is done greedily one by one (layer by layer) using the way Restricted Boltzmann Machines are trained. 2. Then the decoder is just the inverse of the encoder layer (*Unsymmetrical auto-encoders generally give poor results.) 3. And then they are fine tuned using back-propagation, using the input image itself for calculating the loss. (*Due to huge number of weights, the back-propagation training converges very slowly). In short the encoder transforms the high dimensional input to low dimensional output. And then the decoder represents it back to high dimensional space. In the paper this low dimensional output has been used as the code representation for the input image. (Note that these are rounded so that they are binary.) And these capture the semantic content of the image. The author has also written that this approach wont be useful if applied to words with respect to documents as words w.r.t. documents are much more semantically meaningful than pixels w.r.t. images. Then they have used one more thing to tackle translation related variance. They have taken 81 patches of each of the image (regularly spaced patches), and have applied the above mentioned algorithm to compute the hashes. (*It is a kind of convolution except for the fact that we are not summing anything, just iterating over the bags to find their representation. It won't be much effective for tackling rotational variance.) In the array of size 2^28, for the images whose code comes to be `a1,a2..a28` the value of `a1,a2..a28` is computed in decimal representation as 'i' and the image is stored at index 'i'. Now for a query image, it is broken into 81 overlapping patches which are fed to the network and its code is computed. Then all the images at indexes whose difference is less than 3 bits are returned and given a score based on difference in bits. And then scores are summed for each of the image and then images are returned as per descending order of score. The author has used two layer searching, where in first layer for the given input image, the output images are returned using the 28-bit codes, then the 256 bit code of input image and the previously returned output images are compared and based on the 256-bit codes a refined order is returned. Though I recommend people to study A Practical Guide to Training Restricted Boltzmann Machines for a better understanding of the learning used in the training part above. |