[link]
Summary by nishnik 8 years ago
Here the author presents a way to retrieve similar images in O(distance) time (where distance can be termed as 100 - correlation in percentage between two images), but it uses O(2^n) memory (n is the number of bits in which we are encoding the image). Therefore this approach is independent of the size of database (Though the value of 'n' is very important, it is kind of precision measure for the correlation, if we choose very small 'n' the difference between similar images to given image can't be scored efficiently, while if we use very large 'n' the semantic similarity won't be captured efficiently enough).
(But this is only hashing, what is peculiar about this paper? It uses semantic hashing.)
For all the images, they are fed to autoencoder as input, a code is generated as output which is used for hashing. And then for the query image, again a code is generated and k-nearest images are retrieved from the output space.
But what is autoencoder? It is an artificial neural network with layers which are not intra-connected but inter-connected. They are comprised of two parts: encoder and decoder.
In the paper n is 28 that means the autoencoder will have 28 units in the middle layer.
*Training:*
1. The encoder training is done greedily one by one (layer by layer) using the way Restricted Boltzmann Machines are trained.
2. Then the decoder is just the inverse of the encoder layer (*Unsymmetrical auto-encoders generally give poor results.)
3. And then they are fine tuned using back-propagation, using the input image itself for calculating the loss.
(*Due to huge number of weights, the back-propagation training converges very slowly).
In short the encoder transforms the high dimensional input to low dimensional output.
And then the decoder represents it back to high dimensional space.
In the paper this low dimensional output has been used as the code representation for the input image. (Note that these are rounded so that they are binary.) And these capture the semantic content of the image. The author has also written that this approach wont be useful if applied to words with respect to documents as words w.r.t. documents are much more semantically meaningful than pixels w.r.t. images.
Then they have used one more thing to tackle translation related variance. They have taken 81 patches of each of the image (regularly spaced patches), and have applied the above mentioned algorithm to compute the hashes. (*It is a kind of convolution except for the fact that we are not summing anything, just iterating over the bags to find their representation. It won't be much effective for tackling rotational variance.)
In the array of size 2^28, for the images whose code comes to be `a1,a2..a28` the value of `a1,a2..a28` is computed in decimal representation as 'i' and the image is stored at index 'i'.
Now for a query image, it is broken into 81 overlapping patches which are fed to the network and its code is computed. Then all the images at indexes whose difference is less than 3 bits are returned and given a score based on difference in bits. And then scores are summed for each of the image and then images are returned as per descending order of score.
The author has used two layer searching, where in first layer for the given input image, the output images are returned using the 28-bit codes, then the 256 bit code of input image and the previously returned output images are compared and based on the 256-bit codes a refined order is returned.
Though I recommend people to study A Practical Guide to Training Restricted Boltzmann Machines for a better understanding of the learning used in the training part above.
more
less