Evan Su's profile - ShortScience.org

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Efficient Visual Search of Videos Cast as Text Retrieval
Sivic, Josef and Zisserman, Andrew
IEEE Transactions on Pattern Analysis and Machine Intelligence - 2009 via Local Bibsonomy
Keywords: dblp

[link] Summary by Evan Su 9 years ago

This paper presents an efficient object retrieval approach that employs methods from statistical text retrieval. High level features (visual analogy of words) are are provided by vector quantizing low level features (region descriptors). The use of high level features and techniques of text retrieval significantly reduce the matching cost.

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Iterative quantization: A procrustean approach to learning binary codes
Gong, Yunchao and Lazebnik, Svetlana
Conference and Computer Vision and Pattern Recognition - 2011 via Local Bibsonomy
Keywords: dblp

[link] Summary by Evan Su 9 years ago

Similarity-preserving binary code obtained from quantization can efficiently accelerate the retrieval of large-scale image collections. This paper presents iterative quantization (ITQ) that iteratively minimizes quantization error by rotating the data before quantization. ITQ can be couple with any projection of data onto an orthogonal basis. Experiment results show outstanding performance on the retrieval of large-scale image collections especially when the length of binary code is short.
Technical details and results

Idea of ITQ

Figure 1 illustrate the idea of rotating data before quantization to reduce quantization error.

![](http://2.bp.blogspot.com/-riBFQ71aWKk/VRI4Aprwx5I/AAAAAAAAAxw/ggFrB14UGlw/s1600/toy.png)

Figure 1. Toy illustration of the proposed ITQ method
The data points (blue points) are quantized to the closest vertex of the binary cube. By rotating the data points to that shown in Figure 1 (c), the quantization error is reduced and the partitioning respects the structure of the cluster.

ITQ in unsupervised code learning

Given a set of n data points and let each data point be d dimension, the data matrix is denoted by

$X \in \mathbb{R}^{n \times d}$

The binary code matrix can be computed by 

$B = sgn(XW)$ 

$W \in \mathbb{R}^{d \times c}$

where W denotes the projection matrix computed by PCA in this case.

Let R denote any c x c orthogonal matrix. The quantization error after projection and rotation of the data matrix is denotes by

$Q(B,R) = ||B - VR||^{2}_F$

$V = XW$

The ITQ method minimizes the quantization error by seeking optimal R. Because of the quantization operator, the quantization error is not a smooth function and direct minimization of the quantization error is impractical. This paper propose optimizing R and B alternately like k-mean algorithm.  When R is fixed, B is computed by


When B is fixed, R is computed by

$B = sgn(VR)$

where S and "S hat" are the left-singular vector and right-singular vector of the matrix 
         
$B^TV$

Figure 2 shows the quantization error for learning a 32-bit ITQ code on the CIFAR dataset.

![](http://4.bp.blogspot.com/-F6wZGT5fl3E/VRJW5U2VvFI/AAAAAAAAAzg/aEt4DIDpV6E/s1600/quantization%2Berror.png)

Results

"PCA-ITQ" in the legend of the figures denote the proposed method.


![](http://3.bp.blogspot.com/-WbVup__p9CQ/VRJTBh7NKpI/AAAAAAAAAx8/DCUvfo5Ijuw/s1600/result1.png)

![](http://2.bp.blogspot.com/-dj6F3pd5WiM/VRJTBvF6G5I/AAAAAAAAAyA/WyW7He4TJOw/s1600/result2.png)

scholar.google.com

Nonlinear Dimensionality Reduction by Locally Linear Embedding
Roweis, Sam T. and Saul, Lawrence K.
Science - 2000 via Local Bibsonomy
Keywords: visualization, dimensionality_reduction, dipl_literatur, unsupervised, nldr, ml

[link] Summary by Evan Su 9 years ago

This paper presents locally linear embedding (LLE) for nonlinear dimensionality reduction. LLE can learn the structure of the underlying low-dimensional manifold of the sampled data in high dimensional space. Therefore, LLE can preserve the distance in the manifold space much better than PCA. Unlike PCA which projects high dimensional space to low dimensional space with a global linear matrix, LLE seeks locally linear projections for locally linear patches formed by neighboring data points. Using many locally linear projections instead of a global linear projection is the key to nonlinear dimensionality reduction.
Technical details 

Fig. 1 shows the problem if nonlinear dimensionality reduction.

![](http://1.bp.blogspot.com/-Vpkju74n9w8/VSX3RGAjKtI/AAAAAAAAA08/NqJA7v1MLuQ/s1600/dimreduct.png)


Fig. 2 summarizes the LLE algorithm. The neighbors of each data point can be computed by K-nearest neighbor or by collecting the data points within a radius. The weights in step 2 reflect intrinsic geometric properties of the data that are invariant to locally linear projections. The third step finds new data points projected by locally linear projections in low-dimensional space.

![](http://3.bp.blogspot.com/-kVz0nYKKSQ0/VSX3_Aio4lI/AAAAAAAAA1E/hrltFFQ7njY/s1600/LLEgraph.png)

![](http://3.bp.blogspot.com/-bfgIDOavP5s/VSX23Jko4tI/AAAAAAAAA0s/IsE4Th0SN1Y/s1600/LLE.png)


$\varepsilon (W) = \displaystyle\sum\_i | \vec{X} = \sum\_j W\_{ij} \vec{X}\_j|^2$

$\Phi (Y) = \displaystyle\sum\_i | \vec{Y} = \sum\_j W\_{ij} \vec{T}\_j|^2$


Results

Figure 4 shows the results of dimensional reduction of images of lips using CFA and LLE.

![](http://3.bp.blogspot.com/-STDK5O9TBX4/VSX4TyurcvI/AAAAAAAAA1M/XUoZ-CmnObQ/s1600/lips.png)

dx.doi.org
sci-hub
scholar.google.com

To Aggregate or Not to aggregate: Selective Match Kernels for Image Search
Tolias, Giorgos and Avrithis, Yannis S. and Jégou, Hervé
International Conference on Computer Vision - 2013 via Local Bibsonomy
Keywords: dblp

[link] Summary by Evan Su 9 years ago

Descriptors and matching kernel are key components in an image search system. This paper present a framework for matching kernels including non-aggregated kernel such as Hamming Embedding (HE) and aggregated kernel such as Bag-of-Words (BoW) and vector or locally aggregated descriptors (VLAD). To evaluate the effectiveness of aggregation, this paper introduces selective match kernel (SMK) (non-aggregated) and aggregated selective match kernel (ASMK) based on the framework. Experimental results show that ASMK outperforms SMK amd state-of-the-art methods because ASMK can deal with burstiness better than SMK.

Technical details 

The frame work of matching kernel is described by the following general form.

$$K(\mathcal{X},\mathcal{Y}) = \gamma(\mathcal{X})\gamma(\mathcal{Y})
\displaystyle\sum\_{c \in C} w\_c M (\mathcal{X}\_c,\mathcal{Y}\_c)$$

where X and Y are the descriptors of two images, Xc and Yc are a subset of the descriptors that are assigned to a particular visual word, M denotes similarity function, wc is a scalar and gamma denotes normalization factor.

The proposed selective match kernel (SMK) is denoted by

$$M\_N(\mathcal{X}\_c,\mathcal{Y}\_c) = 
\displaystyle\sum\_{x \in \mathcal{X}\_c} 
\displaystyle\sum\_{y \in \mathcal{Y}\_c} 
\sigma (\phi(x)^T\phi(y))$$

Note that #$\mathcal{X}\_c$ times #$\mathcal{Y}\_c$ (# = number of )  matches (dot product) are needed for each visual word.


The proposed aggregated selective match kernel (ASMK) is denoted by

![](http://1.bp.blogspot.com/-ocG18fK0TMM/VS8_MsfOeSI/AAAAAAAAA2k/BrXcXyF0lfM/s1600/ASMK.png)

Note that only one match (dot product) is needed for each visual word.

Results

As shown in Figure 5, ASMK outperform SMK and SMK-BURST. BURST refer to burstiness normalization.

![](http://1.bp.blogspot.com/-PfG69JuQnxk/VS9B6BvrAfI/AAAAAAAAA20/_2eKsEiAdew/s1600/fig5.png)

Table 4 shows that ASMK outperforms state-of-the-art methods.

![](http://2.bp.blogspot.com/-3GtM-bH6C2c/VS9CRKqN3TI/AAAAAAAAA28/Grzn5QT_GhM/s1600/table%2B4.png)

Note that all the results above are from the initial result set. Re-ranking approaches are not included.

scholar.google.com

Probabilistic latent semantic indexing
Hofmann, Thomas
- 1999 via Local Bibsonomy
Keywords: semantic, latent, probabilistic, indexing

[link] Summary by Evan Su 9 years ago

Probabilistic latent semantic indexing (PLSI) is an approach for document retrieval by modeling the joint probability model of words and documents as a mixture of independent multinomial distribution conditioned by latent semantic classes. The model is based on two independence assumption. First, the observed words and documents are assumed to be generated independently. Second, conditioned on the latent class, words are generated independently of the specific document identity. Given that the number of classes is smaller than the number of documents, each class acts as a bottleneck variable in predicting the distribution of words conditioned on documents.

Technical details

Given a word w and a document d, their joint probability distribution is model as follows.

$$P(d,w) = P(d)P(w|d), where$$

$$P(w|d) = \displaystyle\sum\_{z\in Z} P(w|z)P(z|d)$$

where $z$ denotes a latent class. 

Following the likelihood principle, one determines the distributions in (1) and (2) by maximization of the log-likelihood function

$$\mathcal{L} = \displaystyle\sum\_{d \in D} \displaystyle\sum\_{w \in W} n(d,w) log P(d,w)$$

The maximization is done by the Expectation Maximization (EM) algorithm.

Results

![](http://1.bp.blogspot.com/-eSKjS0950ac/VUHFJ2Vv7fI/AAAAAAAAA4U/bMGoVNh5_O0/s1600/result.png)

dx.doi.org
sci-hub
scholar.google.com

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
Girshick, Ross B. and Donahue, Jeff and Darrell, Trevor and Malik, Jitendra
Conference and Computer Vision and Pattern Recognition - 2014 via Local Bibsonomy
Keywords: dblp

[link] Summary by Evan Su 9 years ago

This paper presents a object detection algorithm that improves mAP on PASCAL VOC dataset by over 20% to previous state-of-the-art. Unlike image classification which take an image or the center part of an image as input, object detection task requires an algorithm to detect bounding boxes of objects in an image. To use the high capacity CNN features in object detection, the proposed algorithm first generates region proposals. CNN features are extracted from those region proposals and are feed to a set of class-specific linear SVMs which tell whether objects are detected in those regions.
Technical details

The figure below show the object detection system in this paper.
![](http://3.bp.blogspot.com/-O6e43qcpcYA/VWapFWyXt5I/AAAAAAAAA8c/rcjlQJAQ35s/s320/system.png)

Because the PASCAL VOC dataset is not large enough for training high capacity CNN features, this paper use supervised pre-training on a large auxiliary dataset (ILSVRC 2012). The CNN is then fine-tuned with a portion of the PASCAL VOC dataset.

Results

The following table shows the detection mAP on VOC 2007 test.

![](http://1.bp.blogspot.com/-AmEd1cI6iWs/VWaqnD1YYwI/AAAAAAAAA8o/07pfZpdvwck/s400/table%2B2.png)

arxiv.org
scholar.google.com

Universum Prescription: Regularization using Unlabeled Data
Zhang, Xiang and LeCun, Yann
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Evan Su 9 years ago

This paper apply temporal convolutional neural network on character input to learn abstract text concepts. Depending on application, the model can output the category of text or review sentiment. The model is trained from character level and do not require knowledge of syntax or semantic structure. Therefore, the model can work for various language including English and Chinese with little prior knowledge of languages.

scholar.google.com

Imagenet classification with deep convolutional neural networks
Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E
Neural Information Processing Systems Conference - 2012 via Local Bibsonomy
Keywords: image, imagenet, thema:deepwalk, classification

[link] Summary by Evan Su 9 years ago

Deep convolutional neural networks (DCNN) has been a popular model for image classification over the last few years. This paper proposes a DCNN structure, also known as AlexNet, for the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). To train AlexNet, which has 60 million parameters, this paper uses Rectified Linear Units (ReLU) and multiple GPU to accelerate training. This paper also report that using local response normalization and overlapping pooling can reduce error rate. To prevent over fitting, they suggest data augmentation and apply dropout in the fully connected layer. 
Technical details

The following figure shows the architecture of AlexNet. It contains five convolutional and three fully connected layers. Response-normalization layers follow the first and second convolutional layers. Max-pooling layers follow the first and second response-normalization layers and the fifth convolutional layer.

![](http://i.imgur.com/2iqwCq1.png)

Evan Su

sciscore: 1.4