[link]
In this paper, the authors raise a very important point for instance based image retrieval. For a task like an image recognition features extracted from higher layer of deep networks works really well in general, but for task like instance based image retrieval features extracted from higher layers don't prove to be that useful, so the authors suggest that we take features from lower layer and on those features, apply [VLAD encoding](https://www.robots.ox.ac.uk/~vgg/publications/2013/arandjelovic13/arandjelovic13.pdf). On top of the VLAD encoding as part of post processing, we perform steps like intra-normalisation and then apply PCA and reduce the encoding to a size of 128 Dimension. The authors have performed their experiments using [Googlenet](https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf) and [VGG-16](https://arxiv.org/pdf/1409.1556v6.pdf), and they tried Inception 3a, Inception 4a and Inception 4e on GoogleNet and conv4_2, conv5_1 and conv5_2 on VGG-16. The above mentioned layers has almost similar performance on the dataset they have used. The performance metric used by the authors is Mean Average Precision(MAP). ![]() |
[link]
FaceNet directly maps face images to $\mathbb{R}^{128}$ where distances directly correspond to a measure of face similarity. They use a triplet loss function. The triplet is (face of person A, other face of person A, face of person which is not A). Later, this is called (anchor, positive, negative). The loss function is learned and inspired by LMNN. The idea is to minimize the distance between the two images of the same person and maximize the distance to the other persons image. ## LMNN Large Margin Nearest Neighbor (LMNN) is learning a pseudo-metric $$d(x, y) = (x -y) M (x -y)^T$$ where $M$ is a positive-definite matrix. The only difference between a pseudo-metric and a metric is that $d(x, y) = 0 \Leftrightarrow x = y$ does not hold. ## Curriculum Learning: Triplet selection Show simple examples first, then increase the difficulty. This is done by selecting the triplets. They use the triplets which are *hard*. For the positive example, this means the distance between the anchor and the positive example is high. For the negative example this means the distance between the anchor and the negative example is low. They want to have $$||f(x_i^a) - f(x_i^p)||_2^2 + \alpha < ||f(x_i^a) - f(x_i^n)||_2^2$$ where $\alpha$ is a margin and $x_i^a$ is the anchor, $x_i^p$ is the positive face example and $x_i^n$ is the negative example. They increase $\alpha$ over time. It is crucial that $f$ maps the images not in the complete $\mathbb{R}^{128}$, but on the unit sphere. Otherwise one could double $\alpha$ by simply making $f' = 2 \cdot f$. ## Tasks * **Face verification**: Is this the same person? * **Face recognition**: Who is this person? ## Datasets * 99.63% accuracy on Labeled FAces in the Wild (LFW) * 95.12% accuracy on YouTube Faces DB ## Network Two models are evaluated: The [Zeiler & Fergus model](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13) and an architecture based on the [Inception model](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14). ## See also * [DeepFace](http://www.shortscience.org/paper?bibtexKey=conf/cvpr/TaigmanYRW14#martinthoma) ![]() |
[link]
This paper discusses some amazing results. The goal is to learn how to count by end-to-end training. The network input is an image and the output is a count of the objects inside it. They do not perform any direct training using the locations of the objects in the image. The reason for avoiding direct training is that labeled data is expensive. Employing a surrogate objective ,such as the count of items in the image, is much cheaper and makes more sense because it is the goal of the system we want to learn. This paper states that it is possible! The discuss experiments on two datasets; one of MNIST digits placed in an image and one with the UCSD Pedestrian Database. The network description seems to be general and they don't report any special constraints on the design `"We consider networks of two or more convolutional layers followed by one or more fully connected layers. Each convolutional layer consist of several elements: a set of convolutional filters, ReLU non-linearities, max pooling layers and normalization layers."` and `"We use a five layers architecture CNN with two convolutional layers followed by three fully connected layers"`. They provide these two tables for their designs: $$\begin{array}{c|c|c|c} Conv1 & Conv2 & FC1 & FC2 \\ \hline 10\text{x}15\text{x}15 & 10\text{x}3\text{x}3 & 32 & 6 \\ \text{x2 pool} & \text{x2 pool} & & \\ \hline \end{array}\\ \text{CNN arch for numbers}$$ $$ \begin{array}{c|c|c|c|c} Conv1 & Conv2 & FC1 & FC2 & FC3 \\ \hline 8\text{x}9\text{x}9 & 8\text{x}5\text{x}5 & 128 & 128 & 25 \\ \text{x2 pool} & \text{x2 pool} & & \\ \hline \end{array}\\ \text{CNN arch for people}$$ They state that they use a method based on hypercolumns \cite{1411.5752} but the description is not clear at all: `" Starting with the hypercolumn representation on the last layer we cluster the resulting hypercolumns into a set of prototypes using an online k-means algorithm. Then, a MIL approach with positive and negative instances with the concept of interest is used."`  Interesting work but I wish it was a longer paper with more details. This paper doesn't really give me enough information to reproduce it. ![]() |
[link]
Originally posted [here](https://github.com/abhshkdz/papers/blob/master/reviews/actions-~-transformations.md). This paper introduces a novel representation for actions in videos as transformations that change the state of the environment from what it was before the action (precondition) to what it will be after it (effect). - Model - The model utilizes a Siamese architecture with each head having convolutional and fully-connected layers (similar to VGG16). Each head extracts features for a subset of video frames (precondition or effect) that are aggregated by average pooling and followed by a fully-connected layer. - The precondition frames are indexed from 1 to z\_p and the effect frames from z\_e to t. Both z\_p and z\_e are latent variables, constrained to be from [1/3t, 1/2t] and [1/2t, 2/3t] respectively and estimated via brute force search during training. - The action is represented as a linear transformation between the final fully-connected layers of the two heads. For n action categories, the transformation layer has n transformation matrices. - The model is trained with a contrastive loss function to 1) maximize cosine similarity between the effect embedding and the transformed precondition embedding, and 2) maximize distance for incorrect transformations if greater than a chosen margin. - ACT Dataset - 50 keywords, 43 classes, ~500 YouTube videos per keyword. - The authors collect the ACT dataset primarily for the task of cross-category generalization (as it doesn't allow models to overfit to contextual information). For example, how would a model learned on "opening a window" generalize to recognize "opening the trunk of the car"? How about generalizing from a model trained on "climbing a cliff" to recognize "climbing a tree"? - The ACT dataset has class and super-class annotations from human workers. Each super-class has different sub-categories which are the same action under different subjects, objects and scenes. - Experiments - Action recognition on UCF101, HMDB51, ACT. - Cross-category generalization on ACT. - Visualizations - Nearest neighbor: modeling the actions as transformations gives semantically meaningful retrievals that don't just depend on motion and color. - Gradient visualizations (Simonyan et al. 2014): model focuses on changes in scene (human + object) than context. - Embedding retrievals based on transformed precondition embeddings. ** Thoughts ** - Modeling action as a transformation from precondition to effect is a very neat idea. - The exact formulation and supporting experiments and ablation studies are thorough. - During inference, the model first extracts features for all frames and then does a brute force search over (y,z\_p,z\_e) to estimate the action category and segmentation into precondition and effect. For longer sequences, this seems expensive. Although hard decisions aren't differentiable, a soft attention mechanism on z might be feasible and reduce computation to a single forward pass. ![]() |
[link]
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. Advantages: * Learning the identity becomes learning 0 which is simpler * Loss in information flow in the forward pass is not a problem anymore * No vanishing / exploding gradient * Identities don't have parameters to be learned ## Evaluation The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128. * ImageNet ILSVRC 2015: 3.57% (ensemble) * CIFAR-10: 6.43% * MS COCO: 59.0% mAp@0.5 (ensemble) * PASCAL VOC 2007: 85.6% mAp@0.5 * PASCAL VOC 2012: 83.8% mAp@0.5 ## See also * [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993) ![]() |