The program for CVPR consists of high quality contributed papers on all aspects of
computer vision and pattern recognition.

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

Justin Johnson and Andrej Karpathy and Li Fei-Fei

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.CV, cs.LG

**First published:** 2015/11/24 (8 years ago)

**Abstract:** We introduce the dense captioning task, which requires a computer vision
system to both localize and describe salient regions in images in natural
language. The dense captioning task generalizes object detection when the
descriptions consist of a single word, and Image Captioning when one predicted
region covers the full image. To address the localization and description task
jointly we propose a Fully Convolutional Localization Network (FCLN)
architecture that processes an image with a single, efficient forward pass,
requires no external regions proposals, and can be trained end-to-end with a
single round of optimization. The architecture is composed of a Convolutional
Network, a novel dense localization layer, and Recurrent Neural Network
language model that generates the label sequences. We evaluate our network on
the Visual Genome dataset, which comprises 94,000 images and 4,100,000
region-grounded captions. We observe both speed and accuracy improvements over
baselines based on current state of the art approaches in both generation and
retrieval settings.
more
less

Justin Johnson and Andrej Karpathy and Li Fei-Fei

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.CV, cs.LG

[link]
This paper introduces the task of dense captioning and proposes a network architecture that processes an image and produce region descriptions in a single pass and can be trained end-to-end. Main contributions: - Dense captioning - Generalization of object detection (caption consists of single word) and image captioning (region consists of whole image). - Fully convolution localization network - Fully differentiable, can be trained jointly with the rest of the network - Consists of a region proposal network, box regression (similar to Faster R-CNN) and bilinear interpolation (similar to Spatial Transformer Networks) for sampling. - Network details - Convolutional layer features are extracted for image - For each element in the feature map, k anchor boxes of different aspect ratios are selected in the input image space. - For each of these, the localization layer predicts offsets and confidence. - The region proposals are projected on the convolutional feature map and a sampling grid is computed from output feature map to input (bilinear sampling). - The computed feature map is passed through an MLP to compute representations corresponding to each region. - These are passed (in a batch) as the first word to an LSTM (Show and Tell) which is trained to predict each word of the caption. ## Strengths - Fully differentiable 'spatial attention' mechanism (bilinear interpolation) in place of RoI pooling as in the case of Faster R-CNN. - RoI pooling is not differentiable with respect to the input proposal coordinates. - Fast, and impressive qualitative results. ## Weaknesses / Notes The model is very well engineered together from different works (Faster R-CNN + Spatial Transformer Networks + Show & Tell). |

Exploiting local features from deep networks for image retrieval

Ng, Joe Yue-Hei and Yang, Fan and Davis, Larry S.

Conference and Computer Vision and Pattern Recognition - 2015 via Local Bibsonomy

Keywords: dblp

Ng, Joe Yue-Hei and Yang, Fan and Davis, Larry S.

Conference and Computer Vision and Pattern Recognition - 2015 via Local Bibsonomy

Keywords: dblp

[link]
In this paper, the authors raise a very important point for instance based image retrieval. For a task like an image recognition features extracted from higher layer of deep networks works really well in general, but for task like instance based image retrieval features extracted from higher layers don't prove to be that useful, so the authors suggest that we take features from lower layer and on those features, apply [VLAD encoding](https://www.robots.ox.ac.uk/~vgg/publications/2013/arandjelovic13/arandjelovic13.pdf). On top of the VLAD encoding as part of post processing, we perform steps like intra-normalisation and then apply PCA and reduce the encoding to a size of 128 Dimension. The authors have performed their experiments using [Googlenet](https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf) and [VGG-16](https://arxiv.org/pdf/1409.1556v6.pdf), and they tried Inception 3a, Inception 4a and Inception 4e on GoogleNet and conv4_2, conv5_1 and conv5_2 on VGG-16. The above mentioned layers has almost similar performance on the dataset they have used. The performance metric used by the authors is Mean Average Precision(MAP). |

FaceNet: A Unified Embedding for Face Recognition and Clustering

Florian Schroff and Dmitry Kalenichenko and James Philbin

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.CV

**First published:** 2015/03/12 (9 years ago)

**Abstract:** Despite significant recent advances in the field of face recognition,
implementing face verification and recognition efficiently at scale presents
serious challenges to current approaches. In this paper we present a system,
called FaceNet, that directly learns a mapping from face images to a compact
Euclidean space where distances directly correspond to a measure of face
similarity. Once this space has been produced, tasks such as face recognition,
verification and clustering can be easily implemented using standard techniques
with FaceNet embeddings as feature vectors.
Our method uses a deep convolutional network trained to directly optimize the
embedding itself, rather than an intermediate bottleneck layer as in previous
deep learning approaches. To train, we use triplets of roughly aligned matching
/ non-matching face patches generated using a novel online triplet mining
method. The benefit of our approach is much greater representational
efficiency: we achieve state-of-the-art face recognition performance using only
128-bytes per face.
On the widely used Labeled Faces in the Wild (LFW) dataset, our system
achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves
95.12%. Our system cuts the error rate in comparison to the best published
result by 30% on both datasets.
We also introduce the concept of harmonic embeddings, and a harmonic triplet
loss, which describe different versions of face embeddings (produced by
different networks) that are compatible to each other and allow for direct
comparison between each other.
more
less

Florian Schroff and Dmitry Kalenichenko and James Philbin

arXiv e-Print archive - 2015 via Local arXiv

Keywords: cs.CV

[link]
FaceNet directly maps face images to $\mathbb{R}^{128}$ where distances directly correspond to a measure of face similarity. They use a triplet loss function. The triplet is (face of person A, other face of person A, face of person which is not A). Later, this is called (anchor, positive, negative). The loss function is learned and inspired by LMNN. The idea is to minimize the distance between the two images of the same person and maximize the distance to the other persons image. ## LMNN Large Margin Nearest Neighbor (LMNN) is learning a pseudo-metric $$d(x, y) = (x -y) M (x -y)^T$$ where $M$ is a positive-definite matrix. The only difference between a pseudo-metric and a metric is that $d(x, y) = 0 \Leftrightarrow x = y$ does not hold. ## Curriculum Learning: Triplet selection Show simple examples first, then increase the difficulty. This is done by selecting the triplets. They use the triplets which are *hard*. For the positive example, this means the distance between the anchor and the positive example is high. For the negative example this means the distance between the anchor and the negative example is low. They want to have $$||f(x_i^a) - f(x_i^p)||_2^2 + \alpha < ||f(x_i^a) - f(x_i^n)||_2^2$$ where $\alpha$ is a margin and $x_i^a$ is the anchor, $x_i^p$ is the positive face example and $x_i^n$ is the negative example. They increase $\alpha$ over time. It is crucial that $f$ maps the images not in the complete $\mathbb{R}^{128}$, but on the unit sphere. Otherwise one could double $\alpha$ by simply making $f' = 2 \cdot f$. ## Tasks * **Face verification**: Is this the same person? * **Face recognition**: Who is this person? ## Datasets * 99.63% accuracy on Labeled FAces in the Wild (LFW) * 95.12% accuracy on YouTube Faces DB ## Network Two models are evaluated: The [Zeiler & Fergus model](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13) and an architecture based on the [Inception model](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14). ## See also * [DeepFace](http://www.shortscience.org/paper?bibtexKey=conf/cvpr/TaigmanYRW14#martinthoma) |

Learning to count with deep object features

Seguí, Santi and Pujol, Oriol and Vitrià, Jordi

Conference and Computer Vision and Pattern Recognition - 2015 via Local Bibsonomy

Keywords: dblp

Seguí, Santi and Pujol, Oriol and Vitrià, Jordi

Conference and Computer Vision and Pattern Recognition - 2015 via Local Bibsonomy

Keywords: dblp

[link]
This paper discusses some amazing results. The goal is to learn how to count by end-to-end training. The network input is an image and the output is a count of the objects inside it. They do not perform any direct training using the locations of the objects in the image. The reason for avoiding direct training is that labeled data is expensive. Employing a surrogate objective ,such as the count of items in the image, is much cheaper and makes more sense because it is the goal of the system we want to learn. This paper states that it is possible! The discuss experiments on two datasets; one of MNIST digits placed in an image and one with the UCSD Pedestrian Database. The network description seems to be general and they don't report any special constraints on the design `"We consider networks of two or more convolutional layers followed by one or more fully connected layers. Each convolutional layer consist of several elements: a set of convolutional filters, ReLU non-linearities, max pooling layers and normalization layers."` and `"We use a five layers architecture CNN with two convolutional layers followed by three fully connected layers"`. They provide these two tables for their designs: $$\begin{array}{c|c|c|c} Conv1 & Conv2 & FC1 & FC2 \\ \hline 10\text{x}15\text{x}15 & 10\text{x}3\text{x}3 & 32 & 6 \\ \text{x2 pool} & \text{x2 pool} & & \\ \hline \end{array}\\ \text{CNN arch for numbers}$$ $$ \begin{array}{c|c|c|c|c} Conv1 & Conv2 & FC1 & FC2 & FC3 \\ \hline 8\text{x}9\text{x}9 & 8\text{x}5\text{x}5 & 128 & 128 & 25 \\ \text{x2 pool} & \text{x2 pool} & & \\ \hline \end{array}\\ \text{CNN arch for people}$$ They state that they use a method based on hypercolumns \cite{1411.5752} but the description is not clear at all: `" Starting with the hypercolumn representation on the last layer we cluster the resulting hypercolumns into a set of prototypes using an online k-means algorithm. Then, a MIL approach with positive and negative instances with the concept of interest is used."` ![](https://i.imgur.com/x2q3E9Y.png) Interesting work but I wish it was a longer paper with more details. This paper doesn't really give me enough information to reproduce it. |

Actions ~ Transformations

Wang, Xiaolong and Farhadi, Ali and Gupta, Abhinav

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

Wang, Xiaolong and Farhadi, Ali and Gupta, Abhinav

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

[link]
Originally posted [here](https://github.com/abhshkdz/papers/blob/master/reviews/actions-~-transformations.md). This paper introduces a novel representation for actions in videos as transformations that change the state of the environment from what it was before the action (precondition) to what it will be after it (effect). - Model - The model utilizes a Siamese architecture with each head having convolutional and fully-connected layers (similar to VGG16). Each head extracts features for a subset of video frames (precondition or effect) that are aggregated by average pooling and followed by a fully-connected layer. - The precondition frames are indexed from 1 to z\_p and the effect frames from z\_e to t. Both z\_p and z\_e are latent variables, constrained to be from [1/3t, 1/2t] and [1/2t, 2/3t] respectively and estimated via brute force search during training. - The action is represented as a linear transformation between the final fully-connected layers of the two heads. For n action categories, the transformation layer has n transformation matrices. - The model is trained with a contrastive loss function to 1) maximize cosine similarity between the effect embedding and the transformed precondition embedding, and 2) maximize distance for incorrect transformations if greater than a chosen margin. - ACT Dataset - 50 keywords, 43 classes, ~500 YouTube videos per keyword. - The authors collect the ACT dataset primarily for the task of cross-category generalization (as it doesn't allow models to overfit to contextual information). For example, how would a model learned on "opening a window" generalize to recognize "opening the trunk of the car"? How about generalizing from a model trained on "climbing a cliff" to recognize "climbing a tree"? - The ACT dataset has class and super-class annotations from human workers. Each super-class has different sub-categories which are the same action under different subjects, objects and scenes. - Experiments - Action recognition on UCF101, HMDB51, ACT. - Cross-category generalization on ACT. - Visualizations - Nearest neighbor: modeling the actions as transformations gives semantically meaningful retrievals that don't just depend on motion and color. - Gradient visualizations (Simonyan et al. 2014): model focuses on changes in scene (human + object) than context. - Embedding retrievals based on transformed precondition embeddings. ** Thoughts ** - Modeling action as a transformation from precondition to effect is a very neat idea. - The exact formulation and supporting experiments and ablation studies are thorough. - During inference, the model first extracts features for all frames and then does a brute force search over (y,z\_p,z\_e) to estimate the action category and segmentation into precondition and effect. For longer sequences, this seems expensive. Although hard decisions aren't differentiable, a soft attention mechanism on z might be feasible and reduce computation to a single forward pass. |

Deep Residual Learning for Image Recognition

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

[link]
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. Advantages: * Learning the identity becomes learning 0 which is simpler * Loss in information flow in the forward pass is not a problem anymore * No vanishing / exploding gradient * Identities don't have parameters to be learned ## Evaluation The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128. * ImageNet ILSVRC 2015: 3.57% (ensemble) * CIFAR-10: 6.43% * MS COCO: 59.0% mAp@0.5 (ensemble) * PASCAL VOC 2007: 85.6% mAp@0.5 * PASCAL VOC 2012: 83.8% mAp@0.5 ## See also * [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993) |

About