Summaries from Conference and Computer Vision and Pattern Recognition on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

DenseCap: Fully Convolutional Localization Networks for Dense Captioning
Justin Johnson and Andrej Karpathy and Li Fei-Fei
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CV, cs.LG
more

[link] Summary by Abhishek Das 7 years ago

This paper introduces the task of dense captioning and proposes
a network architecture that processes an image and produce region descriptions
in a single pass and can be trained end-to-end. Main contributions:

- Dense captioning
    - Generalization of object detection (caption consists of single word)
    and image captioning (region consists of whole image).

- Fully convolution localization network
    - Fully differentiable, can be trained jointly with the rest of the network
    - Consists of a region proposal network, box regression (similar to Faster R-CNN)
    and bilinear interpolation (similar to Spatial Transformer Networks) for
    sampling.

- Network details
    - Convolutional layer features are extracted for image
    - For each element in the feature map, k anchor boxes of different aspect ratios
    are selected in the input image space.
    - For each of these, the localization layer predicts offsets and confidence.
    - The region proposals are projected on the convolutional feature map and a sampling
    grid is computed from output feature map to input (bilinear sampling).
    - The computed feature map is passed through an MLP to compute representations
    corresponding to each region.
    - These are passed (in a batch) as the first word to an LSTM (Show and Tell) which
    is trained to predict each word of the caption.

## Strengths

- Fully differentiable 'spatial attention' mechanism (bilinear interpolation)
in place of RoI pooling as in the case of Faster R-CNN.
    - RoI pooling is not differentiable with respect to the input proposal coordinates.

- Fast, and impressive qualitative results.

## Weaknesses / Notes

The model is very well engineered together from different works (Faster R-CNN +
Spatial Transformer Networks + Show & Tell).

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Exploiting local features from deep networks for image retrieval
Ng, Joe Yue-Hei and Yang, Fan and Davis, Larry S.
Conference and Computer Vision and Pattern Recognition - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Vivek Gandhi 8 years ago

In this paper, the authors raise a very important point for instance based image retrieval. For a task like an image recognition features extracted from higher layer of deep networks works really well in general, but for task like instance based image retrieval features extracted from higher layers don't prove to be that useful, so the authors suggest that we take features from lower layer and on those features, apply [VLAD encoding](https://www.robots.ox.ac.uk/~vgg/publications/2013/arandjelovic13/arandjelovic13.pdf). On top of the VLAD encoding as part of post processing, we perform steps like intra-normalisation and then apply PCA and reduce the encoding to a size of 128 Dimension. The authors have performed their experiments using [Googlenet](https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf) and [VGG-16](https://arxiv.org/pdf/1409.1556v6.pdf), and they tried Inception 3a, Inception 4a and Inception 4e on GoogleNet and conv4_2, conv5_1 and conv5_2 on VGG-16. The above mentioned layers has almost similar performance on the dataset they have used. The performance metric used by the authors is Mean Average Precision(MAP).

arxiv.org
arxiv-vanity.com
scholar.google.com

FaceNet: A Unified Embedding for Face Recognition and Clustering
Florian Schroff and Dmitry Kalenichenko and James Philbin
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Martin Thoma 8 years ago

FaceNet directly maps face images to $\mathbb{R}^{128}$ where distances directly correspond to a measure of face similarity. They use a triplet loss function. The triplet is (face of person A, other face of person A, face of person which is not A). Later, this is called (anchor, positive, negative).

The loss function is learned and inspired by LMNN. The idea is to minimize the distance between the two images of the same person and maximize the distance to the other persons image.

## LMNN

Large Margin Nearest Neighbor (LMNN) is learning a pseudo-metric

$$d(x, y) = (x -y) M  (x -y)^T$$

where $M$ is a positive-definite matrix. The only difference between a pseudo-metric and a metric is that $d(x, y) = 0 \Leftrightarrow x = y$ does not hold.

## Curriculum Learning: Triplet selection

Show simple examples first, then increase the difficulty. This is done by selecting the triplets.

They use the triplets which are *hard*. For the positive example, this means the distance between the anchor and the positive example is high. For the negative example this means the distance between the anchor and the negative example is low.

They want to have

$$||f(x_i^a) - f(x_i^p)||_2^2 + \alpha < ||f(x_i^a) - f(x_i^n)||_2^2$$

where $\alpha$ is a margin and $x_i^a$ is the anchor, $x_i^p$ is the positive face example and $x_i^n$ is the negative example. They increase $\alpha$ over time. It is crucial that $f$ maps the images not in the complete $\mathbb{R}^{128}$, but on the unit sphere. Otherwise one could double $\alpha$ by simply making $f' = 2 \cdot f$.

## Tasks

* **Face verification**: Is this the same person?
* **Face recognition**: Who is this person?

## Datasets

* 99.63% accuracy on Labeled FAces in the Wild (LFW)
* 95.12% accuracy on YouTube Faces DB

## Network

Two models are evaluated: The [Zeiler & Fergus model](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13)  and an architecture based on the [Inception model](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14).

## See also

* [DeepFace](http://www.shortscience.org/paper?bibtexKey=conf/cvpr/TaigmanYRW14#martinthoma)

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Learning to count with deep object features
Seguí, Santi and Pujol, Oriol and Vitrià, Jordi
Conference and Computer Vision and Pattern Recognition - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 8 years ago

This paper discusses some amazing results. The goal is to learn how to count by end-to-end training. The network input is an image and the output is a count of the objects inside it. They do not perform any direct training using the locations of the objects in the image. 

The reason for avoiding direct training is that labeled data is expensive. Employing a surrogate objective ,such as the count of items in the image, is much cheaper and makes more sense because it is the goal of the system we want to learn. This paper states that it is possible! The discuss experiments on two datasets; one of MNIST digits placed in an image and one with the UCSD Pedestrian Database.  

The network description seems to be general and they don't report any special constraints on the design  `"We consider networks of two or more convolutional layers followed by one or more fully connected layers. Each convolutional layer consist of several elements: a set of convolutional filters, ReLU non-linearities, max pooling layers and normalization layers."` and `"We use a five layers architecture CNN with two convolutional layers followed by three fully connected layers"`. They provide these two tables for their designs:

$$\begin{array}{c|c|c|c}
 Conv1 &  Conv2 & FC1 & FC2  \\ \hline
10\text{x}15\text{x}15 & 10\text{x}3\text{x}3 & 32 & 6 \\
\text{x2 pool} & \text{x2 pool} & & \\ \hline
\end{array}\\
\text{CNN arch for numbers}$$

$$
\begin{array}{c|c|c|c|c}
 Conv1 &  Conv2 & FC1 & FC2 & FC3 \\ \hline
8\text{x}9\text{x}9 & 8\text{x}5\text{x}5 & 128 & 128 & 25 \\
\text{x2 pool} & \text{x2 pool} & & \\ \hline
\end{array}\\
\text{CNN arch for people}$$

They state that they use a method based on hypercolumns \cite{1411.5752} but the description is not clear at all: `" Starting with the hypercolumn representation
on the last layer we cluster the resulting hypercolumns
into a set of prototypes using an online k-means
algorithm. Then, a MIL approach with positive and negative
instances with the concept of interest is used."`

![](https://i.imgur.com/x2q3E9Y.png)

Interesting work but I wish it was a longer paper with more details. This paper doesn't really give me enough information to reproduce it.

arxiv.org
scholar.google.com

Actions ~ Transformations
Wang, Xiaolong and Farhadi, Ali and Gupta, Abhinav
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 8 years ago

Originally posted [here](https://github.com/abhshkdz/papers/blob/master/reviews/actions-~-transformations.md).

This paper introduces a novel representation for actions in videos as transformations that change the state of the environment from what it was before the action (precondition) to what it will be after it (effect).

- Model
    - The model utilizes a Siamese architecture with each head having convolutional and fully-connected layers (similar to VGG16). Each head extracts features for a subset of video frames (precondition or effect) that are aggregated by average pooling and followed by a fully-connected layer.
    - The precondition frames are indexed from 1 to z\_p and the effect frames from z\_e to t. Both z\_p and z\_e are latent variables, constrained to be from [1/3t, 1/2t] and [1/2t, 2/3t] respectively and estimated via brute force search during training.
    - The action is represented as a linear transformation between the final fully-connected layers of the two heads. For n action categories, the transformation layer has n transformation matrices.
    - The model is trained with a contrastive loss function to 1) maximize cosine similarity between the effect embedding and the transformed precondition embedding, and 2) maximize distance for incorrect transformations if greater than a chosen margin.
- ACT Dataset
    - 50 keywords, 43 classes, ~500 YouTube videos per keyword.
    - The authors collect the ACT dataset primarily for the task of cross-category generalization (as it doesn't allow models to overfit to contextual information). For example, how would a model learned on "opening a window" generalize to recognize "opening the trunk of the car"? How about generalizing from a model trained on "climbing a cliff" to recognize "climbing a tree"?
    - The ACT dataset has class and super-class annotations from human workers. Each super-class has different sub-categories which are the same action under different subjects, objects and scenes.
- Experiments
    - Action recognition on UCF101, HMDB51, ACT.
    - Cross-category generalization on ACT.
- Visualizations
    - Nearest neighbor: modeling the actions as transformations gives semantically meaningful retrievals that don't just depend on motion and color.
    - Gradient visualizations (Simonyan et al. 2014): model focuses on changes in scene (human + object) than context.
    - Embedding retrievals based on transformed precondition embeddings.

** Thoughts **

- Modeling action as a transformation from precondition to effect is a very neat idea.
- The exact formulation and supporting experiments and ablation studies are thorough.
- During inference, the model first extracts features for all frames and then does a brute force search over (y,z\_p,z\_e) to estimate the action category and segmentation into precondition and effect. For longer sequences, this seems expensive. Although hard decisions aren't differentiable, a soft attention mechanism on z might be feasible and reduce computation to a single forward pass.

arxiv.org
scholar.google.com

Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Martin Thoma 9 years ago

Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. 

Advantages:

* Learning the identity becomes learning 0 which is simpler
* Loss in information flow in the forward pass is not a problem anymore
    * No vanishing / exploding gradient
* Identities don't have parameters to be learned

## Evaluation

The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128.

* ImageNet ILSVRC 2015: 3.57% (ensemble)
* CIFAR-10: 6.43%
* MS COCO: 59.0% mAp@0.5 (ensemble)
* PASCAL VOC 2007: 85.6% mAp@0.5
* PASCAL VOC 2012: 83.8% mAp@0.5

## See also

* [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993)