[link]
Summary by Eddie Smolansky 6 years ago
# Metadata
* **Title**: The Do’s and Don’ts for CNN-based Face Verification
* **Authors**: Ankan Bansal Carlos Castillo Rajeev Ranjan Rama Chellappa
UMIACS -
University of Maryland, College Park
* **Link**: https://arxiv.org/abs/1705.07426
# Abstract
>Convolutional neural networks (CNN) have become the most sought after tools for addressing object recognition problems. Specifically, they have produced state-of-the art results for unconstrained face recognition and verification tasks. While the research community appears to have developed a consensus on the methods of acquiring annotated data, design and training of CNNs, many questions still remain to be answered. In this paper, we explore the following questions that are critical to face recognition research: (i) Can we train on still images and expect the systems to work on videos? (ii) Are deeper datasets better than wider datasets? (iii) Does adding label noise lead to improvement in performance of deep networks? (iv) Is alignment needed for face recognition? We address these questions by training CNNs using CASIA-WebFace, UMDFaces, and a new video dataset and testing on YouTubeFaces, IJBA and a disjoint portion of UMDFaces datasets. Our new data set, which will be made publicly available, has 22,075 videos and 3,735,476 human annotated frames extracted from them.
# Introduction
>We make the following main contributions in this paper:
• We introduce a large dataset of videos of over
3,000 subjects along with 3,735,476 human annotated
bounding boxes in frames extracted from these videos.
• We conduct a large scale systematic study about the
effects of making certain apparently routine decisions
about the training procedure. Our experiments clearly
show that data variety, number of individuals in the
dataset, quality of the dataset, and good alignment are
keys to obtaining good performance.
• We suggest the best practices that could lead to an improvement
in the performance of deep face recognition
networks. These practices will also guide future data
collection efforts.
# How they made the dataset
- collect youtube videos
- automated filtering with yolo and landmark detection projects
- crowd source final filtering (AMT - give 50 face images to turks and ask which don't belong)
- quality control through sentinels: give turks the same test but with 5 known correct answers,
and rank the turks according to how they perform on this ground truth test.
If they're good, trust their answers on the real tests.
- result:
> we have 3,735,476 annotated frames in 22,075 videos. We will
publicly release this massive dataset
# Questions and experiments
## Do deep recognition networks trained on stills perform well on videos?
> We study the effects of this difference between
still images and frames extracted from videos in section
3.1 using our new dataset. We found that mixing both
still images and the large number of video frames during
training performs better than using just still images or video
frames for testing on any of the test datasets
## What is better: deeper or wider datasets?
>In section 3.2 we investigate the impact of using a deep
dataset against using a wider dataset. For two datasets with
the same number of images, we call one deeper than the
other if on average it has more images per subject (and
hence fewer subjects) than the other. We show that it is
important to have a wider dataset than a deeper dataset with
the same number of images.
## Does some amount of label noise help improve the performance of deep recognition networks?
>When training any supervised face classification system,
each image is first associated with a label. Label noise is the
phenomenon of assigning an incorrect label to some images.
Label noise is an inherent part of the data collection process.
Some authors intentionally leave in some label noise [25, 6,
7] in the dataset in hopes of making the deep networks more
robust. In section 3.3 we examine the effect of this label
noise on the performance of deep networks for verification
trained on these datasets and demonstrate that clean datasets
almost always lead to significantly better performance than
noisy datasets.
## Does thumbnail creation method affect performance?
>... This leads to generation of different types
of bounding boxes for faces. Verification accuracy can
be affected by the type of bounding box used. In addition,
most recent face recognition and verification methods
[35, 31, 33, 5, 9, 34] use some kind of 2D or 3D alignment
procedure [41, 14, 28, 8]. ... In section 3.4 we study the consequences
of using different thumbnail generation methods
on verification performance of deep networks. We show
that using a good keypoint detection method and aligning
faces both during training and testing leads to the best performance.
more
less