FaceNet: A Unified Embedding for Face Recognition and Clustering
Florian Schroff
and
Dmitry Kalenichenko
and
James Philbin
arXiv e-Print archive - 2015 via Local arXiv
Keywords:
cs.CV
First published: 2015/03/12 (9 years ago) Abstract: Despite significant recent advances in the field of face recognition,
implementing face verification and recognition efficiently at scale presents
serious challenges to current approaches. In this paper we present a system,
called FaceNet, that directly learns a mapping from face images to a compact
Euclidean space where distances directly correspond to a measure of face
similarity. Once this space has been produced, tasks such as face recognition,
verification and clustering can be easily implemented using standard techniques
with FaceNet embeddings as feature vectors.
Our method uses a deep convolutional network trained to directly optimize the
embedding itself, rather than an intermediate bottleneck layer as in previous
deep learning approaches. To train, we use triplets of roughly aligned matching
/ non-matching face patches generated using a novel online triplet mining
method. The benefit of our approach is much greater representational
efficiency: we achieve state-of-the-art face recognition performance using only
128-bytes per face.
On the widely used Labeled Faces in the Wild (LFW) dataset, our system
achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves
95.12%. Our system cuts the error rate in comparison to the best published
result by 30% on both datasets.
We also introduce the concept of harmonic embeddings, and a harmonic triplet
loss, which describe different versions of face embeddings (produced by
different networks) that are compatible to each other and allow for direct
comparison between each other.
FaceNet directly maps face images to $\mathbb{R}^{128}$ where distances directly correspond to a measure of face similarity. They use a triplet loss function. The triplet is (face of person A, other face of person A, face of person which is not A). Later, this is called (anchor, positive, negative).
The loss function is learned and inspired by LMNN. The idea is to minimize the distance between the two images of the same person and maximize the distance to the other persons image.
## LMNN
Large Margin Nearest Neighbor (LMNN) is learning a pseudo-metric
$$d(x, y) = (x -y) M (x -y)^T$$
where $M$ is a positive-definite matrix. The only difference between a pseudo-metric and a metric is that $d(x, y) = 0 \Leftrightarrow x = y$ does not hold.
## Curriculum Learning: Triplet selection
Show simple examples first, then increase the difficulty. This is done by selecting the triplets.
They use the triplets which are *hard*. For the positive example, this means the distance between the anchor and the positive example is high. For the negative example this means the distance between the anchor and the negative example is low.
They want to have
$$||f(x_i^a) - f(x_i^p)||_2^2 + \alpha < ||f(x_i^a) - f(x_i^n)||_2^2$$
where $\alpha$ is a margin and $x_i^a$ is the anchor, $x_i^p$ is the positive face example and $x_i^n$ is the negative example. They increase $\alpha$ over time. It is crucial that $f$ maps the images not in the complete $\mathbb{R}^{128}$, but on the unit sphere. Otherwise one could double $\alpha$ by simply making $f' = 2 \cdot f$.
## Tasks
* **Face verification**: Is this the same person?
* **Face recognition**: Who is this person?
## Datasets
* 99.63% accuracy on Labeled FAces in the Wild (LFW)
* 95.12% accuracy on YouTube Faces DB
## Network
Two models are evaluated: The [Zeiler & Fergus model](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13) and an architecture based on the [Inception model](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14).
## See also
* [DeepFace](http://www.shortscience.org/paper?bibtexKey=conf/cvpr/TaigmanYRW14#martinthoma)
## Keywords
Triplet-loss , face embedding , harmonic embedding
---
## Summary
### Introduction
**Goal of the paper**
A unified system is given for face verification , recognition and clustering.
Use of a 128 float pose and illumination invariant feature vector or embedding in the euclidean space.
* Face Verification : Same faces of the person gives feature vectors that have a very close L2 distance between them.
* Face recognition : Face recognition becomes a clustering task in the embedding space
**Previous work**
* Previous use of deep learning made use of an bottleneck layer to represent face as an embedding of 1000s dimension vector.
* Some other techniques use PCA to reduce the dimensionality of the embedding for comparison.
**Method**
* This method makes use of inception style CNN to get an embedding of each face.
* The thumbnails of the face image are the tight crop of the face area with only scaling and translation done on them.
**Triplet Loss**
Triplet loss makes use of two matching face thumbnails and a non-matching thumbnail. The loss function tries to reduce the distance between the matching pair while increasing the separation between the the non-matching pair of images.
**Triplet Selection**
* Selection of triplets is done such that samples are hard-positive or hard-negative .
* Hardest negative can lead to local minima early in the training and a collapse model in a few cases
* Use of semi-hard negatives help to improve the convergence speed while at the same time reach nearer to the global minimum.
**Deep Convolutional Network**
* Training is done using SGD (Stochastic gradient descent) with Backpropagation and AdaGrad
* The training is done on two networks :
- Zeiler&Fergus architecture with model depth of 22 and 140 million parameters
- GoogLeNet style inception model with 6.6 to 7.5 million parameters.
**Experiment**
* Study of the following cases are done :
- Quality of the jpeg image : The validation rate of model improves with the JPEG quality upto a certain threshold.
- Embedding dimensionality : The dimension of the embedding increases from 64 to 128,256 and then gradually starts to decrease at 512 dimensions.
- No. of images in the training data set
**Results classification accuracy** :
- LFW(Labelled faces in the wild) dataset : 98.87% 0.15
- Youtube Faces DB : 95.12% .39
On clustering tasks the model was able to work on a wide varieties of face images and is invariant to pose , lighting and also age.
**Conclusion**
* The model can be extended further to improve the overall accuracy.
* Training networks to run on smaller systems like mobile phones.
* There is need for improving the training efficiency.
---
## Notes
* Harmonic embedding is a set of embedding that we get from different models but are compatible to each other. This helps to improve future upgrades and transitions to a newer model
* To make the embeddings compatible with different models , harmonic-triplet loss and the generated triplets must be compatible with each other
## Open research questions
* Better understanding of the error cases.
* Making the model more compact for embedded and mobile use cases.
* Methods to reduce the training times.