[link]
## General stuff about face recognition Face recognition has 4 main tasks: * **Face detection**: Given an image, draw a rectangle around every face * **Face alignment**: Transform a face to be in a canonical pose * **Face representation**: Find a representation of a face which is suitable for follow-up tasks (small size, computationally cheap to compare, invariant to irrelevant changes) * **Face verification**: Images of two faces are given. Decide if it is the same person or not. The face verification task is sometimes (more simply) a face classification task (given a face, decide which of a fixed set of people it is). Datasets being used are: * **LFW** (Labeled Faces in the Wild): 97.35% accuracy; 13 323 web photos of 5 749 celebrities * **YTF** (YouTube Faces): 3425 YouTube videos of 1 595 subjects * **SFC** (Social Face Classification): 4.4 million labeled faces from 4030 people, each 800 to 1200 faces * **USF** (Human-ID database): 3D scans of faces ## Ideas in this paper This paper deals with face alignment and face representation. **Face Alignment** They made an average face with the USF dataset. Then, for each new face, they apply the following procedure: * Find 6 points in a face (2 eyes, 1 nose tip, 2 corners of the lip, 1 middle point of the bottom lip) * Crop according to those * Find 67 points in the face / apply them to a normalized 3D model of a face * Transform (=align) face to a normalized position **Representation** Train a neural network on 152x152 images of faces to classify 4030 celebrities. Remove the softmax output layer and use the output of the second-last layer as the transformed representation. The network is: * C1 (convolution): 32 filters of size $11 \times 11 \times 3$ (RGB-channels) (returns $142\times 142$ "images") * M2 (max pooling): $3 \times 3$, stride of 2 (returns $71\times 71$ "images") * C3 (convolution): 16 filters of size $9 \times 9 \times 16$ (returns $63\times 63$ "images") * L4 (locally connected): $16\times9\times9\times16$ (returns $55\times 55$ "images") * L5 (locally connected): $16\times7\times7\times16$ (returns $25\times 25$ "images") * L6 (locally connected): $16\times5\times5\times16$ (returns $21\times 21$ "images") * F7 (fully connected): ReLU, 4096 units * F8 (fully connected): softmax layer with 4030 output neurons The training was done with: * Stochastic Gradient Descent (SGD) * Momentum of 0.9 * Performance scheduling (LR starting at 0.01, ending at 0.0001) * Weight initialization: $w \sim \mathcal{N}(\mu=0, \sigma=0.01)$, $b = 0.5$ * ~15 epochs ($\approx$ 3 days) of training ## Evaluation results * **Quality**: * 97.35% accuracy (or mean accuracy?) with an Ensemble of DNNs for LFW * 91.4% accuracy with a single network on YTF * **Speed**: DeepFace runs in 0.33 seconds per image (I'm not sure which size). This includes image decoding, face detection and alignment, **the** feed forward network (why only one? wasn't this the best performing Ensemble?) and final classification output ## See also * Andrew Ng: [C4W4L03 Siamese Network](https://www.youtube.com/watch?v=6jfw8MuKwpI) ![]()
Your comment:
|
[link]
They describe a CNN architecture that can be used to identify a person given an image of their face. ### How * The expected input is the image of a face (i.e. it does not search for faces in images, the faces already have to be extracted by a different method). * *Face alignment / Frontalization* * Target of this step: Get rid of variations within the face images, so that every face seems to look straight into the camera ("frontalized"). * 2D alignment * They search for landmarks (fiducial points) on the face. * They use SVRs (features: LBPs) for that. * After every application of the SVR, the localized landmarks are used to transform/normalize the face. Then the SVR is applied again. By doing this, the locations of the landmarks are gradually refined. * They use the detected landmarks to normalize the face images (via scaling, rotation and translation). * 3D alignment * The 2D alignment allows to normalize variations within the 2D-plane, not out-of-plane variations (e.g. seeing that face from its left/right side). To normalize out-of-plane variations they need a 3D transformation. * They detect an additional 67 landmarks on the faces (again via SVRs). * They construct a human face mesh from a dataset (USF Human-ID). * They map the 67 landmarks to that mesh. * They then use some more complicated steps to recover the frontalized face image. * *CNN architecture* * The CNN receives the frontalized face images (152x152, RGB). * It then applies the following steps: * Convolution, 32 filters, 11x11, ReLU (-> 32x142x142, CxHxW) * Max pooling over 3x3, stride 2 (-> 32x71x71) * Convolution, 16 filters, 9x9, ReLU (-> 16x63x63) * Local Convolution, 16 filters, 9x9, ReLU (-> 16x55x55) * Local Convolution, 16 filters, 7x7, ReLU (-> 16x25x25) * Local Convolution, 16 filters, 5x5, ReLU (-> 16x21x21) * Fully Connected, 4096, ReLU * Fully Connected, 4030, Softmax * Local Convolutions use a different set of learned weights at every "pixel" (while a normal convolution uses the same set of weights at all locations). * They can afford to use local convolutions because of their frontalization, which roughly forces specific landmarks to be at specific locations. * They use dropout (apparently only after the first fully connected layer). * They normalize "the features" (probably the 4096 fully connected layer). Each component is divided by its maximum value across a training set. Additionally, the whole vector is L2-normalized. The goal of this step is to make the network less sensitive to illumination changes. * The whole network has about 120 million parameters. * Visualization of the architecture: *  * *Training* * The network receives images, each showing a face, and is trained to classify the identity of the face (e.g. gets image of Obama, has to return "that's Obama"). * They use cross-entropy as their loss. * *Face verification* * In order to tell whether two images of faces show the same person they try three different methods. * Each of these relies on the vector extracted by the first fully connected layer in the network (4096d). * Let these vectors be `f1` (image 1) and `f2` (image 2). The methods are then: 1. Inner product between `f1` and `f2`. The classification (same person/not same person) is then done by a simple threshold. 2. Weighted X^2 (chi-squared) distance. Equation, per vector component i: `weight_i (f1[i] - f2[i])^2 / (f1[i] + f2[i])`. The vector is then fed into an SVM. 3. Siamese network. Means here simply that the absolute distance between `f1` and `f2` is calculated (`|f1-f2|`), each component is weighted by a learned weight and then the sum of the components is calculated. If the result is above a threshold, the faces are considered to show the same person. ### Results * They train their network on the Social Face Classification (SFC) dataset. That seems to be a Facebook-internal dataset (i.e. not public) with 4.4 million faces of 4k people. * When applied to the LFW dataset: * Face recognition ("which person is shown in the image") (apparently they retrained the whole model on LFW for this task?): * Simple SVM with LBP (i.e. not their network): 91.4% mean accuracy. * Their model, with frontalization, with 2d alignment: ??? no value. * Their model, no frontalization (only 2d alignment): 94.3% mean accuracy. * Their model, no frontalization, no 2d alignment: 87.9% mean accuracy. * Face verification (two images -> same/not same person) (apparently also trained on LFW? unclear): * Method 1 (inner product + threshold): 95.92% mean accuracy. * Method 2 (X^2 vector + SVM): 97.00% mean accurracy. * Method 3 (siamese): Apparently 96.17% accuracy alone, and 97.25% when used in an ensemble with other methods (under special training schedule using SFC dataset). * When applied to the YTF dataset (YouTube video frames): * 92.5% accuracy via X^2-method. ![]() |