[link]
* They describe a model to locate faces in images. * Their model uses information from suspected face regions *and* from the corresponding suspected body regions to classify whether a region contains a face. * The intuition is, that seeing the region around the face (specifically where the body should be) can help in estimating whether a suspected face is really a face (e.g. it might also be part of a painting, statue or doll). ### How * Their whole model is called "CMS-RCNN" (Contextual Multi-Scale Region-CNN). * It is based on the "Faster R-CNN" architecture. * It uses the VGG network. * Subparts of their model are: MS-RPN, CMS-CNN. * MS-RPN finds candidate face regions. CMS-CNN refines their bounding boxes and classifies them (face / not face). * **MS-RPN** (Multi-Scale Region Proposal Network) * "Looks" at the feature maps of the network (VGG) at multiple scales (i.e. before/after pooling layers) and suggests regions for possible faces. * Steps: * Feed an image through the VGG network. * Extract the feature maps of the three last convolutions that are before a pooling layer. * Pool these feature maps so that they have the same heights and widths. * Apply L2 normalization to each feature map so that they all have the same scale. * Apply a 1x1 convolution to merge them to one feature map. * Regress face bounding boxes from that feature map according to the Faster R-CNN technique. * **CMS-CNN** (Contextual Multi-Scale CNN): * "Looks" at feature maps of face candidates found by MS-RPN and classifies whether these regions contains faces. * It also uses the same multi-scale technique (i.e. take feature maps from convs before pooling layers). * It uses some area around these face regions as additional information (suspected regions of bodies). * Steps: * Receive face candidate regions from MS-RPN. * Do per candidate region: * Calculate the suspected coordinates of the body (only based on the x/y-position and size of the face region, i.e. not learned). * Extract the feature maps of the *face* region (at multiple scales) and apply RoI-Pooling to it (i.e. convert to a fixed height and width). * Extract the feature maps of the *body* region (at multiple scales) and apply RoI-Pooling to it (i.e. convert to a fixed height and width). * L2-normalize each feature map. * Concatenate the (RoI-pooled and normalized) feature maps of the face (at multiple scales) with each other (creates one tensor). * Concatenate the (RoI-pooled and normalized) feature maps of the body (at multiple scales) with each other (creates another tensor). * Apply a 1x1 convolution to the face tensor. * Apply a 1x1 convolution to the body tensor. * Apply two fully connected layers to the face tensor, creating a vector. * Apply two fully connected layers to the body tensor, creating a vector. * Concatenate both vectors. * Based on that vector, make a classification of whether it is really a face. * Based on that vector, make a regression of the face's final bounding box coordinates and dimensions. * Note: They use in both networks the multi-scale approach in order to be able to find small or tiny faces. Otherwise, after pooling these small faces would be hard or impossible to detect. ### Results * Adding context to the classification (i.e. the body regions) empirically improves the results. * Their model achieves the highest recall rate on FDDB compared to other models. However, it has lower recall if only very few false positives are accepted. * FDDB ROC curves (theirs is bold red): *  * Example results on FDDB: *  ![]()
Your comment:
|