DensePose: Dense Human Pose Estimation In The Wild
Rıza Alp Güler
and
Natalia Neverova
and
Iasonas Kokkinos
arXiv e-Print archive - 2018 via Local arXiv
Keywords:
cs.CV
First published: 2018/02/01 (6 years ago) Abstract: In this work, we establish dense correspondences between RGB image and a
surface-based representation of the human body, a task we refer to as dense
human pose estimation. We first gather dense correspondences for 50K persons
appearing in the COCO dataset by introducing an efficient annotation pipeline.
We then use our dataset to train CNN-based systems that deliver dense
correspondence 'in the wild', namely in the presence of background, occlusions
and scale variations. We improve our training set's effectiveness by training
an 'inpainting' network that can fill in missing groundtruth values and report
clear improvements with respect to the best results that would be achievable in
the past. We experiment with fully-convolutional networks and region-based
models and observe a superiority of the latter; we further improve accuracy
through cascading, obtaining a system that delivers highly0accurate results in
real time. Supplementary materials and videos are provided on the project page
http://densepose.org
## Task
They introduce a dense version of the human pose estimation task: predict body surface coordinates for each pixel in an RGB image.
Body surface is representated on two levels:
- Body part label (24 parts)
- Head, torso, hands, feet, etc.
- Each leg split in 4 parts: upper/lower front/back. Same for arms.
- 2 coordinates (u,v) within body part
- head, hands, feet: based on SMPL model
- others: determined by Multidimensional Scaling on geodesic distances
## Data
* They annotate COCO for this task
- annotation tool: draw mask, then click on a 3D rendering for each of up to 14 points sampled from the mask
- annotator accuracy on synthetic renderings (average geodesic distance)
- small parts (e.g. feet): ~2 cm
- large parts (e.g. torso): ~7 cm
## Method
Fully-convolutional baseline
- ResNet-50/101
- 25-way body part classification head (cross-entropy loss)
- Regression head with 24*2 outputs per pixel (Huber loss)
Region-based approach
- Like Mask-RCNN
- New branch with same architecture as the keypoint branch
- ResNet-50-FPN (Feature Pyramid Net) backbone
Enhancements tested:
- Multi-task learning
- Train keypoint/mask and dense pose task at once
- Interaction implicit by sharing backbone net
- Multi-task *cross-cascading*
- Explicit interaction of tasks
- Introduce second stage that depends on the first-stage-output of all tasks
- Ground truth interpolation (distillation)
- Train a "teacher" FCN with the pointwise annotations
- Use its dense predictions as ground truth to train final net
- (To make the teacher as accurate as possible, they use ground-truth mask to remove background)
## Results
**Single-person results (train and test on single-person crops)**
Pointwise eval measure:
- Compute geodesic distance between prediction and ground truth at each annotated point
- For various error thresholds, plot percentage of points with lower error than the threshold
- Compute Area Under this Curve
Training (non-regional) FCN on new dataset vs. synthetic data improves AUC10 from 0.20 to 0.38
This paper's FCN method vs. model-fitting baseline
- Baseline: Estimate body keypoint locations in 2D (usual "pose estimation" task) + fit 3D model
- AUC10 improves from 0.23 to 0.43
- Speed: 4-25 fps for FCN vs. model-fitting taking 1-3 minutes per frame (!).
**Multi-person results**
- Region-based method outperforms FCN baseline: 0.25 -> 0.32
- FCN cannot deal well with varying person scales (despite multi-scale testing)
- Training on points vs interpolated ground-truth (distillation) 0.32 -> 0.38
- AUC10 with cross-task cascade: 0.39
Also: Per-instance eval ("Geodesic Point Similarity" - GPS)
- Compute a Gaussian function on the geodesic distances
- Average it within each person instance (=> GPS)
- Compute precision and recall of persons for various thresholds of GPS
- Compute average precision and recall over thresholds
Comparison of multi-task approaches:
1. Just dense pose branch (single-task) (AP 51)
2. Adding keypoint (AP 53) OR mask branch (multi-task without cross-cascade) (AP 52)
3. Refinement stage without cross-links (AP 52)
4. Multi-task cross-cascade (keypoints: AP 56, masks: AP 53)