[link]
Summary by isarandi 7 years ago
## Task
They introduce a dense version of the human pose estimation task: predict body surface coordinates for each pixel in an RGB image.
Body surface is representated on two levels:
- Body part label (24 parts)
- Head, torso, hands, feet, etc.
- Each leg split in 4 parts: upper/lower front/back. Same for arms.
- 2 coordinates (u,v) within body part
- head, hands, feet: based on SMPL model
- others: determined by Multidimensional Scaling on geodesic distances
## Data
* They annotate COCO for this task
- annotation tool: draw mask, then click on a 3D rendering for each of up to 14 points sampled from the mask
- annotator accuracy on synthetic renderings (average geodesic distance)
- small parts (e.g. feet): ~2 cm
- large parts (e.g. torso): ~7 cm
## Method
Fully-convolutional baseline
- ResNet-50/101
- 25-way body part classification head (cross-entropy loss)
- Regression head with 24*2 outputs per pixel (Huber loss)
Region-based approach
- Like Mask-RCNN
- New branch with same architecture as the keypoint branch
- ResNet-50-FPN (Feature Pyramid Net) backbone
Enhancements tested:
- Multi-task learning
- Train keypoint/mask and dense pose task at once
- Interaction implicit by sharing backbone net
- Multi-task *cross-cascading*
- Explicit interaction of tasks
- Introduce second stage that depends on the first-stage-output of all tasks
- Ground truth interpolation (distillation)
- Train a "teacher" FCN with the pointwise annotations
- Use its dense predictions as ground truth to train final net
- (To make the teacher as accurate as possible, they use ground-truth mask to remove background)
## Results
**Single-person results (train and test on single-person crops)**
Pointwise eval measure:
- Compute geodesic distance between prediction and ground truth at each annotated point
- For various error thresholds, plot percentage of points with lower error than the threshold
- Compute Area Under this Curve
Training (non-regional) FCN on new dataset vs. synthetic data improves AUC10 from 0.20 to 0.38
This paper's FCN method vs. model-fitting baseline
- Baseline: Estimate body keypoint locations in 2D (usual "pose estimation" task) + fit 3D model
- AUC10 improves from 0.23 to 0.43
- Speed: 4-25 fps for FCN vs. model-fitting taking 1-3 minutes per frame (!).
**Multi-person results**
- Region-based method outperforms FCN baseline: 0.25 -> 0.32
- FCN cannot deal well with varying person scales (despite multi-scale testing)
- Training on points vs interpolated ground-truth (distillation) 0.32 -> 0.38
- AUC10 with cross-task cascade: 0.39
Also: Per-instance eval ("Geodesic Point Similarity" - GPS)
- Compute a Gaussian function on the geodesic distances
- Average it within each person instance (=> GPS)
- Compute precision and recall of persons for various thresholds of GPS
- Compute average precision and recall over thresholds
Comparison of multi-task approaches:
1. Just dense pose branch (single-task) (AP 51)
2. Adding keypoint (AP 53) OR mask branch (multi-task without cross-cascade) (AP 52)
3. Refinement stage without cross-links (AP 52)
4. Multi-task cross-cascade (keypoints: AP 56, masks: AP 53)

more
less