DensePose: Dense Human Pose Estimation In The Wild on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

DensePose: Dense Human Pose Estimation In The Wild
Rıza Alp Güler and Natalia Neverova and Iasonas Kokkinos
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CV
more

Summaries/Notes 1

[link] Summary by isarandi 6 years ago

## Task
They introduce a dense version of the human pose estimation task: predict body surface coordinates for each pixel in an RGB image.

Body surface is representated on two levels:
- Body part label (24 parts)
    - Head, torso, hands, feet, etc.
    - Each leg split in 4 parts: upper/lower front/back. Same for arms.
- 2 coordinates (u,v) within body part
    - head, hands, feet: based on SMPL model
    - others: determined by Multidimensional Scaling on geodesic distances

## Data
* They annotate COCO for this task
    - annotation tool: draw mask, then click on a 3D rendering for each of up to 14 points sampled from the mask
    - annotator accuracy on synthetic renderings (average geodesic distance)
       - small parts (e.g. feet): ~2 cm
       - large parts (e.g. torso): ~7 cm

## Method

Fully-convolutional baseline
  - ResNet-50/101
  - 25-way body part classification head (cross-entropy loss)
  - Regression head with 24*2 outputs per pixel (Huber loss)

Region-based approach
  - Like Mask-RCNN
  - New branch with same architecture as the keypoint branch
  - ResNet-50-FPN (Feature Pyramid Net) backbone

Enhancements tested:

- Multi-task learning
  - Train keypoint/mask and dense pose task at once
  - Interaction implicit by sharing backbone net

- Multi-task *cross-cascading*
  - Explicit interaction of tasks
  - Introduce second stage that depends on the first-stage-output of all tasks

- Ground truth interpolation (distillation)
  - Train a "teacher" FCN with the pointwise annotations
  - Use its dense predictions as ground truth to train final net
  - (To make the teacher as accurate as possible, they use ground-truth mask to remove background)

## Results

**Single-person results (train and test on single-person crops)**

Pointwise eval measure:
   - Compute geodesic distance between prediction and ground truth at each annotated point
   - For various error thresholds, plot percentage of points with lower error than the threshold
   - Compute Area Under this Curve

Training (non-regional) FCN on new dataset vs. synthetic data improves AUC10 from 0.20 to 0.38

This paper's FCN method vs. model-fitting baseline
- Baseline: Estimate body keypoint locations in 2D (usual "pose estimation" task) + fit 3D model
- AUC10 improves from 0.23 to 0.43
- Speed: 4-25 fps for FCN vs. model-fitting taking 1-3 minutes per frame (!).

**Multi-person results**

- Region-based method outperforms FCN baseline: 0.25 -> 0.32
    - FCN cannot deal well with varying person scales (despite multi-scale testing)
- Training on points vs interpolated ground-truth (distillation) 0.32 -> 0.38
- AUC10 with cross-task cascade: 0.39

Also: Per-instance eval ("Geodesic Point Similarity" - GPS)
   - Compute a Gaussian function on the geodesic distances
   - Average it within each person instance (=> GPS)
   - Compute precision and recall of persons for various thresholds of GPS
   - Compute average precision and recall over thresholds

Comparison of multi-task approaches:
1. Just dense pose branch (single-task) (AP 51)
2. Adding keypoint (AP 53) OR mask branch (multi-task without cross-cascade) (AP 52)
3. Refinement stage without cross-links (AP 52)
4. Multi-task cross-cascade (keypoints: AP 56, masks: AP 53)

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private