[link]
## Task They introduce a dense version of the human pose estimation task: predict body surface coordinates for each pixel in an RGB image. Body surface is representated on two levels: - Body part label (24 parts) - Head, torso, hands, feet, etc. - Each leg split in 4 parts: upper/lower front/back. Same for arms. - 2 coordinates (u,v) within body part - head, hands, feet: based on SMPL model - others: determined by Multidimensional Scaling on geodesic distances ## Data * They annotate COCO for this task - annotation tool: draw mask, then click on a 3D rendering for each of up to 14 points sampled from the mask - annotator accuracy on synthetic renderings (average geodesic distance) - small parts (e.g. feet): ~2 cm - large parts (e.g. torso): ~7 cm ## Method Fully-convolutional baseline - ResNet-50/101 - 25-way body part classification head (cross-entropy loss) - Regression head with 24*2 outputs per pixel (Huber loss) Region-based approach - Like Mask-RCNN - New branch with same architecture as the keypoint branch - ResNet-50-FPN (Feature Pyramid Net) backbone Enhancements tested: - Multi-task learning - Train keypoint/mask and dense pose task at once - Interaction implicit by sharing backbone net - Multi-task *cross-cascading* - Explicit interaction of tasks - Introduce second stage that depends on the first-stage-output of all tasks - Ground truth interpolation (distillation) - Train a "teacher" FCN with the pointwise annotations - Use its dense predictions as ground truth to train final net - (To make the teacher as accurate as possible, they use ground-truth mask to remove background) ## Results **Single-person results (train and test on single-person crops)** Pointwise eval measure: - Compute geodesic distance between prediction and ground truth at each annotated point - For various error thresholds, plot percentage of points with lower error than the threshold - Compute Area Under this Curve Training (non-regional) FCN on new dataset vs. synthetic data improves AUC10 from 0.20 to 0.38 This paper's FCN method vs. model-fitting baseline - Baseline: Estimate body keypoint locations in 2D (usual "pose estimation" task) + fit 3D model - AUC10 improves from 0.23 to 0.43 - Speed: 4-25 fps for FCN vs. model-fitting taking 1-3 minutes per frame (!). **Multi-person results** - Region-based method outperforms FCN baseline: 0.25 -> 0.32 - FCN cannot deal well with varying person scales (despite multi-scale testing) - Training on points vs interpolated ground-truth (distillation) 0.32 -> 0.38 - AUC10 with cross-task cascade: 0.39 Also: Per-instance eval ("Geodesic Point Similarity" - GPS) - Compute a Gaussian function on the geodesic distances - Average it within each person instance (=> GPS) - Compute precision and recall of persons for various thresholds of GPS - Compute average precision and recall over thresholds Comparison of multi-task approaches: 1. Just dense pose branch (single-task) (AP 51) 2. Adding keypoint (AP 53) OR mask branch (multi-task without cross-cascade) (AP 52) 3. Refinement stage without cross-links (AP 52) 4. Multi-task cross-cascade (keypoints: AP 56, masks: AP 53)
Your comment:
|