[link]
Aim: generate realistic-looking synthetic data that can be used to train 3D Human Pose Estimation methods. Instead of rendering 3D models, they choose to combine parts of real images. Input: RGB images with 2D annotations + a query 3D pose. Output: A synthetic image, stitched from patches of the images, so that it looks like a person in the query 3D pose. Steps: - Project 3D pose on random camera to get 2D coords - For each joint, find an image in the 2D annotated dataset whose annotation is locally similar - Based on the similarities, decide for each pixel which image is most relevant. - For each pixel, take the histogram of the chosen images in a neighborhood, and use this as blending factors to generate the result. They also present a method that they trained on this synthetic dataset. |
[link]
## Task They introduce a dense version of the human pose estimation task: predict body surface coordinates for each pixel in an RGB image. Body surface is representated on two levels: - Body part label (24 parts) - Head, torso, hands, feet, etc. - Each leg split in 4 parts: upper/lower front/back. Same for arms. - 2 coordinates (u,v) within body part - head, hands, feet: based on SMPL model - others: determined by Multidimensional Scaling on geodesic distances ## Data * They annotate COCO for this task - annotation tool: draw mask, then click on a 3D rendering for each of up to 14 points sampled from the mask - annotator accuracy on synthetic renderings (average geodesic distance) - small parts (e.g. feet): ~2 cm - large parts (e.g. torso): ~7 cm ## Method Fully-convolutional baseline - ResNet-50/101 - 25-way body part classification head (cross-entropy loss) - Regression head with 24*2 outputs per pixel (Huber loss) Region-based approach - Like Mask-RCNN - New branch with same architecture as the keypoint branch - ResNet-50-FPN (Feature Pyramid Net) backbone Enhancements tested: - Multi-task learning - Train keypoint/mask and dense pose task at once - Interaction implicit by sharing backbone net - Multi-task *cross-cascading* - Explicit interaction of tasks - Introduce second stage that depends on the first-stage-output of all tasks - Ground truth interpolation (distillation) - Train a "teacher" FCN with the pointwise annotations - Use its dense predictions as ground truth to train final net - (To make the teacher as accurate as possible, they use ground-truth mask to remove background) ## Results **Single-person results (train and test on single-person crops)** Pointwise eval measure: - Compute geodesic distance between prediction and ground truth at each annotated point - For various error thresholds, plot percentage of points with lower error than the threshold - Compute Area Under this Curve Training (non-regional) FCN on new dataset vs. synthetic data improves AUC10 from 0.20 to 0.38 This paper's FCN method vs. model-fitting baseline - Baseline: Estimate body keypoint locations in 2D (usual "pose estimation" task) + fit 3D model - AUC10 improves from 0.23 to 0.43 - Speed: 4-25 fps for FCN vs. model-fitting taking 1-3 minutes per frame (!). **Multi-person results** - Region-based method outperforms FCN baseline: 0.25 -> 0.32 - FCN cannot deal well with varying person scales (despite multi-scale testing) - Training on points vs interpolated ground-truth (distillation) 0.32 -> 0.38 - AUC10 with cross-task cascade: 0.39 Also: Per-instance eval ("Geodesic Point Similarity" - GPS) - Compute a Gaussian function on the geodesic distances - Average it within each person instance (=> GPS) - Compute precision and recall of persons for various thresholds of GPS - Compute average precision and recall over thresholds Comparison of multi-task approaches: 1. Just dense pose branch (single-task) (AP 51) 2. Adding keypoint (AP 53) OR mask branch (multi-task without cross-cascade) (AP 52) 3. Refinement stage without cross-links (AP 52) 4. Multi-task cross-cascade (keypoints: AP 56, masks: AP 53) |
[link]
* Presents an architecture dubbed ResNeXt * They use modules built of * 1x1 conv * 3x3 group conv, keeping the depth constant. It's like a usual conv, but it's not fully connected along the depth axis, but only connected within groups * 1x1 conv * plus a skip connection coming from the module input * Advantages: * Fewer parameters, since the full connections are only within the groups * Allows more feature channels at the cost of more aggressive grouping * Better performance when keeping the number of params constant * Questions/Disadvantages: * Instead of keeping the num of params constant, how about aiming at constant memory consumption? Having more feature channels requires more RAM, even if the connections are sparser and hence there are fewer params * Not so much improvement over ResNet |
[link]
* Semi-supervised method * There is a teacher net and a student net, with identical architecture. * The teacher makes predictions on unlabeled data, which are used as ground-truth for training the student net. * After each gradient descent update on the student, the teacher's weights are updated so that it becomes an exponential moving average of the weights of the student at previous timesteps. It's called a "mean teacher" because of this moving average. |
[link]
* It's a semi-supervised method (the goal is to make use of unlabeled data in addition to labeled data). * They first train a neural net normally, in the supervised way, on a labeled dataset. * Then **they retrain the net using *its own predictions* on the originally unlabeled data as if it was ground truth** (but only when the net is confident enough about the prediction). * More precisely they retrain on the union of the original dataset and the examples labeled by the net itself. (Each minibatch is on average 60% original and 40% self-labeled) * When making these predictions (that will subsequently used for training), they use **multi-transform inference**. * They apply the net to differently transformed versions of the image (mirroring, scaling), transform the outputs back accordingly and combine the results. |