Image-based Synthesis for Deep 3D Human Pose Estimation
Grégory Rogez
and
Cordelia Schmid
arXiv e-Print archive - 2018 via Local arXiv
Keywords:
cs.CV
First published: 2018/02/12 (6 years ago) Abstract: This paper addresses the problem of 3D human pose estimation in the wild. A
significant challenge is the lack of training data, i.e., 2D images of humans
annotated with 3D poses. Such data is necessary to train state-of-the-art CNN
architectures. Here, we propose a solution to generate a large set of
photorealistic synthetic images of humans with 3D pose annotations. We
introduce an image-based synthesis engine that artificially augments a dataset
of real images with 2D human pose annotations using 3D motion capture data.
Given a candidate 3D pose, our algorithm selects for each joint an image whose
2D pose locally matches the projected 3D pose. The selected images are then
combined to generate a new synthetic image by stitching local image patches in
a kinematically constrained manner. The resulting images are used to train an
end-to-end CNN for full-body 3D pose estimation. We cluster the training data
into a large number of pose classes and tackle pose estimation as a $K$-way
classification problem. Such an approach is viable only with large training
sets such as ours. Our method outperforms most of the published works in terms
of 3D pose estimation in controlled environments (Human3.6M) and shows
promising results for real-world images (LSP). This demonstrates that CNNs
trained on artificial images generalize well to real images. Compared to data
generated from more classical rendering engines, our synthetic images do not
require any domain adaptation or fine-tuning stage.
Aim: generate realistic-looking synthetic data that can be used to train 3D Human Pose Estimation methods. Instead of rendering 3D models, they choose to combine parts of real images.
Input: RGB images with 2D annotations + a query 3D pose.
Output: A synthetic image, stitched from patches of the images, so that it looks like a person in the query 3D pose.
Steps:
- Project 3D pose on random camera to get 2D coords
- For each joint, find an image in the 2D annotated dataset whose annotation is locally similar
- Based on the similarities, decide for each pixel which image is most relevant.
- For each pixel, take the histogram of the chosen images in a neighborhood, and use this as blending factors to generate the result.
They also present a method that they trained on this synthetic dataset.