[link]
This paper presents a per-frame image-to-image translation system enabling copying of a motion of a person from a source video to a target person. For example, a source video might be a professional dancer performing complicated moves, while the target person is you. By utilizing this approach, it is possible to generate a video of you dancing as a professional. Check the authors' [video](https://www.youtube.com/watch?v=PCBTZh41Ris) for the visual explanation. **Data preparation** The authors have manually recorded high-resolution video ( at 120fps ) of a person performing various random moves. The video is further decomposed to frames, and person's pose keypoints (body joints, hands, face) are extracted for each frame. These keypoints are further connected to form a person stick figure. In practice, pose estimation is performed by open source project [OpenPose](https://github.com/CMU-Perceptual-Computing-Lab/openpose). **Training** https://i.imgur.com/VZCXZMa.png Once the data is prepared the training is performed in two stages: 1. **Training pix2pixHD model with temporal smoothing**. The core model is an original [pix2pixHD](https://tcwang0509.github.io/pix2pixHD/)[1] model with temporal smoothing. Specifically, if we were to use vanilla pix2pixHD, the input to the model would be a stick person image, and the target is the person's image corresponding to the pose. The network's objective would be $min_{G} (Loss1 + Loss2 + Loss3)$, where: - $Loss1 = max_{D_1, D_2, D_3} \sum_{k=1,2,3} \alpha_{GAN}(G, D_k)$ is adverserial loss; - $Loss2 = \lambda_{FM} \sum_{k=1,2,3} \alpha_{FM}(G,D_k)$ is feature matching loss; - $Loss3 = \lambda_{VGG}\alpha_{VGG}(G(x),y)]$ is VGG perceptual loss. However, this objective does not account for the fact that we want to generate video composed of frames that are temporally coherent. The authors propose to ensure *temporal smoothing* between adjacent frames by including pose, corresponding image, and generated image from the previous step (zero image for the first frame) as shown in the figure below: https://i.imgur.com/0NSeBVt.png Since the generated output $G(x_t; G(x_{t-1}))$ at time step $t$ is now conditioned on the previously generated frame $G(x_{t-1})$ as well as current stick image $x_t$, better temporal consistency is ensured. Consequently, the discriminator is now trying to determine both correct generation, as well as temporal consitency for a fake sequence $[x_{t-1}; x_t; G(x_{t-1}), G(x_t)]$. 2. **Training FaceGAN model**. https://i.imgur.com/mV1xuMi.png In order to improve face generation, the authors propose to use specialized FaceGAN. In practice, this is another smaller pix2pixHD model (with a global generator only, instead of local+global) which is fed with a cropped face area of a stick image and cropped face area of a corresponding generated image (from previous step 1) and aims to generate a residual which is added to the previously generated full image. **Testing** During testing, we extract frames from the input video, obtain pose stick image for each frame, normalize the stick pose image and feed it to pix2pixHD (with temporal consistency) and, further, to FaceGAN to produce final generated image with improved face features. Normalization is needed to capture possible pose variation between a source and a target input video. **Remarks** While this method produces a visually appealing result, it is not perfect. The are several reasons for being so: 1. *Quality of a pose stick image*: if the pose detector "misses" the keypoint, the generator might have difficulties to generate a properly rendered image; 2. *Motion blur*: motion blur causes pose detector to miss keypoints; 3. *Severe scale change*: if source person is very far, keypoint detector might fail to detect proper keypoints. Among video rendering challenges, the authors mention self-occlusion, cloth texture generation, video jittering (training-test motion mismatch). References: [1] "High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs"
2 Comments
|