This paper synthesizes a high-quality video of Barack Obama given the audio. Practically, it only synthesizes the region around the mouth, while the rest of the elements (i.e. pixels) come from a video in a database.
The overall pipeline is the following:
- Given a video, an audio and a mouth shape are extracted. Audio is represented as MFCC coefficients; mouth shape - 18 lip markers;
- Train audio to mouth shape mapping with time-delayed unidirectional LSTM.
- Synthesize mouth texture: retrieve a number of video frames in a database where a mouth shape is similar to the output of LSTM; synthesize median texture by applying weighted median on mouth shapes from retrieved video frames; manually select teeth target frame (selection criteria are purely subjected) and enhance teeth median texture with selected teeth target frame.
- Re-timing to avoid situations where Obama is not speaking but his head is moving which looks very unnatural.
- Final composition into the target video involves jaw correction to make it more natural.
The results look ridiculously natural. Authors suggest that one of the applications of this paper is speech summarization, where you summarize a speech not only with selected parts as text and audio but also synthesize a video for it. Personally, this work inspires me to work on a method that is able to generate natural sign language interpreter that takes sound/text as input and produces sign language moves.