This paper synthesizes a high-quality video of Barack Obama given the audio. Practically, it only synthesizes the region around the mouth, while the rest of the elements (i.e. pixels) come from a video in a database.
The overall pipeline is the following:
- Given a video, an audio and a mouth shape are extracted. Audio is represented as MFCC coefficients; mouth shape - 18 lip markers;
- Train audio to mouth shape mapping with time-delayed unidirectional LSTM.
- Synthesize mouth texture: retrieve a number of video frames in a database where a mouth shape is similar to the output of LSTM; synthesize median texture by applying weighted median on mouth shapes from retrieved video frames; manually select teeth target frame (selection criteria are purely subjected) and enhance teeth median texture with selected teeth target frame.
- Re-timing to avoid situations where Obama is not speaking but his head is moving which looks very unnatural.
- Final composition into the target video involves jaw correction to make it more natural.
![Algorithm flow](http://www.kurzweilai.net/images/Obama-lip-Sync-Graphic.jpg)
The results look ridiculously natural. Authors suggest that one of the applications of this paper is speech summarization, where you summarize a speech not only with selected parts as text and audio but also synthesize a video for it. Personally, this work inspires me to work on a method that is able to generate natural sign language interpreter that takes sound/text as input and produces sign language moves.