Synthesizing Obama: learning lip sync from audio on ShortScience.org

doi.acm.org
sci-hub
scholar.google.com

Synthesizing Obama: learning lip sync from audio
Suwajanakorn, Supasorn and Seitz, Steven M. and Kemelmacher-Shlizerman, Ira
ACM Special Interest Group on computer GRAPHics - 2017 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 1

[link] Summary by Oleksandr Bailo 6 years ago

This paper synthesizes a high-quality video of Barack Obama given the audio. Practically, it only synthesizes the region around the mouth, while the rest of the elements (i.e. pixels) come from a video in a database.
The overall pipeline is the following:
- Given a video, an audio and a mouth shape are extracted. Audio is represented as MFCC coefficients; mouth shape - 18 lip markers;
- Train audio to mouth shape mapping with time-delayed unidirectional LSTM.
- Synthesize mouth texture: retrieve a number of video frames in a database where a mouth shape is similar to the output of LSTM; synthesize median texture by applying weighted median on mouth shapes from retrieved video frames; manually select teeth target frame (selection criteria are purely subjected) and enhance teeth median texture with selected teeth target frame.
- Re-timing to avoid situations where Obama is not speaking but his head is moving which looks very unnatural.
- Final composition into the target video involves jaw correction to make it more natural.
![Algorithm flow](http://www.kurzweilai.net/images/Obama-lip-Sync-Graphic.jpg)

The results look ridiculously natural. Authors suggest that one of the applications of this paper is speech summarization, where you summarize a speech not only with selected parts as text and audio but also synthesize a video for it. Personally, this work inspires me to work on a method that is able to generate natural sign language interpreter that takes sound/text as input and produces sign language moves.

Your comment: