Summaries from ACM Special Interest Group on computer GRAPHics on ShortScience.org

doi.acm.org
sci-hub
scholar.google.com

Synthesizing Obama: learning lip sync from audio
Suwajanakorn, Supasorn and Seitz, Steven M. and Kemelmacher-Shlizerman, Ira
ACM Special Interest Group on computer GRAPHics - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Oleksandr Bailo 7 years ago

This paper synthesizes a high-quality video of Barack Obama given the audio. Practically, it only synthesizes the region around the mouth, while the rest of the elements (i.e. pixels) come from a video in a database.
The overall pipeline is the following:
- Given a video, an audio and a mouth shape are extracted. Audio is represented as MFCC coefficients; mouth shape - 18 lip markers;
- Train audio to mouth shape mapping with time-delayed unidirectional LSTM.
- Synthesize mouth texture: retrieve a number of video frames in a database where a mouth shape is similar to the output of LSTM; synthesize median texture by applying weighted median on mouth shapes from retrieved video frames; manually select teeth target frame (selection criteria are purely subjected) and enhance teeth median texture with selected teeth target frame.
- Re-timing to avoid situations where Obama is not speaking but his head is moving which looks very unnatural.
- Final composition into the target video involves jaw correction to make it more natural.
![Algorithm flow](http://www.kurzweilai.net/images/Obama-lip-Sync-Graphic.jpg)

The results look ridiculously natural. Authors suggest that one of the applications of this paper is speech summarization, where you summarize a speech not only with selected parts as text and audio but also synthesize a video for it. Personally, this work inspires me to work on a method that is able to generate natural sign language interpreter that takes sound/text as input and produces sign language moves.

doi.acm.org
sci-hub
scholar.google.com

Visual attribute transfer through deep image analogy
Liao, Jing and Yao, Yuan and Yuan, Lu and Hua, Gang and Kang, Sing Bing
ACM Special Interest Group on computer GRAPHics - 2017 via Local Bibsonomy
Keywords: dblp

[link] Summary by Léo Paillier 7 years ago

_Objective:_ Transfer visual attribute (color, tone, texture, and style, etc) between two semantically-meaningful images such as a picture and a sketch.

## Inner workings:

### Image analogy

An image analogy A:A′::B:B′ is a relation where:

*   B′ relates to B in the same way as A′ relates to A
*   A and A′ are in pixel-wise correspondences
*   B and B′ are in pixel-wise correspondences

In this paper only a source image A and an example image B′ are given, and both A′ and B represent latent images to be estimated.

[![screen shot 2017-05-18 at 10 43 48 am](https://cloud.githubusercontent.com/assets/17261080/26193907/f080e212-3bb6-11e7-9441-7b255e4219f5.png)](https://cloud.githubusercontent.com/assets/17261080/26193907/f080e212-3bb6-11e7-9441-7b255e4219f5.png)

### Dense correspondence

In order to find dense correspondences between two images they use features from previously trained CNN (VGG-19) and retrieve all the ReLU layers.

The mapping is divided in two sub-mappings that are easier to compute, first a visual attribute transformation and then a space transformation.

[![screen shot 2017-05-18 at 11 04 58 am](https://cloud.githubusercontent.com/assets/17261080/26194835/03ccd94a-3bba-11e7-93ca-9420d4d96162.png)](https://cloud.githubusercontent.com/assets/17261080/26194835/03ccd94a-3bba-11e7-93ca-9420d4d96162.png)

## Architecture:

The algorithm proceeds as follow:

1.  Compute features at each layer for the input image using a pre-trained CNN and initialize feature maps of latent images with coarsest layer.
2.  For said layer compute a forward and reverse nearest-neighbor field (NNF, basically an offset field).
3.  Use this NNF with the feature of the input current layer to compute the features of the latent images.
4.  Upsample the NNF and use it as the initialization for the NNF of the next layer.

[![screen shot 2017-05-18 at 11 14 33 am](https://cloud.githubusercontent.com/assets/17261080/26195178/35277e0e-3bbb-11e7-82ce-037466314640.png)](https://cloud.githubusercontent.com/assets/17261080/26195178/35277e0e-3bbb-11e7-82ce-037466314640.png)

## Results:

Impressive quality on all type of visual transfer but veryyyyy slow! (~3min on GPUs for one image).

[![screen shot 2017-05-18 at 11 36 47 am](https://cloud.githubusercontent.com/assets/17261080/26196151/54ef423c-3bbe-11e7-9433-b29be5091fae.png)](https://cloud.githubusercontent.com/assets/17261080/26196151/54ef423c-3bbe-11e7-9433-b29be5091fae.png)