[link]
Summary by CodyWild 5 years ago
https://i.imgur.com/JJFljWo.png
This paper follows in a recent tradition of results out of Samsung: in the wake of StyleGAN’s very impressive generated images, it uses a lot of similar architectural elements, combined with meta-learning and a new discriminator framework, to generate convincing “talking head” animations based on a small number of frames of a person’s face. Previously, models that generated artificial face videos could only do so by training by a large number of frames of each individual speaker that they wanted to simulate. This system instead is able to generate video in a few-shot way: where they only need one or two frames of a new speaker to do convincing generation.
The structure of talking head video generation as a problem relies on the idea of “landmarks,” explicit parametrization of where the nose, the eyes, the lips, the head, are oriented in a given shot. The model is trained to be able to generate frames of a specified person (based on an input frame), and in a specific pose (based on an input landmark set).
While the visual quality of the simulated video generated here is quite stunning, the most centrally impressive fact about this paper is that generation was only conditioned on a few frames of each target person. This is accomplished through a combination of meta-learning (as an overall training procedure/regime) and adaptive instance normalization, a way of dynamically parametrizing models that was earlier used in a StyleGAN paper (also out of the Samsung lab). Meta-learning works by doing simulated few-shot training iterations, where a model is trained for a small number of steps on a given “task” (where here a task is a given target face), and then optimized on the meta-level to be able to get good test set error rates across many such target faces.
https://i.imgur.com/RIkO1am.png
The mechanics of how this meta-learning approach actually work are quite interesting: largely a new application of existing techniques, but with some extensions and innovations worked in.
- A convolutional model produces an embedding given an input image and a pose. An average embedding is calculated by averaging over different frames, with the hopes of capturing information about the video, in a pose-independent way. This embedding, along with a goal set of landmarks (i.e. the desired facial expression of your simulation) is used to parametrize the generator, which is then asked to determine whether the generated image looks like it came from the sequence belonging to the target face, and looks like it corresponds to the target pose
- Adaptive instance normalization works by having certain parameters of the network (typically, per the name, post-normalization rescaling values) that are dependent on the properties of some input data instance. This works by training a network to produce an embedding vector of the image, and then multiplying that embedding by per-layer, per-filter projection matrices to obtain new parameters. This is in particular a reasonable thing to do in the context of conditional GANs, where you want to have parameters of your generator be conditioned on the content of the image you’re trying to simulate
- This model structure gives you a natural way to do few-shot generation: you can train your embedding network, your generator, and your projection matrices over a large dataset, where they’ve hopefully learned how to compress information from any given target image, and generate convincing frames based on it, so that you can just pass in your new target image, have it transformed into an embedding, and have it contain information the rest of the network can work with
- This model uses a relatively new (~mid 2018) formulation of a conditional GAN, called the projection discriminator. I don’t have time to fully explain this here, but at a high level, it frames the problem of a discriminator determining whether a generated image corresponds to a given conditioning class by projecting both the class and the image into vectors, and calculating a similarity-esque dot product.
- During few-shot application of this model, it can get impressively good performance without even training on the new target face at all, simply by projecting the target face into an embedding, and updating the target-specific network parameters that way. However, they do get better performance if they fine-tune to a specific person, which they do by treating the embedding-projection parameters as an initialization, and then taking a few steps of gradient descent from there
more
less