Autoencoding beyond pixels using a learned similarity metric on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Autoencoding beyond pixels using a learned similarity metric
Anders Boesen Lindbo Larsen and Søren Kaae Sønderby and Hugo Larochelle and Ole Winther
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG, cs.CV, stat.ML
more

Summaries/Notes 1

[link] Summary by CodyWild 6 years ago

Variational Autoencoders are a type of generative model that seek to learn how to generate new data by incentivizing the model to be able to reconstruct input data, after compressing it to a low-dimensional space. Typically, the way that the reconstruction is scored against the original is by comparing the pixel by pixel values: a reconstruction gets a high score if it is able to place pixels of color in the same places that the original did. However, there are compelling reasons why this is a sub-par way of scoring images.

The central one is: it focuses on and penalizes superficial differences, so if the model accurately reproduces the focal object of the image, but does so, say, 10 pixels to the right of where it was previously, that will incur a penalty we might not actually want to apply. The flip side of this is that a direct pixel-comparison loss doesn’t differentiate between pixel differences that do or don’t change the fundamental substance of the image. For instance, having 100 pixels wrong around the border of a dog, making it seem very slightly larger, would be the same amount of error as having 100 pixels concentrated in a weird bulb that appears to be growing out of a dog’s ear, even though the former does a better job of being recognizable as a dog.

The authors of the VAE/GAN paper have a clever approach to solving this problem, that involves taking the typical pixel loss, and breaking it up into two conceptual parts.
The first focuses on aligning the conceptual features of the reconstructed image with the conceptual features of the input image. It does so by running both the input and the reconstruction through a discriminative convolutional model which - in the typical way of deep learning - learns ever more abstract features at each layer of the network. These “conceptual features” abstract out the precise pixel values, and instead capture the higher level features of the image. So, instead of calculating the pixelwise squared loss between the specific input x, and its after-bottleneck reconstruction x~, you take the squared loss between the feature maps at some layer for both x and x~, and push them to be closer together, so that the reconstruction shares the same features as the original.
The second focuses on detail-level specifics of images, but, cleverly, does so in a general, rather than a observation-specific way. This is done by training a GAN-style discriminator to tell the difference between generated images* and original image, and then using that loss to train the decoder part of the VAE. The cleverness of this comes from the fact that they are still enforcing that the details and structural features of the reconstructed image are not distinguishable from real images, but doing so in a general sense, rather than requiring the details to be an exact match to the details found in a given input x.

https://i.imgur.com/Bmtmac2.png

The authors freely admit that existing metrics of scoring images (which themselves *use* pixelwise similarity) rate their method as being worse than existing VAEs. However, they argue, that’s inherently a flawed metric, that doesn’t capture the aspects of clean visual quality we want in generated image. A metric they propose instead involves using an dataset where a list of attributes are attached to each image (old, black, blond, etc). They add these as additional input while training the network, so that whatever signals the decoder part of the model needs to turn someone blonde, it gets those from the externally-given attribute vector, rather than a learned representation. This means that, once the model is trained, we can set some value of the attribute vector, and have the decoder generate samples conditional on that. The metric is constructed by taking the decoded samples conditioned on some attribute set, and then taking a classifier model that is trained on the real images to detect attribute values from the images. The generated images are then scored by how closely the predictions from the classifier model match the true values of the attributes. If the generator model were working perfectly, this error rate would as low as for real data. By this metric (which: grain of salt, since they invented), the VAE/GAN model is superior to both GANs and vanilla VAEs.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private