[link]
_Objective:_ Image-to-image translation to perform visual attribute transfer using unpaired images. _Dataset:_ [Cityscapes](https://www.cityscapes-dataset.com/), [CMP Facade](http://cmp.felk.cvut.cz/%7Etylecr1/facade/), [UT Zappos50k](http://vision.cs.utexas.edu/projects/finegrained/utzap50k/) and [ImageNet](http://www.image-net.org/). _Code:_ [CycleGAN](https://github.com/junyanz/CycleGAN) ## Inner-workings: Basically two GANs for each domain with their respective Generator and Discriminator plus two additional losses (called consistency losses) to make sure that translating to the other domain then back yields an image that is still realistic. [![screen shot 2017-06-02 at 10 24 45 am](https://cloud.githubusercontent.com/assets/17261080/26717449/bcd8a9cc-477d-11e7-9137-fd277a0ec04f.png)](https://cloud.githubusercontent.com/assets/17261080/26717449/bcd8a9cc-477d-11e7-9137-fd277a0ec04f.png) For the consistency los they use a pixel-wise L1 norm: [![screen shot 2017-06-02 at 10 31 22 am](https://cloud.githubusercontent.com/assets/17261080/26717733/bc088cdc-477e-11e7-96af-2defa06a1660.png)](https://cloud.githubusercontent.com/assets/17261080/26717733/bc088cdc-477e-11e7-96af-2defa06a1660.png) ## Architecture: Based on [Perceptual losses for real-time style transfer and super-resolution](https://arxiv.org/pdf/1603.08155.pdf), code available [here](https://github.com/jcjohnson/fast-neural-style). Training seems to employ several tricks and then even use a batch of 1. ## Results: Very impressive and the really key point is that you don't need paired images which makes this trainable on any domain with the same representation behind. [![screen shot 2017-06-02 at 10 26 29 am](https://cloud.githubusercontent.com/assets/17261080/26717502/f6d1fb7e-477d-11e7-8174-7bdd621cf1b6.png)](https://cloud.githubusercontent.com/assets/17261080/26717502/f6d1fb7e-477d-11e7-8174-7bdd621cf1b6.png)
Your comment:
|
[link]
Over the last five years, artificial creative generation powered by ML has blossomed. We can now imagine buildings based off of a sketch, peer into the dog-tiled “dreams” of a convolutional net, and, as of 2017, turn images of horses into ones of zebras. This last problem - typically termed image-to-image translation- is the one that CycleGAN focuses on. The kinds of transformations that can full under this category is pretty conceptually broad: zebras to horses, summer scenes to winter ones, images to Monet paintings. (Note: I switch between using horse/zebra as my explanatory example, and using summer/winter. Both have advantages for explaining different conceptual poinfts) However, the idea is the same: you start with image a, which belongs to set A, and you want to generate a mapping of that image into set B, where the only salient change is that it’s now in set B. As a clarifying example: if you started out with a horse, and your goal was to translate it into a zebra, you would hope that the animal is in the same size, relative position, and pose, and that the only element that changed was changing the quality of “horseness” for the quality of “zebraness”. https://i.imgur.com/NCExS7A.png The real trick of CycleGAN is the fact that, unlike prior attempts to solve this problem, they didn’t use paired data. This is understandable, given the prior example: while it’s possible to take a picture of a scene in both summer and winter, you obviously can’t convert a horse into a zebra so that you can take a “paired” picture of it in both forms. When you have paired data, this is a reasonably well-defined problem: you want to learn some mathematical transformation to turn a specific summer image into a specific winter one, and you can use the ground truth winter image as explicit supervision. Since they lack this per-image cross-domain ground truth, the authors of this paper take what would be one question (“is the winter version of this image that the network generated close to the actual known winter version of this image”) and decompose it into two: Does the winter version of this original summer image looks like it belongs to the set of winter images? This is enforced by a GAN-style discriminator, which takes in outputs of the summer->winter generator, and true images of winter, and tries to tell them apart. This loss component pushes generated winter images to have the quality of “winterness”. This is the “Adversarial Loss” Does the winter version of this image contain enough information about this specific original summer image to accurately reconstruct it with an inverted (winter -> summer) generator? This constraint pushes the generator to actually translate aspects of this specific image between summer and winter. Without it, as the authors of the paper showed, the model has no incentive to actually do translation, and instead just generates winter images that have nothing to do with the summer image (and, frequently experience mode collapse: only generating a single winter image over and over again). This is termed the “Cycle Consistency Loss” It’s actually the case that there are two versions of both of the above networks; that’s what puts the “cycle” in CycleGAN. In addition to a loss ensuring you can map summer -> winter -> summer, there’s another one ensuring the other direction, winter -> summer -> winter holds as well. And, for both of those directions, we use the adversarial loss on the middle “translated” image, and a cycle consistency loss on the last “reconstructed” image. A key point here is that, because of the inherent structure of this loss function requires mapping networks going in both directions, training a winter->summer generator gets you a summer-> winter one for free. (Note: this is a totally different model architecture than most of the “style transfer” applications you likely previously seen, though when applied to photograph -> painting translation, it can have similar results) |