Image-to-Image Translation with Conditional Adversarial Networks on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Image-to-Image Translation with Conditional Adversarial Networks
Phillip Isola and Jun-Yan Zhu and Tinghui Zhou and Alexei A. Efros
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CV
more

Summaries/Notes 1

[link] Summary by Joseph Paul Cohen 7 years ago

Summary by [brannondorsey](https://gist.github.com/brannondorsey/fb075aac4d5423a75f57fbf7ccc12124):

- Euclidean distance between predicted and ground truth pixels is not a good method of judging similarity because it yields blurry images.
- GANs learn a loss function rather than using an existing one.
- GANs learn a loss that tries to classify if the output image is real or fake, while simultaneously training a generative model to minimize this loss.
- Conditional GANs (cGANs) learn a mapping from observed image `x` and random noise vector `z` to `y`: `y = f(x, z)`
- The generator `G` is trained to produce outputs that cannot be distinguished from "real" images by an adversarially trained discrimintor, `D` which is trained to do as well as possible at detecting the generator's "fakes".
- The discriminator `D`, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator.
- Unlike an unconditional GAN, both the generator and discriminator observe an input image `z`.
- Asks `G` to not only fool the discriminator but also to be near the ground truth output in an `L2` sense.
- `L1` distance between an output of `G` is used over `L2` because it encourages less blurring.
- Without `z`, the net could still learn a mapping from `x` to `y` but would produce deterministic outputs (and therefore fail to match any distribution other than a delta function. Past conditional GANs have acknowledged this and provided Gaussian noise `z` as an input to the generator, in addition to `x`)
- Either vanilla encoder-decoder or Unet can be selected as the model for `G` in this implementation.
- Both generator and discriminator use modules of the form convolution-BatchNorm-ReLu.
- A defining feature of image-to-image translation problems is that they map a high resolution input grid to a high resolution output grid.
- Input and output images differ in surface appearance, but both are renderings of the same underlying structure. Therefore, structure in the input is roughly aligned with structure in the output.
- `L1` loss does very well at low frequencies (I think this means general tonal-distribution/contrast, color-blotches, etc) but fails at high frequencies (crispness/edge/detail) (thus you get blurry images). This motivates restricting the GAN discriminator to only model high frequency structure, relying on an `L1` term to force low frequency correctness. In order to model high frequencies, it is sufficient to restrict our attention to the structure in local image patches. Therefore, we design a discriminator architecture – which we term a PatchGAN – that only penalizes structure at the scale of patches. This discriminator tries to classify if each `NxN`patch in an image is real or fake. We run this discriminator convolutationally across the image, averaging all responses to provide the ultimate output of `D`.
- Because PatchGAN assumes independence between pixels seperated by more than a patch diameter (`N`) it can be thought of as a form of texture/style loss.
- To optimize our networks we alternate between one gradient descent step on `D`, then one step on `G` (using minibatch SGD applying the Adam solver)
- In our experiments, we use batch size `1` for certain experiments and `4` for others, noting little difference between these two conditions.
- __To explore the generality of conditional GANs, we test the method on a variety of tasks and datasets, including both graphics tasks, like photo generation, and vision tasks, like semantic segmentation.__
- Evaluating the quality of synthesized images is an open and difficult problem. Traditional metrics such as per-pixel mean-squared error do not assess joint statistics of the result, and therefore do not measure the very structure that structured losses aim to capture.
- FCN-Score: while quantitative evaluation of generative models is known to be challenging, recent works have tried using pre-trained semantic classifiers to measure the discriminability of the generated images as a pseudo-metric. The intuition is that if the generated images are realistic, classifiers trained on real images will be able to classify the synthesized image correctly as well.
- cGANs seems to work much better than GANs for this type of image-to-image transformation, as it seems that with a GAN, the generator collapses into producing nearly the exact same output regardless of the input photograph.
- `16x16` PatchGAN produces sharp outputs but causes tiling artifacts, `70x70` PatchGAN alleviates these artifacts. `256x256` ImageGAN doesn't appear to improve the tiling artifacts and yields a lower FCN-score.
- An advantage of the PatchGAN is that a fixed-size patch discriminator can be applied to arbitrarily large images. This allows us to train on, say, `256x256` images and test/sample/generate on `512x512`.
- cGANs appear to be effective on problems where the output is highly detailed or photographic, as is common in image processing and graphics tasks.
- When semantic segmentation is required (i.e. going from image to label) `L1` performs better than `cGAN`. We argue that for vision problems, the goal (i.e. predicting output close to ground
truth) may be less ambiguous than graphics tasks, and reconstruction losses like L1 are mostly sufficient.

### Conclusion

The results in this paper suggest that conditional adversarial networks are a promising approach for many image-to-image translation tasks, especially those involving highly structured graphical outputs. These networks learn a loss adapted to the task and data at hand, which makes them applicable in a wide variety of settings.

### Misc

- Least absolute deviations (`L1`) and Least square errors (`L2`) are the two standard loss functions, that decides what function should be minimized while learning from a dataset. ([source](http://rishy.github.io/ml/2015/04/28/l1-vs-l2-loss/))
- How, using pix2pix, do you specify a loss of `L1`, `L1+GAN`, and `L1+cGAN`?

### Resources
- [GAN paper](https://arxiv.org/pdf/1406.2661.pdf)

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private