[link]
Summary by [brannondorsey](https://gist.github.com/brannondorsey/fb075aac4d5423a75f57fbf7ccc12124): - Euclidean distance between predicted and ground truth pixels is not a good method of judging similarity because it yields blurry images. - GANs learn a loss function rather than using an existing one. - GANs learn a loss that tries to classify if the output image is real or fake, while simultaneously training a generative model to minimize this loss. - Conditional GANs (cGANs) learn a mapping from observed image `x` and random noise vector `z` to `y`: `y = f(x, z)` - The generator `G` is trained to produce outputs that cannot be distinguished from "real" images by an adversarially trained discrimintor, `D` which is trained to do as well as possible at detecting the generator's "fakes". - The discriminator `D`, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator. - Unlike an unconditional GAN, both the generator and discriminator observe an input image `z`. - Asks `G` to not only fool the discriminator but also to be near the ground truth output in an `L2` sense. - `L1` distance between an output of `G` is used over `L2` because it encourages less blurring. - Without `z`, the net could still learn a mapping from `x` to `y` but would produce deterministic outputs (and therefore fail to match any distribution other than a delta function. Past conditional GANs have acknowledged this and provided Gaussian noise `z` as an input to the generator, in addition to `x`) - Either vanilla encoder-decoder or Unet can be selected as the model for `G` in this implementation. - Both generator and discriminator use modules of the form convolution-BatchNorm-ReLu. - A defining feature of image-to-image translation problems is that they map a high resolution input grid to a high resolution output grid. - Input and output images differ in surface appearance, but both are renderings of the same underlying structure. Therefore, structure in the input is roughly aligned with structure in the output. - `L1` loss does very well at low frequencies (I think this means general tonal-distribution/contrast, color-blotches, etc) but fails at high frequencies (crispness/edge/detail) (thus you get blurry images). This motivates restricting the GAN discriminator to only model high frequency structure, relying on an `L1` term to force low frequency correctness. In order to model high frequencies, it is sufficient to restrict our attention to the structure in local image patches. Therefore, we design a discriminator architecture – which we term a PatchGAN – that only penalizes structure at the scale of patches. This discriminator tries to classify if each `NxN`patch in an image is real or fake. We run this discriminator convolutationally across the image, averaging all responses to provide the ultimate output of `D`. - Because PatchGAN assumes independence between pixels seperated by more than a patch diameter (`N`) it can be thought of as a form of texture/style loss. - To optimize our networks we alternate between one gradient descent step on `D`, then one step on `G` (using minibatch SGD applying the Adam solver) - In our experiments, we use batch size `1` for certain experiments and `4` for others, noting little difference between these two conditions. - __To explore the generality of conditional GANs, we test the method on a variety of tasks and datasets, including both graphics tasks, like photo generation, and vision tasks, like semantic segmentation.__ - Evaluating the quality of synthesized images is an open and difficult problem. Traditional metrics such as per-pixel mean-squared error do not assess joint statistics of the result, and therefore do not measure the very structure that structured losses aim to capture. - FCN-Score: while quantitative evaluation of generative models is known to be challenging, recent works have tried using pre-trained semantic classifiers to measure the discriminability of the generated images as a pseudo-metric. The intuition is that if the generated images are realistic, classifiers trained on real images will be able to classify the synthesized image correctly as well. - cGANs seems to work much better than GANs for this type of image-to-image transformation, as it seems that with a GAN, the generator collapses into producing nearly the exact same output regardless of the input photograph. - `16x16` PatchGAN produces sharp outputs but causes tiling artifacts, `70x70` PatchGAN alleviates these artifacts. `256x256` ImageGAN doesn't appear to improve the tiling artifacts and yields a lower FCN-score. - An advantage of the PatchGAN is that a fixed-size patch discriminator can be applied to arbitrarily large images. This allows us to train on, say, `256x256` images and test/sample/generate on `512x512`. - cGANs appear to be effective on problems where the output is highly detailed or photographic, as is common in image processing and graphics tasks. - When semantic segmentation is required (i.e. going from image to label) `L1` performs better than `cGAN`. We argue that for vision problems, the goal (i.e. predicting output close to ground truth) may be less ambiguous than graphics tasks, and reconstruction losses like L1 are mostly sufficient. ### Conclusion The results in this paper suggest that conditional adversarial networks are a promising approach for many image-to-image translation tasks, especially those involving highly structured graphical outputs. These networks learn a loss adapted to the task and data at hand, which makes them applicable in a wide variety of settings. ### Misc - Least absolute deviations (`L1`) and Least square errors (`L2`) are the two standard loss functions, that decides what function should be minimized while learning from a dataset. ([source](http://rishy.github.io/ml/2015/04/28/l1-vs-l2-loss/)) - How, using pix2pix, do you specify a loss of `L1`, `L1+GAN`, and `L1+cGAN`? ### Resources - [GAN paper](https://arxiv.org/pdf/1406.2661.pdf)
Your comment:
|