[link]
Summary by Sina Honari 6 years ago
This paper generates photographic images from semantic images using progressively growing resolution of the feature maps. The goal is to generate high resolution images while maintaining global structure in the images in a coarse-to-fine procedure.
The architecture is composed of several refinement modules (as shown in Figure below), where each one maintains the resolution of its input. The output resolution of each module is then doubled when being passed to the next module. The first module has a resolution of $4 \times 8$ and takes semantic image at this resolution. It produces a feature layer $F_0$ as output. The output is then doubled in resolution and passed together with a downsampled semantic image to the next module that generates feature layer $F_1$ as output. This process continues where each module takes feature layer $F_{i-1}$ together with a semantic image as input and produces $F_i$ as output. The final module outputs 3 channels for RGB image.
https://i.imgur.com/M3ucgwI.png
This process is used to generate high resolution images (images of resolution $1024 \times 2048$ on Cityscapes dataset are generated) and meanwhile maintains global coordination in the image in a coarse-to-fine process. For example, if the model generates the left red light of a car, the right red light should also be similar. The global structure can then be specified at low resolution where features are close and then maintained while increasing the resolution of the maps.
Creating photographic images is a 1 to n mapping, so a model can output many plausible and at the same time correct outputs. Therefore, pixel-wise comparison of the generated image with the ground truth (GT) from the training set can produce high errors. For example, if the model assigns black color instead of white to a car the error is very high while the output is still correct. Therefore the authors define the cost by comparing features of a pre-trained VGG network as follows:
https://i.imgur.com/gIflZLM.png
where $l$ is the layer of pre-trained VGG model and $\lambda_l$ is its corresponding weight, $\Phi(I)$ and $\Phi(g(L,\theta))$ are features of GT image and generated image.
The following image shows samples of this model:
https://i.imgur.com/coxsdbU.png
In order to generate more diverse images another variant of this model is proposed, where the final output layers generates $3k$ images ($k$ tuples of RGB images) instead of $3$. The model then optimizes the following loss:
https://i.imgur.com/wVQwufn.png
where for each class label $c$, the image $u$ among $k$ generated images that generates the least error is selected. The rest of the loss is similar to Eq. (1), with the difference that it considers loss for each feature map $j$ and the difference in features is multiplied (with Hadamard product) in $L_p^l$, which is a mask (0 or 1) of the same resolution as feature map $\Phi$ and indicates the existence of the class label $c$ in the corresponding feature. In summary, this loss takes the best synthesized image for each class $c$ and penalizes only the corresponding pixels to the class $c$ in the feature maps.
The following image shows two different samples for the same input:
https://i.imgur.com/TFPWLxa.png
The model (referred to as CRN) is evaluated by comparing pair-wise samples of CRN with the following cases using Mechanical Turks:
- $\textbf{GAN and semantic segmentation:}$ a model that uses gan loss plus semantic loss on the generated photographic images.
- $\textbf{Image-to-image translation:}$ a model that uses conditional GAN using image-to-image translation network.
- $\textbf{Encoder-decoder:}$ a model that uses CRN loss but replaces its architecture with U-Net or Recombinator Networks architecture (where the model has an encode-decoder architecture with skip connections.)
- $\textbf{Full-resolution network:}$ a model that uses CRN loss but with a full-resolution network, which is a model that maintains the resolution from input to output.
- $\textbf{Image-space loss:}$ a model that uses CRN loss but with loss directly on the RGB values rather than VGG features.
The first two use different losses and also different architectures, the last three use the same loss as CRN but with different architectures. The Mechanical Turk users rate samples of CRN with its proposed loss more realistic than other approaches.
Although this paper compares with a model that uses GAN loss and/or semantic segmentation loss, but it would have been better to try these losses on the CRN architecture itself to evaluate better the impact of these losses.
Also the paper does not show the diverse samples generated by the model (only two samples are shown). More samples of the model's output would show better the effectiveness of the proposed approach in terms of generating diverse samples (impact of using Eq. 3).
In general I like the proposed approach in using a coarse-to-fine modular resolution increment and find their defined loss and architecture affective.
more
less