[link]
This paper generates photographic images from semantic images using progressively growing resolution of the feature maps. The goal is to generate high resolution images while maintaining global structure in the images in a coarse-to-fine procedure. The architecture is composed of several refinement modules (as shown in Figure below), where each one maintains the resolution of its input. The output resolution of each module is then doubled when being passed to the next module. The first module has a resolution of $4 \times 8$ and takes semantic image at this resolution. It produces a feature layer $F_0$ as output. The output is then doubled in resolution and passed together with a downsampled semantic image to the next module that generates feature layer $F_1$ as output. This process continues where each module takes feature layer $F_{i-1}$ together with a semantic image as input and produces $F_i$ as output. The final module outputs 3 channels for RGB image. https://i.imgur.com/M3ucgwI.png This process is used to generate high resolution images (images of resolution $1024 \times 2048$ on Cityscapes dataset are generated) and meanwhile maintains global coordination in the image in a coarse-to-fine process. For example, if the model generates the left red light of a car, the right red light should also be similar. The global structure can then be specified at low resolution where features are close and then maintained while increasing the resolution of the maps. Creating photographic images is a 1 to n mapping, so a model can output many plausible and at the same time correct outputs. Therefore, pixel-wise comparison of the generated image with the ground truth (GT) from the training set can produce high errors. For example, if the model assigns black color instead of white to a car the error is very high while the output is still correct. Therefore the authors define the cost by comparing features of a pre-trained VGG network as follows: https://i.imgur.com/gIflZLM.png where $l$ is the layer of pre-trained VGG model and $\lambda_l$ is its corresponding weight, $\Phi(I)$ and $\Phi(g(L,\theta))$ are features of GT image and generated image. The following image shows samples of this model: https://i.imgur.com/coxsdbU.png In order to generate more diverse images another variant of this model is proposed, where the final output layers generates $3k$ images ($k$ tuples of RGB images) instead of $3$. The model then optimizes the following loss: https://i.imgur.com/wVQwufn.png where for each class label $c$, the image $u$ among $k$ generated images that generates the least error is selected. The rest of the loss is similar to Eq. (1), with the difference that it considers loss for each feature map $j$ and the difference in features is multiplied (with Hadamard product) in $L_p^l$, which is a mask (0 or 1) of the same resolution as feature map $\Phi$ and indicates the existence of the class label $c$ in the corresponding feature. In summary, this loss takes the best synthesized image for each class $c$ and penalizes only the corresponding pixels to the class $c$ in the feature maps. The following image shows two different samples for the same input: https://i.imgur.com/TFPWLxa.png The model (referred to as CRN) is evaluated by comparing pair-wise samples of CRN with the following cases using Mechanical Turks: - $\textbf{GAN and semantic segmentation:}$ a model that uses gan loss plus semantic loss on the generated photographic images. - $\textbf{Image-to-image translation:}$ a model that uses conditional GAN using image-to-image translation network. - $\textbf{Encoder-decoder:}$ a model that uses CRN loss but replaces its architecture with U-Net or Recombinator Networks architecture (where the model has an encode-decoder architecture with skip connections.) - $\textbf{Full-resolution network:}$ a model that uses CRN loss but with a full-resolution network, which is a model that maintains the resolution from input to output. - $\textbf{Image-space loss:}$ a model that uses CRN loss but with loss directly on the RGB values rather than VGG features. The first two use different losses and also different architectures, the last three use the same loss as CRN but with different architectures. The Mechanical Turk users rate samples of CRN with its proposed loss more realistic than other approaches. Although this paper compares with a model that uses GAN loss and/or semantic segmentation loss, but it would have been better to try these losses on the CRN architecture itself to evaluate better the impact of these losses. Also the paper does not show the diverse samples generated by the model (only two samples are shown). More samples of the model's output would show better the effectiveness of the proposed approach in terms of generating diverse samples (impact of using Eq. 3). In general I like the proposed approach in using a coarse-to-fine modular resolution increment and find their defined loss and architecture affective. |
[link]
This paper introduces triangle-GAN ($\triangle$-GAN) that aims at cross-domain joint distribution matching: The model is shown below https://i.imgur.com/boIDOMu.png Having two domains of data $x$ and $y$, there are two generators: 1- $G_x(y)$ which takes $y$ and generates $\tilde{x}$ 2- $G_y(x)$ which takes $x$ and generates $\tilde{y}$ There are two discriminators in the model: 1- $D_1 (x,y)$ a discriminator that distinguishes between $(x, y)$ and either of $(x, \tilde{y})$ or $(\tilde{x}, y)$. 2- $D_2 (x,y)$ a discriminator that distinguishes between $(x, \tilde{y})$ and $(\tilde{x}, y)$. The second discriminator is ALI and can be used on un-paired sets of data. The first discriminator is equivalent to a conditional discriminator where the true paired data $(x, y)$ is compared to either $(x, \tilde{y})$ or $(\tilde{x}, y)$, where one element in the pair is sampled. This discriminator needs paired $(x, y)$ data for training. This model can be used for semi-supervised settings, where a small set of paired data is provided. In this paper it is used for: - semi-supervised image classification, where a small subset of CIFAR10 is labelled. $x$ and $y$ are images and class labels here. - image to image translation on edge2shoes dataset, where only a subset of dataset is paired. - attribute conditional image generation where $x$ and $y$ domains are image and attributes. CelebA and COCO datasets are used here. In one experiment test-set images are projected to attributes and then given those attributes new images are generated: On celebA: https://i.imgur.com/EX5tDZ0.png On COCO: https://i.imgur.com/GRpvjGx.png In another experiment some attributes are chosen (as samples shown below in the first row with different noise) and then another feature is added (using the same noise) to generate the samples in the second row: https://i.imgur.com/KeHL8Ye.png The triangle gan demonstrates improved performance compared to triple gan in the experiments shown in the paper. It has been also compared with Disco gan (a model that can be trained on un-paired data) and shows improved performance when some percentage of paired data is provided. In an experiment they pair each MNIST digit with its transposed (as $x$, $y$ pairs). Disco-GAN cannot learn correct mapping between them, while triangle-GAN can learn correct mapping since it leverages paired data. https://i.imgur.com/Vz9Zfhu.png In general this model is a useful approach for semi-supervised cross-domain matching and can leverage un-paired data (using ALI) as well as paired data (using conditional discriminator). |
[link]
This paper aims at changing the attributes of a face, without manipulating other aspects of the image, such as add/remove glasses, make a person young/old, changed the gender, and hence the name Fader Networks, similar to sliders of audio mixing tools that can change a value linearly to increase/decrease a feature. The model is shown below: https://i.imgur.com/fntPmNu.png An image $x$ is passed to the encoder and the output of the encoder $E(x)$ is passed to the discriminator to distinguish whether a feature $y$ is in the latent space or not. The encoded features $E(x)$ and the feature $y$ is passed to the decoder to reconstruct the image $D(E(x))$. The AE therefore has two loss: 1- The reconstruction loss between $x$ and $D(E(x))$, and 2- The gan loss to fool the discriminator on the feature $y$ in the encoded space $E(x)$. The discriminator tries to distinguish whether a feature $y$ is in the encoded space $E(x)$ or not, while the encoder tries to fool the discriminator. This process leads to removal of the feature $y$ from the $E(x)$ by encoder. The encoded feature $E(x)$ therefore does not have any information on $y$. However, since the decoder needs to reconstruct the same input image, $E(x)$ has to maintain all information, except the feature $y$ and the decoder should get the feature $y$ from the input of the decoder. The model is trained on binary $y$ features such as: male/female, young/old, glasses Yes/No, mouth open Yes/No, eyes open Yes/No (some samples from test set below): https://i.imgur.com/bj9wu6B.png At test time, they can change the features continuously and show transition in the features: https://i.imgur.com/XUD3ZTu.png The performance of the model is measured using mechanical turks on two metrics: Naturalness of the images and the accuracy of swapping features on the image. In both FadNet shows better results compared to IcGAN, and FadNet shows very good results on accuracy, however on naturalness the performance drops when some features are swapped. On Flowers dataset, FadNet can change colors of the flowers: https://i.imgur.com/7nvBSEY.png I find the following positive aspects about FadNet: 1- It can change some features while maintaining other features of the image such as identity of the person, background information, etc. 2- The model does not need paired data. In some cases it is impossible to gather paired data (e.g. male/female) or very difficult (young/old). 3- The gan loss is used to remove a feature in the latent space, where that feature can be later specified for reconstruction by decoder. Since GAN is applied to latent space, it can be used to remove features on the data that is discrete (where direct usage of disc on those data is not trivial). I think these aspects need further work for improvement: - When multiple features are changed the blurriness of the image shows up: https://i.imgur.com/LD5cVbg.png When only one feature changes the blurriness affect is much less, despite the fact that they use L2-loss for AE reconstruction. I guess also using a high resolution of 256*256 helps make the bluriness of the images less noticeable. - The model should be first trained only on AE (no gan loss) and then the gan loss in AE is linearly increased to remove a feature. So, it requires a bit of care in training it properly. Overall, I find it an interesting paper on how to change a feature in an image when one wants to keep other features unchanged. |
[link]
This paper merges a GAN and VAE to improve pose estimation on depth hand images. They used paired data (where both depth image ($x$) and pose ($y$) is provided) and merge that with unlabelled data where only depth image ($x$) is given. The model is shown below: https://i.imgur.com/BvjZekU.png The VAE model takes $y$ and projects it to latent space ($z_y$) using encoder and then reconstructs it back to $\bar y$. Ali is used to map between latent space of VAE $z_y$ and the latent space of GAN $z_x$. The depth image synthesizer takes $z_x$ and generates a depth image $\bar x$. The Discriminator does three tasks: 1-$L_{gan}$: distinguishing between true ($x$) and generated sample ($\bar x$). 2- $L_{pos}$: predicting the pose of the true depth image $x$. 3: $L_{smo}$: a smoothing loss to enforce the difference between two latent spaces in the generator and the ones predicted by discriminator to be the same (see below for more details). $\textbf{Here is how the data flows and losses are defined:}$ Given a pair of labelled data $(x,y)$, the pose $y$ is projected to latent space $z_y$, then projected back to estimate pose $\bar y$. Using VAE model, a reconstruction loss $L_{recons}$ is defined on pose. Using Ali, the latent variable $z_y$ is projected to $z_x$ and then the depth image $\bar{x}$ is generated $\bar{x} = Gen(Ali({z_y}))$. A reconstruction loss between x and $\bar{x}$ is defined (d_{self}). A random noise is samples from pose latent space ($\hat{z_y}$) and projected to a depth map using $\hat{x} = Gen(Ali(\hat{z_y}))$. Discriminator then takes $x$ and $\hat{x}$. It estimates pose on $x$ using $L_{pos}$. It also distinguishes between $x$ and $\hat{x}$ with $L_{gan}$. Finally, it measures the $x$ and $\hat{x}$'s latent space difference $smo(x, \hat x)$, which should be similar to the distance between $z_y$ and $\hat{z_y}$, so the smo-loss is: $L_{smo} = || smo(x, \hat x) - (z_y - \hat{z_y})||^2 + d_{self}$. In general the the VAE model and the depth image synthesizer can be considered as the Generator of the network. The total loss can be written as: $L_G = L_{recons} + L_{smo} - L_{gan}\\$ $L_D = L_{pos} + L_{smo} - L_{gan}\\$ The generator loss contains pose reconstruction, smo-loss, and gan loss on generated depth maps. The discriminator loss contains pose estimation loss, smo-loss, and gan loss on distinguishing fake and real depth images. Note that in the gen and disc losses all except the gan loss need paired data and the un-labelled data can be used for only gan-loss. However, the unlabelled data would train the lowest layers of the disc (for pose estimation) and the image synthesis part of gen. But for pose estimation (the final target of the paper), training the VAE model, and also mapping between VAE and GAN using Ali, labelled data should be provided. Also note that $ L_{smo}$ trains both generator and discriminator parameters. In terms of performance the model improves the results on partially labelled data. On fully labelled data it shows either improvement or comparable results w.r.t to previous models. I find the strongest aspect of the paper in semi-supervised learning where smaller portion of labelled data is provided, However, due to the way parameters are binded together, the model needs some labelled data to train the model completely. |
[link]
This paper gets a face image and changes its pose or rotates it (to any desired pose) by passing the target pose as the input to the model. https://i.imgur.com/AGNOag5.png They use a GAN (named DR-GAN) for face rotation. The gan has an encoder and a decoder. The encoder takes the image and gets a high-level feature representation. The decoder gets high-level features, the target pose, and some noise to generate the output image with rotated face. The generated image is then passed to a discriminator where it says whether the image is real or fake. The disc also has two other outputs: 1- it estimates the pose of the generated image, 2) it estimated the identity of the person. no direct loss is applied to the generator, it is trained by the gradient that it gets through discriminator to minimize the three objects: 1- gan loss (to fool disc) 2-pose estimation 3- identity estimation. They use two tricks to improve the model: 1- using the same parameters for encoder of generator (gen-enc) and the discriminator (they observe this helps better identity recognition) 2- passing two images to gen-enc and interpolating between their high-level features (gen-enc output) and then applying two costs on it: 1) gan loss 2) pose loss. These losses are applied through disc, similar to above. The first trick improves gen-enc and second trick improves gen-dec, both help on identification. Their model can also leverage multiple image of the same identity if the dataset provides that to get better latent representation in gen-enc for a given identity. https://i.imgur.com/23Tckqc.png These are some samples on face frontalization: https://i.imgur.com/zmCODXe.png and these are some samples on interpolating different features in latent space: (sub-fig a) interpolating f(x) between the latent space of two images, (sub-fig b) interpolating pose (c), (sub-fig c) interpolating noise: https://i.imgur.com/KlkVyp9.png I find these positive aspects about the paper: 1) face rotation is applied on the images in the wild, 2) It is not required to have paired data. 3) multiple source images of the same identity can be used if provided, 4) identity and pose are used smartly in the discriminator to guide the generator, 5) model can specify the target pose (it is not only face-frontalization). Negative aspects: 1) face has many artifacts, similar to artifacts of some other gan models. 2) The identity is not well-preserved and the faces seem sometime distorted compared to the original person. They show the models performance on identity recognition and face rotation and demonstrate compelling results. |