ICCV is the premier international Computer Vision event comprising the main ICCV conference and several co-located workshops and short courses. With its high quality and low cost, it provides an exceptional value for students, academics and industry researchers.

Least Squares Generative Adversarial Networks

Xudong Mao and Qing Li and Haoran Xie and Raymond Y. K. Lau and Zhen Wang and Stephen Paul Smolley

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CV

**First published:** 2016/11/13 (7 years ago)

**Abstract:** Unsupervised learning with generative adversarial networks (GANs) has proven
hugely successful. Regular GANs hypothesize the discriminator as a classifier
with the sigmoid cross entropy loss function. However, we found that this loss
function may lead to the vanishing gradients problem during the learning
process. To overcome such a problem, we propose in this paper the Least Squares
Generative Adversarial Networks (LSGANs) which adopt the least squares loss
function for the discriminator. We show that minimizing the objective function
of LSGAN yields minimizing the Pearson $\chi^2$ divergence. There are two
benefits of LSGANs over regular GANs. First, LSGANs are able to generate higher
quality images than regular GANs. Second, LSGANs perform more stable during the
learning process. We evaluate LSGANs on five scene datasets and the
experimental results show that the images generated by LSGANs are of better
quality than the ones generated by regular GANs. We also conduct two comparison
experiments between LSGANs and regular GANs to illustrate the stability of
LSGANs.
more
less

Xudong Mao and Qing Li and Haoran Xie and Raymond Y. K. Lau and Zhen Wang and Stephen Paul Smolley

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CV

[link]
Generative Adversarial Networks (GANs) are an exciting technique, a kernel of an effective concept that has been shown to be able to overcome many of the problems of previous generative models: particularly the fuzziness of VAEs. But, as I’ve mentioned before, and as you’ve doubtless read if you’re read any material about the topic, they’re finicky things, difficult to train in a stable way, and particularly difficult to not devolve into mode collapse. Mode collapse is a phenomenon where, at each iteration, the generator places all of its mass on one single output or dense cluster of outputs, instead of representing the full distribution of output space, they way we’d like it to. One proposed solution to this is the one I discussed yesterday, of explicitly optimizing the generator according to not only what the discriminator thinks about its current allocation of probability, but what the discriminator’s next move will be (thus incentivizing the generator not to take indefensible strategies like “put all your mass in one location the discriminator can push down next round”. An orthogonal approach to that one is the one described in LSGANs: to change the objective function of the network, away from sigmoid cross-entropy, and instead to a least squares loss. While I don’t have the latex capabilities to walk through the exact mathematics in this format, what this means on a conceptual level is that instead of incentivizing the generator to put all of its mass on places that the discriminator is sure is a “true data” region, we’re instead incentivizing the generator to put mass right on the true/fake data decision boundary. Likely this doesn’t make very much sense yet (it didn’t for me, at this point in reading). Occasionally, delving deeper into math and theory behind an idea provides you rigor, but without much intuition. I found the opposite to be true in this case, where learning more (for the first time!) about f divergences actually made this method make more sense. So, bear with me, and hopefully trust me not to take you to deep into the weeds without a good reason. On a theoretical level, this paper’s loss function means that you end up minimizing a chi squared divergence between the distributions, instead of a KL divergence. "F divergences" are a quantity that calculates a measure of how different two distributions are from one another, and does that by taking an average of the density q, weighted at each point by f, which is some function of the ratio of densities, p/q. (You could also think of this as being an average of the function f, weighted by the density q; they’re equivalent statements). For the KL divergence, this function is x*logx. For chi squared it’s (x-1)^2. All of this starts to coalesce into meaning with the information that, typically the behavior of a typical GAN looks like the divergence FROM the generator’s probability mass, TO the discriminator’s probability mass. That means that we take the ratio of how much mass a generator puts somewhere to how much mass the data has there, and we plug it into the x*logx function seen below. https://i.imgur.com/BYRfi0u.png Now, look how much the function value spikes when that ratio goes over 1. Intuitively, what this means is that we heavily punish the generator when it puts mass in a place that’s unrealistic, i.e. where there isn’t representation from the data distribution. But - and this is the important thing - we don’t symmetrically punish it when it its mass at a point is far higher than the mass put their in the real data; or when the ratio is much smaller than one. This means that we don’t have a way of punishing mode collapse, the scenario where the generator puts all of its mass on one of the modes of the data; we don’t do a good job of pushing the generator to have mass everywhere that the data has mass. By contrast, the Chi Squared divergence pushes the ratio of (generator/data) to be equal to 1 *from both directions*. So, if there’s more generator mass than data mass somewhere, that’s bad, but it’s also bad for there to be more data mass than generator mass. This gives the network a stronger incentive to not learn mode collapsed solutions. |

Rotation equivariant vector field networks

Diego Marcos and Michele Volpi and Nikos Komodakis and Devis Tuia

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CV

**First published:** 2016/12/29 (7 years ago)

**Abstract:** We propose a method to encode rotation equivariance or invariance into
convolutional neural networks (CNNs). Each convolutional filter is applied with
several orientations and returns a vector field that represents the magnitude
and angle of the highest scoring rotation at the given spatial location. To
propagate information about the main orientation of the different features to
each layer in the network, we propose an enriched orientation pooling, i.e. max
and argmax operators over the orientation space, allowing to keep the
dimensionality of the feature maps low and to propagate only useful
information. We name this approach RotEqNet. We apply RotEqNet to three
datasets: first, a rotation invariant classification problem, the MNIST-rot
benchmark, in which we improve over the state-of-the-art results. Then, a
neuron membrane segmentation benchmark, where we show that RotEqNet can be
applied successfully to obtain equivariance to rotation with a simple fully
convolutional architecture. Finally, we improve significantly the
state-of-the-art on the problem of estimating cars' absolute orientation in
aerial images, a problem where the output is required to be covariant with
respect to the object's orientation.
more
less

Diego Marcos and Michele Volpi and Nikos Komodakis and Devis Tuia

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CV

[link]
This work deals with rotation equivariant convolutional filters. The idea is that when you rotate an image you should not need to relearn new filters to deal with this rotation. First we can look at how convolutions typically handle rotation and how we would expect a rotation invariant solution to perform below: | | | | - | - | | https://i.imgur.com/cirTi4S.png | https://i.imgur.com/iGpUZDC.png | | | | | The method computes all possible rotations of the filter which results in a list of activations where each element represents a different rotation. From this list the maximum is taken which results in a two dimensional output for every pixel (rotation, magnitude). This happens at the pixel level so the result is a vector field over the image. https://i.imgur.com/BcnuI1d.png We can visualize their degree selection method with a figure from https://arxiv.org/abs/1603.04392 which determined the rotation of a building: https://i.imgur.com/hPI8J6y.png We can also think of this approach as attention \cite{1409.0473} where they attend over the possible rotations to obtain a score for each possible rotation value to pass on. The network can learn to adjust the rotation value to be whatever value the later layers will need. ------------------------ Results on [Rotated MNIST](http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/MnistVariations) show an impressive improvement in training speed and generalization error: https://i.imgur.com/YO3poOO.png |

StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks

Zhang, Han and Xu, Tao and Li, Hongsheng and Zhang, Shaoting and Huang, Xiaolei and Wang, Xiaogang and Metaxas, Dimitris N.

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

Zhang, Han and Xu, Tao and Li, Hongsheng and Zhang, Shaoting and Huang, Xiaolei and Wang, Xiaogang and Metaxas, Dimitris N.

arXiv e-Print archive - 2016 via Local Bibsonomy

Keywords: dblp

[link]
Problem ------------ Text to image Contributions ----------------- * Images are more photo realistic and higher resolution then previous methods * Stacked generative model Approach ------------- 2 stage process: 1. Text-to-image: generates low resolution image with primitive shape and color. 2. low-to-hi-res: using low res image and text, generates hi res image. adding details and sharpening the edges. https://pbs.twimg.com/media/Cziw6bfWgAAh3Yg.jpg Datasets -------------- * CUB - Birds * Oxford-102 - Flowers Results -------- https://cdn-images-1.medium.com/max/1012/1*sIphVx4tqaXJxtnZNt3JWA.png Criticism/ Questions ------------------- * Is it possible the resulting images are replicas of images in the original dataset? To what extent does the model "hallucinate" new images? |

About