[link]
Generative Adversarial Networks (GANs) are an exciting technique, a kernel of an effective concept that has been shown to be able to overcome many of the problems of previous generative models: particularly the fuzziness of VAEs. But, as I’ve mentioned before, and as you’ve doubtless read if you’re read any material about the topic, they’re finicky things, difficult to train in a stable way, and particularly difficult to not devolve into mode collapse. Mode collapse is a phenomenon where, at each iteration, the generator places all of its mass on one single output or dense cluster of outputs, instead of representing the full distribution of output space, they way we’d like it to. One proposed solution to this is the one I discussed yesterday, of explicitly optimizing the generator according to not only what the discriminator thinks about its current allocation of probability, but what the discriminator’s next move will be (thus incentivizing the generator not to take indefensible strategies like “put all your mass in one location the discriminator can push down next round”. An orthogonal approach to that one is the one described in LSGANs: to change the objective function of the network, away from sigmoid cross-entropy, and instead to a least squares loss. While I don’t have the latex capabilities to walk through the exact mathematics in this format, what this means on a conceptual level is that instead of incentivizing the generator to put all of its mass on places that the discriminator is sure is a “true data” region, we’re instead incentivizing the generator to put mass right on the true/fake data decision boundary. Likely this doesn’t make very much sense yet (it didn’t for me, at this point in reading). Occasionally, delving deeper into math and theory behind an idea provides you rigor, but without much intuition. I found the opposite to be true in this case, where learning more (for the first time!) about f divergences actually made this method make more sense. So, bear with me, and hopefully trust me not to take you to deep into the weeds without a good reason. On a theoretical level, this paper’s loss function means that you end up minimizing a chi squared divergence between the distributions, instead of a KL divergence. "F divergences" are a quantity that calculates a measure of how different two distributions are from one another, and does that by taking an average of the density q, weighted at each point by f, which is some function of the ratio of densities, p/q. (You could also think of this as being an average of the function f, weighted by the density q; they’re equivalent statements). For the KL divergence, this function is x*logx. For chi squared it’s (x-1)^2. All of this starts to coalesce into meaning with the information that, typically the behavior of a typical GAN looks like the divergence FROM the generator’s probability mass, TO the discriminator’s probability mass. That means that we take the ratio of how much mass a generator puts somewhere to how much mass the data has there, and we plug it into the x*logx function seen below. https://i.imgur.com/BYRfi0u.png Now, look how much the function value spikes when that ratio goes over 1. Intuitively, what this means is that we heavily punish the generator when it puts mass in a place that’s unrealistic, i.e. where there isn’t representation from the data distribution. But - and this is the important thing - we don’t symmetrically punish it when it its mass at a point is far higher than the mass put their in the real data; or when the ratio is much smaller than one. This means that we don’t have a way of punishing mode collapse, the scenario where the generator puts all of its mass on one of the modes of the data; we don’t do a good job of pushing the generator to have mass everywhere that the data has mass. By contrast, the Chi Squared divergence pushes the ratio of (generator/data) to be equal to 1 *from both directions*. So, if there’s more generator mass than data mass somewhere, that’s bad, but it’s also bad for there to be more data mass than generator mass. This gives the network a stronger incentive to not learn mode collapsed solutions. |
[link]
This work deals with rotation equivariant convolutional filters. The idea is that when you rotate an image you should not need to relearn new filters to deal with this rotation. First we can look at how convolutions typically handle rotation and how we would expect a rotation invariant solution to perform below: | | | | - | - | | https://i.imgur.com/cirTi4S.png | https://i.imgur.com/iGpUZDC.png | | | | | The method computes all possible rotations of the filter which results in a list of activations where each element represents a different rotation. From this list the maximum is taken which results in a two dimensional output for every pixel (rotation, magnitude). This happens at the pixel level so the result is a vector field over the image. https://i.imgur.com/BcnuI1d.png We can visualize their degree selection method with a figure from https://arxiv.org/abs/1603.04392 which determined the rotation of a building: https://i.imgur.com/hPI8J6y.png We can also think of this approach as attention \cite{1409.0473} where they attend over the possible rotations to obtain a score for each possible rotation value to pass on. The network can learn to adjust the rotation value to be whatever value the later layers will need. ------------------------ Results on [Rotated MNIST](http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/MnistVariations) show an impressive improvement in training speed and generalization error: https://i.imgur.com/YO3poOO.png |
[link]
Problem ------------ Text to image Contributions ----------------- * Images are more photo realistic and higher resolution then previous methods * Stacked generative model Approach ------------- 2 stage process: 1. Text-to-image: generates low resolution image with primitive shape and color. 2. low-to-hi-res: using low res image and text, generates hi res image. adding details and sharpening the edges. https://pbs.twimg.com/media/Cziw6bfWgAAh3Yg.jpg Datasets -------------- * CUB - Birds * Oxford-102 - Flowers Results -------- https://cdn-images-1.medium.com/max/1012/1*sIphVx4tqaXJxtnZNt3JWA.png Criticism/ Questions ------------------- * Is it possible the resulting images are replicas of images in the original dataset? To what extent does the model "hallucinate" new images? |