[link]
Summary by CodyWild 6 years ago
Generative Adversarial Networks (GANs) are an exciting technique, a kernel of an effective concept that has been shown to be able to overcome many of the problems of previous generative models: particularly the fuzziness of VAEs. But, as I’ve mentioned before, and as you’ve doubtless read if you’re read any material about the topic, they’re finicky things, difficult to train in a stable way, and particularly difficult to not devolve into mode collapse. Mode collapse is a phenomenon where, at each iteration, the generator places all of its mass on one single output or dense cluster of outputs, instead of representing the full distribution of output space, they way we’d like it to. One proposed solution to this is the one I discussed yesterday, of explicitly optimizing the generator according to not only what the discriminator thinks about its current allocation of probability, but what the discriminator’s next move will be (thus incentivizing the generator not to take indefensible strategies like “put all your mass in one location the discriminator can push down next round”.
An orthogonal approach to that one is the one described in LSGANs: to change the objective function of the network, away from sigmoid cross-entropy, and instead to a least squares loss. While I don’t have the latex capabilities to walk through the exact mathematics in this format, what this means on a conceptual level is that instead of incentivizing the generator to put all of its mass on places that the discriminator is sure is a “true data” region, we’re instead incentivizing the generator to put mass right on the true/fake data decision boundary. Likely this doesn’t make very much sense yet (it didn’t for me, at this point in reading). Occasionally, delving deeper into math and theory behind an idea provides you rigor, but without much intuition. I found the opposite to be true in this case, where learning more (for the first time!) about f divergences actually made this method make more sense. So, bear with me, and hopefully trust me not to take you to deep into the weeds without a good reason.
On a theoretical level, this paper’s loss function means that you end up minimizing a chi squared divergence between the distributions, instead of a KL divergence. "F divergences" are a quantity that calculates a measure of how different two distributions are from one another, and does that by taking an average of the density q, weighted at each point by f, which is some function of the ratio of densities, p/q. (You could also think of this as being an average of the function f, weighted by the density q; they’re equivalent statements). For the KL divergence, this function is x*logx. For chi squared it’s (x-1)^2.
All of this starts to coalesce into meaning with the information that, typically the behavior of a typical GAN looks like the divergence FROM the generator’s probability mass, TO the discriminator’s probability mass. That means that we take the ratio of how much mass a generator puts somewhere to how much mass the data has there, and we plug it into the x*logx function seen below.
https://i.imgur.com/BYRfi0u.png
Now, look how much the function value spikes when that ratio goes over 1. Intuitively, what this means is that we heavily punish the generator when it puts mass in a place that’s unrealistic, i.e. where there isn’t representation from the data distribution. But - and this is the important thing - we don’t symmetrically punish it when it its mass at a point is far higher than the mass put their in the real data; or when the ratio is much smaller than one. This means that we don’t have a way of punishing mode collapse, the scenario where the generator puts all of its mass on one of the modes of the data; we don’t do a good job of pushing the generator to have mass everywhere that the data has mass. By contrast, the Chi Squared divergence pushes the ratio of (generator/data) to be equal to 1 *from both directions*. So, if there’s more generator mass than data mass somewhere, that’s bad, but it’s also bad for there to be more data mass than generator mass. This gives the network a stronger incentive to not learn mode collapsed solutions.
more
less