![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
These days, a bulk of recent work in Variational AutoEncoders - a type of generative model - focuses on the question of how to add recently designed, powerful decoders (the part that maps from the compressed information bottleneck to the reconstruction) to VAEs, but still cause them to capture high level, conceptual information within the aforementioned information bottleneck (also know as a latent code). In the status quo, it’s the case that the decoder can do well enough even without conditioning on conceptual variables stored in the latent codes, that it’s not worth storing information there. The reason why VAEs typically make it costly to store information in latent codes is the typical inclusion of a term that measures the KL divergence (distributional distance, more or less) between an uninformative unit Gaussian (the prior) and distribution of latent z codes produced for each individual input x (the posterior). Intuitively, if the distribution for each input x just maps to the prior, then that gives the decoder no information about what x was initially passed in: this means the encoder has learned to ignore the latent code. The question of why this penalty term is included in the VAE has two answers, depending on whether you’re asking from a theoretical or practical standpoint. Theoretically, it’s because the original VAE objective function could be interpreted as a lower bound on the true p(x) distribution. Practically, pulling the individual distributions closer to that prior often has a regularizing effect, that causes z codes for individual files to be closer together, and also for closeness in z space to translate more to closeness in recreation concept. That happens because the encoder is disincentivized from making each individual z distribution that far from a prior. The upshot of this is that there’s a lot of overlap between the distributions learned for various input x values, and so it’s in the model’s interest to make the reconstruction of those nearby elements similar as well. The argument of this paper starts from the compression cost side. If you look at the KL divergence term with the prior from an information theory, you can see it as the “cost of encoding your posterior, using a codebook developed from your prior”. This is a bit of an opaque framing, but the right mental image is the morse code tree, the way that the most common character in the English language corresponds to the shortest morse symbol, and so on. This tree was optimized to make messages as short as possible, and was done so by mapping common letters to short symbols. But, if you were to encode a message in, say, Russian, you’d no longer be well optimized for the letter distribution in Russian, and your messages would generally be longer. So, in the typical VAE setting, we’re imagining a receiver who has no idea what message he’ll be sent yes, and so uses the global prior to inform their codebook. By contrast, the authors suggest a world in which we meaningfully order the entries sent to the receiver in terms of similarity. Then, if you use the heuristic “each message provides a good prior for the next message I’ll receive, you incur a lot less coding cost than, because the “prior” is designed to be a good distribution to use to encode this sample, which will hopefully be quite similar to the next one. On a practical level, this translates to: 1. Encoding a z distribution 2. Choosing one of that z code’s K closest neighbors 3. Putting that as input into a “prior network” that takes in the randomly chosen nearby c, and spits out distributional parameters for another distribution over zs, which we’ll call the “prior”. Intuitively, a lot of the trouble with the constraint that all z encodings be close to the same global prior is that that was just too restrictive. This paper tries to impose a local prior instead, that’s basically enforcing local smoothness, by pulling the z value closer to others already nearby it,but without forcing everything to look like a global prior. ![]() |
[link]
Variational Autoencoders are a type of generative model that seek to learn how to generate new data by incentivizing the model to be able to reconstruct input data, after compressing it to a low-dimensional space. Typically, the way that the reconstruction is scored against the original is by comparing the pixel by pixel values: a reconstruction gets a high score if it is able to place pixels of color in the same places that the original did. However, there are compelling reasons why this is a sub-par way of scoring images. The central one is: it focuses on and penalizes superficial differences, so if the model accurately reproduces the focal object of the image, but does so, say, 10 pixels to the right of where it was previously, that will incur a penalty we might not actually want to apply. The flip side of this is that a direct pixel-comparison loss doesn’t differentiate between pixel differences that do or don’t change the fundamental substance of the image. For instance, having 100 pixels wrong around the border of a dog, making it seem very slightly larger, would be the same amount of error as having 100 pixels concentrated in a weird bulb that appears to be growing out of a dog’s ear, even though the former does a better job of being recognizable as a dog. The authors of the VAE/GAN paper have a clever approach to solving this problem, that involves taking the typical pixel loss, and breaking it up into two conceptual parts. The first focuses on aligning the conceptual features of the reconstructed image with the conceptual features of the input image. It does so by running both the input and the reconstruction through a discriminative convolutional model which - in the typical way of deep learning - learns ever more abstract features at each layer of the network. These “conceptual features” abstract out the precise pixel values, and instead capture the higher level features of the image. So, instead of calculating the pixelwise squared loss between the specific input x, and its after-bottleneck reconstruction x~, you take the squared loss between the feature maps at some layer for both x and x~, and push them to be closer together, so that the reconstruction shares the same features as the original. The second focuses on detail-level specifics of images, but, cleverly, does so in a general, rather than a observation-specific way. This is done by training a GAN-style discriminator to tell the difference between generated images* and original image, and then using that loss to train the decoder part of the VAE. The cleverness of this comes from the fact that they are still enforcing that the details and structural features of the reconstructed image are not distinguishable from real images, but doing so in a general sense, rather than requiring the details to be an exact match to the details found in a given input x. https://i.imgur.com/Bmtmac2.png The authors freely admit that existing metrics of scoring images (which themselves *use* pixelwise similarity) rate their method as being worse than existing VAEs. However, they argue, that’s inherently a flawed metric, that doesn’t capture the aspects of clean visual quality we want in generated image. A metric they propose instead involves using an dataset where a list of attributes are attached to each image (old, black, blond, etc). They add these as additional input while training the network, so that whatever signals the decoder part of the model needs to turn someone blonde, it gets those from the externally-given attribute vector, rather than a learned representation. This means that, once the model is trained, we can set some value of the attribute vector, and have the decoder generate samples conditional on that. The metric is constructed by taking the decoded samples conditioned on some attribute set, and then taking a classifier model that is trained on the real images to detect attribute values from the images. The generated images are then scored by how closely the predictions from the classifier model match the true values of the attributes. If the generator model were working perfectly, this error rate would as low as for real data. By this metric (which: grain of salt, since they invented), the VAE/GAN model is superior to both GANs and vanilla VAEs. ![]() |
[link]
There are mathematicians, still today, who look at deep learning, and get real salty over the lack of convex optimization. That is to say: convex functions are ones where you have an actual guarantees that gradient descent will converge, and mathematicians of olden times (i.e. 2006) spent reams of paper arguing that this or that function had convex properties, and thus could be guaranteed to converge, under this or that set of arcane conditions. And then, Deep Learning came along, with its huge, nonlinear, very much nonconvex objective functions, that it was nonetheless trying to optimize via gradient descent. From the perspective of an optimization theorist, this had the whiff of heresy, but exceptionally effective heresy. And, so, the field of DL has half-exploded, half-stumbled along, showcasing a portfolio of very impressive achievements, but with theory very much a secondary priority relative to performance. Something else that gradient descent isn’t supposed to be able to do is learn models that include discrete (i.e. non-continuous) operators. Without continuous gradients, the functions don’t have an obvious way to “push” in a certain direction, to modulate the loss at the end of the network. Discrete nodes mean that the value just jumps from being in one state, to being in the other, with no intermediate values. This has historically posed a problem for algorithms fueled by gradient descent. The authors of this paper came up with a solution that is 60% cleverness, and 40% just guessing that “even if we ignore the theory, things will probably work well enough”. But, first, their overall goal: to create a Variational Auto Encoder where the latent states, the compressed internal representation that is typically an array of continuous values, is instead an array of categorical values. The goal of this was 1) to have a representation type that was a better match for the discrete nature of data types like speech (which has distinct phonemes we might like to discretely capture), and, 2) to have a more compressed latent space that would (of necessity) focus on more global information, and leave local pixel-level information to be learned by the expressive PixelCNN decoder. The way they do this is remarkably simple. First, they learn a typical VAE encoder, mapping from the input pixels to a continuous z space. (An interesting sidenote here is that this paper uses spatially organized z; instead of using one single z vector to represent the whole image, they may have 32x32 spatial locations, each of which has its own z vector, to represent at 128x128 image). Then, for each of the spatial regions, they take the continuous vector produced by the network, and compare it to a fixed set of “embedding” vectors, of the same shape. That spatial location is then lumped into the category of the embedding that it’s closest to, meaning that you end up with a compressed layer of 32x32 (in this case) spatial regions, each of which is represented by a categorical number between 0 and max-num-categories. Then, the network passes forward the embedding that this input vector was just “snapped” to being, Then, the decoder uses the full spatial location set of embeddings to do its decoding. https://i.imgur.com/P8LQRYJ.png The clever thing here comes when you ask how to train the encoder to produce a different embedding, when there was this discrete “jump” that happened. The authors choose to just avoid the problem, more or less. They do that by just taking the gradient signals that come back from the end of the network to the embedding, and just pass those directly to the vector that was used to nearest-neighbors-lookup the embedding. Basically, they pretend that they passed the vector through the rest of the network, rather than the embedding. The embeddings are then trained in a K Means Clustering kind of way; with the embeddings being iteratively updated to be closer to the points that were assigned to their embedding in each round of training. This is the “Vector Quantization” part of VQ-VAE Overall, this seems to perform quite well: with the low capacity of the latente space meaning that it is incentivized to handle more global structure, while leaving low level pixel details to the decoder. It is also much easier to fit after-the-fact distributions over; once we’ve trained a VQ-VAE, we can easily learn a global model that represents the location by location dependencies between the categories (i.e. a 1 in this corner means at 5 in this other corner is more probable). This gives us the ability to have an analytically specified distribution, in latent space, that actually represents the structure of how these “concept level categories” relate to each other. By contrast, with most continuous latent spaces, it’s intractable to learn an explicit density function after the fact, and thus if we want to be able to sample we need to specify and enforce a prior distribution over z ahead of time. ![]() |
[link]
I’ve spent the last few days pretty deep in the weeds of GAN theory - with all its attendant sample-squinting and arcane training diagnosis - and so today I’m shifting gears to an applied paper, that mostly showcases some clever modifications of an underlying technique. The goal of the MusicVAE is as you might expect: to make music. But the goal isn’t just the ability to produce patterns of notes that sound musical, it’s the ability to learn a vector space where we can modify the values along each dimension, and cause the music we produce to vary along conceptually meaningful directions. In an ideal world, we might learn a dimension that corresponds to tempo, another that corresponds to the key we’re in, etc. To achieve this goal, the modelers use the structure of a Variational AutoEncoder, a model where we pass in the input, compress it down to some latent code (read: a low-dimensional vector of continuous values), and then, starting from that latent code, use a decoder to try to recreate (or “reconstruct”) the output. Think of this as describing a scene to a friend behind their back, and trying to describe it in a maximally informative way, so that they can draw it themselves, and get as close as possible to the original. Ideally, this set of constraints incentives you to learn an informative code, which will contain the kind of conceptually meaningful information that we want it to. One problem this can run into is that, given certain mathematical facts about the structure of autoencoders, if you use a decoder with a lot of capacity, like a RNN, the model can “decide” to use the RNN to model the data directly, storing all that conceptual information we’d like to have pulled out in the latent code in the parameters of the RNN instead. And, so, to solve this, the authors of the paper came up with a clever solution: instead of generating the full piece of music at once, they would instead build a hierarchical model, with a “conductor” layer that prescribes what a medium-sized chunk of the reconstructed piece will sound like, and a lower level “decoder” layer that takes the conductor’s direction for that chunk, and unspools it into a series of notes. On a more mechanical level, when the encoder spits out a latent code for a given piece of music, we pass that to the conductor. The conductor then produces - say - 10 embeddings, with each embedding corresponding to a set of 4 measures. Each decoder only sees the embedding for its chunk, and is only responsible for mapping that embedding into a series of concrete notes. This inability of each decoder to see what the decoders before and after it are doing means that, in order for the piece to sound coherent, the network needs to learn to develop a condensed set of instructions to give to the conductor. https://i.imgur.com/PQKoraX.png In practice, they come up with some really neat results: the example they show on the linked page demonstrates a learned concept-dimension that maps to “how much is this piece composed of long, held notes, vs short staccato ones”. They show that they can “interpolate” across this dimension (that is: slowly change its value) and see that the output slowly morphs from very long held notes, to a high density of different ones. ![]() |
[link]
Despite their difficulties in training, Generative Adversarial Networks are still one of the most exciting recent ideas in machine learning; a way to generate data without the fuzziness and averaging of earlier methods. However, up until recently, there had been major way in which the GAN’s primary competitor in the field, the Variational Autoencoder, was superior: it could do inference. Intuitively, inference is the inverse of generation. Whereas generation works by taking some source of randomness - a random vector, the setting of some latent code - and transforming that recipe into an observation, an inference process tries to work in reverse, taking in the observation as input and trying to guess what “recipe” was used to generate it. (As a note: in real world data, it’s generally not the case that there were explicit numerical factors used to generate data; this framing is a simplified model meant to represent the way a small set of latent settings of an object jointly cause a lot of that object’s feature values). The authors of this paper proposed the BiGAN to fix that deficiency in GAN literature. https://i.imgur.com/vZZzWH5.png The BiGAN - short for Bidirectional GAN - works by having two generators, not one. One generator works in the typical fashion of a GAN: taking in a random vector z, and transforming that into G(z) = x. The second generator works in reverse, taking in as input data from the underlying dataset, and transforming it into a code z, E(x) = z. Once these generators are in place, the discriminators work, not by trying to differentiate the x and z values separately, but all together. That works by giving the discriminator a pair, (x, z), and asking the discriminator to decide whether that pair came from the z -> x decoder, or the x -> z encoder. If this model fully converges, it becomes the case that G(z) and E(x) are inverse transformations, giving us a way to take in a new input x, and infer its underlying factors z. This is valuable because it’s been shown that, in typical GANs, changes in z often correspond to latent values we care about, and it would be useful to be able to generate z from x for purposes of representation learning. The authors offer quite a nice intuitive proof for why the model learns this inverse mapping. For each pair of (x, z), it’s either the case that E(x) = z (if the pair came from the encoder), or that G(z) = x (if the pair came from the decoder). But if only one of those is the case, then it’s easy for the discriminator to tell which generation process produced the pair. So, in order to fool the discriminator, G(z) and E(x) need to synchronize their decoding and encoding processes. The authors also tried a method where, instead of having this bidirectional GAN structure, they instead simply built a network on top of the generated samples, that tries to predict the original z used, taking the generated x as input. They show that this performs less well on subjective quality measures of the learned representation, which they attribute to the fact that GANs notoriously only learn some modes of the data, and thus a x -> z encoder that only takes the generated z as input will not have good coverage over the full distribution of x. ![]() |