![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
This paper draws from two strains of recent work: the hierarchical music modeling of MusicVAE - which intentionally model musical structure at both local and more global levels - , and the discrete autoencoder approaches of Vector Quantized VAEs - which seek to maintain the overall structure of a VAE, but apply a less aggressive form of regularization. The goal of this paper is to build a model that can generate music, not from that music’s symbolic representation - lists of notes - but from actual waveform audio. This is a more difficult task because the model now has to learn mappings between waveforms and symbolic notes, but confers the advantage of being able to model expressive dimensions of music that are difficult to capture in a pure symbolic representation. Models of pure waveform data have been used before - Wavenet is a central example - but typically they are learned alongside some kind of text conditioning structure, which is to say, you tell the model to say “Hello there, world” and the model is only responsible for building local mappings between those phonemes and waveforms, not actually modeling coherent words to follow after “Hello”. To try to address this problem, the authors of the paper propose the solution of learning an autoencoded representation over the full music sample, to try to capture global structure. Each predicted value of the global structure sequence then represents some number of timesteps of the generated sequence: say, 20. The idea here is: learn a global model that produces 1/N (1/20, in this case) fewer sequence points, whose job is ensuring long term consistency. Then, the authors also suggest the use of a lower level decoder model that uses the conditioning information from the autoencoder, and, in a similar fashion to a text to speech wavenet, captures a high fidelity mapping between that conditioning and the output waveform. This overall structure has a lot in common with the recently released MusicVAE paper. The most salient architectural change proposed by this paper is that of Argmax VAEs, rather than VQ VAEs. Overall, the reason for training discrete autoencoders is to have a more easily adjustable way of regularizing the bottlenecked representation, to avoid the fact that for some challenging problems, excessively strong VAE regularization can lead to that high level representational space just not being used. To understand the difference, it’s worth understanding that VQ VAEs work by generating a continuous encoding vector (the same as a typical VAE) but then instead of passing that continuous vector itself directly on to the decoder, the VQ VAE instead fits what is basically a K means operation: it maps the continuous vector to one of it’s “prototypical” or “codebook” vectors based on closeness in Euclidean distance (these codebook vectors are learned in a separate trading loop, in a K Means style algorithm). The Argmax VAE is similar, but instead of needing to take that alternating step of learning the codebook vectors via K Means, it performs a much simpler quantization operation: just taking the argmax of indices across the continuous vector, so that the output is the one-hot vector closest to the continuous input. While this reduces the capacity of the model, it also limits the problem of “codebook collapse”, which is a failure mode that can happen during the K Means iteration (I’m actually not entirely clear on the prototypical example of codebook collapse, or exactly why it happens). https://i.imgur.com/H5YqSZG.png Combining these ideas together: this paper’s model works by learning an Argmax VAE over a larger and courser timeframe of the model, and then learning a local, high resolution decoder - similar to Wavenet - over the smaller time scales, conditioned on the output of the Argmax VAE making high level decisions. This combination balances the needs of coherent musical structure and local fidelity, and allows for different weighing of those trade-offs in a fairly flexible way, by changing the frequency at which you produce Argmax VAE conditioning output. ![]() |
[link]
Variational Inference builds around the ELBO (Evidence Lower BOund) -- a lower bound on a marginal log-likelihood of the observed data $\log p(x) = \log \int p(x, z) dz$ (which is typically intractable). The ELBO makes use of an approximate posterior to form a lower bound: $$ \log p(x) \ge \mathbb{E}_{q(z|x)} \log \frac{p(x, z)}{q(z|x)} $$ # Introduction to Quasi Monte Carlo It's assumed that both the join $p(x, z)$ (or, equivalently the likelihood $p(x|z)$ and the prior $p(z)$) and the approximate posterior $q(z|x)$ are tractable (have closed-form density and are easy to sample from). Then one can estimate the ELBO via Monte Carlo as $$ \text{ELBO} \approx \frac{1}{N} \sum_{n=1}^N \log \frac{p(x, z_n)}{q(z_n|x)}, \quad\quad z_n \sim q(z|x) $$ This estimate can be used in stochastic optimization, essentially stochastically maximizing the ELBO, which leads to either increasing marginal log-likelihood or decreasing the gap between the true posterior distribution $p(z|x)$ and the approximate one $q(z|x)$. Efficiency of stochastic optimization depends on the amount of stochasticity. The bigger the variance is -- the harder it's to locate the optimum. It's well-known that in typical Monte Carlo variance scales as 1/N for a sample of size N, and hence typical error of such "approximation" has an order of $1/\sqrt{N}$ However, there are more efficient schemes to evaluate the integrals of the form of the expectation. To give you some intuition, consider $$ \mathbb{E}_{q(z)} f(z) = \int_\mathcal{Z} f(z) q(z) dz = \int_{[0, 1]^d} f(z(u)) du $$ Here I used the fact that any random variance can be expressed as a deterministic transformation of a uniform r.v. (by application of the inverse CDF of the former r.v.), so estimating the expectation using MC essentially means sampling a bunch of uniform r.v. $u_1, \dots, u_N$ and transforming them into the corresponding $z$s. However, uniformly distributed random variables sometimes clump together and leave some areas uncovered: https://i.imgur.com/fejsl2t.png Low Discrepancy sequences are designed to cover the unit cube more uniformly in a sense that points are unlikely to clump and should not leave "holes" in the landscape, effectively facilitating a better exploration. The Quasi Monte Carlo then employs these sequences to evaluate the integral at, giving (a deterministic!) approximation with an error of an order $\tfrac{(\log N)^d}{N}$. If you want some randomness, there are clever randomization techniques, that give you Randomized Quasi Monte Carlo with roughly the same guarantees. # RQMC applied to VI Authors estimate the ELBO using samples obtained from the Randomized QMC (scrambled Sobol sequence, in particular), and show experimentally that this leads to lower gradient variance and faster convergence. # Theoretical Properties Authors also analyse Stochastic Gradient Descent with RQMC and prove several convergence theorems. To the best of my knowledge, this is the first work considering stochastic optimization using QMC (which is understandable given that one needs to be able to control the gradients to do so) # Critique The paper was a great read, and spurred a great interest in me. I find the idea of using QMC very intriguing, however in my opinion there are several problems on the road to mass-adoption 1. Authors use RQMC to get the stochastic nature of $z_n$, however that essentially changes the effective distribution of generated $z$, which should be accounted for in the ELBO, otherwise the objective they're maximizing is not an ELBO (if only asymptotically) and hence not necessary a lower bound on the marginal log-likelihood. However, finding the correct proposal density $q(z|x)$ (and successfully using it) does not seem easy as most randomization schemes give you degenerate support, and KL is not well-defined. 2. Authors have an experiment on a Bayesian Neural Network, however a very small one, there are reasons to doubt their results will translate to real ones, as the positive effect of QMC vanishes as dimension grows (because it's harder for uniform samples to clump together) 3. Standard control variates might no longer reduce the variance, further research is needed. ![]()
1 Comments
|
[link]
A central question of this paper is: under what circumstances will you see agents that have been trained to optimize their own reward implement strategies - like tit for tat - that are are more sophisticated and higher overall reward than each agent simply pursuing its dominant strategy. The games under consideration here are “general sum” games like Iterated Prisoner’s Dilemma, where each agent’s dominant strategy is to defect, but with some amount of coordination or reciprocity, better overall outcomes are possible. Previously, models have achieved this via explicit hardcoding, but this paper strove to use a simpler, more general approach: allowing each agent A to optimize its reward not only with regard to a fixed opponent, but with regard to an opponent that will make a predictable update move in response to the action A is about to take. Specifically, this model - shorthanded as LOLA, Learning with Opponent-Learning Awareness - maximizes a given agent’s expected discount reward, but looks at reward *conditional on* the ways the opponent will update to a given action. In a simplified world where the explicit reward function is known, it’s possible to literally take the derivative through the opponent’s expected update step, taking into account the ways your expected reward is changed by the response you expect of your opponent. Outside of this simplified framework, in the world of policy gradients, there’s no analytic loss function; you can no longer directly differentiate your reward function with respect to your opponent’s actions, but you can differentiate your expected reward estimator with respect to them. This concept is quite similar to a 2016 paper, Metz et al, that used this concept to train a more effective GAN, by allowing each network in the adversarial pair to “look ahead” to their opponent’s expected response as a way to avoid getting stuck in repetitive action/response cycles. In circumstances where the parameters of the opponent are not known - obviously closer to realistic for an adversarial scenario - the paper demonstrates proof of concept ability to model an opponent’s strategy based on their past actions, and use that to conduct response-step estimates. https://i.imgur.com/5xddJRj.png It should of course be said in all this: even though this setup did produce results closer to what we would expect in rational reciprocity, it’s still very simplified. In most of the experiments, each agent had perfect knowledge of the opponent’s priorities and likely responses; in most game theory scenarios, constructing a model of your opponent is a nontrivial part of the difficulty. Nonetheless, I found it a ![]() |
[link]
The overall goal of the paper is measure how similar different layer activation profiles are to one another, in hopes of being able to quantify the similarity of the representations that different layers are learning. If you had a measure that captured this, you could ask questions like: “how similar are the representations that are learned by different networks on the same task”, and “what is the dynamic of representational change in a given layer throughout training”? Canonical Correlation Analysis is one way of approaching this question, and the way taken by this paper. The premise of CCA is that you have two multidimensional variable sets, where each set is made up of vectors representing dimensions within that variable set. Concretely, in this paper, the sets under examination are the activation profiles of two layers (either the same layer at different points in training, or different layers in the same network, or layers in different networks). An activation profile is thought of in terms of multiple vectors, where each vector represents a given neuron’s activation value, evaluated over some observation set X. Importantly, for the two layers that you’re comparing, the set of observations X needs to be of the same length, but the layers can have different number of neurons (and, consequently, different numbers of vectors making up that layer’s multivariate set). Given this setup, the goal of CCA is to find vectors that are linear combinations of the basis vectors of each set, to satisfy some constraint. In that broad sense, this is similar to the project of PCA, which also constructs linear-combination principal components to better represent the underlying data space. However, in PCA, the constraints that define these combinations are based on one multidimensional feature space, not two. In CCA, instead of generating k principal components, you generate k *pairs* of canonical correlates. Each canonical correlate pair, (U1, V1) is a linear combination of the activation vectors of sets L1 and L2 respectively, and is chosen with the goal of minimizing the the angle (cosine) distance between the correlates in each pair. If you think about L1 and L2 each only having two activations (that is: if you think about them as being two-dimensional spaces) then the goal of CCA is to find the cosine distance between the planes defined by the two activation spaces. An important intuition here is that in this framing, vector sets that are just linear transformations of one another (scalings, rotations, swaps in the arbitrary order of activations) will look the same, which wouldn’t be the case if you just looked at raw correlations between the individual activations. This is connected to the linear algebra idea that, if you have two vectors, and a third that is just a linear combination of the first two, the span of those vectors is still just that two-dimensional space. This property is important for the analysis of neural network representations because it means it will be able to capture similarities between representational spaces that have fundamental geometric similarities, even if they’re different on a more surface level. In prior papers, CCA had been used by calculating the CCA vectors between varying sets of layers, and then taking the mean CCA value over all of the pairs of vectors. This paper argues against that approach, on the theory that network layers are probably not using the full representational capacity of their activation dimensions (think, as analogy: a matrix with three columns, that only actually spans two), and so including in your average very low-order correlations is mostly adding uninformative noise to your similarity measure. Instead, this paper weights the correlation coefficients according to the magnitudes of the correlate vectors in the pair; as best I can tell, this is roughly analogous to weighting according to eigenvalues, in a PCA setting. Using this weighted-average similarity measure, the authors do some really interesting investigations into learning dynamics. These include: * Comparing the intermediate-layer representations learned by networks that achieve low train error via memorization vs via actually-generalizing solutions, and show that, during training, the intermediate representations of generalizing networks are more similar to one another than memorizing networks are to one another. Intuitively, this aligns with the idea that there are many ways to noisily memorize, but a more constrained number of ways to actually learn meaningful information about a dataset. A super interesting implication of this is the idea that representational similarity *on the training set* across multiple bootstrapped or randomized trainings could be used as a proxy for test set performance, which could be particularly valuable in contexts where test data is limited https://i.imgur.com/JwyHFmN.png * Across networks, lower layers tend to be more similar to one another than layers closer to the output; said another way, the very simple (e.g. edge detectors) tend to be quite similar across networks, but the higher level representations are more divergent and influenceable by smaller quirks of the training set. * Within a given dataset, you can cluster learned internal representations across many training sets and recover groups trained with the same learning rate, even though the final layer softmax is inherently similar across models that achieve the same training error. This implies that metrics like this can give us some idea of the different minima that the optimization algorithm finds, as a function of different learning rates. Overall, I found this paper a great example of a straightforward idea used to clearly answer important and interesting questions, which is always refreshing amidst a sea of “tiny hack for an extra 0.05 accuracy”. ![]() |
[link]
If one is a Bayesian he or she best expresses beliefs about next observation $x_{n+1}$ after observing $x_1, \dots, x_n$ using the **posterior predictive distribution**: $p(x_{n+1}\vert x_1, \dots, x_n)$. Typically one invokes the de Finetti theorem and assumes there exists an underlying model $p(x\vert\theta)$, hence $p(x_{n+1}\vert x_1, \dots, x_n) = \int p(x_{n+1} \vert \theta) p(\theta \vert x_1, \dots, x_n) d\theta$, however this integral is far from tractable in most cases. Nevertheless, having tractable posterior predictive is useful in cases like few-shot generative learning where we only observe a few instances of a given class and are asked to produce more of it. In this paper authors take a slightly different approach and build a neural model with tractable posterior predictive distribution $p(x_{n+1} | x_1, \dots, x_n)$ suited for complex objects like images. In order to do so the authors take a simple model with tractable posterior predictive $p(z_{n+1} | z_1, \dots, z_n)$ (like a Gaussian Process, but not quite) and use it as a latent code, which is obtained from observations using an analytically inversible encoder $f$. This setup lets you take a complex $x$ like an image, run it through $f$ to obtain $z = f(x)$ -- a simplified latent representation for which it's easier to build joint density of all possible representations and hence easier to model the posterior predictive. By feeding latent representations of $x_1, \dots, x_n$ (namely, $z_1, \dots, z_n$) to the posterior predictive $p(z_{n+1} | f(x_1), \dots, f(x_n))$ we obtain obtain a distribution of latent representations that are coherent with those of already observed $x$s. By sampling $z$ from this distribution and running it through $f^{-1}$ we recover an object in the observation space, $x_\text{pred} = f^{-1}(z)$ -- a sample most coherent with previous observations. Important choices are: * Model for latent representations $z$: one could use Gaussian Process, however authors claim it lacks some helpful properties and go for a more general [Student-T Process](http://www.shortscience.org/paper?bibtexKey=journals/corr/1402.4306). Then authors assume that each component of $z$ is a univariate sample from this process (and hence is independent from other components) * Encoder $f$. It has to be easily inversible and have an easy-to-evaluate Jacobian (the determinant of the Jacobi matrix). The former is needed to perform decoding of predictions in latent representations space and the later is used to efficiently compute a density of observations $p(x_1, \dots, x_n)$ using the standard change of variables formula $$p(x_1, \dots, x_n) = p(z_1, \dots, z_n) \left\vert\text{det} \frac{\partial f(x)}{\partial x} \right\vert$$The architecture of choice for this task is [RealNVP](http://www.shortscience.org/paper?bibtexKey=journals/corr/1605.08803) Done this way, it's possible to write out the marginal density $p(x_1, \dots, x_n)$ on all the observed $x$s and maximize it (as in the Maximum Likelihood Estimation). Authors choose to factor the joint density in an auto-regressive fashion (via the chain rule) $$p(x_1, \dots, x_n) = p(x_1) p(x_2 \vert x_1) p(x_3 \vert x_1, x_2) \dots p(x_n \vert x_1, \dots, x_{n-1}) $$with all the conditional marginals $p(x_i \vert x_1, \dots, x_{i-1})$ having analytic (student t times the jacobian) density -- this allows one to form a fully differentiable recurrent computation graph whose parameters (parameters of Student Processes for each component of $z$ + parameters of the encoder $f$) to be learned using any stochastic gradient method. https://i.imgur.com/yRrRaMs.png ![]() |