[link]
The paper extends the [WGAN](http://www.shortscience.org/paper?bibtexKey=journals/corr/1701.07875) paper by replacing the L2 norm in the transportation cost by some other metric $d(x, y)$. By following the same reasoning as in the WGAN paper one arrives at a dual optimization problem similar to the WGAN's one except that the critic $f$ has to be 1-Lipschitz w.r.t. a given norm (rather than L2). This, in turn, means that critic's gradient (w.r.t. input $x$) has to be bounded in the dual norm (only in Banach spaces, hence the name). Authors build upon the [WGAN-GP](http://www.shortscience.org/paper?bibtexKey=journals/corr/1704.00028) to incorporate similar gradient penalty term to force critic's constraint. In particular authors choose [Sobolev norm](https://en.wikipedia.org/wiki/Sobolev_space#Multidimensional_case): $$ ||f||_{W^{s,p}} = \left( \int \sum_{k=0}^s ||\nabla^k f(x)||_{L_p}^p dx \right)^{1 / p} $$ This norm is chosen because it not only forces pixel values to be close, but also the gradients to be close as well. The gradients are small when you have smooth texture, and big on the edges -- so this loss can regulate how much you care about the edges. Alternatively, you could express the same norm by first transforming the $f$ using the Fourier Transform, then multiplying the result by $1 + ||x||_{L_2}^2$ pointwise, and then transforming it back and integrating over the whole space: $$ ||f||_{W^{s,p}} = \left( \int \left( \mathcal{F}^{-1} \left[ (1 + ||x||_{L_2}^2)^{s/2} \mathcal{F}[f] (x) \right] (x) \right)^p dx \right)^{1 / p} $$ Here $f(x)$ would be image pixels intensities, and $x$ would be image coordinates, so $\nabla^k f(x)$ would be spatial gradient -- the one you don't have access to, and it's a bit hard to estimate one with finite differences, so the authors go for the second -- fourier -- option. Luckily, a DFT transform is just a linear operator, and fast implementations exists, so you can backpropagate through it (TensorFlow already includes tf.spectal) Authors perform experiments on CIFAR and report state-of-the-art non-progressive results in terms of Inception Score (though not beating SNGANs by a statistically significant margin). The samples they present, however, are too small to tell if the network really cared about the edges. ![]() |
[link]
Variational Inference builds around the ELBO (Evidence Lower BOund) -- a lower bound on a marginal log-likelihood of the observed data $\log p(x) = \log \int p(x, z) dz$ (which is typically intractable). The ELBO makes use of an approximate posterior to form a lower bound: $$ \log p(x) \ge \mathbb{E}_{q(z|x)} \log \frac{p(x, z)}{q(z|x)} $$ # Introduction to Quasi Monte Carlo It's assumed that both the join $p(x, z)$ (or, equivalently the likelihood $p(x|z)$ and the prior $p(z)$) and the approximate posterior $q(z|x)$ are tractable (have closed-form density and are easy to sample from). Then one can estimate the ELBO via Monte Carlo as $$ \text{ELBO} \approx \frac{1}{N} \sum_{n=1}^N \log \frac{p(x, z_n)}{q(z_n|x)}, \quad\quad z_n \sim q(z|x) $$ This estimate can be used in stochastic optimization, essentially stochastically maximizing the ELBO, which leads to either increasing marginal log-likelihood or decreasing the gap between the true posterior distribution $p(z|x)$ and the approximate one $q(z|x)$. Efficiency of stochastic optimization depends on the amount of stochasticity. The bigger the variance is -- the harder it's to locate the optimum. It's well-known that in typical Monte Carlo variance scales as 1/N for a sample of size N, and hence typical error of such "approximation" has an order of $1/\sqrt{N}$ However, there are more efficient schemes to evaluate the integrals of the form of the expectation. To give you some intuition, consider $$ \mathbb{E}_{q(z)} f(z) = \int_\mathcal{Z} f(z) q(z) dz = \int_{[0, 1]^d} f(z(u)) du $$ Here I used the fact that any random variance can be expressed as a deterministic transformation of a uniform r.v. (by application of the inverse CDF of the former r.v.), so estimating the expectation using MC essentially means sampling a bunch of uniform r.v. $u_1, \dots, u_N$ and transforming them into the corresponding $z$s. However, uniformly distributed random variables sometimes clump together and leave some areas uncovered: https://i.imgur.com/fejsl2t.png Low Discrepancy sequences are designed to cover the unit cube more uniformly in a sense that points are unlikely to clump and should not leave "holes" in the landscape, effectively facilitating a better exploration. The Quasi Monte Carlo then employs these sequences to evaluate the integral at, giving (a deterministic!) approximation with an error of an order $\tfrac{(\log N)^d}{N}$. If you want some randomness, there are clever randomization techniques, that give you Randomized Quasi Monte Carlo with roughly the same guarantees. # RQMC applied to VI Authors estimate the ELBO using samples obtained from the Randomized QMC (scrambled Sobol sequence, in particular), and show experimentally that this leads to lower gradient variance and faster convergence. # Theoretical Properties Authors also analyse Stochastic Gradient Descent with RQMC and prove several convergence theorems. To the best of my knowledge, this is the first work considering stochastic optimization using QMC (which is understandable given that one needs to be able to control the gradients to do so) # Critique The paper was a great read, and spurred a great interest in me. I find the idea of using QMC very intriguing, however in my opinion there are several problems on the road to mass-adoption 1. Authors use RQMC to get the stochastic nature of $z_n$, however that essentially changes the effective distribution of generated $z$, which should be accounted for in the ELBO, otherwise the objective they're maximizing is not an ELBO (if only asymptotically) and hence not necessary a lower bound on the marginal log-likelihood. However, finding the correct proposal density $q(z|x)$ (and successfully using it) does not seem easy as most randomization schemes give you degenerate support, and KL is not well-defined. 2. Authors have an experiment on a Bayesian Neural Network, however a very small one, there are reasons to doubt their results will translate to real ones, as the positive effect of QMC vanishes as dimension grows (because it's harder for uniform samples to clump together) 3. Standard control variates might no longer reduce the variance, further research is needed. ![]()
1 Comments
|
[link]
If one is a Bayesian he or she best expresses beliefs about next observation $x_{n+1}$ after observing $x_1, \dots, x_n$ using the **posterior predictive distribution**: $p(x_{n+1}\vert x_1, \dots, x_n)$. Typically one invokes the de Finetti theorem and assumes there exists an underlying model $p(x\vert\theta)$, hence $p(x_{n+1}\vert x_1, \dots, x_n) = \int p(x_{n+1} \vert \theta) p(\theta \vert x_1, \dots, x_n) d\theta$, however this integral is far from tractable in most cases. Nevertheless, having tractable posterior predictive is useful in cases like few-shot generative learning where we only observe a few instances of a given class and are asked to produce more of it. In this paper authors take a slightly different approach and build a neural model with tractable posterior predictive distribution $p(x_{n+1} | x_1, \dots, x_n)$ suited for complex objects like images. In order to do so the authors take a simple model with tractable posterior predictive $p(z_{n+1} | z_1, \dots, z_n)$ (like a Gaussian Process, but not quite) and use it as a latent code, which is obtained from observations using an analytically inversible encoder $f$. This setup lets you take a complex $x$ like an image, run it through $f$ to obtain $z = f(x)$ -- a simplified latent representation for which it's easier to build joint density of all possible representations and hence easier to model the posterior predictive. By feeding latent representations of $x_1, \dots, x_n$ (namely, $z_1, \dots, z_n$) to the posterior predictive $p(z_{n+1} | f(x_1), \dots, f(x_n))$ we obtain obtain a distribution of latent representations that are coherent with those of already observed $x$s. By sampling $z$ from this distribution and running it through $f^{-1}$ we recover an object in the observation space, $x_\text{pred} = f^{-1}(z)$ -- a sample most coherent with previous observations. Important choices are: * Model for latent representations $z$: one could use Gaussian Process, however authors claim it lacks some helpful properties and go for a more general [Student-T Process](http://www.shortscience.org/paper?bibtexKey=journals/corr/1402.4306). Then authors assume that each component of $z$ is a univariate sample from this process (and hence is independent from other components) * Encoder $f$. It has to be easily inversible and have an easy-to-evaluate Jacobian (the determinant of the Jacobi matrix). The former is needed to perform decoding of predictions in latent representations space and the later is used to efficiently compute a density of observations $p(x_1, \dots, x_n)$ using the standard change of variables formula $$p(x_1, \dots, x_n) = p(z_1, \dots, z_n) \left\vert\text{det} \frac{\partial f(x)}{\partial x} \right\vert$$The architecture of choice for this task is [RealNVP](http://www.shortscience.org/paper?bibtexKey=journals/corr/1605.08803) Done this way, it's possible to write out the marginal density $p(x_1, \dots, x_n)$ on all the observed $x$s and maximize it (as in the Maximum Likelihood Estimation). Authors choose to factor the joint density in an auto-regressive fashion (via the chain rule) $$p(x_1, \dots, x_n) = p(x_1) p(x_2 \vert x_1) p(x_3 \vert x_1, x_2) \dots p(x_n \vert x_1, \dots, x_{n-1}) $$with all the conditional marginals $p(x_i \vert x_1, \dots, x_{i-1})$ having analytic (student t times the jacobian) density -- this allows one to form a fully differentiable recurrent computation graph whose parameters (parameters of Student Processes for each component of $z$ + parameters of the encoder $f$) to be learned using any stochastic gradient method. https://i.imgur.com/yRrRaMs.png ![]() |