[link]
The authors introduce a new, sampling-free method for training and evaluating energy-based models (aka EBMs, aka unnormalized density models). There are two broad approches for training EBMs. Sampling-based approaches like contrastive divergence try to estimate the likelihood with MCMC, but can be biased if the chain is not sufficiently long. The speed of training also greatly depends on the sampling parameters. Other approches, like score matching, avoid sampling by solving a surrogate objective that approximates the likelihood. However, using a surrogate objective also introduces bias in the solution. In any case, comparing goodness of fit of different models is challenging, regardless of how the models were trained. The authors introduce a measure of probability distance between distributions $p$ and $q$ called the Learned Stein Discrepancy ($LSD$): $$ LSD(f_{\phi}, p, q) = \mathbb{E}_{p(x)} [\nabla_x \log q(x)^T f_{\phi}(x) + Tr(\nabla_x f_{\phi} (x)) $$ This measure is derived from the Stein Discrepancy $SD(p,q)$. Note that like the $SD$, the $LSD$ is 0 iff $p = q$. Typically, $p$ is the data distribution and $q$ is the learned approximate distribution (an EBM), although this doesn't have to be the case. Note also that this objective only requires a differentiable unnormalized distribution $\tilde{q}$, and does not require MCMC sampling or computation of the normalizing constant $Z$, since $\nabla_x \log q(x) = \nabla_x \log \tilde{q}(x) - \nabla_x \log Z = \nabla_x \log \tilde{q}(x)$. $f_\phi$ is known as the critic function, and minimizing the $LSD$ with respect to $\phi$ (i.e. with gradient descent) over a bounded space of functions $\mathcal{F}$ can approximate the $SD$ over that space. The authors choose to define the function space $\mathcal{F} = \{ f: \mathbb{E}_{p(x)} [f(x)^Tf(x)] < \infty \}$, which is convenient because it can be optimized by introducing a simple L2 regularizer on the critic's output: $\mathcal{R}_\lambda (f_\phi) = \lambda \mathbb{E}_{p(x)} [f_\phi(x)^T f_\phi(x)]$. Since the trace of a matrix is expensive to backpropagate through, the authors use a single-sample Monte Carlo estimate $Tr(\nabla_x f_\phi(x)) \approx \mathbb{E}_{\mathbb{N}(\epsilon|0,1)} [\epsilon^T \nabla_x f_\phi(x) \epsilon] $, which is more efficient since $\epsilon^T \nabla_x f_\phi(x)$ is a vector-Jacobian product. The overall objective is thus the following: $$ \text{arg} \max_\phi \mathbb{E}_{p(x)} [\nabla_x \log q(x)^T f_{\phi}(x) + \mathbb{E}_{\epsilon} [\epsilon^T \nabla_x f_{\phi} (x) \epsilon)] - \lambda f_\phi(x)^T f_\phi(x)] $$ It is possible to compare two different EBMs $q_1$ and $q_2$ by optimizing the above objective for two different critic parameters $\phi_1$ and $\phi_2$, using the training and validation data for critic optimization (then evaluating on the held-out test set). Note that when computing the $LSD$ on the test set, the exact trace can be computed instead of the Monte Carlo approximation to reduce variance, since gradients are no longer required. The model that is closer to 0 has achieved a better fit. Similarly, a hypothesis test using the $LSD$ can be used to test if $p = q$ for the data distribution $p$ and model distribution $q$. The authors then show how EBM parameters $\theta$ can actually be optimized by gradient descent on the $LSD$ objective, in a minimax problem that is similar to the problem of optimizing a generative adversarial network (GAN). For given $\theta$, you first optimize the critic $f_\phi$ w.r.t. $\phi$ to try to get the $LSD(f_\phi, p, q_\theta)$ close to its theoretical optimum with the current $q_\theta$, then you take a single gradient step $\nabla_\theta LSD$ to minimize the $LSD$. They show some experiments that indicates that this works pretty well. One thing that was not clear to me when reading this paper is whether the $LSD(f_\phi,p,q)$ should be minimized or maximized with respect to $\phi$ to get it close to the true $SD(p,q)$. Although it it possible for $LSD$ to be above or below 0 for a given choice of $q$ and $f_\phi$, the problem can always be formulated as minimization by simply changing the sign of $f_\phi$ at the beginning such that the $LSD$ is positive (or as maximization by making it negative).
Your comment:
|