![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
If one is a Bayesian he or she best expresses beliefs about next observation $x_{n+1}$ after observing $x_1, \dots, x_n$ using the **posterior predictive distribution**: $p(x_{n+1}\vert x_1, \dots, x_n)$. Typically one invokes the de Finetti theorem and assumes there exists an underlying model $p(x\vert\theta)$, hence $p(x_{n+1}\vert x_1, \dots, x_n) = \int p(x_{n+1} \vert \theta) p(\theta \vert x_1, \dots, x_n) d\theta$, however this integral is far from tractable in most cases. Nevertheless, having tractable posterior predictive is useful in cases like few-shot generative learning where we only observe a few instances of a given class and are asked to produce more of it. In this paper authors take a slightly different approach and build a neural model with tractable posterior predictive distribution $p(x_{n+1} | x_1, \dots, x_n)$ suited for complex objects like images. In order to do so the authors take a simple model with tractable posterior predictive $p(z_{n+1} | z_1, \dots, z_n)$ (like a Gaussian Process, but not quite) and use it as a latent code, which is obtained from observations using an analytically inversible encoder $f$. This setup lets you take a complex $x$ like an image, run it through $f$ to obtain $z = f(x)$ -- a simplified latent representation for which it's easier to build joint density of all possible representations and hence easier to model the posterior predictive. By feeding latent representations of $x_1, \dots, x_n$ (namely, $z_1, \dots, z_n$) to the posterior predictive $p(z_{n+1} | f(x_1), \dots, f(x_n))$ we obtain obtain a distribution of latent representations that are coherent with those of already observed $x$s. By sampling $z$ from this distribution and running it through $f^{-1}$ we recover an object in the observation space, $x_\text{pred} = f^{-1}(z)$ -- a sample most coherent with previous observations. Important choices are: * Model for latent representations $z$: one could use Gaussian Process, however authors claim it lacks some helpful properties and go for a more general [Student-T Process](http://www.shortscience.org/paper?bibtexKey=journals/corr/1402.4306). Then authors assume that each component of $z$ is a univariate sample from this process (and hence is independent from other components) * Encoder $f$. It has to be easily inversible and have an easy-to-evaluate Jacobian (the determinant of the Jacobi matrix). The former is needed to perform decoding of predictions in latent representations space and the later is used to efficiently compute a density of observations $p(x_1, \dots, x_n)$ using the standard change of variables formula $$p(x_1, \dots, x_n) = p(z_1, \dots, z_n) \left\vert\text{det} \frac{\partial f(x)}{\partial x} \right\vert$$The architecture of choice for this task is [RealNVP](http://www.shortscience.org/paper?bibtexKey=journals/corr/1605.08803) Done this way, it's possible to write out the marginal density $p(x_1, \dots, x_n)$ on all the observed $x$s and maximize it (as in the Maximum Likelihood Estimation). Authors choose to factor the joint density in an auto-regressive fashion (via the chain rule) $$p(x_1, \dots, x_n) = p(x_1) p(x_2 \vert x_1) p(x_3 \vert x_1, x_2) \dots p(x_n \vert x_1, \dots, x_{n-1}) $$with all the conditional marginals $p(x_i \vert x_1, \dots, x_{i-1})$ having analytic (student t times the jacobian) density -- this allows one to form a fully differentiable recurrent computation graph whose parameters (parameters of Student Processes for each component of $z$ + parameters of the encoder $f$) to be learned using any stochastic gradient method. https://i.imgur.com/yRrRaMs.png ![]() |
[link]
This recent paper, a collaboration involving some of the authors of MAML, proposes an intriguing application of techniques developed in the field of meta learning to the problem of unsupervised learning - specifically, the problem of developing representations without labeled data, which can then be used to learn quickly from a small amount of labeled data. As a reminder, the idea behind meta learning is that you train models on multiple different tasks, using only a small amount of data from each task, and update the model based on the test set performance of the model. The conceptual advance proposed by this paper is to adopt the broad strokes of the meta learning framework, but apply it to unsupervised data, i.e. data with no pre-defined supervised tasks. The goal of such a project is, as so often is the case with unsupervised learning, to learn representations, specifically, representations we believe might be useful over a whole distribution of supervised tasks. However, to apply traditional meta learning techniques, we need that aforementioned distribution of tasks, and we’ve defined our problem as being over unsupervised data. How exactly are we supposed to construct the former out of the latter? This may seem a little circular, or strange, or definitionally impossible: how can we generate supervised tasks without supervised labels? https://i.imgur.com/YaU1y1k.png The artificial tasks created by this paper are rooted in mechanically straightforward operations, but conceptually interesting ones all the same: it uses an off the shelf unsupervised learning algorithm to generate a fixed-width vector embedding of your input data (say, images), and then generates multiple different clusterings of the embedded data, and then uses those cluster IDs as labels in a faux-supervised task. It manages to get multiple different tasks, rather than just one - remember, the premise of meta learning is in models learned over multiple tasks - by randomly up and down-scaling dimensions of the embedding before clustering is applied. Different scalings of dimensions means different points close to one another, which means the partition of the dataset into different clusters. With this distribution of “supervised” tasks in hand, the paper simply applies previously proposed meta learning techniques - like MAML, which learns a model which can be quickly fine tuned on a new task, or prototypical networks, which learn an embedding space in which observations from the same class, across many possible class definitions are close to one another. https://i.imgur.com/BRcg6n7.png An interesting note from the evaluation is that this method - which is somewhat amusingly dubbed “CACTUs” - performs best relative to alternative baselines in cases where the true underlying class distribution on which the model is meta-trained is the most different from the underlying class distribution on which the model is tested. Intuitively, this makes reasonable sense: meta learning is designed to trade off knowledge of any given specific task against the flexibility to be performant on a new class division, and so it gets the most value from trade off where a genuinely dissimilar class split is seen during testing. One other quick thing I’d like to note is the set of implicit assumptions this model builds on, in the way it creates its unsupervised tasks. First, it leverages the smoothness assumptions of classes - that is, it assumes that the kinds of classes we might want our model to eventually perform on are close together, in some idealized conceptual space. While not a perfect assumption (there’s a reason we don’t use KNN over embeddings for all of our ML tasks) it does have a general reasonableness behind it, since rarely are the kinds of classes very conceptually heterogenous. Second, it assumes that a truly unsupervised learning method can learn a representation that, despite being itself sub-optimal as a basis for supervised tasks, is a well-enough designed feature space for the general heuristic of “nearby things are likely of the same class” to at least approximately hold. I find this set of assumptions interesting because they are so simplifying that it’s a bit of a surprise that they actually work: even if the “classes” we meta-train our model on are defined with simple Euclidean rules, optimizing to be able to perform that separation using little data does indeed seem to transfer to the general problem of “separating real world, messier-in-embedding-space classes using little data”. ![]() |
[link]
CNNs predictions are known to be very sensitive to adversarial examples, which are samples generated to be wrongly classifiied with high confidence. On the other hand, probabilistic generative models such as `PixelCNN` and `VAEs` learn a distribution over the input domain hence could be used to detect ***out-of-distribution inputs***, e.g., by estimating their likelihood under the data distribution. This paper provides interesting results showing that distributions learned by generative models are not robust enough yet to employ them in this way. * **Pros (+):** convincing experiments on multiple generative models, more detailed analysis in the invertible flow case, interesting negative results. * **Cons (-):** It would be interesting to provide further results for different datasets / domain shifts to observe if this property can be quanitfied as a characteristics of the model or of the input data. --- ## Experimental negative result Three classes of generative models are considered in this paper: * **Auto-regressive** models such as `PixelCNN` [1] * **Latent variable** models, such as `VAEs` [2] * Generative models with **invertible flows** [3], in particular `Glow` [4]. The authors train a generative model $G$ on input data $\mathcal X$ and then use it to evaluate the likelihood on both the training domain $\mathcal X$ and a different domain $\tilde{\mathcal X}$. Their main (negative) result is showing that **a model trained on the CIFAR-10 dataset yields a higher likelihood when evaluated on the SVHN test dataset than on the CIFAR-10 test (or even train) split**. Interestingly, the converse, when training on SVHN and evaluating on CIFAR, is not true. This result was consistantly observed for various architectures including [1], [2] and [4], although it is of lesser effect in the `PixelCNN` case. Intuitively, this could come from the fact that both of these datasets contain natural images and that CIFAR-10 is strictly more diverse than SVHN in terms of semantic content. Nonetheless, these datasets vastly differ in appearance, and this result is counter-intuitive as it goes against the direction that generative models can reliably be use to detect out-of-distribution samples. Furthermore, this observation also confirms the general idea that higher likelihoods does not necessarily coincide with better generated samples [5]. --- ## Further analysis for invertible flow models The authors further study this phenomenon in the invertible flow models case as they provide a more rigorous analytical framework (exact likelihood inference unlike VAE which only provide a bound on the true likelihood). More specifically invertible flow models are characterized with a ***diffeomorphism*** (invertible function), $f(x; \phi)$, between input space $\mathcal X$ and latent space $\mathcal Z$, and choice of the latent distribution $p(z; \psi)$. The ***change of variable formula*** links the density of $x$ and $z$ as follows: $$ \int_x p_x(x)d_x = \int_x p_z(f(x)) \left| \frac{\partial f}{\partial x} \right| dx $$ And the training objective under this transformation becomes $$ \arg\max_{\theta} \log p_x(\mathbf{x}; \theta) = \arg\max_{\phi, \psi} \sum_i \log p_z(f(x_i; \phi); \psi) + \log \left| \frac{\partial f_{\phi}}{\partial x_i} \right| $$ Typically, $p_z$ is chosen to be Gaussian, and samples are build by inverting $f$, i.e.,$z \sim p(\mathbf z),\ x = f^{-1}(z)$. And $f_{\phi}$ is build such that computing the log determinant of the Jacabian in the previous equation can be done efficiently. First, they observe that contribution of the flow can be decomposed in a ***density*** element (left term) and a ***volume*** element (right term), resulting from the change of variables formula. Experiment results with Glow [4] show that the higher density on SVHN mostly comes from the ***volume element contribution***. Secondly, they try to directly analyze the difference in likelihood between two domains $\mathcal X$ and $\tilde{\mathcal X}$; which can be done by a second-order expansion of the log-likelihood locally around the expectation of the distribution (assuming $\mathbb{E} (\mathcal X) \sim \mathbb{E}(\tilde{\mathcal X})$). For the constant volume Glow module, the resulting analytical formula indeed confirms that the log-likelihood of SVHN should be higher than CIFAR's, as observed in practice. --- ## References * [1] Conditional Image Generation with PixelCNN Decoders, van den Oord et al, 2016 * [2] Auto-Encoding Variational Bayes, Kingma and Welling, 2013 * [3] Density estimation using Real NVP, Dinh et al., ICLR 2015 * [4] Glow: Generative Flow with Invertible 1x1 Convolutions, Kingma and Dhariwal * [5] A Note on the Evaluation of Generative Models, Theis et al., ICLR 2016 ![]() |
[link]
Gowal et al. propose interval bound propagation to obtain certified robustness against adversarial examples. In particular, given a neural network consisting of linear layers and monotonic increasing activation functions, a set of allowed perturbations is propagated to obtain upper and lower bounds at each layer. These lead to bounds on the logits of the network; these are used to verify whether the network changes its prediction on the allowed perturbations. Specifically, Gowal et al. consider an $L_\infty$ ball around input examples; the initial bounds are, thus, $\underline{z}_0 = x - \epsilon$ and $\overline{z}_0 = x + \epsilon$. For each layer, the bounds are defined as $\underline{z}_{k,i} = \min_{\underline{z}_{k – 1} \leq z_{k – 1} \leq \overline{z}_{k-1}} e_i^T h_k(z_{k – 1})$ and the analogous maximization problem for the upper bound; here, $h$ denotes the applied layer. For Linear layers and monotonic activation functions, this is easy to solve, as shown in the paper. Moreover, computing these bounds is very efficient, only needing roughly two times the computation of one forward pass. During training, a combination of a clean loss and adversarial loss is used: $\kappa l(z_K, y) + (1 - \kappa) l(\hat{z}_K, y)$ where $z_K$ are the logits of the input $x$, and $\hat{z}_K$ are the adversarial logits computed as $\hat{Z}_{K,y’} = \begin{cases} \overline{z}_{K,y’} & \text{if } y’ \neq y\\\underline{z}_{K,y} & \text{otherwise}\end{cases}$ Both $\epsilon$ and $\kappa$ are annealed during training. In experiments, it is shown that this method results in quite tight bounds on robustness. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
[link]
Meta learning, or, the idea of training models on some distribution of tasks, with the hope that they can then learn more quickly on new tasks because they have “learned how to learn” similar tasks, has become a more central and popular research field in recent years. Although there is a veritable zoo of different techniques (to an amusingly literal degree; there’s an emergent fad of naming new methods after animals), the general idea is: have your inner loop consist of training a model on some task drawn from a distribution over tasks (be that maze learning with different wall configurations, letter identification from different languages, etc), and have the outer loop that updates some structural part of your model be based on improving generalization error on each task within the distribution. It’s been demonstrated that meta-learned systems can in fact learn more quickly (at least when their tasks are “in distribution” relative to the distribution they were trained on, which is an important point to be cognizant of), but this paper is less interested with how much better or faster they’re learning, and more interested in whether there are qualitative differences in the way normal learning systems and meta-trained learning systems go about learning a new task. The author (oddly for DeepMind, which typically goes in for super long author lists, there’s only the one on this paper) goes about this by studying simple learning tasks where it’s easier for us to introspect into what each model is learning over time. https://i.imgur.com/ceycq46.png In the first test, he looks at linear regression in a simple setting: where for each individual “task” data is generated according a known true weight matrix (sampled from a prior over weight matrices), with some noise added in. Given this weight matrix, he takes the singular value decomposition (think: PCA), and so ends up with a factorized representation of the weights, where higher eigenvalues on the factors, or “modes”, represent that those factors represent larger-scale patterns that explain more variance, and lower eigenvalues are smaller scale refinements on top of that. He can apply this same procedure to the weights the network has learned at any given point in training, and compare, to see how close the network is to having correctly captured each of these different modes. When normal learners (starting from a raw initialization) approach the task, they start by matching the large scale (higher eigenvalue) factors of variation, and then over the course of training improve performance on the higher-precision factors. By contrast, meta learners, in addition to learning faster, also learn large scale and small scale modes at the same rate. Similar analysis was performed and similar results found for nonlinear regression, where instead of PCA-style components, the function generating data were decomposed into different Fourier frequencies, and the normal learner learned the broad, low-frequency patterns first, where the meta learner learned them all at the same rate. The paper finds intuition for this by showing that the behavior of the meta learners matches quite well against how a Bayes-optimal learner would update on new data points, in the world where that learner had a prior over the data-generating weights that matched the true generating process. So, under this framing, the process of meta learning is roughly equivalent to your model learning a prior correspondant with the task distribution it was trained on. This is, at a high level, what I think we all sort of thought was happening with meta learning, but it’s pretty neat to see it laid out in a small enough problem where we can actually validate against an analytic model. A bit of a meta (heh) point: I wish this paper had more explanation of why the author chose to use the specific eigenvalue-focused metrics of progression on task learning that he did. They seem reasonable, but I’d have been curious to see an explication of what is captured by these, and what might be captured by alternative metrics of task progress. (A side note: the paper also contained a reinforcement learning experiment, but I both understood that one less well and also feel like it wasn’t really that analogous to the other tests) ![]() |