ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Visualizing the Loss Landscape of Neural Nets
Hao Li and Zheng Xu and Gavin Taylor and Christoph Studer and Tom Goldstein
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG, cs.CV, stat.ML
more

[link] Summary by CodyWild 6 years ago

This paper was a real delight to read, and even though I’m summarizing it here, I’d really encourage you, if you’re reading this, to read the paper itself, since I found it to be unusually clearly written. It tackles the problem of understanding how features of loss functions - these integral, yet arcane, objects defined in millions of parameter-dimensions - impact model performance. Loss function analysis is generally a difficult area, since the number of dimensions and number of points needed to evaluate to calculate loss are both so high. The latter presents computational challenges, the former ones of understanding: human brains and many-dimensional spaces are not a good fit. Overall, this paper contributes by 1) arguing for a new way of visualizing loss functions, 2) demonstrating how and in what cases “flatness” of loss function contributes to performance and trainability, and 3)) 

The authors review a few historically common ways of visualizing loss functions, before introducing their variant. The simplest, one-dimensional visualization technique, 1-D Linear Interpolation, works by taking two parameter settings (say, a random initialization, and the final network minimum), and smoothly interpolating between the two, by taking a convex combination mediated by alpha. Then, you can plot the value of the loss at all of these parameter configurations as a function of alpha. If you want to plot in 2D, with a contour plot, you can do so in a pretty similar manner, by picking two random “direction vectors” of the same size as the parameter vector, and then adding amounts of those directions, weighted by alpha and beta, to your starting point. These random directions become your axes, and you get a snapshot of the change in your loss function as you move along them. 


The authors then make the observation that these techniques can’t natively be used to compare two different models, if the parameters of those models are on different scales. If you take a neural net, multiply one layer by 10, and then divide the next layer by 10, you’ve essentially done a no-op that won’t impact the outcome. However, if you’re moving by a fixed amount along your random direction in parameter space, you’ll have to move much farther to go the commensurate amount of distance in the network that’s been multiplied by 10. To address this problem, they suggest a simple fix: after you’ve selected each of your random directions, scale the value in each direction vector by the norm of the filter that corresponds to that value. This gets rid of the sensitivity of your plots to the scale of weights. (One thing I admit I’m a little confused by here is the fact that each value in the direction vector corresponds to a filter, rather than to a weight; I would have natively thought theta, and all the direction vectors, are of length number-of-model-parameters, and each value is a single weight. I think I still broadly grasp the intuition, but I’d value having a better sense of this). 

To demonstrate the value of their normalization system, they compare the interpolation plots for a model with small and large batch size, with and without weight decay. Small batches are known to increase flatness of the loss function around the eventual minimum, which seems co-occurrent with good generalization results. And, that bears out in the original model’s linear interpolation (figs a, b, c), where the small model has the wider solution basin, and also better performance. However, once weight decay is applied (figs d, e, f), the small-batch basin appears to shrink to be very narrow, although small-batch still has dominant performance. At first glance, this would seem to be a contradiction of the “flatter solutions mean more generalization” rule.  

https://i.imgur.com/V0H13kK.png

But this is just because weight decay hits smaller models more strongly, because they have more distinct updates during which they apply the weight decay penalty. This means that when weight decay is applied, the overall scale of weights in the small-batch network is lower, and so it’s solution looked “sharp” when plotted on the same weight scale as the large-batch network. When normalization was used, this effect by and large went away, and you once again saw higher performance with flatter loss functions. (batch size and performance with and without weight decay, shown normalized below)

https://i.imgur.com/vEUIgo0.png

A few other, more scattered observations from the paper:

- I’ve heard explanations of skip connections in terms of “giving the model shorter gradient paths between parameters and output,” but haven’t really seen an argument for why skip connections lead to smoother loss functions, even they empirically seem to

https://i.imgur.com/g3QqRzh.png

- The authors also devise a technique for visualizing the change in loss function along the trajectory taken by the optimization algorithm, so that different ones can be compared.  The main problem in previous methods for this has been that optimization trajectories happen in a low-dimensional manifold within parameter space, so if you just randomly select directions, you won’t see any interesting movement along the trajectory. To fix this, they choose as their axes the principal components you get from making a matrix out of the parameter values at each epoch: this prioritizes the parameters that had the most variance throughout training.

arxiv.org
arxiv-vanity.com
scholar.google.com

On the Intriguing Connections of Regularization, Input Gradients and Transferability of Evasion and Poisoning Attacks
Ambra Demontis and Marco Melis and Maura Pintor and Matthew Jagielski and Battista Biggio and Alina Oprea and Cristina Nita-Rotaru and Fabio Roli
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, cs.CR, stat.ML, 68T10, 68T45
more

[link] Summary by CodyWild 6 years ago

This paper focuses on the well-known fact that adversarial examples are often transferable: that is, that an adversarial example created by optimizing loss on a surrogate model trained on similar data can often still induce increased loss on the true target model, though typically not to the same magnitude as an example optimized against the target itself. Its goal is to come up with clearer theoretical formulation for transferred examples, and more clearly understand what kinds of models transfer better than others. 

The authors define their two scenarios of interest as white box (where the parameters of the target model are known), and limited knowledge, or black box, where only the data type and feature representation is known, but the exact training dataset is unknown, as well as the parameters of the target model. Most of the mathematics of this paper revolve around this equation, which characterizes how to find a delta to maximize loss on the surrogate model: 

https://i.imgur.com/Y0mD35x.png

In words: you’re finding a delta (perturbations of each input value) such that the p-norm of delta is less than some radius epsilon, and such that delta maximizes the dot product between delta and the model gradient with respect to the inputs. The closer two vectors are to one another, the higher their dot product. So, having your delta just *be* the model gradient w.r.t inputs maximizes that quantity. However, we also need to meet the requirement of having our perturbation’s norm be less than epsilon, so we in order to find the actual optimal value, we divide by the norm of the gradient (to get ourselves a norm of 1), and multiply by epsilon (to get ourselves a norm of epsilon). This leads to the optimal value of delta being, for a norm of 2: 

https://i.imgur.com/Op0H7KL.png

An important thing to remember is that all of the above has been using w-hat, meaning it’s been an examination of what the optimal delta is when we’re calculating against the surrogate model. But, if we plug in the optimal transfer value of delta we found above, how does this compare to the increase in loss if we were able to optimize against the true model? 

https://i.imgur.com/RHILZK1.png

Loss on the true model is, as above, calculated as the dot product of the delta perturbation with the gradient w.r.t inputs of the true model. Using the same logic as above, this quantity is maximized when our perturbation is as close as possible to the target model’s gradient vector. So, the authors show, the degree to which adversarial examples calculated on one model transfer to another is mediated by the cosine distance between surrogate model’s gradient vector and the target model’s one. The more similar these gradients w.r.t the input are to one another, the closer surrogate-model loss increase will be to target-model loss increase. This is one of those things that makes sense once it’s laid out, but it’s still useful to have a specific conceptual quality to point to when predicting whether adversarial examples will transfer, rather than just knowing that they do, at least some of the time, to at least some extent. 

Another interesting thing to notice from the above equation, though not directly related to transfer examples, is the right hand of the equation, the upper bound on loss increase, which is the p-norm of the gradient vector of the target model. In clearer words, this means that the amount of loss that it’s possible to induce on a model using a given epsilon of perturbation is directly dependent on the norm of that model’s gradient w.r.t inputs. This suggests that more highly regularized models, which are by definition smoother and have smaller gradients with respect to inputs, will be harder to attack. This hypothesis is borne out by the authors’ experiments. However, they also find, consistent with my understanding of prior work, that linear models are harder to attack than non-linear ones. This draws a line between two ways we’re used to thinking about model complexity/simplicity: having a less-smooth function with bigger gradients increases your vulnerability, but having nonlinear model structure seems to decrease it.

https://i.imgur.com/mw9exLU.png

One final intriguing empirical finding of this paper is that, in addition to being the hardest models to attack when they are the target, highly regularized models work the best as surrogate models. There’s a simplistic way in which this makes sense, in that if you create your examples against a “harder” adversary to begin with, they’ll be in some sense stronger, and transfer better. However, I’m not sure that intuition is a correct one here.

arxiv.org
arxiv-vanity.com
scholar.google.com

Adversarial Reprogramming of Neural Networks
Gamaleldin F. Elsayed and Ian Goodfellow and Jascha Sohl-Dickstein
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, cs.CR, cs.CV, stat.ML
more

[link] Summary by CodyWild 6 years ago

In the literature of adversarial examples, there’s this (to me) constant question: is it the case that adversarial examples are causing the model to objectively make a mistake, or just displaying behavior that is deeply weird, and unintuitive relative to our sense of what these models “should” be doing. A lot of the former question seems to come down to arguing over about what’s technically “out of distribution”, which has an occasional angels-dancing-on-a-pin quality, but it’s pretty unambiguously clear that the behavior displayed in this paper is weird, and beyond what I naively expected a network to be able to be manipulated to do.  

The goal these authors set for themselves is what they call “reprogramming” of a network; they want the ability to essentially hijack the network’s computational engine to perform a different task, predicting a different set of labels, on a different set of inputs than the ones the model was trained on. For example, one task they perform is feeding in MNIST images at the center of a bunch of (what appear to be random, but are actually carefully optimized) pixels, and getting a network that can predict MNIST labels out the other end. Obviously, it’s not literally possible to change the number of outputs that a network produces once it’s trained, so the authors would arbitrarily map ImageNet outputs to MNIST categories (like, “when this model predicts Husky, that actually means the digit 7”)  and then judge how well this mapped output performs as a MNIST classifier. I enjoyed the authors’ wry commentary here about the arbitrariness of the mapping, remarking that “a ‘White Shark’ has nothing to do with counting 3 squares in an image, and an ‘Ostrich’ does not at all resemble 10 squares”. 

https://i.imgur.com/K02cwK0.png

This paper assumes a white box attack model, which implies visibility of all of the parameters, and ability to directly calculate gradients through the model. So, given this setup of a input surrounded by modifiable pixel weights, and a desire to assign your “MNIST Labels” correctly, this becomes a straightforward optimization problem: modify the values of your input weights so as to maximize your MNIST accuracy. An important point to note here is that the same input mask of pixel values is applied for every new-task image, and so these values are optimized over a full training set of inserted images, the way that normal weights would be. One interesting observation the authors make is that, counter to the typical setup of adversarial examples, this attack would not work with a fully linear model, since you actually need your “weights” to interact with your “input”, which is different each time, but these are both just different areas of your true input. This need to have different regions of input determine how other areas of input are processed isn’t possible in a linear model where each input has a distinct impact on the output, regardless of other input values. By contrast, when you just need to optimize a single perturbation to get the network to jack up the prediction for one class, that can be accomplished by just applying a strong enough bias everywhere in the input, all pointing in the same direction, which can be added together linearly and still get the job done. 

The authors are able to perform MNIST and the task of “count the squares in this small input” to relatively high levels of accuracy. They perform reasonably on CIFAR (as well as a fully connected network, but not as well as a convnet). They found that performance was higher when using a pre-trained ImageNet, relative to just random weights. There’s some suggestion made that this implies there’s a kind of transfer learning going on, but honestly, this is weird enough that it’s hard to say. 
https://i.imgur.com/bj2MUnk.png

They were able to get this reprogramming work on different model structures, but, fascinatingly, saw distinctive patterns to the "weight pixels" they needed to add to each model structure, with ResNet easily differentiable from Inception. 

One minor quibble I have with the framing of this paper - which I overall found impressive, creative, and well-written - is that I feel like it’s stretching the original frame of “adversarial example” a bit too far, to the point of possible provoking confusion. It’s not obvious that the network is making a mistake, per se, when it classifies this very out-of-distribution input as something silly. I suppose, in an ideal world, we may want our models to return to something like a uniform-over-outputs state of low confidence when predicting out of distribution, but that’s a bit different than seeing a gibbon in a picture of a panda. I don’t dispute the authors claim that the behavior they’re demonstrating is a vulnerability in terms of its ability to let outside actors “hijack” networks compute, but I worry we might be overloading the “adversarial example” to cover too many types of network failure modes.

arxiv.org
arxiv-vanity.com
scholar.google.com

Learning Plannable Representations with Causal InfoGAN
Thanard Kurutach and Aviv Tamar and Ge Yang and Stuart Russell and Pieter Abbeel
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, cs.AI, cs.CV, cs.NE, cs.RO, stat.ML
more

[link] Summary by CodyWild 6 years ago

This paper tries to solve the problem of how to learn systems that, given a starting state and a desired target, can earn the set of actions necessary to reach that target. The strong version of this problem requires a planning algorithm to learn a full set of actions to take the agent from state A to B. However, this is a difficult and complex task, and so this paper tries to address a relaxed version of this task: generating a set of “waypoint” observations between A and B, such that each successive observation is relatively close to one another in terms of possible actions (the paper calls this ‘h-reachable’, if observations are reachable from one another in h timesteps). With these checkpoint observations in hand, the planning system can them solve many iterations of a much shorter-time-scale version of the problem. 

However, the paper asserts, applying pre-designed planning algorithms in observation space (sparse, high-dimensional) is difficult, because planning algorithms apparently do better with denser representations. (I don’t really understand, based on just reading this paper, *why* this is the case, other than the general fact that high dimensional, sparse data is just hard for most things). Historically, a typical workflow for applying planning algorithms to an environment would have been to hand-design feature representations where nearby representations were close in causal decision space (i.e. could be easily reached from one another). This paper’s goal is to derive such representations from data, rather than hand-designing them. 

The system they design to do this is a little unwieldy to follow, and I only have about 80% confidence that I fully understand all the mechanisms. One basic way you might compress high-dimensional space into a low-dimensional code is by training a Variational Autoencoder, and pulling the latent code out of the bottleneck in the middle. However, we also want to be able to map between our low-dimensional code and a realistic observation space, once we’re done planning and have our trajectory of codes, and VAE typically have difficulty generating high-dimensional observations with high fidelity. If what you want is image-generation fidelity, the natural step would be to use a GAN. 

However, GANs aren’t really natively designed to learn an informative representation; their main goal is generation, and there’s no real incentive for the noise variables used to seed generation to encode any useful information. One GAN design that tries to get around this is the InfoGAN, which gets its name from the requirement that there be high mutual information between (some subset of) the noise variables used to seed the generator, and the actual observation produced. I’m not going to get into the math of the variational approximation, but what this actually mechanically shakes out to is: in addition to generating an observation from a code, an InfoGAN also tries to predict the original code subset given the observation. Intuitively, this requirement, for the observation to contain information about the code, also means the code is forced to contain meaningful information about the image generated from it. 

However, even with this system, even if each code separately corresponds to a realistic observation, there’s no guarantee that closeness in state space corresponds to closeness in “causality space”. This feature is valuable for planning, because it means that if you chart out a trajectory through state space, it actually corresponds to a reasonable trajectory through observation space. In order to solve this problem, the authors added their final, and more novel, modification to the InfoGAN framework: instead of giving the GAN one latent code, and having it predict one observation, they would give two at a time, and have the GAN try to generate a pair of temporally nearby (i.e. less than h actions away) observations. Importantly, they’d also define some transition or sampling function within state space, so that there would be a structured or predictable way that adjacent pairs of states looked. So, if the GAN were able to learn to map adjacent points in state space to adjacent points in observation space, then you’d be able to plan out trajectories in state space, and have them be realistic in observation space. 
https://i.imgur.com/oVlVc0x.png
They do some experiments and do show that both adding the “Info” structure of the InfoGAN, and adding the paired causal structure, lead to states with improved planning properties.They also compared the clusters derived from their Causal InfoGAN states to the clusters you’d get from just naively assuming that nearness in observation space meant nearness in causality space. 

https://i.imgur.com/ddQpIdH.png

They specifically tested this on an environment divided into two “rooms”, where there were many places where there were two points, nearby in Euclidean space, but far away (or mutually inaccessible) in action space. They showed that the Causal InfoGAN (b) was successfully able to learn representations such that points nearby in action space clustered together, whereas a Euclidean representation (c) didn't have this property.

arxiv.org
arxiv-vanity.com
scholar.google.com

Intrinsic Social Motivation via Causal Influence in Multi-Agent RL
Natasha Jaques and Angeliki Lazaridou and Edward Hughes and Caglar Gulcehre and Pedro A. Ortega and DJ Strouse and Joel Z. Leibo and Nando de Freitas
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, cs.AI, cs.MA, stat.ML
more

[link] Summary by CodyWild 6 years ago

This paper builds very directly on the idea of “empowerment” as an intrinsic reward for RL agents. Where empowerment incentivizes agents to increase the amount of influence they’re able to have over the environment, “social influence,” this paper’s metric, is based on the degree which the actions of one agent influence the actions of other agents, within a multi-agent setting. The goals between the two frameworks are a little different. The notion of “empowerment” is built around a singular agent trying to figure out a short-term proxy for likelihood of long-term survival (which is a feedback point no individual wants to hit). By contrast, the problems that the authors of this paper seek to solve are more explicitly multi-agent coordination problems: prisoner’s dilemma-style situations where collective reward requires cooperation. However, they share a mathematical basis: the idea that an agent’s influence on some other element of its environment (be it the external state, or another agent’s actions) is well modeled by calculating the mutual information between its agents and that element. 

While this is initially a bit of an odd conceptual jump, it does make sense: if an action can give statistical information to help you predict an outcome, it’s likely (obviously not certain, but likely) that that action influenced that outcome. In a multi-agent problem, where cooperation and potentially even communication can help solve the task, being able to influence other agents amounts to “finding ways to make oneself useful to other agents”, because other agents aren’t going to change behavior based on your actions, or “listen” to your “messages” (in the experiment where a communication channel was available between agents) if these signals don’t help them achieve *their* goals. So, this incentive, to influence the behavior of other (self-interested) agents, amounts to a good proxy for incentivizing useful cooperation. 

Zooming in on the exact mathematical formulations (which differ slightly from, though they’re in a shared spirit with, the empowerment math): the agent’s (A’s) Causal Influence reward is calculated by taking a KL divergence between the action distribution of the other agent (B) conditional on the action A took, compared to other actions A might have taken. (see below. Connecting back to empowerment: Mutual Information is just the expected value of this quantity, taken over A’s action distribution). 

https://i.imgur.com/oxXCbdK.png

One thing you may notice from the above equation is that, because we’re working in KL divergences, we expect agent A to have access to the full distribution of agent B’s policy conditional on A’s action, not just the action B actually took. We also require the ability to sample “counterfactuals,” i.e. what agent B would have done if agent A had done something differently. Between these two requirements. If we take a realistic model of two agents interacting with each other, in only one timeline, only having access to the external and not internal parameters of the other, it makes it clear that these quantities can’t be pulled from direct experience. Instead, they are calculated by using an internal model: each agent builds its own MOA (Model of Other Agents), where they build a predictive model of what an agent will do at a given time, conditional on the environment and the actions of all other agents. It’s this model that is used to sample the aforementioned counterfactuals, since that just involves passing in a different input. I’m not entirely sure, in each experiment, whether the MOAs are trained concurrent with agent policies, or in a separate prior step. 

https://i.imgur.com/dn2cBg4.png

Testing on, again, Prisoner’s Dilemma style problems requiring agents to take risky collaborative actions, the authors did find higher performance using their method, compared to approaches where each agent just maximizes its own external reward (which, it should be said, does depend on other agents’ actions), with no explicit incentive towards collaboration. Interestingly, when they specifically tested giving agents access to a “communication channel” (the ability to output discrete signals or “words” visible to other agents), they found that it was able to train just as effectively with only an influence reward, as it was with both an influence and external reward.