Summaries from International Conference on Learning Representations on ShortScience.org

Lee et al. propose a generative model for obtaining confidence-calibrated classifiers. Neural networks are known to be overconfident in their predictions – not only on examples from the task’s data distribution, but also on other examples taken from different distributions. The authors propose a GAN-based approach to force the classifier to predict uniform predictions on examples not taken from the data distribution. In particular, in addition to the target classifier, a generator and a discriminator are introduced. The generator generates “hard” out-of-distribution examples; ideally these examples are close to the in-distribution, i.e., the data distribution of the actual task. The discriminator is intended to distinguish between out- and in-distribution. The overall algorithm, including the necessary losses, is given in Algorithm 1. In experiments, the approach is shown to allow detecting out-distribution examples nearly perfectly. Examples of the generated “hard” out-of-distribution samples are given in Figure 1.

https://i.imgur.com/NmF0fpN.png
Algorithm 1: The proposed joint training scheme of out-distribution generator $G$, the in-/out-distribution discriminator $G$ and the original classifier providing $P_\theta$(y|x)$ with parameters $\theta$.

https://i.imgur.com/kAclSQz.png
Figure 1: A comparison of a regular GAN (a and c) to the proposed framework (c and d). Clearly, the proposed approach generates out-of-distribution samples (i.e., no meaningful digits) close to the original data distribution.

On the importance of single directions for generalization
Ari S. Morcos and David G. T. Barrett and Neil C. Rabinowitz and Matthew Botvinick
arXiv e-Print archive - 2018 via Local arXiv
Keywords: stat.ML, cs.AI, cs.LG, cs.NE
more

Morcos et al. study the influence of ablating single units as a proxy to generalization performance. On Cifar10, for example, a 11-layer convolutional network is trained on the clean dataset, as well as on versions of Cifar10 where a fraction of $p$ samples have corrupted labels. In the latter cases, the network is forced to memorize examples, as there is no inherent structure in the labels assignment. Then, it is experimentally shown that these memorizing networks are less robust to setting whole feature maps to zero, i.e., ablating them. This is shown in Figure 1. Based on this result, the authors argue that the area under this ablation curve (AUC) can be used as proxy for generalization performance. For example, early stopping or hyper-parameter selection can be done based on this AUC value. Furthermore, they show that batch normalization discourages networks to rely on these so-called single-directions, i.e., single units or feature maps. Specifically, batch normalization seems to favor units holding information about multiple classes/concepts.


https://i.imgur.com/h2JwLUF.png
Figure 1: Classification accuracy (y-axis) over the number of units that are ablated (x-axis) for networks trained on Cifar10 with various degrees of corrupted labels. The same experiments (left and right) for MNIST and ImageNet.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

Cost-Sensitive Robustness against Adversarial Examples
Zhang, Xiao and Evans, David
arXiv e-Print archive - 2018 via Local Bibsonomy
Keywords: dblp

Thang and Evanse propose cost-sensitive certified robustness where different adversarial examples can be weighted based on their actual impact for the application. Specifically, they consider the certified robustness formulation (and the corresponding training scheme) by Wong and Kolter. This formulation is extended by acknowledging that different adversarial examples have different impact for specific applications; this is formulized through a cost matrix which quantifies which source-target label combinations of adversarial examples are actually harmful. Based on this cost matrix, cost-sensitive certified robustness as well as the corresponding training scheme is proposed and evaluated in experiments.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

Towards the first adversarially robust neural network model on MNIST
Lukas Schott and Jonas Rauber and Matthias Bethge and Wieland Brendel
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CV
more

Schott et al. propose an analysis-by-synthetis approach for adversarially robust MNIST classification. In particular, as illustrated in Figure 1, class-conditional variational auto-encoders (i.e., one variational auto-encoder per class) are learned. The respective recognition models, i.e., encoders, are discarded. For classification, the optimization problem

$l_y^*(x) = \max_z \log p(x|z) - \text{KL}(\mathcal{N}(z, \sigma I)|\mathcal{N}(0,1))$

is solved for each class $z$. Here, $p(x|z)$ represents the learned generative model. The optimization problem leads a latent code $z$ corresponding to the best reconstruction of the input. The corresponding likelihood can be used for classificaiton using Bayes’ theorem. The obtained posteriors $p(y|x)$ are then scaled using a modified softmax (see paper) to obtain the final decision. (Additionally, input binarization is used as defense.)

https://i.imgur.com/ignvoHQ.png
Figure 1: The proposed analysis by synthesis approach to MNIST classification. The depicted generators are taken from class-specific variational auto-encoders.

In addition to the proposed defense, Schott et al. also derive lower and upper bounds on the robustness of the classification procedure. These bounds can be derived from the optimization problem above, see the paper for details.

In experiments, they show that their defense outperforms state-of-the-art adversarial training and allows to estimate tight bounds. In addition, the method is robust against distal adversarial examples and the adversarial examples look more meaningful, see Figure 2.

https://i.imgur.com/uxGzzg1.png
Figure 2: Adversarial examples for the proposed “ABS” method, its binary variant and related work.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

openreview.net
scholar.google.com

Generating Natural Adversarial Examples
Zhao, Zhengli and Dua, Dheeru and Singh, Sameer
International Conference on Learning Representations - 2018 via Local Bibsonomy
Keywords: dblp

Zhao et al. propose a generative adversarial network (GAN) based approach to generate meaningful and natural adversarial examples for images and text. With natural adversarial examples, the authors refer to meaningful changes in the image content instead of adding seemingly random/adversarial noise – as illustrated in Figure 1. These natural adversarial examples can be crafted by first learning a generative model of the data, e.g., using a GAN together with an inverter (similar to an encoder), see Figure 2. Then, given an image $x$ and its latent code $z$, adversarial examples $\tilde{z} = z + \delta$ can be found within the latent code. The hope is that these adversarial examples will correspond to meaningful, naturally looking adversarial examples in the image space.

https://i.imgur.com/XBhHJuY.png
Figure 1: Illustration of natural adversarial examples in comparison ot regular, FGSM adversarial examples.

https://i.imgur.com/HT2StGI.png
Figure 2: Generative model (GAN) together with the required inverter.

In practice, e.g., on MNIST, any black-box classifier can be attacked by randomly sampling possible perturbations $\delta$ in the random space (with increasing norm) until an adversarial perturbation is found. Here, the inverted from Figure 2 is trained on top of the critic of the GAN (although specific details are missing in the paper).

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

Temporal Difference Variational Auto-Encoder
Gregor, Karol and Besse, Frederic
arXiv e-Print archive - 2018 via Local Bibsonomy
Keywords: dblp

This was definitely one of the more conceptually nuanced and complicated papers I’ve read recently, and I’ve only got about 60% confidence that I fully grasp all of its intuitions. However, I’m going to try to collect together what I did understand. 

There is a lot of research into generative models of text or image sequences, and some amount of research into building “models” in the reinforcement learning sense, where your model can predict future observations given current observations and your action. There’s an important underlying distinction here between model-based RL (where you learn a model of how the world evolves, and use that to optimize reward) and model-free RL (where you leave don’t bother explicitly learning a world model, and just directly try to optimize rewards)  However, this paper identifies a few limitations of this research. 

1) It’s largely focused on predicting observations, rather than predicting *state*. State is a bit of a fuzzy concept, and corresponds to, roughly, “the true underlying state of the game”. An example I like to use is a game where you walk in one door, and right next to it is a second door, which requires you to traverse the space and find rewards and a key before you can open. Now, imagine that the observation of your agent is it looking at the door. If the game doesn’t have any on-screen representation of the fact that you’ve found the key, it won’t be present in your observations, and you’ll observe the same thing at the point you have just entered and once you found the key. However, the state of the game at these two points will be quite different, in that in the latter case, your next states might be “opening the door” rather than “going to collect rewards”. Scenarios like this are referred to broadly as Partially Observable games or environments. This paper wants to build a model of how the game evolves into the future, but it wants to build a model of *state-to-state* evolution, rather than observation-to-observation evolution, since observations are typically both higher-dimensionality and also more noisy/less informative. 

2) Past research has typically focused on predicting each next-step observation, rather than teaching models to be able to directly predict a state many steps in the future, without having to roll out the entire sequence of intermediate predictions.  This is arguably quite valuable for making models that can predict the long term consequences of their decision 

This paper approaches that with an approach inspired by the Temporal Difference framework used in much of RL, in which you update your past estimate of future rewards by forcing it to be consistent with the actual observed rewards you encounter in the future. Except, in this model, we sample two a state (z1) and then a state some distance into the future (z2), and try to make our backwards-looking prediction of the state at time 1, taking into account observations that happened in between, match what our prediction was with only the information at time one. 

An important mechanistic nuance here is the idea of a “belief state”, something that captures all of your knowledge about game history up to a certain point. We can then directly sample a state Zt given the belief state Bt at that point. This isn’t actually possible with a model where we predict a state at time T given the state at time T-1, because the state at time Z-1 is itself a sample, and so in order to get a full distribution of Zt, you have to sample Zt over the distribution of Zt-1, and in order to get the distribution of Zt-1 you have to sample over the distribution of Zt-2, and so on and so on. Instead, we have a separate non-state variable, Bt that is created conditional on all our past observations (through a RNN). 

https://i.imgur.com/N0Al42r.png

All said and done, the mechanics of this model look like: 
1) Pick two points along the sequence trajectory
2) Calculate the belief state at each point, and, from that, construct a distribution over states at each timestep using p(z|b) 
3) Have an additional model that predicts z1 given z2, b1, and b2 (that is, the future beliefs and states), and push the distribution over z1 from this model to be close to the distribution over z1 given only the information available at time t1
4) Have a model that predicts Z2 given Z1 and the time interval ahead that we’re jumping, and try to have this model be predictive/have high likelihood over the data 
5) And, have a model that predicts an observation at time T2 given the state Z2, and train that so that we can convert our way back to an observation, given a state 

They mostly test it on fairly simple environments, but it’s an interesting idea, and I’d be curious to see other people develop it in future. 

(A strange aspect of this model is that, as far as I can tell, it’s non-interventionist, in that we’re not actually conditioning over agent action, or trying to learn a policy for an agent. This is purely trying to learn the long term transitions between states)

Meta-Learning Update Rules for Unsupervised Representation Learning
Luke Metz and Niru Maheswaranathan and Brian Cheung and Jascha Sohl-Dickstein
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, cs.NE, stat.ML
more

Unsupervised representation learning is a funny thing: our aspiration in learning representations from data is typically that they’ll be useful for future tasks, but, since we (by definition) don’t have access to labels, our approach has historically been to define heuristics, such as representing the data distribution in a low-dimensional space, and hope that those heuristics translate to useful learned representations. And, to a fair extent, they have. However, this paper’s goal is to attach this problem more directly, by explicitly meta-learning an unsupervised update rule so that performs well in future tasks. They do this by:

https://i.imgur.com/EEkpW9g.png

1) Defining a parametrized weight update function, to slot into the role that Stochastic Gradient Descent on a label-defined loss function would play in a supervised network. This function calculates a “hidden state”, is defined for each neuron in each layer, and takes in the pre and post-nonlinearity activations for that batch, the hidden state of the next layer, and a set of learned per-layer “backwards weights”. The weight update for that neuron is then calculated using the current hidden state, the last batch's hidden state, and the current value of the weight. In the traditional way of people in this field who want to define some generic function, they instantiate these functions as a MLP.

2) Using that update rule on the data from a new task, taking the representing resulting from applying the update rule, and using it in a linear regression with a small number of samples. The generalization performance from this k-shot regression, taken in expectation over multiple tasks, acts as our meta training objective. By back-propagating from this objective, to the weight values of the representation, and from there to the parameters of the optimization step, they incentivize their updater to learn representations that are useful across some distribution of tasks.

A slightly weird thing about this paper is that they train on image datasets, but shuffle the pixels and use a fully connected network rather than a conv net. I presume this has to do with the complexities of defining a weight update rule for a convolution, but it does make it harder to meaningfully compare with other image-based unsupervised systems, which are typically done using convolution. An interesting thing they note is that, early in meta-training on images, their update rules generalize fairly well to text data. However, later in training the update rules seem to have specialized to images, and generalize more poorly to images.

Exploration by Random Network Distillation
Burda, Yuri and Edwards, Harrison and Storkey, Amos and Klimov, Oleg
- 2018 via Local Bibsonomy
Keywords: reinforcement-learning

Reward functions are a funny part of modern reinforcement learning: enormously salient from the inside, if you’re coding or working with RL systems, yet not as clearly visible from the outside perspective, where we just see agents playing games in what seem to be human-like ways. Just seeing things from this angle, it can be easy to imagine that the mechanisms being used to learn are human-like as well. And, it’s true that some of the Atari games being examined are cases where there is in fact a clear, explicit reward in the form of points, that human players would also be trying to optimize. But in most cases, the world isn’t really in the habit of producing clear reward signals, and it definitely doesn’t typically do so on time scales that account for most of the learning humans do.

So, it’s generally hypothesized that in addition to updating on (sparse) environmental rewards, humans also operate according to certain pre-coded, possibly evolutionarily-engineered heuristics, of which one is curiosity. The intuition is: it sure seems like, especially early in life, humans learn by interacting with objects purely driven by curiosity, and we’d love to somehow harness that same drive to allow our learning systems to function in environments lacking dense, informative reward signals. One such environment is the video game Montezuma’s Revenge, which in addition to being amusingly difficult to search for, is a game with sparse, long-range rewards, on which typical reward-based agents have historically performed poorly, and on which this current paper focuses.

A strong existing tradition of curiosity objectives focuses on incentivizing agents to be able to better predict the next observation, given the current observation and their action within it. Intuitively, by training such a network on historical observations, and giving agents a bonus according to that prediction’s error on a given observation. The theory behind this is that if an agent isn’t able to predict the observation-transition dynamics at a given state, that probably means it hasn’t visited many nearby states, and so we want to incentivize it doing so to gain information. If this sounds familiar to the classic “explore vs exploit” trade-off, it’s very much a similar idea: in cases of clear reward, we should take the reward, but in cases of low or uncertain reward, there’s value to exploration.

One difficulty of systems like the one described above is that they reward the agent for being in environments where the next observation is difficult to predict from the current one. And while that could describe novel states about which the agent needs to gain information, it can also describe states that are inherently stochastic; the canonical example being random static on a TV screen. The agent has a lot of trouble predicting the next observation because it’s fundamentally non-deterministic to a greater degree than even the random-but-causal dynamics of most games. The proposed alternative of this paper is a little strange, but makes more sense in the context of responding to this stochasticity problem. The authors propose to create a random mapping, in the form of an initialized but untrained neural network, taking in observations and spitting out embedding vectors. Then, they incentivize their agent to go to places that have high prediction error on a network designed to predict these random embeddings. Since the output is just a function mapping, it’s deterministic with respect to observations. The idea here is that if you’ve seen observations similar to your current observation, you’ll be more able to predict the corresponding embedding, even if there’s no meaningful relationship that you’re learning.

https://i.imgur.com/Ds5gHDE.png

The authors found that this performed well on Montezuma’s Revenge and Private Eye, but only middlingly-well on other environments. I’m a bit torn on this paper overall. On one hand, it seems like a clever idea, and I’m in general interested in seeing more work on curiosity. It does clearly seem to be capturing something that corresponds to novelty-seeking, and the agent trained using it explores a higher number of rooms than alternative options. On the other, I’m a little skeptical of the fact that it only has consistent performance in two environments, and wish there had been more comparisons to simpler forms of observation similarity, since this really does just seem like a metric of “how similar of observation vectors to this have you seen before”. I find myself wondering if some sort of density modeling could even be effective here, especially if (as may be the case, I’m unsure) the input observations are metadata rather than pixels.

Spatially Transformed Adversarial Examples
Xiao, Chaowei and Zhu, Jun-Yan and Li, Bo and He, Warren and Liu, Mingyan and Song, Dawn
arXiv e-Print archive - 2018 via Local Bibsonomy
Keywords: dblp

Xiao et al. propose adversarial examples based on spatial transformations. Actually, this work is very similar to the adversarial deformations of [1]. In particular, a deformation flow field is optimized (allowing individual deformations per pixel) to cause a misclassification. The distance of the perturbation is computed on the flow field directly. Examples on MNIST are shown in Figure 1 – it can clearly be seen that most pixels are moved individually and no kind of smoothness is enforced. They also show that commonly used defense mechanisms are more or less useless against these attacks. Unfortunately, and in contrast to [1], they do not consider adversarial training on their own adversarial transformations as defense.

https://i.imgur.com/uDfttMU.png
Figure 1: Examples of the computed adversarial examples/transformations on MNIST for three different models. Note that these are targeted attacks.

[1] R. Alaifair, G. S. Alberti, T. Gauksson. Adef: an Iterative Algorithm to Construct Adversarial Deformations. ArXiv, abs/1804.07729v2, 2018.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

ADef: an Iterative Algorithm to Construct Adversarial Deformations
Alaifari, Rima and Alberti, Giovanni S. and Gauksson, Tandri
arXiv e-Print archive - 2018 via Local Bibsonomy
Keywords: dblp

Alaifari et al. propose an iterative attack to construct adversarial deformations of images. In particular, and in contrast to general adversarial perturbations, adversarial deformations are described through a deformation vector field – and the corresponding norm of this vector field may be bounded; an illustration can be found in Figure 1. The adversarial deformation is computed iteratively where the deformation itself is expressed in a differentiable manner. In contrast to very simple transformations such as rotations and translations, the computed adversarial deformations may contain significantly more subtle deformations as shown in Figure 2. The authors show that such deformations can successful attack MNIST and ImageNet models.

https://i.imgur.com/7N8rLaK.png
Figure 1: Illustration of the advantage of using general pixel-level deformations compared to simple transformations such as translations or rotations.

https://i.imgur.com/dCWBoI8.png
Figure 2: Illustration of untargeted (top) and targeted (bottom) attacks on ImageNet.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

Investigating Human Priors for Playing Video Games
Rachit Dubey and Pulkit Agrawal and Deepak Pathak and Thomas L. Griffiths and Alexei A. Efros
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.AI, cs.LG
more

[link] Summary by Fábio Perez 5 years ago

Authors investigated why humans play some video games better than machines. That is the case for games that do not have continuous rewards (e.g., scores). They experimented with a game -- inspired by _Montezuma's Revenge_ -- in which the player has to climb stairs, collect keys and jump over enemies. RL algorithms can only know if they succeed if they finish the game, as there is no rewards during the gameplay, so they tend to do much worse than humans in these games.

To compare between humans and machines, they set up RL algorithms and recruite players from Amazon Mechanical Turk. Humans did much better than machines for the original game setup. However, authors wanted to check the impact of semantics and prior knowledge on humans performance. They set up scenarios with different levels of reduced semantic information, as shown in Figure 2.

https://i.imgur.com/e0Dq1WO.png

This is what the game originally looked like:

https://rach0012.github.io/humanRL_website/main.gif

And this is the version with lesser semantic clues:

https://rach0012.github.io/humanRL_website/random2.gif

You can try yourself in the [paper's website](https://rach0012.github.io/humanRL_website/).

Not surprisingly, humans took much more time to complete the game in scenarios with less semantic information, indicating that humans strongly rely on prior knowledge to play video games.

The authors argue that this prior knowledge should also be somehow included into RL algorithms in order to move their efficiency towards the human level.

## Additional reading

[Why humans learn faster than AI—for now](https://www.technologyreview.com/s/610434/why-humans-learn-faster-than-ai-for-now/).

[OpenReview submission](https://openreview.net/forum?id=Hk91SGWR-)

Do Deep Generative Models Know What They Don't Know?
Eric Nalisnick and Akihiro Matsukawa and Yee Whye Teh and Dilan Gorur and Balaji Lakshminarayanan
arXiv e-Print archive - 2018 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by ameroyer 5 years ago

CNNs predictions are known to be very sensitive to adversarial examples, which are samples generated to be wrongly classifiied with high confidence. On the other hand, probabilistic generative models such as `PixelCNN` and `VAEs` learn a distribution over the input domain hence could be used to detect ***out-of-distribution inputs***, e.g., by estimating their likelihood under the data distribution. This paper provides interesting results showing that distributions learned by generative models are not robust enough yet to employ them in this way. 
  * **Pros (+):** convincing experiments on multiple generative models, more detailed analysis in the invertible flow case, interesting negative results.
  * **Cons (-):** It would be interesting to provide further results for different datasets / domain shifts to observe if this property can be quanitfied as a characteristics of the model or of the input data.
  
  
---

## Experimental negative result
Three classes of generative models are considered in this paper:
  * **Auto-regressive** models such as `PixelCNN` [1]
  * **Latent variable** models, such as `VAEs` [2]
  * Generative models with **invertible flows** [3], in particular `Glow` [4]. 
  
The authors train a generative model $G$ on input data $\mathcal X$ and then use it to evaluate the likelihood on both the training domain $\mathcal X$ and a different domain $\tilde{\mathcal X}$. Their main (negative) result is showing that **a model trained on the CIFAR-10 dataset yields a higher likelihood when evaluated on the SVHN test dataset than on the CIFAR-10 test (or even train) split**. Interestingly, the  converse, when training on SVHN and evaluating on CIFAR, is not true.

 This result was consistantly observed for various architectures including [1], [2] and [4], although it is of lesser effect in the `PixelCNN` case.

Intuitively, this could come from the fact that both of these datasets contain natural images and that CIFAR-10 is strictly more diverse than SVHN in terms of semantic content. Nonetheless, these datasets vastly differ in appearance, and this result is counter-intuitive as it goes against the direction that generative models can reliably be use to detect out-of-distribution samples. Furthermore, this observation also confirms the general idea that higher likelihoods does not necessarily coincide with better generated samples [5].

---
## Further analysis for invertible flow models
The authors further study this phenomenon in the invertible flow models case as they provide a more rigorous analytical framework (exact likelihood inference unlike VAE which only provide a bound on the true likelihood). 

More specifically invertible flow models are characterized with a ***diffeomorphism*** (invertible function),  $f(x; \phi)$, between input space $\mathcal X$ and latent space $\mathcal Z$, and choice of the latent distribution $p(z; \psi)$. The ***change of variable formula*** links the density of $x$ and $z$ as follows:

$$
\int_x p_x(x)d_x = \int_x p_z(f(x)) \left| \frac{\partial f}{\partial x} \right| dx
$$


And the training objective under this transformation becomes

$$
\arg\max_{\theta} \log p_x(\mathbf{x}; \theta) = \arg\max_{\phi, \psi} \sum_i \log p_z(f(x_i; \phi); \psi) + \log \left| \frac{\partial f_{\phi}}{\partial x_i} \right|
$$

Typically, $p_z$ is chosen to be Gaussian, and samples are build by inverting $f$, i.e.,$z \sim p(\mathbf z),\  x = f^{-1}(z)$. And $f_{\phi}$ is build such that computing the log determinant of the Jacabian in the previous equation can be done efficiently.

First, they observe that contribution of the flow can be decomposed in a ***density*** element (left term) and a ***volume*** element (right term), resulting from the change of variables formula. Experiment results with Glow [4] show that the higher density  on SVHN mostly comes from the ***volume element contribution***.
  
Secondly, they try to directly analyze the difference in likelihood between two domains $\mathcal X$ and $\tilde{\mathcal X}$; which can be done by a second-order expansion of the log-likelihood locally around the expectation of the distribution (assuming $\mathbb{E} (\mathcal X) \sim \mathbb{E}(\tilde{\mathcal X})$). For the constant volume Glow module, the resulting analytical formula indeed confirms that the log-likelihood of SVHN should be higher than CIFAR's, as observed in practice.
  
  
  --- 
## References
  * [1] Conditional Image Generation with PixelCNN Decoders, van den Oord et al, 2016
  * [2] Auto-Encoding Variational Bayes, Kingma and Welling, 2013
  * [3] Density estimation using Real NVP, Dinh et al., ICLR 2015
  * [4] Glow: Generative Flow with Invertible 1x1 Convolutions, Kingma and Dhariwal
  * [5] A Note on the Evaluation of Generative Models, Theis et al., ICLR 2016

Relational Forward Models for Multi-Agent Learning
Andrea Tacchetti and H. Francis Song and Pedro A. M. Mediano and Vinicius Zambaldi and Neil C. Rabinowitz and Thore Graepel and Matthew Botvinick and Peter W. Battaglia
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, cs.AI, cs.MA, stat.ML
more

One of the dominant narratives of the deep learning renaissance has been the value of well-designed inductive bias - structural choices that shape what a model learns. The biggest example of this can be found in convolutional networks, where models achieve a dramatic parameter reduction by having features maps learn local patterns, which can then be re-used across the whole image. This is based on the prior belief that patterns in local images are generally locally contiguous, and so having feature maps that focus only on small (and gradually larger) local areas is a good fit for that prior. This paper operates in a similar spirit, except its input data isn’t in the form of an image, but a graph: the social graph of multiple agents operating within a Multi Agent RL Setting. In some sense, a graph is just a more general form of a pixel image: where a pixel within an image has a fixed number of neighbors, which have fixed discrete relationships to it (up, down, left, right), nodes within graphs have an arbitrary number of nodes, which can have arbitrary numbers and types of attributes attached to that relationship. 

The authors of this paper use graph networks as a sort of auxiliary information processing system alongside a more typical policy learning framework, on tasks that require group coordination and knowledge sharing to complete successfully. For example, each agent might be rewarded based on the aggregate reward of all agents together, and, in the stag hunt, it might require collaborative effort by multiple agents to successfully “capture” a reward. Because of this, you might imagine that it would be valuable to be able to predict what other agents within the game are going to do under certain circumstances, so that you can shape your strategy accordingly. 

The graph network used in this model represents both agents and objects in the environment as nodes, which have attributes including their position, whether they’re available or not (for capture-able objects), and what their last action was. As best I can tell, all agents start out with directed connections going both ways to all other agents, and to all objects in the environment, with the only edge attribute being whether the players are on the same team, for competitive environments. Given this setup, the graph network works through a sort of “diffusion” of information, analogous to a message passing algorithm. At each iteration (analogous to a layer), the edge features pull in information from their past value and sender and receiver nodes, as well as from a “global feature”. Then, all of the nodes pull in information from their edges, and their own past value. Finally, this “global attribute” gets updated based on summations over the newly-updated node and edge information. (If you were predicting attributes that were graph-level attributes, this global attribute might be where you’d do that prediction. However, in this case, we’re just interested in predicting agent-level actions). 

https://i.imgur.com/luFlhfJ.png

All of this has the effect of explicitly modeling agents as entities that both have information, and have connections to other entities. One benefit the authors claim of this structure is that it allows them more interpretability: when they “play out” the values of their graph network, which they call a Relational Forward Model or RFM, they observe edge values for two agents go up if those agents are about to collaborate on an action, and observe edge values for an agent and an object go up before that object is captured. Because this information is carefully shaped and structured, it makes it easier for humans to understand, and, in the tests the authors ran, appears to also help agents do better in collaborative games.

https://i.imgur.com/BCKSmIb.png

While I find graph networks quite interesting, and multi-agent learning quite interesting, I’m a little more uncertain about the inherent “graphiness” of this problem, since there aren’t really meaningful inherent edges between agents.  One thing I am curious about here is how methods like these would work in situations of sparser graphs, or, places where the connectivity level between a node’s neighbors, and the average other node in the graph is more distinct. Here, every node is connected to every other node, so the explicit information localization function of graph networks is less pronounced. I might naively think that - to whatever extent the graph is designed in a way that captures information meaningful to the task - explicit graph methods would have an even greater comparative advantage in this setting.

Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search
Lars Buesing and Theophane Weber and Yori Zwols and Sebastien Racaniere and Arthur Guez and Jean-Baptiste Lespiau and Nicolas Heess
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, stat.ML
more

It is a fact universally acknowledged that a reinforcement learning algorithm not in possession of a model must be in want of more data. Because they generally are. Joking aside, it is broadly understood that model-free RL takes a lot of data to train, and, even when you can design them to use off-policy trajectories, collecting data in the real environment might still be too costly. Under those conditions, we might want to learn a model of the environment and generate synthesized trajectories, and train on those. This has the advantage of not needing us to run the actual environment, but the obvious disadvantage that any model will be a simplification of the true environment, and potentially an inaccurate one. These authors seek to answer the question of: “is there a way to generate trajectories that has higher fidelity to the true environment.” As you might infer from the fact that they published a paper, and that I’m now writing about it, they argue that, yes, there is, and it’s through explicit causal/counterfactual modeling.

Causal modeling is one of those areas of statistics that seems straightforward at its highest level of abstraction, but tends to get mathematically messy and unintuitive when you dive into the math. So, rather than starting with equations, I’m going to try to verbally give some intuitions for the way causal modeling is framed here. Imagine you’re trying to understand what would happen if a person had gone to college. There’s some set of information you know about them, and some set of information you don’t, that’s just random true facts about them and about the universe. If, in the real world, they did go to college, and you want to simulate what would have happened if they didn’t, it’s not enough to just know the observed facts about them, you want to actually isolate all of the random other facts (about them, about the world) that weren’t specifically “the choice to go to college”, and condition on those as well. Obviously, in the example given here, it isn’t really practically possible to isolate all the specific unseen factors that influence someone’s outcome. But, conceptually, this quantity, is what we’re going to focus on in this paper.

Now, imagine a situation where a RL agent has been dropped into a maze-like puzzle. It has some set of dynamics, not immediately visible to the player, that make it difficult, but ultimately solvable. The best kind of simulated data, the paper argues, would be to keep that state of the world (which is partially unobservable) fixed, and sample different sets of actions the agent might take in that space. Thus, “counterfactual modeling”: for a given configuration of random states in the world, sampling different actions within it. To do this, you first have to infer the random state the agent is experiencing. In the normal model-based case, you’d have some prior over world states, and just sample from it. However, if you use the experience of the agent’s trajectory, you can make a better guess as to what world configuration it was dropped into. If you can do this, which is, technically speaking, sampling from the posterior over unseen context, conditional on an agent’s experience, then the paper suggests you’ll be able to generate data that’s more realistic, because the trajectories will be direct counterfactuals of “real world” scenarios, rather than potentially-unsolvable or unrealistic draws from the prior.

This is, essentially, the approach proposed by the paper: during training, they make this “world state” visible to the agent, and let it learn a model predicting what state it started with, given some trajectory of experience. They also learn a model that predicts the outcome and ultimately the value of actions taken, conditioned on this random context (as well as visible context, and the agent’s prior actions). They start out by using this as a tool for policy evaluation, which is a nice problem setup because you can actually check how well you’re doing against some baseline: if you want to know how good your simulated data is at replicating the policy reward on real data, you can just try it out on real data and see. The authors find that they reduce policy reward estimation error pretty substantially by adding steps of experience (in Bayesian terms, bit of evidence moving them from the prior, towards the posterior).

https://i.imgur.com/sNAcGjZ.png
They also experiment with using this for actual policy search, but, honestly, I didn’t quite follow the intuitions behind Guided Policy Search, so I’m just going to not dive into that for now, since I think a lot of the key contributions of the paper are wrapped up in the idea of “estimate the reward of a policy by simulating data from a counterfactual trajectory”

Adversarial Reprogramming of Neural Networks
Gamaleldin F. Elsayed and Ian Goodfellow and Jascha Sohl-Dickstein
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, cs.CR, cs.CV, stat.ML
more

In the literature of adversarial examples, there’s this (to me) constant question: is it the case that adversarial examples are causing the model to objectively make a mistake, or just displaying behavior that is deeply weird, and unintuitive relative to our sense of what these models “should” be doing. A lot of the former question seems to come down to arguing over about what’s technically “out of distribution”, which has an occasional angels-dancing-on-a-pin quality, but it’s pretty unambiguously clear that the behavior displayed in this paper is weird, and beyond what I naively expected a network to be able to be manipulated to do.  

The goal these authors set for themselves is what they call “reprogramming” of a network; they want the ability to essentially hijack the network’s computational engine to perform a different task, predicting a different set of labels, on a different set of inputs than the ones the model was trained on. For example, one task they perform is feeding in MNIST images at the center of a bunch of (what appear to be random, but are actually carefully optimized) pixels, and getting a network that can predict MNIST labels out the other end. Obviously, it’s not literally possible to change the number of outputs that a network produces once it’s trained, so the authors would arbitrarily map ImageNet outputs to MNIST categories (like, “when this model predicts Husky, that actually means the digit 7”)  and then judge how well this mapped output performs as a MNIST classifier. I enjoyed the authors’ wry commentary here about the arbitrariness of the mapping, remarking that “a ‘White Shark’ has nothing to do with counting 3 squares in an image, and an ‘Ostrich’ does not at all resemble 10 squares”. 

https://i.imgur.com/K02cwK0.png

This paper assumes a white box attack model, which implies visibility of all of the parameters, and ability to directly calculate gradients through the model. So, given this setup of a input surrounded by modifiable pixel weights, and a desire to assign your “MNIST Labels” correctly, this becomes a straightforward optimization problem: modify the values of your input weights so as to maximize your MNIST accuracy. An important point to note here is that the same input mask of pixel values is applied for every new-task image, and so these values are optimized over a full training set of inserted images, the way that normal weights would be. One interesting observation the authors make is that, counter to the typical setup of adversarial examples, this attack would not work with a fully linear model, since you actually need your “weights” to interact with your “input”, which is different each time, but these are both just different areas of your true input. This need to have different regions of input determine how other areas of input are processed isn’t possible in a linear model where each input has a distinct impact on the output, regardless of other input values. By contrast, when you just need to optimize a single perturbation to get the network to jack up the prediction for one class, that can be accomplished by just applying a strong enough bias everywhere in the input, all pointing in the same direction, which can be added together linearly and still get the job done. 

The authors are able to perform MNIST and the task of “count the squares in this small input” to relatively high levels of accuracy. They perform reasonably on CIFAR (as well as a fully connected network, but not as well as a convnet). They found that performance was higher when using a pre-trained ImageNet, relative to just random weights. There’s some suggestion made that this implies there’s a kind of transfer learning going on, but honestly, this is weird enough that it’s hard to say. 
https://i.imgur.com/bj2MUnk.png

They were able to get this reprogramming work on different model structures, but, fascinatingly, saw distinctive patterns to the "weight pixels" they needed to add to each model structure, with ResNet easily differentiable from Inception. 

One minor quibble I have with the framing of this paper - which I overall found impressive, creative, and well-written - is that I feel like it’s stretching the original frame of “adversarial example” a bit too far, to the point of possible provoking confusion. It’s not obvious that the network is making a mistake, per se, when it classifies this very out-of-distribution input as something silly. I suppose, in an ideal world, we may want our models to return to something like a uniform-over-outputs state of low confidence when predicting out of distribution, but that’s a bit different than seeing a gibbon in a picture of a panda. I don’t dispute the authors claim that the behavior they’re demonstrating is a vulnerability in terms of its ability to let outside actors “hijack” networks compute, but I worry we might be overloading the “adversarial example” to cover too many types of network failure modes.

Large-Scale Study of Curiosity-Driven Learning
Yuri Burda and Harri Edwards and Deepak Pathak and Amos Storkey and Trevor Darrell and Alexei A. Efros
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, cs.AI, cs.CV, cs.RO, stat.ML
more

I really enjoyed this paper - in addition to being a clean, fundamentally empirical work, it was also clearly written, and had some pretty delightful moments of quotable zen, which I’ll reference at the end. The paper’s goal is to figure out how far curiosity-driven learning alone can take reinforcement learning systems, without the presence of an external reward signal. “Intrinsic” reward learning is when you construct a reward out of internal, inherent features of the environment, rather than using an explicit reward function. In some ways, intrinsic learning in RL can be thought of as analogous to unsupervised learning in classification problems, since reward functions are not inherent to most useful environments, and (when outside of game environments that inherently provide rewards), frequently need to be hand-designed. Curiosity-driven learning is a subset of intrinsic learning, which uses as a reward signal the difference between a prediction made by the dynamics model (predicting next state, given action) and the true observed next state. Situations where the this prediction area are high generate high reward for the agent, which incentivizes it to reach those states, which allows the dynamics model to then make ever-better predictions about them.

Two key questions this paper raises are:

1) Does this approach even work when used on it’s own? Curiosity had previously most often been used as a supplement to extrinsic rewards, and the authors wanted to know how far it could go separately.

2) What is the best feature to do this “surprisal difference” calculation in? Predicting raw pixels is a high-dimensional and noisy process, so naively we might want something with fewer, more informationally-dense dimensions, but it’s not obvious which methods that satisfy these criteria will work the best, so the paper empirically tried them.

The answer to (1) seems to be: yes, at least in the video games tested. Impressively, when you track against extrinsic reward (which, again, these games have, but we’re just ignoring in a curiosity-only setting), the agents manage to increase it despite not optimizing against it directly. There were some Atari games where this effect was stronger than others, but overall performance was stronger than might have been naively expected. One note the authors made, worth keeping in mind, is that it’s unclear how much of this is an artifact of the constraints and incentives surrounding game design, which might reflect back a preference for gradually-increasing novelty because humans find it pleasant.

https://i.imgur.com/zhl39vo.png

As for (2), another interesting result of this paper is that random features performed consistently well as a feature space to do these prediction/reality comparisons in. Random features here is really just as simple as “design a convolutional net that compresses down to some dimension, randomly initialize it, and then use those randomly initialized weights to run forward passes of the network to get your lower-dimensional state”. This has the strong disadvantage of (presumably) not capturing any meaningful information about the state, but also has the advantage of being stable: the other techniques tried, like pulling out the center of a VAE bottleneck, changed over time as they were being trained on new states, so they were informative, but non-stationary.

My two favorite quotable moments from this paper were:
1) When the authors noted that they had removed the “done” signal associated with an agent “dying,” because it is itself a sort of intrinsic reward. However, “in practice, we do find that the agent avoids dying in the games since that brings it back to the beginning of the game, an area it has already seen many times and where it can predict the dynamics well.”. Short and sweet: “Avoiding death, because it’s really boring”

https://i.imgur.com/SOfML8d.png

2) When they noted that an easy way to hack the motivation structure of a curiosity-driven agent was through a “noisy tv”, which, every time you pressed the button, jumped to a random channel. As expected, when they put this distraction inside a maze, the agent spent more time jacking up reward through that avenue, rather than exploring. Any resemblance to one’s Facebook feed is entirely coincidental.

Language GANs Falling Short
Massimo Caccia and Lucas Caccia and William Fedus and Hugo Larochelle and Joelle Pineau and Laurent Charlin
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CL, cs.LG
more

This paper’s high-level goal is to evaluate how well GAN-type structures for generating text are performing, compared to more traditional maximum likelihood methods. In the process, it zooms into the ways that the current set of metrics for comparing text generation fail to give a well-rounded picture of how models are performing.

In the old paradigm, of maximum likelihood estimation, models were both trained and evaluated on a maximizing the likelihood of each word, given the prior words in a sequence. That is, models were good when they assigned high probability to true tokens, conditioned on past tokens. However, GANs work in a fundamentally new framework, in that they aren’t trained to increase the likelihood of the next (ground truth) word in a sequence, but to generate a word that will make a discriminator more likely to see the sentence as realistic. Since GANs don’t directly model the probability of token t, given prior tokens, you can’t evaluate them using this maximum likelihood framework.

This paper surveys a range of prior work that has evaluated GANs and MLE models on two broad categories of metrics, occasionally showing GANs to perform better on one or the other, but not really giving a way to trade off between the two.
- The first type of metric, shorthanded as “quality”, measures how aligned the generated text is with some reference corpus of text: to what extent your generated text seems to “come from the same distribution” as the original. BLEU, a heuristic frequently used in translation, and also leveraged here, measures how frequently certain sets of n-grams occur in the reference text, relative to the generated text. N typically goes up to 4, and so in addition to comparing the distributions of single tokens in the reference and generated, BLEU also compares shared bigrams, trigrams, and quadgrams (?) to measure more precise similarity of text.
- The second metric, shorthanded as “diversity” measures how different generated sentences are from one another. If you want to design a model to generate text, you presumably want it to be able to generate a diverse range of text - in probability terms, you want to fully sample from the distribution, rather than just taking the expected or mean value. Linguistically, this would be show up as a generator that just generates the same sentence over and over again. This sentence can be highly representative of the original text, but lacks diversity. One metric used for this is the same kind of BLEU score, but for each generated sentence against a corpus of prior generated sentences, and, here, the goal is for the overlap to be as low as possible

The trouble with these two metrics is that, in their raw state, they’re pretty incommensurable, and hard to trade off against one another. The authors of this paper try to address this by observing that all models trade off diversity and quality to some extent, just by modifying the entropy of the conditional token distribution they learn. If a distribution is high entropy, that is, if it spreads probability out onto more tokens, it’s likelier to bounce off into a random place, which increases diversity, but also can make the sentence more incoherent. By contrast, if a distribution is too low entropy, only ever putting probability on one or two words, then it will be only ever capable of carving out a small number of distinct paths through word space.
The below table shows a good example of what language generation can look like at high and low levels of entropy
https://i.imgur.com/YWGXDaJ.png

The entropy of a softmax distribution be modified, without changing the underlying model, by changing the *temperature* of the softmax calculation. So, the authors do this, and, as a result, they can chart out that model’s curve on the quality/diversity axis. Conceptually, this is asking “at a range of different quality thresholds, how good is this model’s diversity,” and vice versa. I mentally analogize this to a ROC curve, where it’s not really possible to compare, say, precision of models that use different thresholds, and so you instead need to compare the curve over a range of different thresholds, and compare models on that.

https://i.imgur.com/C3zdEjm.png

When they do this for GANs and MLEs, they find that, while GANs might dominate on a single metric at a time, when you modulate the temperature of MLE models, they’re able to achieve superior quality when you tune them to commensurate levels of diversity.

Compositional Obverter Communication Learning From Raw Visual Input
Edward Choi and Angeliki Lazaridou and Nando de Freitas
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.AI, cs.CL, cs.LG, cs.NE
more

[link] Summary by Ben Bogin 5 years ago

This paper proposes a new training method for multi-agent communication settings. They show the following referential game: A speaker sees an image of a 3d rendered object and describes it to a listener. The listener sees a different image and must decide if it is the same object as described by the speaker (has the same color and shape). The game can only be completed successfully if a communication protocol emerges that can express the color and shape the speaker sees.

The main contribution of the paper is the training algorithm. The speaker enumerates the message that would maximise its own understanding of the message given the image it sees (in a greedy way, symbol by symbol). The listener, given the image and the message, predicts a binary output and is trained using maximum likelihood given the correct answer. Only the listener is updating its parameters - therefore the speaker and listener change roles every number of rounds.

They show that a compositional communication protocol has emerged and evaluate it using zero-shot tests.

[Implemenation of this paper in pytorch](https://github.com/benbogin/obverter)