NIPS is a single-track machine learning and computational neuroscience conference that includes invited talks, demonstrations and oral and poster presentations of refereed papers.

- 1989 3
- 1992 1
- 1995 1
- 1996 1
- 1999 1
- 2000 1
- 2007 1
- 2010 1
- 2011 1
- 2012 2
- 2013 51
- 2014 19
- 2015 53
- 2016 18
- 2017 19
- 2018 17
- 2019 20

Data-Efficient Hierarchical Reinforcement Learning

Ofir Nachum and Shixiang Gu and Honglak Lee and Sergey Levine

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**First published:** 2024/03/04 (just now)

**Abstract:** Hierarchical reinforcement learning (HRL) is a promising approach to extend
traditional reinforcement learning (RL) methods to solve more complex tasks.
Yet, the majority of current HRL methods require careful task-specific design
and on-policy training, making them difficult to apply in real-world scenarios.
In this paper, we study how we can develop HRL algorithms that are general, in
that they do not make onerous additional assumptions beyond standard RL
algorithms, and efficient, in the sense that they can be used with modest
numbers of interaction samples, making them suitable for real-world problems
such as robotic control. For generality, we develop a scheme where lower-level
controllers are supervised with goals that are learned and proposed
automatically by the higher-level controllers. To address efficiency, we
propose to use off-policy experience for both higher and lower-level training.
This poses a considerable challenge, since changes to the lower-level behaviors
change the action space for the higher-level policy, and we introduce an
off-policy correction to remedy this challenge. This allows us to take
advantage of recent advances in off-policy model-free RL to learn both higher-
and lower-level policies using substantially fewer environment interactions
than on-policy algorithms. We term the resulting HRL agent HIRO and find that
it is generally applicable and highly sample-efficient. Our experiments show
that HIRO can be used to learn highly complex behaviors for simulated robots,
such as pushing objects and utilizing them to reach target locations, learning
from only a few million samples, equivalent to a few days of real-time
interaction. In comparisons with a number of prior HRL methods, we find that
our approach substantially outperforms previous state-of-the-art techniques.
more
less

Ofir Nachum and Shixiang Gu and Honglak Lee and Sergey Levine

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

[link]
# Keypoints - Proposes the HIerarchical Reinforcement learning with Off-policy correction (**HIRO**) algorithm. - Does not require careful task-specific design. - Generic goal representation to make it broadly applicable, without any manual design of goal spaces, primitives, or controllable dimensions. - Use of off-policy experience using a novel off-policy correction. - A two-level hierarchy architecture - A higher-level controller outputs a goal for the lower-level controller every **c** time steps and collects the rewards given by the environment, being the goal the desired change in state space - The lower level controller has the goal given added to its input and acts directly in the environment, the reward received is parametrized from the current state and the goal. # Background This paper adopts a standard continuous control reinforcement learning setting, in which an agent acts on an environment that yields a next state and a reward from unknown functions. This paper utilizes the TD3 learning algorithm. ## General and Efficient Hierarchical Reinforcement Learning https://i.imgur.com/zAHoWWO.png ## Hierarchy of Two Policies The higher-level policy $\mu^{hi}$ outputs a goal $g_t$, which correspond directly to desired relative changes in state that the lower-level policy $\mu^{lo}$ attempts to reach. $\mu^{hi}$ operates at a time abstraction, updating the goal $g_t$ and collecting the environment rewards $R_t$ every $c$ environment steps, the higher-level transition $(s_{t:t+c−1},g_{t:t+c−1},a_{t:t+c−1},R_{t:t+c−1},s_{t+c})$ is stored for off-policy training. The lower-level policy $\mu^{lo}$ outputs an action to be applied directly to the environment, having as input the current environment observations $s_t$ and the goal $g_t$. The goal $g_t$ is given by $\mu^{hi}$ every $c$ environment time steps, for the steps in between, the goal $g_t$ used by $\mu^{lo}$ is given by the transition function $g_t=h(s_{t−1},g_{t−1},s_t)$, the lower-level controller reward is provided by the parametrized reward function $r_t=r(s_t,g_t,a_t,s_{t+1})$. The lower-level transition $(s_t,g_t,a_t,r_t,s_{t+1}, g_{t+1})$ is stored for off-policy training. ## Parameterized Rewards The goal $g_t$ indicates a desired relative changes in state observations, the lower-level agent task is to take actions from state $s_t$ that yield it an observation $s_{t+c}$ that is close to $s_t+g_t$. To maintain the same absolute position of the goal regardless of state change, the goal transition model, used between $\mu^{hi}$ updates every $c$ steps, is defined as: $h(s_t,g_t,s_{t+1}) =s_t+g_t−s_{t+1}$ And the reward given to the lower-level controller is defined as to reinforce reaching a state closer to the goal $g_t$, this paper parametrizes it by the function: $r(s_t,g_t,a_t,s_{t+1}) =−||s_t+g_t−s_{t+1}||_2$. ## Off-Policy Corrections for Higher-Level Training The higher-level transitions stored $(s_{t:t+c−1},g_{t:t+c−1},a_{t:t+c−1},R_{t:t+c−1},s_{t+c})$ have to be converted to state-action-reward transitions $(s_t,g_t,∑R_{t:t+c−1},s_{t+c})$ as they can be used in standard off-policy RL algorithms, however, since the lower-level controller is evolving, these past transitions do not accurately represent the actions tha would be taken by the current lower-level policy and must be corrected. This paper correction technique used is to change the goal $g_t$ of past transitions using an out of date lower-level controller to a relabeled goal $g ̃_t$ which is likely to induce the same lower-level behavior with the updated $\mu^{lo}$. In other words, we want to find a goal $g ̃_t$ which maximizes the probability $μ_{lo}(a_{t:t+c−1}|s_{t:t+c−1},g ̃_{t:t+c−1})$, in which the $\mu^{lo}$ is the current policy and the actions $a_{t:t+c−1}$ and states $s_{t:t+c−1}$ are from the stored high level transition. To approximately maximize this quantity in practice, the authors calculated the probability for 10 candidates $g ̃_t$, eight candidate goals sampled randomly from a Gaussian centered at $s_{t+c}−s_t$, the original goal $g_t$ and a goal corresponding to the difference $s_{t+c}−s_t$. # Experiments https://i.imgur.com/iko9nCd.png https://i.imgur.com/kGx8fZv.png The authors compared the $HIRO$ method to prior method in 4 different environments: - Ant Gather; - Ant Maze; - Ant Push; - Ant Fall. They also performed an ablative analysis with the following variants: - With lower-level re-labeling; - With pre-training; - No off-policy correction; - No HRL. # Closing Points - The method proposed is interesting in the hierarchical reinforcement learning setting for not needing a specific design, the generic goal representation enables applicability without the need of designing a goal space manually; - The off-policy correction method enables this algorithm to be sample efficient; - The hierarchical structure with intermediate goals on state-space enables to better visualize the agent goals; - The paper Appendix elaborates on possible alternative off-policy corrections. |

Recurrent World Models Facilitate Policy Evolution

David Ha and Jürgen Schmidhuber

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2018/09/04 (5 years ago)

**Abstract:** A generative recurrent neural network is quickly trained in an unsupervised
manner to model popular reinforcement learning environments through compressed
spatio-temporal representations. The world model's extracted features are fed
into compact and simple policies trained by evolution, achieving state of the
art results in various environments. We also train our agent entirely inside of
an environment generated by its own internal world model, and transfer this
policy back into the actual environment. Interactive version of paper at
https://worldmodels.github.io
more
less

David Ha and Jürgen Schmidhuber

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, stat.ML

## General Framework The take-home message is that the challenge of Reinforcement Learning for environments with high-dimensional and partial observations is learning a good representation of the environment. This means learning a sensory features extractor V to deal with the highly dimensional observation (pixels for example). But also learning a temporal representation M of the environment dynamics to deal with the partial observability. If provided with such representations, learning a controller so as to maximize a reward is really easy (single linear layer evolved with CMA-ES). Authors call these representations a *World model* since they can use the learned environment's dynamics to simulate roll-outs. They show that policies trained inside the world model transfer well back to the real environment provided that measures are taken to prevent the policy from exploiting the world model's inaccuracies. ## Method **Learning the World Model** ![](https://i.imgur.com/tgV17k4.png =600x) In this work they propose to learn these representations off-line in an unsupervized manner in order to be more efficient. They use a VAE for V that they train exclusively with the reconstruction loss, that way the learned representations are independent of the reward and can be used alongside any reward. They then train M as Mixture-Density-Network-RNN to predict the next sensory features (as extracted by the VAE) --and possibly the done condition and the reward-- and thus learn the dynamics of the environment in the VAE's latent space (which is likely simpler there than in the pixel space). Note that the VAE's latent space is a single Gaussian (adding stochasticity makes it more robust to the "next state" outputs of M), whereas M outputs next states in a mixture of Gaussians. Indeed, an image is likely to have one visual encoding, yet it can have multiple and different future scenarii which are captured by the multimodal output of M. **Training the policy** ![](https://i.imgur.com/H5vpb2H.png) * In the real env: The agent is provided with the visual features and M's hidden state (temporal features). * In the world model: To avoid that the agent exploits this imperfect simulator they increase its dynamics' stochasticity by playing with $\tau$ the sampling temperature of $z_{t+1}$ in M. ## Limitations If exploration is important in the environment the initial random policy might fail to collect data in all the relevant part of the environment and an iterative version of Algorithm 1 might be required (see https://worldmodels.github.io/ for a discussion on the different iterative methods) for the data collection. By training V independently of M it might fail to encode all the information relevant to the task. Another option would be to train V and M concurrently so that the reward and $z_{t+1}$'s prediction loss (or next state reconstruction loss) of M flows through V (that would also be trained with its own reconstruction loss). The trade-off is that now V is tuned to a particular reward and cannot be reused. The authors argue that since $h_t$ is such that it can predict $z_{t+1}$, it contains enough insight about the future for the agent not needing to *plan ahead* and just doing reflexive actions based on $h_t$. This is interesting but the considered tasks (driving, dodging fireball) are still very reflexive and do not require much planning. ## Results When trained on the true env, a simple controller with the V and M representations achieve SOTA on car-racing. V + M is better than V alone. When trained inside the world model, its dynamics' stochasticity must be tuned in order for the policy to transfer well and perform well on the real env: too little stochasticity and the agent overfits to the world model flaws and does not transfer to the real env, too much and the agent becomes risk-averse and robust but suboptimal. ![](https://i.imgur.com/e8ETSjQ.png) ## Additional ressources Thorough interactive blog post with additional experiments and discussions: https://worldmodels.github.io/ |

Thwarting Adversarial Examples: An L_0-Robust Sparse Fourier Transform

Bafna, Mitali and Murtagh, Jack and Vyas, Nikhil

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Bafna, Mitali and Murtagh, Jack and Vyas, Nikhil

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Bafna et al. show that iterative hard thresholding results in $L_0$ robust Fourier transforms. In particular, as shown in Algorithm 1, iterative hard thresholding assumes a signal $y = x + e$ where $x$ is assumed to be sparse, and $e$ is assumed to be sparse. This translates to noise $e$ that is bounded in its $L_0$ norm, corresponding to common adversarial attacks such as adversarial patches in computer vision. Using their algorithm, the authors can provably reconstruct the signal, specifically the top-$k$ coordinates for a $k$-sparse signal, which can subsequently be fed to a neural network classifier. In experiments, the classifier is always trained on sparse signals, and at test time, the sparse signal is reconstructed prior to the forward pass. This way, on MNIST and Fashion-MNIST, the algorithm is able to recover large parts of the original accuracy. https://i.imgur.com/yClXLoo.jpg Algorithm 1 (see paper for details): The iterative hard thresholding algorithm resulting in provable robustness against $L_0$ attack on images and other signals. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels

Zhang, Zhilu and Sabuncu, Mert R.

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Zhang, Zhilu and Sabuncu, Mert R.

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Zhang and Sabuncu propose a generalized cross entropy loss for robust learning on noisy labels. The approach is based on the work by Gosh et al. [1] showing that the mean absolute error can be robust to label noise. Specifically, they show that a symmetric loss, under specific assumptions on the label noise, is robust. Here, symmetry corresponds to $\sum_{j=1}^c \mathcal{L}(f(x), j) = C$ for all $x$ and $f$ where $c$ is the number of classes and $C$ some constant. The cross entropy loss is not symmetric, while the mean absolute error is. The mean absolute error however, usually results in slower learning and may reach lower accuracy. As alternative, the authors propose $\mathcal{L}(f(x), e_j) = \frac{(1 – f_j(x)^q)}{q}$. Here, $f$ is the classifier which is assumed to contain a softmax layer at the end. For $q \rightarrow 0$ this reduces to the cross entropy and for $q = 1$ it reduces to the mean absolute error. As shown in Figure 1, this loss (or a slightly adapted version, see paper, respectively) may obtain better performance on noisy labels. To this end, the label noise is assumed to be uniform, meaning that $p(\tilde{y} = k|y = j, x)= 1 - \eta$ where $\tilde{y}$ is the perturbed label. https://i.imgur.com/HRQ84Zv.jpg Figure 1: Performance of the proposed loss for different $q$ and noise rate $\eta$ on Cifar-10. A ResNet-34 is used. [1] Aritra Gosh, Himanshu Kumar, PS Sastry. Robust loss functions under label noise for deep neural networks. AAAI, 2017. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

A Spectral View of Adversarially Robust Features

Garg, Shivam and Sharan, Vatsal and Zhang, Brian Hu and Valiant, Gregory

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Garg, Shivam and Sharan, Vatsal and Zhang, Brian Hu and Valiant, Gregory

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Garg et al. propose adversarially robust features based on a graph interpretation of the training data. In this graph, training points are connected based on their distance in input space. Robust features are obtained using the eigenvectors of the Laplacian of the graph. It is theoretically shown that these features are robust, based on some assumptions on the graph. For example, the bound obtained on robustness depends on the gap between second and third eigenvalue. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Regularizing by the Variance of the Activations' Sample-Variances

Littwin, Etai and Wolf, Lior

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Littwin, Etai and Wolf, Lior

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Littwin and Wolf propose a activation variance regularizer that is shown to have a similar, even better, effect than batch normalization. The proposed regularizer is based on an analysis of the variance of activation values; the idea is that the measured variance of these variances is low if the activation values come from a distribution with few modes. Thus, the intention of the regularizer is to encourage distributions of activations with only few modes. This is achieved using the regularizers $\mathbb{E}[(1 - \frac{\sigma_s^2}{\sigma^2})^2]$ where $\sigma_s^2$ is the measured variance of activation values and $\sigma^2$ is the true variance of activation values. The estimate $\sigma^2_s$ is mostly influenced by the mini-batch used for training. In practice, the regularizer is replaced by $(1 - \frac{\sigma_{s_1}^2}{\sigma_{s_2}^2 + \beta})^2$ which can be estimated on two different batches, $s_1$ and $s_2$, during training and $\beta$ is a parameter that can be learned and mainly handles the case where the variance is close to zero. In the paper, the authors provide some theoretical bounds and also make a connection to batch normalization and in which cases and why the regularizer might be a better alternative. These claims are supported by experiments on Cifar and Tiny ImageNet. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Generalizing Tree Probability Estimation via Bayesian Networks

Zhang, Cheng and IV, Frederick A. Matsen

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Zhang, Cheng and IV, Frederick A. Matsen

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
A common problem in phylogenetics is: 1. I have $p(\text{DNA sequences} | \text{tree})$ and $p(\text{tree})$. 2. I've used these to run an MCMC algorithm and generate many (approximate) samples from $p(\text{tree} | \text{DNA sequences})$. 3. I want to evaluate $p(\text{tree} | \text{DNA sequences})$. The first solution you might think of is to add up how many times you saw each *tree topology* and divide by the total number of MCMC samples; referred to in this paper as *simple sample relative frequencies* (SRF). An obvious problem with this method is that if you didn't happen to sample a tree topology you will assign it zero probability. This paper focuses on producing a better solution to this problem, by defining a distribution over trees that's easy to fit and provides support over the entire space of tree topologies. What is a Subsplit Bayesian Network? ============================== Phylogenies are leaf-labeled bifurcating trees, binary trees with labeled leaves. **Clades** are nonempty subsets of the leaf labels; labeled in the following figure as $C_1 \to C_7$: https://i.imgur.com/khS3uSo.png To build a phylogeny, I can just take the full set of leaves and recursively split them into clades as described in the above diagram. This means that the distribution over phylogenies is equivalent to the distribution over clades. To make this distribution tractable, it is typically assumed that clades are conditionally independent given parent clades; called the *Conditional Clade Distribution* (CCD) assumption, attributed here to [Larget][] and [Hohna][]. So, a phylogeny may be described by its clades, this paper proposes that these clades may be described by their **subsplits**; the splitting process placing leaf nodes into one of two clades. Then, the authors note this process is a directed Bayesian network and that this Bayesian network must include all possible clade splits. It is therefore a complete binary tree with depth $N-1$ (where $N$ is the number of leaf nodes). Fitting This Parameterisation to MCMC Samples ===================================== Under this parameterisation the likelihood of observing a given tree is: $$ p(\text{tree}) = p(S_1 = s_{1,k}) \prod_{i>1} p(S_i = s_{i,k} | S_{\pi_i} = s_{\pi_i, k}), $$ assuming a collection of subsplits $T_k = \{ S_i = s_{i,k}, i \geq 1 \}$. In this definition $S_{\pi_i}$ is index set of parent nodes of $S_i$; ie a subsplit can depend on any of the parent nodes. The maximum likelihood probabilities for possible subsplits can be solved in closed form; algorithmically involving counting the occurrences of subsplits and dividing by the number of trees observed. If this seems exactly the same as SRF, that's because it is; I [checked the published code to verify this][sbn_ds1]. The authors then consider the case of unrooted trees, where the log likelihood of observed trees can't be easily factorised. They then present a [simple averaging][sa] (not sure where this variational method is discussed in that paper, appears to be under a different name?) and an EM algorithm to fit SBN models over such distributions. They also discuss conditional probability sharing (parameterising conditional on parent-child relationships) immediately before this and it's not clear if this is used in the distribution fit by SA or EM. Experiments show the distributions fit using the EM algorithm perform well according to KL divergence from a "true" distribution defined by fitting a distribution using SRF to a larger dataset. They also show less dispersion estimating the probability of sampled trees versus that estimated by this "ground truth". [sa]: https://people.eecs.berkeley.edu/~wainwrig/Papers/WaiJor08_FTML.pdf [sbn_ds1]: https://github.com/morrislab/sbn/blob/master/experiments/DS1.ipynb [larget]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3676676/ [hohna]: https://academic.oup.com/sysbio/article/61/1/1/1676649 |

Sanity Checks for Saliency Maps

Adebayo, Julius and Gilmer, Justin and Muelly, Michael and Goodfellow, Ian J. and Hardt, Moritz and Kim, Been

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Adebayo, Julius and Gilmer, Justin and Muelly, Michael and Goodfellow, Ian J. and Hardt, Moritz and Kim, Been

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
The paper designs some basic tests to compare saliency methods. It founds that some of the most popular methods are independent of model parameters and the data, meaning they are effectively useless. ## Methods compared The paper compare the following methods: gradient explanation, gradient x input, integrated gradients, guided backprop, guided GradCam and SmoothGrad. They provide a refresher on those methods in the appendix. All those methods can be put in the same framework. They require a classification model and an input (typically an image). The output of the method is an *explanation map* of the shape of the input where a higher value for a feature implies greater relevance in the decision of the model. ## Metrics of comparison The authors argue that visual inspection of the saliency maps can be misleading. They propose to compute the Spearman rank correlation, the structural similarity index (SSMI) and the Pearson correlation of the histogram of gradients. The authors point out that those metrics capture various notions of similarity, but it is an active area of research and those metrics are imperfect. ## First test: model parameters randomization A saliency method must be dependent of model parameters, otherwise it cannot help us understand a model. In this test, the authors randomize the model parameters, layer per layer, starting from the top. Surprisingly, methods such as guided backprop and guided gradcam are completely insensitive to model parameters, as illustrated on this Inception v3 trained on ImageNet: ![image](https://user-images.githubusercontent.com/8659132/61403152-b10b8000-a8a2-11e9-9f6a-cf1ed6a876cc.png) Integrated gradients looks also dubious as the bird is still visible with a mostly fully randomized model, but the quantitative metrics reveal the difference is actually big between the two models. ## Second test: data randomization It is well-known that randomly shuffling the labels of a dataset does not prevent a neural network from getting a high accuracy on the training set, though it does prevent generalization. The model is able to learn by either memorizing the data or finding spurious patterns. As a result, saliency maps obtained from such a network should have no clearly interpretable signal. Here is the result for a ConvNet trained on MNIST and a shuffled MNIST: ![image](https://user-images.githubusercontent.com/8659132/61406757-7efe1c00-a8aa-11e9-9826-a859a373cb4f.png) The results are very damning for most methods. Only gradients and GradCam are very different between both models, as confirmed by the low correlation. ## Discussion - Even though some methods do no depend on model parameters and data, they might still depend on the architecture of the models, which could be of some use in some contexts. - Methods that multiply the input with the gradient are dominated by the input. - Complex saliency methods are just fancy edge detectors. - Only gradient, smooth gradient and GradCam survives the sanity checks. # Comments - Why is their GradCam maps so ugly? They don't look like usual GradCam maps at all. - Their tests are simple enough that it's hard to defend a method that doesn't pass them. - The methods that are left are not very good either. They give fuzzy maps that are difficult to interpret. - In the case of integrated gradients (IG), I'm not convinced this is sufficient to discard the method. IG requires a "baseline input" that represents the absence of features. In the case of images, people usually just set the image to 0, which is not at all the absence of a feature. The authors also use the "set the image to 0" strategy, and I'd say their tests are damning for this strategy, not for IG in general. I'd expect an estimation of the baseline such as done in [this paper](https://arxiv.org/abs/1702.04595) would be a fairer evaluation of IG. Code: [GitHub](https://github.com/adebayoj/sanity_checks_saliency) (not available as of 17/07/19) |

Efficient Neural Network Robustness Certification with General Activation Functions

Zhang, Huan and Weng, Tsui-Wei and Chen, Pin-Yu and Hsieh, Cho-Jui and Daniel, Luca

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Zhang, Huan and Weng, Tsui-Wei and Chen, Pin-Yu and Hsieh, Cho-Jui and Daniel, Luca

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Zhang et al. propose CROWN, a method for certifying adversarial robustness based on bounding activations functions using linear functions. Informally, the main result can be stated as follows: if the activation functions used in a deep neural network can be bounded above and below by linear functions (the activation function may also be segmented first), the network output can also be bounded by linear functions. These linear functions can be computed explicitly, as stated in the paper. Then, given an input example $x$ and a set of allowed perturbations, usually constrained to a $L_p$ norm, these bounds can be used to obtain a lower bound on the robustness of networks. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples

Tao, Guanhong and Ma, Shiqing and Liu, Yingqi and Zhang, Xiangyu

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Tao, Guanhong and Ma, Shiqing and Liu, Yingqi and Zhang, Xiangyu

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Tao et al. propose Attacks Meet Interpretability, an adversarial example detection scheme based on the interpretability of individual neurons. In the context of face recognition, in a first step, the authors identify neurons that correspond to specific face attributes. This is achieved by constructing sets of images were only specific attributes change, and then investigating the firing neurons. In a second step, all other neurons, i.e., neurons not corresponding to any meaningful face attribute, are weakened in order to improve robustness against adversarial examples. The idea is that adversarial examples make use of these non-interpretable neurons to fool the networks. Unfortunately, this defense has been shown not to be effective in [1]. [1] Nicholas Carlini. Is AmI (Attacks Meet Interpretability) Robust to Adversarial Examples? ArXiv.org, abs/1902.02322, 2019. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Towards Robust Interpretability with Self-Explaining Neural Networks

Alvarez-Melis, David and Jaakkola, Tommi S.

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Alvarez-Melis, David and Jaakkola, Tommi S.

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Alvarez-Melis and Jaakkola propose three requirements for self-explainable models, explicitness, faithfulness and stability, and construct a self-explainable, generalized linear model optimizing for these properties. In particular, the proposed model has the form $f(x) = \theta(x)^T h(x)$ where $\theta(x)$ are features (e.g., from a deep network) and $h(x)$ are interpretable features/concepts. In practice, these concepts are learned using an auto-encoder from the raw input while the latent code, which represents $h(x)$, is regularized to learn concept under weak supervision. Additionally, the classifier is regularized to be locally difference-bounded by the concept function $h(x)$. This means that for each point $x_0$ it holds $\|f(x) – f(x_0)\| \leq L \|h(x) – h(x_0)\|$ for all $\|x – x_0\|_\delta$ for some $\delta$ and $L$. This condition leads to some stability of interpretations with respect to the concepts $h(x)$. In practice, this is enforced through a regularizer. In experiments, the authors argue that this class of models has advantages regarding the following three properties of self-explainable models: explicitness, i.e., whether explanations are actually understandable, faithfulness, i.e. whether estimated importance of features reflects true relevance, and stability, i.e., robustness of interpretations against small perturbations. For some of these conditions, the authors propose quantitative metrics; robustness, for example, can be evaluated using $\arg\max_{\|x’ - x\|\leq\epsilon} \frac{\|f(x) – f(x’)}{\|h(x) – h(x’)\|}$ which is very similar to practically evaluating adversarial robustness. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Constructing Unrestricted Adversarial Examples with Generative Models

Song, Yang and Shu, Rui and Kushman, Nate and Ermon, Stefano

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Song, Yang and Shu, Rui and Kushman, Nate and Ermon, Stefano

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Song et al. propose generative adversarial examples, crafted using a generative adversarial network (GAN) from scratch. In particular a GAN is trained on the original images in order to approximate the generative data distribution. Then, adversarial examples can be found in the learned latent space by finding a latent code that minimizes a loss consisting of fooling the target classifier, not fooling an auxiliary classifier (to not change the actual class) and (optionally) staying close to some fixed random latent code. These adversarial examples do not correspond ot original images anymore, instead they are unrestricted and computed from scratch. Figure 1 shows examples. https://i.imgur.com/Krr9MuU.png Figure 1: Examples of projected gradient descent (PGD, top) to find adversarial examples in the image space, and found adversarial examples in the latent space, as proposed. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Learning to Navigate in Cities Without a Map

Mirowski, Piotr and Grimes, Matthew Koichi and Malinowski, Mateusz and Hermann, Karl Moritz and Anderson, Keith and Teplyashin, Denis and Simonyan, Karen and Kavukcuoglu, Koray and Zisserman, Andrew and Hadsell, Raia

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Mirowski, Piotr and Grimes, Matthew Koichi and Malinowski, Mateusz and Hermann, Karl Moritz and Anderson, Keith and Teplyashin, Denis and Simonyan, Karen and Kavukcuoglu, Koray and Zisserman, Andrew and Hadsell, Raia

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
This paper out of DeepMind used a Google StreetView dataset and set out to train a network capable of navigating to a given goal destination, without knowing where it was on any birds-eye map, and with its only input being photographic viewpoint images of its current location and orientation. This was done through a framework of reinforcement learning, where the model is conditioned on a representation of its goal, and given the image features of its current view of the world, and has to take actions like “turn left,” “turn sharply left”, “go forward”, etc, in order to navigate. Rather than lat-long, goals are specified in city-specific ways, in terms of the distance between the goal position and a reference set of landmarks. I don’t entirely understand the motivation behind this approach; the authors say it’s more scalable, but it wasn’t obvious to me why that would be the case. https://i.imgur.com/V3UATsK.png - The authors construct different architectures that combine these two fundamental pieces of input data - current image and the goal you’re trying to reach - in different ways. In the simplest model, called GoalNav, there’s a single LSTM that combines the goal information with the output of a convolutional encoder processing images of your current viewpoint. - In the next most complex, CityNav, there are two LSTMs: one for processing your goal, and the other for combining the output of the goal network with your convolutional inputs, in order to decide on an action. Notionally, this separation of tasks corresponds to “figure out what absolute to go in, given your goal”, and “figure out how to go in that absolute direction from where you are now”. As a way to support training, the goal network is trained with an auxiliary loss function where it needs to predict how far its current orientation is from North. Note that this does pass some amount of information about current location into the model (since the network gets to know its actual orientation relative to true north), but this is only available during training, with the hope that the model will have gotten good enough at predicting orientation to perform well. - The final model, similar to above, is called MultiCityNav, and is explicitly designed for transfer learning. Instead of training multiple cities on a single shared network, only the convolutional encoder and policy network (the “how do I go in the absolute direction needed to reach my goal” parts) are shared between cities, and the goal processing LSTM (the “which direction should I be going in” part) is re-trained per city. This is designed to allow for transfer in the parts of learning you would expect to generalize, but allow the network to learn a city-specific approach for converting between goal specifications (in terms of city landmarks) and direction. In order to get over the fact that reward in this setting is very sparse (i.e. you only get reward when you reach the goal), the authors (1) train in a curriculum fashion, starting with tasks very nearby the model’s starting point, and gradually getting longer, and (2) add a small amount of reward shaping, where you get rewarded for moving in the direction of the goal, but only if you’re within 200m of it. This last is a bit of a concession on the realism front, and the authors say as much, but it’s just quite hard to train RL with purely dense rewards, and it makes sense that reward shaping would help here. Ultimately, they were able to get performance (in terms of goal-reaching rewards) around ¾ as strong as an Oracle model, who had access to the full map and could calculate the true shortest path. |

Adversarially Robust Generalization Requires More Data

Schmidt, Ludwig and Santurkar, Shibani and Tsipras, Dimitris and Talwar, Kunal and Madry, Aleksander

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Schmidt, Ludwig and Santurkar, Shibani and Tsipras, Dimitris and Talwar, Kunal and Madry, Aleksander

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Schmidt et al. theoretically and experimentally show that training adversarially robust models requires a higher sample complexity compared to regular generalization. Theoretically, they analyze two very simple families of datasets, e.g., consisting of two Gaussian distributions corresponding to a two-class problem. On such datasets, they proof that “robust generalization”, i.e., generalization to adversarial examples, required much higher sample complexity compared to regular generlization, i.e., generalization to the test set. These results are interesting because they suggest that the sample complexity might be even worse for more complex and realistic data distributions – as we commonly tackle in computer vision. Experimentally, they show similar result son MNIST, CIFAR-10 and SVHN. Varying the size of the training set and plotting the accuracy on adversarially computed examples results in Figure 1. As can be seen, there seems to be a clear advantage of having larger training sets. Note that these models were trained using adversarial training using an $L_\infty$ adversary constrained by the given $\epsilon$. https://i.imgur.com/SriBAt4.png Figure 1: Training set size plotted against the adversarial test accuracy on MNIST, CIFAR-10 and SVHN. The models were trained using adversarial training. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution

Rosanne Liu and Joel Lehman and Piero Molino and Felipe Petroski Such and Eric Frank and Alex Sergeev and Jason Yosinski

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG, stat.ML

**First published:** 2018/07/09 (5 years ago)

**Abstract:** Few ideas have enjoyed as large an impact on deep learning as convolution.
For any problem involving pixels or spatial representations, common intuition
holds that convolutional neural networks may be appropriate. In this paper we
show a striking counterexample to this intuition via the seemingly trivial
coordinate transform problem, which simply requires learning a mapping between
coordinates in (x,y) Cartesian space and one-hot pixel space. Although
convolutional networks would seem appropriate for this task, we show that they
fail spectacularly. We demonstrate and carefully analyze the failure first on a
toy problem, at which point a simple fix becomes obvious. We call this solution
CoordConv, which works by giving convolution access to its own input
coordinates through the use of extra coordinate channels. Without sacrificing
the computational and parametric efficiency of ordinary convolution, CoordConv
allows networks to learn either perfect translation invariance or varying
degrees of translation dependence, as required by the task. CoordConv solves
the coordinate transform problem with perfect generalization and 150 times
faster with 10--100 times fewer parameters than convolution. This stark
contrast raises the question: to what extent has this inability of convolution
persisted insidiously inside other tasks, subtly hampering performance from
within? A complete answer to this question will require further investigation,
but we show preliminary evidence that swapping convolution for CoordConv can
improve models on a diverse set of tasks. Using CoordConv in a GAN produced
less mode collapse as the transform between high-level spatial latents and
pixels becomes easier to learn. A Faster R-CNN detection model trained on MNIST
detection showed 24% better IOU when using CoordConv, and in the RL domain
agents playing Atari games benefit significantly from the use of CoordConv
layers.
more
less

Rosanne Liu and Joel Lehman and Piero Molino and Felipe Petroski Such and Eric Frank and Alex Sergeev and Jason Yosinski

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG, stat.ML

[link]
This is a paper where I keep being torn between the response of “this is so simple it’s brilliant; why haven’t people done it before,” and “this is so simple it’s almost tautological, and the results I’m seeing aren’t actually that surprising”. The basic observation this paper makes is one made frequently before, most recently to my memory by Geoff Hinton in his Capsule Net paper: sometimes the translation invariance of convolutional networks can be a bad thing, and lead to worse performance. In a lot of ways, translation invariance is one of the benefits of using a convolutional architecture in the first place: instead of having to learn separate feature detectors for “a frog in this corner” and “a frog in that corner,” we can instead use the same feature detector, and just move it over different areas of the image. However, this paper argues, this makes convolutional networks perform worse than might naively be expected at tasks that require them to remember or act in accordance with coordinates of elements within an image. For example, they find that normal convolutional networks take nearly an hour and 200K worth of parameters to learn to “predict” the one-hot encoding of a pixel, when given the (x,y) coordinates of that pixel as input, and only get up to about 80% accuracy. Similarly, trying to take an input image with only one pixel active, and predict the (x,y) coordinates as output, is something the network is able to do successfully, but only when the test points are sampled from the same spatial region as the training points: if the test points are from a held-out quadrant, the model can’t extrapolate to the (x, y) coordinates there, and totally falls apart. https://i.imgur.com/x6phN4p.png The solution proposed by the authors is a really simple one: at one or more layers within the network, in addition to the feature channels sent up from the prior layer, add two addition channels: one with a with deterministic values going from -1 (left) to 1 (right), and the other going top to bottom. This essentially adds two fixed “features” to each pixel, which jointly carry information about where it is in space. Just by adding this small change, we give the network the ability to use spatial information or not, as it sees fit. If these features don’t prove useful, their weights will stay around their initialization values of expectation-zero, and the behavior should be much like a normal convolutional net. However, if it proves useful, convolution filters at the next layer can take position information into account. It’s easy to see how this would be useful for this paper’s toy problems: you can just create a feature detector for “if this pixel is active, pass forward information about it’s spatial position,” and predict the (x, y) coordinates out easily. You can also imagine this capability helping with more typical image classification problems, by having feature filters that carry with them not only content information, but information about where a pattern was found spatially. The authors do indeed find comparable performance or small benefits to ImageNet, MNIST, and Atari RL, when applying their layers in lieu of normal convolutional layer. On GANs in particular, they find less mode collapse, though I don’t yet 100% follow the intuition of why this would be the case. https://i.imgur.com/wu7wQZr.png |

The challenge of realistic music generation: modelling raw audio at scale

Sander Dieleman and Aäron van den Oord and Karen Simonyan

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.SD, cs.LG, eess.AS, stat.ML

**First published:** 2018/06/26 (5 years ago)

**Abstract:** Realistic music generation is a challenging task. When building generative
models of music that are learnt from data, typically high-level representations
such as scores or MIDI are used that abstract away the idiosyncrasies of a
particular performance. But these nuances are very important for our perception
of musicality and realism, so in this work we embark on modelling music in the
raw audio domain. It has been shown that autoregressive models excel at
generating raw audio waveforms of speech, but when applied to music, we find
them biased towards capturing local signal structure at the expense of
modelling long-range correlations. This is problematic because music exhibits
structure at many different timescales. In this work, we explore autoregressive
discrete autoencoders (ADAs) as a means to enable autoregressive models to
capture long-range correlations in waveforms. We find that they allow us to
unconditionally generate piano music directly in the raw audio domain, which
shows stylistic consistency across tens of seconds.
more
less

Sander Dieleman and Aäron van den Oord and Karen Simonyan

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.SD, cs.LG, eess.AS, stat.ML

[link]
This paper draws from two strains of recent work: the hierarchical music modeling of MusicVAE - which intentionally model musical structure at both local and more global levels - , and the discrete autoencoder approaches of Vector Quantized VAEs - which seek to maintain the overall structure of a VAE, but apply a less aggressive form of regularization. The goal of this paper is to build a model that can generate music, not from that music’s symbolic representation - lists of notes - but from actual waveform audio. This is a more difficult task because the model now has to learn mappings between waveforms and symbolic notes, but confers the advantage of being able to model expressive dimensions of music that are difficult to capture in a pure symbolic representation. Models of pure waveform data have been used before - Wavenet is a central example - but typically they are learned alongside some kind of text conditioning structure, which is to say, you tell the model to say “Hello there, world” and the model is only responsible for building local mappings between those phonemes and waveforms, not actually modeling coherent words to follow after “Hello”. To try to address this problem, the authors of the paper propose the solution of learning an autoencoded representation over the full music sample, to try to capture global structure. Each predicted value of the global structure sequence then represents some number of timesteps of the generated sequence: say, 20. The idea here is: learn a global model that produces 1/N (1/20, in this case) fewer sequence points, whose job is ensuring long term consistency. Then, the authors also suggest the use of a lower level decoder model that uses the conditioning information from the autoencoder, and, in a similar fashion to a text to speech wavenet, captures a high fidelity mapping between that conditioning and the output waveform. This overall structure has a lot in common with the recently released MusicVAE paper. The most salient architectural change proposed by this paper is that of Argmax VAEs, rather than VQ VAEs. Overall, the reason for training discrete autoencoders is to have a more easily adjustable way of regularizing the bottlenecked representation, to avoid the fact that for some challenging problems, excessively strong VAE regularization can lead to that high level representational space just not being used. To understand the difference, it’s worth understanding that VQ VAEs work by generating a continuous encoding vector (the same as a typical VAE) but then instead of passing that continuous vector itself directly on to the decoder, the VQ VAE instead fits what is basically a K means operation: it maps the continuous vector to one of it’s “prototypical” or “codebook” vectors based on closeness in Euclidean distance (these codebook vectors are learned in a separate trading loop, in a K Means style algorithm). The Argmax VAE is similar, but instead of needing to take that alternating step of learning the codebook vectors via K Means, it performs a much simpler quantization operation: just taking the argmax of indices across the continuous vector, so that the output is the one-hot vector closest to the continuous input. While this reduces the capacity of the model, it also limits the problem of “codebook collapse”, which is a failure mode that can happen during the K Means iteration (I’m actually not entirely clear on the prototypical example of codebook collapse, or exactly why it happens). https://i.imgur.com/H5YqSZG.png Combining these ideas together: this paper’s model works by learning an Argmax VAE over a larger and courser timeframe of the model, and then learning a local, high resolution decoder - similar to Wavenet - over the smaller time scales, conditioned on the output of the Argmax VAE making high level decisions. This combination balances the needs of coherent musical structure and local fidelity, and allows for different weighing of those trade-offs in a fairly flexible way, by changing the frequency at which you produce Argmax VAE conditioning output. |

Insights on representational similarity in neural networks with canonical correlation

Ari S. Morcos and Maithra Raghu and Samy Bengio

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.CV, cs.LG, cs.NE

**First published:** 2018/06/14 (5 years ago)

**Abstract:** Comparing different neural network representations and determining how
representations evolve over time remain challenging open questions in our
understanding of the function of neural networks. Comparing representations in
neural networks is fundamentally difficult as the structure of representations
varies greatly, even across groups of networks trained on identical tasks, and
over the course of training. Here, we develop projection weighted CCA
(Canonical Correlation Analysis) as a tool for understanding neural networks,
building off of SVCCA, a recently proposed method. We first improve the core
method, showing how to differentiate between signal and noise, and then apply
this technique to compare across a group of CNNs, demonstrating that networks
which generalize converge to more similar representations than networks which
memorize, that wider networks converge to more similar solutions than narrow
networks, and that trained networks with identical topology but different
learning rates converge to distinct clusters with diverse representations. We
also investigate the representational dynamics of RNNs, across both training
and sequential timesteps, finding that RNNs converge in a bottom-up pattern
over the course of training and that the hidden state is highly variable over
the course of a sequence, even when accounting for linear transforms. Together,
these results provide new insights into the function of CNNs and RNNs, and
demonstrate the utility of using CCA to understand representations.
more
less

Ari S. Morcos and Maithra Raghu and Samy Bengio

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.CV, cs.LG, cs.NE

[link]
The overall goal of the paper is measure how similar different layer activation profiles are to one another, in hopes of being able to quantify the similarity of the representations that different layers are learning. If you had a measure that captured this, you could ask questions like: “how similar are the representations that are learned by different networks on the same task”, and “what is the dynamic of representational change in a given layer throughout training”? Canonical Correlation Analysis is one way of approaching this question, and the way taken by this paper. The premise of CCA is that you have two multidimensional variable sets, where each set is made up of vectors representing dimensions within that variable set. Concretely, in this paper, the sets under examination are the activation profiles of two layers (either the same layer at different points in training, or different layers in the same network, or layers in different networks). An activation profile is thought of in terms of multiple vectors, where each vector represents a given neuron’s activation value, evaluated over some observation set X. Importantly, for the two layers that you’re comparing, the set of observations X needs to be of the same length, but the layers can have different number of neurons (and, consequently, different numbers of vectors making up that layer’s multivariate set). Given this setup, the goal of CCA is to find vectors that are linear combinations of the basis vectors of each set, to satisfy some constraint. In that broad sense, this is similar to the project of PCA, which also constructs linear-combination principal components to better represent the underlying data space. However, in PCA, the constraints that define these combinations are based on one multidimensional feature space, not two. In CCA, instead of generating k principal components, you generate k *pairs* of canonical correlates. Each canonical correlate pair, (U1, V1) is a linear combination of the activation vectors of sets L1 and L2 respectively, and is chosen with the goal of minimizing the the angle (cosine) distance between the correlates in each pair. If you think about L1 and L2 each only having two activations (that is: if you think about them as being two-dimensional spaces) then the goal of CCA is to find the cosine distance between the planes defined by the two activation spaces. An important intuition here is that in this framing, vector sets that are just linear transformations of one another (scalings, rotations, swaps in the arbitrary order of activations) will look the same, which wouldn’t be the case if you just looked at raw correlations between the individual activations. This is connected to the linear algebra idea that, if you have two vectors, and a third that is just a linear combination of the first two, the span of those vectors is still just that two-dimensional space. This property is important for the analysis of neural network representations because it means it will be able to capture similarities between representational spaces that have fundamental geometric similarities, even if they’re different on a more surface level. In prior papers, CCA had been used by calculating the CCA vectors between varying sets of layers, and then taking the mean CCA value over all of the pairs of vectors. This paper argues against that approach, on the theory that network layers are probably not using the full representational capacity of their activation dimensions (think, as analogy: a matrix with three columns, that only actually spans two), and so including in your average very low-order correlations is mostly adding uninformative noise to your similarity measure. Instead, this paper weights the correlation coefficients according to the magnitudes of the correlate vectors in the pair; as best I can tell, this is roughly analogous to weighting according to eigenvalues, in a PCA setting. Using this weighted-average similarity measure, the authors do some really interesting investigations into learning dynamics. These include: * Comparing the intermediate-layer representations learned by networks that achieve low train error via memorization vs via actually-generalizing solutions, and show that, during training, the intermediate representations of generalizing networks are more similar to one another than memorizing networks are to one another. Intuitively, this aligns with the idea that there are many ways to noisily memorize, but a more constrained number of ways to actually learn meaningful information about a dataset. A super interesting implication of this is the idea that representational similarity *on the training set* across multiple bootstrapped or randomized trainings could be used as a proxy for test set performance, which could be particularly valuable in contexts where test data is limited https://i.imgur.com/JwyHFmN.png * Across networks, lower layers tend to be more similar to one another than layers closer to the output; said another way, the very simple (e.g. edge detectors) tend to be quite similar across networks, but the higher level representations are more divergent and influenceable by smaller quirks of the training set. * Within a given dataset, you can cluster learned internal representations across many training sets and recover groups trained with the same learning rate, even though the final layer softmax is inherently similar across models that achieve the same training error. This implies that metrics like this can give us some idea of the different minima that the optimization algorithm finds, as a function of different learning rates. Overall, I found this paper a great example of a straightforward idea used to clearly answer important and interesting questions, which is always refreshing amidst a sea of “tiny hack for an extra 0.05 accuracy”. |

About