NIPS is a single-track machine learning and computational neuroscience conference that includes invited talks, demonstrations and oral and poster presentations of refereed papers.

- 1989 3
- 1992 1
- 1995 1
- 1996 1
- 1999 1
- 2000 1
- 2007 1
- 2010 1
- 2011 1
- 2012 2
- 2013 50
- 2014 18
- 2015 51
- 2016 17
- 2017 19
- 2018 13
- 2019 7

Thwarting Adversarial Examples: An L_0-Robust Sparse Fourier Transform

Bafna, Mitali and Murtagh, Jack and Vyas, Nikhil

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Bafna, Mitali and Murtagh, Jack and Vyas, Nikhil

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Bafna et al. show that iterative hard thresholding results in $L_0$ robust Fourier transforms. In particular, as shown in Algorithm 1, iterative hard thresholding assumes a signal $y = x + e$ where $x$ is assumed to be sparse, and $e$ is assumed to be sparse. This translates to noise $e$ that is bounded in its $L_0$ norm, corresponding to common adversarial attacks such as adversarial patches in computer vision. Using their algorithm, the authors can provably reconstruct the signal, specifically the top-$k$ coordinates for a $k$-sparse signal, which can subsequently be fed to a neural network classifier. In experiments, the classifier is always trained on sparse signals, and at test time, the sparse signal is reconstructed prior to the forward pass. This way, on MNIST and Fashion-MNIST, the algorithm is able to recover large parts of the original accuracy. https://i.imgur.com/yClXLoo.jpg Algorithm 1 (see paper for details): The iterative hard thresholding algorithm resulting in provable robustness against $L_0$ attack on images and other signals. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels

Zhang, Zhilu and Sabuncu, Mert R.

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Zhang, Zhilu and Sabuncu, Mert R.

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Zhang and Sabuncu propose a generalized cross entropy loss for robust learning on noisy labels. The approach is based on the work by Gosh et al. [1] showing that the mean absolute error can be robust to label noise. Specifically, they show that a symmetric loss, under specific assumptions on the label noise, is robust. Here, symmetry corresponds to $\sum_{j=1}^c \mathcal{L}(f(x), j) = C$ for all $x$ and $f$ where $c$ is the number of classes and $C$ some constant. The cross entropy loss is not symmetric, while the mean absolute error is. The mean absolute error however, usually results in slower learning and may reach lower accuracy. As alternative, the authors propose $\mathcal{L}(f(x), e_j) = \frac{(1 – f_j(x)^q)}{q}$. Here, $f$ is the classifier which is assumed to contain a softmax layer at the end. For $q \rightarrow 0$ this reduces to the cross entropy and for $q = 1$ it reduces to the mean absolute error. As shown in Figure 1, this loss (or a slightly adapted version, see paper, respectively) may obtain better performance on noisy labels. To this end, the label noise is assumed to be uniform, meaning that $p(\tilde{y} = k|y = j, x)= 1 - \eta$ where $\tilde{y}$ is the perturbed label. https://i.imgur.com/HRQ84Zv.jpg Figure 1: Performance of the proposed loss for different $q$ and noise rate $\eta$ on Cifar-10. A ResNet-34 is used. [1] Aritra Gosh, Himanshu Kumar, PS Sastry. Robust loss functions under label noise for deep neural networks. AAAI, 2017. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

A Spectral View of Adversarially Robust Features

Garg, Shivam and Sharan, Vatsal and Zhang, Brian Hu and Valiant, Gregory

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Garg, Shivam and Sharan, Vatsal and Zhang, Brian Hu and Valiant, Gregory

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Garg et al. propose adversarially robust features based on a graph interpretation of the training data. In this graph, training points are connected based on their distance in input space. Robust features are obtained using the eigenvectors of the Laplacian of the graph. It is theoretically shown that these features are robust, based on some assumptions on the graph. For example, the bound obtained on robustness depends on the gap between second and third eigenvalue. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Regularizing by the Variance of the Activations' Sample-Variances

Littwin, Etai and Wolf, Lior

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Littwin, Etai and Wolf, Lior

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Littwin and Wolf propose a activation variance regularizer that is shown to have a similar, even better, effect than batch normalization. The proposed regularizer is based on an analysis of the variance of activation values; the idea is that the measured variance of these variances is low if the activation values come from a distribution with few modes. Thus, the intention of the regularizer is to encourage distributions of activations with only few modes. This is achieved using the regularizers $\mathbb{E}[(1 - \frac{\sigma_s^2}{\sigma^2})^2]$ where $\sigma_s^2$ is the measured variance of activation values and $\sigma^2$ is the true variance of activation values. The estimate $\sigma^2_s$ is mostly influenced by the mini-batch used for training. In practice, the regularizer is replaced by $(1 - \frac{\sigma_{s_1}^2}{\sigma_{s_2}^2 + \beta})^2$ which can be estimated on two different batches, $s_1$ and $s_2$, during training and $\beta$ is a parameter that can be learned and mainly handles the case where the variance is close to zero. In the paper, the authors provide some theoretical bounds and also make a connection to batch normalization and in which cases and why the regularizer might be a better alternative. These claims are supported by experiments on Cifar and Tiny ImageNet. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Generalizing Tree Probability Estimation via Bayesian Networks

Zhang, Cheng and IV, Frederick A. Matsen

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Zhang, Cheng and IV, Frederick A. Matsen

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
A common problem in phylogenetics is: 1. I have $p(\text{DNA sequences} | \text{tree})$ and $p(\text{tree})$. 2. I've used these to run an MCMC algorithm and generate many (approximate) samples from $p(\text{tree} | \text{DNA sequences})$. 3. I want to evaluate $p(\text{tree} | \text{DNA sequences})$. The first solution you might think of is to add up how many times you saw each *tree topology* and divide by the total number of MCMC samples; referred to in this paper as *simple sample relative frequencies* (SRF). An obvious problem with this method is that if you didn't happen to sample a tree topology you will assign it zero probability. This paper focuses on producing a better solution to this problem, by defining a distribution over trees that's easy to fit and provides support over the entire space of tree topologies. What is a Subsplit Bayesian Network? ============================== Phylogenies are leaf-labeled bifurcating trees, binary trees with labeled leaves. **Clades** are nonempty subsets of the leaf labels; labeled in the following figure as $C_1 \to C_7$: https://i.imgur.com/khS3uSo.png To build a phylogeny, I can just take the full set of leaves and recursively split them into clades as described in the above diagram. This means that the distribution over phylogenies is equivalent to the distribution over clades. To make this distribution tractable, it is typically assumed that clades are conditionally independent given parent clades; called the *Conditional Clade Distribution* (CCD) assumption, attributed here to [Larget][] and [Hohna][]. So, a phylogeny may be described by its clades, this paper proposes that these clades may be described by their **subsplits**; the splitting process placing leaf nodes into one of two clades. Then, the authors note this process is a directed Bayesian network and that this Bayesian network must include all possible clade splits. It is therefore a complete binary tree with depth $N-1$ (where $N$ is the number of leaf nodes). Fitting This Parameterisation to MCMC Samples ===================================== Under this parameterisation the likelihood of observing a given tree is: $$ p(\text{tree}) = p(S_1 = s_{1,k}) \prod_{i>1} p(S_i = s_{i,k} | S_{\pi_i} = s_{\pi_i, k}), $$ assuming a collection of subsplits $T_k = \{ S_i = s_{i,k}, i \geq 1 \}$. In this definition $S_{\pi_i}$ is index set of parent nodes of $S_i$; ie a subsplit can depend on any of the parent nodes. The maximum likelihood probabilities for possible subsplits can be solved in closed form; algorithmically involving counting the occurrences of subsplits and dividing by the number of trees observed. If this seems exactly the same as SRF, that's because it is; I [checked the published code to verify this][sbn_ds1]. The authors then consider the case of unrooted trees, where the log likelihood of observed trees can't be easily factorised. They then present a [simple averaging][sa] (not sure where this variational method is discussed in that paper, appears to be under a different name?) and an EM algorithm to fit SBN models over such distributions. They also discuss conditional probability sharing (parameterising conditional on parent-child relationships) immediately before this and it's not clear if this is used in the distribution fit by SA or EM. Experiments show the distributions fit using the EM algorithm perform well according to KL divergence from a "true" distribution defined by fitting a distribution using SRF to a larger dataset. They also show less dispersion estimating the probability of sampled trees versus that estimated by this "ground truth". [sa]: https://people.eecs.berkeley.edu/~wainwrig/Papers/WaiJor08_FTML.pdf [sbn_ds1]: https://github.com/morrislab/sbn/blob/master/experiments/DS1.ipynb [larget]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3676676/ [hohna]: https://academic.oup.com/sysbio/article/61/1/1/1676649 |

Sanity Checks for Saliency Maps

Adebayo, Julius and Gilmer, Justin and Muelly, Michael and Goodfellow, Ian J. and Hardt, Moritz and Kim, Been

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Adebayo, Julius and Gilmer, Justin and Muelly, Michael and Goodfellow, Ian J. and Hardt, Moritz and Kim, Been

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
The paper designs some basic tests to compare saliency methods. It founds that some of the most popular methods are independent of model parameters and the data, meaning they are effectively useless. ## Methods compared The paper compare the following methods: gradient explanation, gradient x input, integrated gradients, guided backprop, guided GradCam and SmoothGrad. They provide a refresher on those methods in the appendix. All those methods can be put in the same framework. They require a classification model and an input (typically an image). The output of the method is an *explanation map* of the shape of the input where a higher value for a feature implies greater relevance in the decision of the model. ## Metrics of comparison The authors argue that visual inspection of the saliency maps can be misleading. They propose to compute the Spearman rank correlation, the structural similarity index (SSMI) and the Pearson correlation of the histogram of gradients. The authors point out that those metrics capture various notions of similarity, but it is an active area of research and those metrics are imperfect. ## First test: model parameters randomization A saliency method must be dependent of model parameters, otherwise it cannot help us understand a model. In this test, the authors randomize the model parameters, layer per layer, starting from the top. Surprisingly, methods such as guided backprop and guided gradcam are completely insensitive to model parameters, as illustrated on this Inception v3 trained on ImageNet: ![image](https://user-images.githubusercontent.com/8659132/61403152-b10b8000-a8a2-11e9-9f6a-cf1ed6a876cc.png) Integrated gradients looks also dubious as the bird is still visible with a mostly fully randomized model, but the quantitative metrics reveal the difference is actually big between the two models. ## Second test: data randomization It is well-known that randomly shuffling the labels of a dataset does not prevent a neural network from getting a high accuracy on the training set, though it does prevent generalization. The model is able to learn by either memorizing the data or finding spurious patterns. As a result, saliency maps obtained from such a network should have no clearly interpretable signal. Here is the result for a ConvNet trained on MNIST and a shuffled MNIST: ![image](https://user-images.githubusercontent.com/8659132/61406757-7efe1c00-a8aa-11e9-9826-a859a373cb4f.png) The results are very damning for most methods. Only gradients and GradCam are very different between both models, as confirmed by the low correlation. ## Discussion - Even though some methods do no depend on model parameters and data, they might still depend on the architecture of the models, which could be of some use in some contexts. - Methods that multiply the input with the gradient are dominated by the input. - Complex saliency methods are just fancy edge detectors. - Only gradient, smooth gradient and GradCam survives the sanity checks. # Comments - Why is their GradCam maps so ugly? They don't look like usual GradCam maps at all. - Their tests are simple enough that it's hard to defend a method that doesn't pass them. - The methods that are left are not very good either. They give fuzzy maps that are difficult to interpret. - In the case of integrated gradients (IG), I'm not convinced this is sufficient to discard the method. IG requires a "baseline input" that represents the absence of features. In the case of images, people usually just set the image to 0, which is not at all the absence of a feature. The authors also use the "set the image to 0" strategy, and I'd say their tests are damning for this strategy, not for IG in general. I'd expect an estimation of the baseline such as done in [this paper](https://arxiv.org/abs/1702.04595) would be a fairer evaluation of IG. Code: [GitHub](https://github.com/adebayoj/sanity_checks_saliency) (not available as of 17/07/19) |

Efficient Neural Network Robustness Certification with General Activation Functions

Zhang, Huan and Weng, Tsui-Wei and Chen, Pin-Yu and Hsieh, Cho-Jui and Daniel, Luca

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Zhang, Huan and Weng, Tsui-Wei and Chen, Pin-Yu and Hsieh, Cho-Jui and Daniel, Luca

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Zhang et al. propose CROWN, a method for certifying adversarial robustness based on bounding activations functions using linear functions. Informally, the main result can be stated as follows: if the activation functions used in a deep neural network can be bounded above and below by linear functions (the activation function may also be segmented first), the network output can also be bounded by linear functions. These linear functions can be computed explicitly, as stated in the paper. Then, given an input example $x$ and a set of allowed perturbations, usually constrained to a $L_p$ norm, these bounds can be used to obtain a lower bound on the robustness of networks. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples

Tao, Guanhong and Ma, Shiqing and Liu, Yingqi and Zhang, Xiangyu

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Tao, Guanhong and Ma, Shiqing and Liu, Yingqi and Zhang, Xiangyu

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Tao et al. propose Attacks Meet Interpretability, an adversarial example detection scheme based on the interpretability of individual neurons. In the context of face recognition, in a first step, the authors identify neurons that correspond to specific face attributes. This is achieved by constructing sets of images were only specific attributes change, and then investigating the firing neurons. In a second step, all other neurons, i.e., neurons not corresponding to any meaningful face attribute, are weakened in order to improve robustness against adversarial examples. The idea is that adversarial examples make use of these non-interpretable neurons to fool the networks. Unfortunately, this defense has been shown not to be effective in [1]. [1] Nicholas Carlini. Is AmI (Attacks Meet Interpretability) Robust to Adversarial Examples? ArXiv.org, abs/1902.02322, 2019. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Towards Robust Interpretability with Self-Explaining Neural Networks

Alvarez-Melis, David and Jaakkola, Tommi S.

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Alvarez-Melis, David and Jaakkola, Tommi S.

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Alvarez-Melis and Jaakkola propose three requirements for self-explainable models, explicitness, faithfulness and stability, and construct a self-explainable, generalized linear model optimizing for these properties. In particular, the proposed model has the form $f(x) = \theta(x)^T h(x)$ where $\theta(x)$ are features (e.g., from a deep network) and $h(x)$ are interpretable features/concepts. In practice, these concepts are learned using an auto-encoder from the raw input while the latent code, which represents $h(x)$, is regularized to learn concept under weak supervision. Additionally, the classifier is regularized to be locally difference-bounded by the concept function $h(x)$. This means that for each point $x_0$ it holds $\|f(x) – f(x_0)\| \leq L \|h(x) – h(x_0)\|$ for all $\|x – x_0\|_\delta$ for some $\delta$ and $L$. This condition leads to some stability of interpretations with respect to the concepts $h(x)$. In practice, this is enforced through a regularizer. In experiments, the authors argue that this class of models has advantages regarding the following three properties of self-explainable models: explicitness, i.e., whether explanations are actually understandable, faithfulness, i.e. whether estimated importance of features reflects true relevance, and stability, i.e., robustness of interpretations against small perturbations. For some of these conditions, the authors propose quantitative metrics; robustness, for example, can be evaluated using $\arg\max_{\|x’ - x\|\leq\epsilon} \frac{\|f(x) – f(x’)}{\|h(x) – h(x’)\|}$ which is very similar to practically evaluating adversarial robustness. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Constructing Unrestricted Adversarial Examples with Generative Models

Song, Yang and Shu, Rui and Kushman, Nate and Ermon, Stefano

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Song, Yang and Shu, Rui and Kushman, Nate and Ermon, Stefano

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Song et al. propose generative adversarial examples, crafted using a generative adversarial network (GAN) from scratch. In particular a GAN is trained on the original images in order to approximate the generative data distribution. Then, adversarial examples can be found in the learned latent space by finding a latent code that minimizes a loss consisting of fooling the target classifier, not fooling an auxiliary classifier (to not change the actual class) and (optionally) staying close to some fixed random latent code. These adversarial examples do not correspond ot original images anymore, instead they are unrestricted and computed from scratch. Figure 1 shows examples. https://i.imgur.com/Krr9MuU.png Figure 1: Examples of projected gradient descent (PGD, top) to find adversarial examples in the image space, and found adversarial examples in the latent space, as proposed. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Adversarially Robust Generalization Requires More Data

Schmidt, Ludwig and Santurkar, Shibani and Tsipras, Dimitris and Talwar, Kunal and Madry, Aleksander

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Schmidt, Ludwig and Santurkar, Shibani and Tsipras, Dimitris and Talwar, Kunal and Madry, Aleksander

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Schmidt et al. theoretically and experimentally show that training adversarially robust models requires a higher sample complexity compared to regular generalization. Theoretically, they analyze two very simple families of datasets, e.g., consisting of two Gaussian distributions corresponding to a two-class problem. On such datasets, they proof that “robust generalization”, i.e., generalization to adversarial examples, required much higher sample complexity compared to regular generlization, i.e., generalization to the test set. These results are interesting because they suggest that the sample complexity might be even worse for more complex and realistic data distributions – as we commonly tackle in computer vision. Experimentally, they show similar result son MNIST, CIFAR-10 and SVHN. Varying the size of the training set and plotting the accuracy on adversarially computed examples results in Figure 1. As can be seen, there seems to be a clear advantage of having larger training sets. Note that these models were trained using adversarial training using an $L_\infty$ adversary constrained by the given $\epsilon$. https://i.imgur.com/SriBAt4.png Figure 1: Training set size plotted against the adversarial test accuracy on MNIST, CIFAR-10 and SVHN. The models were trained using adversarial training. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

The challenge of realistic music generation: modelling raw audio at scale

Sander Dieleman and Aäron van den Oord and Karen Simonyan

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.SD, cs.LG, eess.AS, stat.ML

**First published:** 2018/06/26 (4 years ago)

**Abstract:** Realistic music generation is a challenging task. When building generative
models of music that are learnt from data, typically high-level representations
such as scores or MIDI are used that abstract away the idiosyncrasies of a
particular performance. But these nuances are very important for our perception
of musicality and realism, so in this work we embark on modelling music in the
raw audio domain. It has been shown that autoregressive models excel at
generating raw audio waveforms of speech, but when applied to music, we find
them biased towards capturing local signal structure at the expense of
modelling long-range correlations. This is problematic because music exhibits
structure at many different timescales. In this work, we explore autoregressive
discrete autoencoders (ADAs) as a means to enable autoregressive models to
capture long-range correlations in waveforms. We find that they allow us to
unconditionally generate piano music directly in the raw audio domain, which
shows stylistic consistency across tens of seconds.
more
less

Sander Dieleman and Aäron van den Oord and Karen Simonyan

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.SD, cs.LG, eess.AS, stat.ML

[link]
This paper draws from two strains of recent work: the hierarchical music modeling of MusicVAE - which intentionally model musical structure at both local and more global levels - , and the discrete autoencoder approaches of Vector Quantized VAEs - which seek to maintain the overall structure of a VAE, but apply a less aggressive form of regularization. The goal of this paper is to build a model that can generate music, not from that music’s symbolic representation - lists of notes - but from actual waveform audio. This is a more difficult task because the model now has to learn mappings between waveforms and symbolic notes, but confers the advantage of being able to model expressive dimensions of music that are difficult to capture in a pure symbolic representation. Models of pure waveform data have been used before - Wavenet is a central example - but typically they are learned alongside some kind of text conditioning structure, which is to say, you tell the model to say “Hello there, world” and the model is only responsible for building local mappings between those phonemes and waveforms, not actually modeling coherent words to follow after “Hello”. To try to address this problem, the authors of the paper propose the solution of learning an autoencoded representation over the full music sample, to try to capture global structure. Each predicted value of the global structure sequence then represents some number of timesteps of the generated sequence: say, 20. The idea here is: learn a global model that produces 1/N (1/20, in this case) fewer sequence points, whose job is ensuring long term consistency. Then, the authors also suggest the use of a lower level decoder model that uses the conditioning information from the autoencoder, and, in a similar fashion to a text to speech wavenet, captures a high fidelity mapping between that conditioning and the output waveform. This overall structure has a lot in common with the recently released MusicVAE paper. The most salient architectural change proposed by this paper is that of Argmax VAEs, rather than VQ VAEs. Overall, the reason for training discrete autoencoders is to have a more easily adjustable way of regularizing the bottlenecked representation, to avoid the fact that for some challenging problems, excessively strong VAE regularization can lead to that high level representational space just not being used. To understand the difference, it’s worth understanding that VQ VAEs work by generating a continuous encoding vector (the same as a typical VAE) but then instead of passing that continuous vector itself directly on to the decoder, the VQ VAE instead fits what is basically a K means operation: it maps the continuous vector to one of it’s “prototypical” or “codebook” vectors based on closeness in Euclidean distance (these codebook vectors are learned in a separate trading loop, in a K Means style algorithm). The Argmax VAE is similar, but instead of needing to take that alternating step of learning the codebook vectors via K Means, it performs a much simpler quantization operation: just taking the argmax of indices across the continuous vector, so that the output is the one-hot vector closest to the continuous input. While this reduces the capacity of the model, it also limits the problem of “codebook collapse”, which is a failure mode that can happen during the K Means iteration (I’m actually not entirely clear on the prototypical example of codebook collapse, or exactly why it happens). https://i.imgur.com/H5YqSZG.png Combining these ideas together: this paper’s model works by learning an Argmax VAE over a larger and courser timeframe of the model, and then learning a local, high resolution decoder - similar to Wavenet - over the smaller time scales, conditioned on the output of the Argmax VAE making high level decisions. This combination balances the needs of coherent musical structure and local fidelity, and allows for different weighing of those trade-offs in a fairly flexible way, by changing the frequency at which you produce Argmax VAE conditioning output. |

Insights on representational similarity in neural networks with canonical correlation

Ari S. Morcos and Maithra Raghu and Samy Bengio

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.CV, cs.LG, cs.NE

**First published:** 2018/06/14 (4 years ago)

**Abstract:** Comparing different neural network representations and determining how
representations evolve over time remain challenging open questions in our
understanding of the function of neural networks. Comparing representations in
neural networks is fundamentally difficult as the structure of representations
varies greatly, even across groups of networks trained on identical tasks, and
over the course of training. Here, we develop projection weighted CCA
(Canonical Correlation Analysis) as a tool for understanding neural networks,
building off of SVCCA, a recently proposed method. We first improve the core
method, showing how to differentiate between signal and noise, and then apply
this technique to compare across a group of CNNs, demonstrating that networks
which generalize converge to more similar representations than networks which
memorize, that wider networks converge to more similar solutions than narrow
networks, and that trained networks with identical topology but different
learning rates converge to distinct clusters with diverse representations. We
also investigate the representational dynamics of RNNs, across both training
and sequential timesteps, finding that RNNs converge in a bottom-up pattern
over the course of training and that the hidden state is highly variable over
the course of a sequence, even when accounting for linear transforms. Together,
these results provide new insights into the function of CNNs and RNNs, and
demonstrate the utility of using CCA to understand representations.
more
less

Ari S. Morcos and Maithra Raghu and Samy Bengio

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.CV, cs.LG, cs.NE

[link]
The overall goal of the paper is measure how similar different layer activation profiles are to one another, in hopes of being able to quantify the similarity of the representations that different layers are learning. If you had a measure that captured this, you could ask questions like: “how similar are the representations that are learned by different networks on the same task”, and “what is the dynamic of representational change in a given layer throughout training”? Canonical Correlation Analysis is one way of approaching this question, and the way taken by this paper. The premise of CCA is that you have two multidimensional variable sets, where each set is made up of vectors representing dimensions within that variable set. Concretely, in this paper, the sets under examination are the activation profiles of two layers (either the same layer at different points in training, or different layers in the same network, or layers in different networks). An activation profile is thought of in terms of multiple vectors, where each vector represents a given neuron’s activation value, evaluated over some observation set X. Importantly, for the two layers that you’re comparing, the set of observations X needs to be of the same length, but the layers can have different number of neurons (and, consequently, different numbers of vectors making up that layer’s multivariate set). Given this setup, the goal of CCA is to find vectors that are linear combinations of the basis vectors of each set, to satisfy some constraint. In that broad sense, this is similar to the project of PCA, which also constructs linear-combination principal components to better represent the underlying data space. However, in PCA, the constraints that define these combinations are based on one multidimensional feature space, not two. In CCA, instead of generating k principal components, you generate k *pairs* of canonical correlates. Each canonical correlate pair, (U1, V1) is a linear combination of the activation vectors of sets L1 and L2 respectively, and is chosen with the goal of minimizing the the angle (cosine) distance between the correlates in each pair. If you think about L1 and L2 each only having two activations (that is: if you think about them as being two-dimensional spaces) then the goal of CCA is to find the cosine distance between the planes defined by the two activation spaces. An important intuition here is that in this framing, vector sets that are just linear transformations of one another (scalings, rotations, swaps in the arbitrary order of activations) will look the same, which wouldn’t be the case if you just looked at raw correlations between the individual activations. This is connected to the linear algebra idea that, if you have two vectors, and a third that is just a linear combination of the first two, the span of those vectors is still just that two-dimensional space. This property is important for the analysis of neural network representations because it means it will be able to capture similarities between representational spaces that have fundamental geometric similarities, even if they’re different on a more surface level. In prior papers, CCA had been used by calculating the CCA vectors between varying sets of layers, and then taking the mean CCA value over all of the pairs of vectors. This paper argues against that approach, on the theory that network layers are probably not using the full representational capacity of their activation dimensions (think, as analogy: a matrix with three columns, that only actually spans two), and so including in your average very low-order correlations is mostly adding uninformative noise to your similarity measure. Instead, this paper weights the correlation coefficients according to the magnitudes of the correlate vectors in the pair; as best I can tell, this is roughly analogous to weighting according to eigenvalues, in a PCA setting. Using this weighted-average similarity measure, the authors do some really interesting investigations into learning dynamics. These include: * Comparing the intermediate-layer representations learned by networks that achieve low train error via memorization vs via actually-generalizing solutions, and show that, during training, the intermediate representations of generalizing networks are more similar to one another than memorizing networks are to one another. Intuitively, this aligns with the idea that there are many ways to noisily memorize, but a more constrained number of ways to actually learn meaningful information about a dataset. A super interesting implication of this is the idea that representational similarity *on the training set* across multiple bootstrapped or randomized trainings could be used as a proxy for test set performance, which could be particularly valuable in contexts where test data is limited https://i.imgur.com/JwyHFmN.png * Across networks, lower layers tend to be more similar to one another than layers closer to the output; said another way, the very simple (e.g. edge detectors) tend to be quite similar across networks, but the higher level representations are more divergent and influenceable by smaller quirks of the training set. * Within a given dataset, you can cluster learned internal representations across many training sets and recover groups trained with the same learning rate, even though the final layer softmax is inherently similar across models that achieve the same training error. This implies that metrics like this can give us some idea of the different minima that the optimization algorithm finds, as a function of different learning rates. Overall, I found this paper a great example of a straightforward idea used to clearly answer important and interesting questions, which is always refreshing amidst a sea of “tiny hack for an extra 0.05 accuracy”. |

About