![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
Meng and Chen propose MagNet, a combination of adversarial example detection and removal. At test time, given a clean or adversarial test image, the proposed defense works as follows: First, the input is passed through one or multiple detectors. If one of these detectors fires, the input is rejected. To this end, the authors consider detection based on the reconstruction error of an auto-encoder or detection based on the divergence between probability predictions (on adversarial vs. clean example). Second, if not rejected, the input is passed through a reformed. The reformer reconstructs the input, e.g., through an auto-encoder, to remove potentially undetected adversarial noise. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
[link]
Coming from the perspective of the rest of machine learning, a somewhat odd thing about reinforcement learning that often goes unnoticed is the fact that, in basically all reinforcement learning, performance of an algorithm is judged by its performance on the same environment it was trained on. In the parlance of ML writ large: training on the test set. In RL, most of the focus has historically been on whether automatic systems would be able to learn a policy from the state distribution of a single environment, already a fairly hard task. But, now that RL has had more success in the single-environment case, there comes the question: how can we train reinforcement algorithms that don't just perform well on a single environment, but over a range of environments. One lens onto this question is that of meta-learning, but this paper takes a different approach, and looks at how straightforward regularization techniques pulled from the land of supervised learning can (or can't straightforwardly) be applied to reinforcement learning. In general, the regularization techniques discussed here are all ways of reducing the capacity of the model, and preventing it from overfitting. Some ways to reduce capacity are: - Apply L2 weight penalization - Apply dropout, which handicaps the model by randomly zeroing out neurons - Use Batch Norm, which uses noisy batch statistics, and increases randomness in a way that, similar to above, deteriorates performance - Use an information bottleneck: similar to a VAE, this approach works by learning some compressed representation of your input, p(z|x), and then predicting your output off of that z, in a way that incentivizes your z to be informative (because you want to be able to predict y well) but also penalizes too much information being put in it (because you penalize differences between your learned p(z|x) distribution and an unconditional prior p(z) ). This pushes your model to use its conditional-on-x capacity wisely, and only learn features if they're quite valuable in predicting y However, the paper points out that there are some complications in straightforwardly applying these techniques to RL. The central one is the fact that in (most) RL, the distribution of transitions you train on comes from prior iterations of your policy. This means that a noisier and less competent policy will also leave you with less data to train on. Additionally, using a noisy policy can increase variance, both by making your trained policy more different than your rollout policy (in an off-policy setting) and by making your estimate of the value function higher-variance, which is problematic because that's what you're using as a target training signal in a temporal difference framework. The paper is a bit disconnected in its connection between justification and theory, and makes two broad, mostly distinct proposals: 1. The most successful (though also the one least directly justified by the earlier-discussed theoretical difficulties of applying regularization in RL) is an information bottleneck ported into a RL setting. It works almost the same as the classification-model one, except that you're trying to increase the value of your actions given compressed-from-state representation z, rather than trying to increase your ability to correctly predict y. The justification given here is that it's good to incentivize RL algorithms in particular to learn simpler, more compressible features, because they often have such poor data and also training signal earlier in training 2. SNI (Selective Noise Injection) works by only applying stochastic aspects of regularization (sampling from z in an information bottleneck, applying different dropout masks, etc) to certain parts of the training procedure. In particular, the rollout used to collect data is non-stochastic, removing the issue of noisiness impacting the data that's collected. They then do an interesting thing where they calculate a weighted mixture of the policy update with a deterministic model, and the update with a stochastic one. The best performing of these that they tested seems to have been a 50/50 split. This is essentially just a knob you can turn on stochasticity, to trade off between the regularizing effect of noise and the variance-increasing-negative effect of it. https://i.imgur.com/fi0dHgf.png https://i.imgur.com/LLbDaRw.png Based on my read of the experiments in the paper, the most impressive thing here is how well their information bottleneck mechanism works as a way to improve generalization, compared to both the baseline and other regularization approaches. It does look like there's some additional benefit to SNI, particularly in the CoinRun setting, but very little in the MultiRoom setting, and in general the difference is less dramatic than the difference from using the information bottleneck. ![]() |
[link]
The paper designs some basic tests to compare saliency methods. It founds that some of the most popular methods are independent of model parameters and the data, meaning they are effectively useless. ## Methods compared The paper compare the following methods: gradient explanation, gradient x input, integrated gradients, guided backprop, guided GradCam and SmoothGrad. They provide a refresher on those methods in the appendix. All those methods can be put in the same framework. They require a classification model and an input (typically an image). The output of the method is an *explanation map* of the shape of the input where a higher value for a feature implies greater relevance in the decision of the model. ## Metrics of comparison The authors argue that visual inspection of the saliency maps can be misleading. They propose to compute the Spearman rank correlation, the structural similarity index (SSMI) and the Pearson correlation of the histogram of gradients. The authors point out that those metrics capture various notions of similarity, but it is an active area of research and those metrics are imperfect. ## First test: model parameters randomization A saliency method must be dependent of model parameters, otherwise it cannot help us understand a model. In this test, the authors randomize the model parameters, layer per layer, starting from the top. Surprisingly, methods such as guided backprop and guided gradcam are completely insensitive to model parameters, as illustrated on this Inception v3 trained on ImageNet:  Integrated gradients looks also dubious as the bird is still visible with a mostly fully randomized model, but the quantitative metrics reveal the difference is actually big between the two models. ## Second test: data randomization It is well-known that randomly shuffling the labels of a dataset does not prevent a neural network from getting a high accuracy on the training set, though it does prevent generalization. The model is able to learn by either memorizing the data or finding spurious patterns. As a result, saliency maps obtained from such a network should have no clearly interpretable signal. Here is the result for a ConvNet trained on MNIST and a shuffled MNIST:  The results are very damning for most methods. Only gradients and GradCam are very different between both models, as confirmed by the low correlation. ## Discussion - Even though some methods do no depend on model parameters and data, they might still depend on the architecture of the models, which could be of some use in some contexts. - Methods that multiply the input with the gradient are dominated by the input. - Complex saliency methods are just fancy edge detectors. - Only gradient, smooth gradient and GradCam survives the sanity checks. # Comments - Why is their GradCam maps so ugly? They don't look like usual GradCam maps at all. - Their tests are simple enough that it's hard to defend a method that doesn't pass them. - The methods that are left are not very good either. They give fuzzy maps that are difficult to interpret. - In the case of integrated gradients (IG), I'm not convinced this is sufficient to discard the method. IG requires a "baseline input" that represents the absence of features. In the case of images, people usually just set the image to 0, which is not at all the absence of a feature. The authors also use the "set the image to 0" strategy, and I'd say their tests are damning for this strategy, not for IG in general. I'd expect an estimation of the baseline such as done in [this paper](https://arxiv.org/abs/1702.04595) would be a fairer evaluation of IG. Code: [GitHub](https://github.com/adebayoj/sanity_checks_saliency) (not available as of 17/07/19) ![]() |
[link]
* The authors define in this paper a special loss function (DeePSiM), mostly for autoencoders. * Usually one would use a MSE of euclidean distance as the loss function for an autoencoder. But that loss function basically always leads to blurry reconstructed images. * They add two new ingredients to the loss function, which results in significantly sharper looking images. ### How * Their loss function has three components: * Euclidean distance in image space (i.e. pixel distance between reconstructed image and original image, as usually used in autoencoders) * Euclidean distance in feature space. Another pretrained neural net (e.g. VGG, AlexNet, ...) is used to extract features from the original and the reconstructed image. Then the euclidean distance between both vectors is measured. * Adversarial loss, as usually used in GANs (generative adversarial networks). The autoencoder is here treated as the GAN-Generator. Then a second network, the GAN-Discriminator is introduced. They are trained in the typical GAN-fashion. The loss component for DeePSiM is the loss of the Discriminator. I.e. when reconstructing an image, the autoencoder would learn to reconstruct it in a way that lets the Discriminator believe that the image is real. * Using the loss in feature space alone would not be enough as that tends to lead to overpronounced high frequency components in the image (i.e. too strong edges, corners, other artefacts). * To decrease these high frequency components, a "natural image prior" is usually used. Other papers define some function by hand. This paper uses the adversarial loss for that (i.e. learns a good prior). * Instead of training a full autoencoder (encoder + decoder) it is also possible to only train a decoder and feed features - e.g. extracted via AlexNet - into the decoder. ### Results * Using the DeePSiM loss with a normal autoencoder results in sharp reconstructed images. * Using the DeePSiM loss with a VAE to generate ILSVRC-2012 images results in sharp images, which are locally sound, but globally don't make sense. Simple euclidean distance loss results in blurry images. * Using the DeePSiM loss when feeding only image space features (extracted via AlexNet) into the decoder leads to high quality reconstructions. Features from early layers will lead to more exact reconstructions. * One can again feed extracted features into the network, but then take the reconstructed image, extract features of that image and feed them back into the network. When using DeePSiM, even after several iterations of that process the images still remain semantically similar, while their exact appearance changes (e.g. a dog's fur color might change, counts of visible objects change).  *Images generated with a VAE using DeePSiM loss.*  *Images reconstructed from features fed into the network. Different AlexNet layers (conv5 - fc8) were used to generate the features. Earlier layers allow more exact reconstruction.*  *First, images are reconstructed from features (AlexNet, layers conv5 - fc8 as columns). Then, features of the reconstructed images are fed back into the network. That is repeated up to 8 times (rows). Images stay semantically similar, but their appearance changes.* -------------------- ### Rough chapter-wise notes * (1) Introduction * Using a MSE of euclidean distances for image generation (e.g. autoencoders) often results in blurry images. * They suggest a better loss function that cares about the existence of features, but not as much about their exact translation, rotation or other local statistics. * Their loss function is based on distances in suitable feature spaces. * They use ConvNets to generate those feature spaces, as these networks are sensitive towards important changes (e.g. edges) and insensitive towards unimportant changes (e.g. translation). * However, naively using the ConvNet features does not yield good results, because the networks tend to project very different images onto the same feature vectors (i.e. they are contractive). That leads to artefacts in the generated images. * Instead, they combine the feature based loss with GANs (adversarial loss). The adversarial loss decreases the negative effects of the feature loss ("natural image prior"). * (3) Model * A typical choice for the loss function in image generation tasks (e.g. when using an autoencoders) would be squared euclidean/L2 loss or L1 loss. * They suggest a new class of losses called "DeePSiM". * We have a Generator `G`, a Discriminator `D`, a feature space creator `C` (takes an image, outputs a feature space for that image), one (or more) input images `x` and one (or more) target images `y`. Input and target image can be identical. * The total DeePSiM loss is a weighted sum of three components: * Feature loss: Squared euclidean distance between the feature spaces of (1) input after fed through G and (2) the target image, i.e. `||C(G(x))-C(y)||^2_2`. * Adversarial loss: A discriminator is introduced to estimate the "fakeness" of images generated by the generator. The losses for D and G are the standard GAN losses. * Pixel space loss: Classic squared euclidean distance (as commonly used in autoencoders). They found that this loss stabilized their adversarial training. * The feature loss alone would create high frequency artefacts in the generated image, which is why a second loss ("natural image prior") is needed. The adversarial loss fulfills that role. * Architectures * Generator (G): * They define different ones based on the task. * They all use up-convolutions, which they implement by stacking two layers: (1) a linear upsampling layer, then (2) a normal convolutional layer. * They use leaky ReLUs (alpha=0.3). * Comparators (C): * They use variations of AlexNet and Exemplar-CNN. * They extract the features from different layers, depending on the experiment. * Discriminator (D): * 5 convolutions (with some striding; 7x7 then 5x5, afterwards 3x3), into average pooling, then dropout, then 2x linear, then 2-way softmax. * Training details * They use Adam with learning rate 0.0002 and normal momentums (0.9 and 0.999). * They temporarily stop the discriminator training when it gets too good. * Batch size was 64. * 500k to 1000k batches per training. * (4) Experiments * Autoencoder * Simple autoencoder with an 8x8x8 code layer between encoder and decoder (so actually more values than in the input image?!). * Encoder has a few convolutions, decoder a few up-convolutions (linear upsampling + convolution). * They train on STL-10 (96x96) and take random 64x64 crops. * Using for C AlexNet tends to break small structural details, using Exempler-CNN breaks color details. * The autoencoder with their loss tends to produce less blurry images than the common L2 and L1 based losses. * Training an SVM on the 8x8x8 hidden layer performs significantly with their loss than L2/L1. That indicates potential for unsupervised learning. * Variational Autoencoder * They replace part of the standard VAE loss with their DeePSiM loss (keeping the KL divergence term). * Everything else is just like in a standard VAE. * Samples generated by a VAE with normal loss function look very blurry. Samples generated with their loss function look crisp and have locally sound statistics, but still (globally) don't really make any sense. * Inverting AlexNet * Assume the following variables: * I: An image * ConvNet: A convolutional network * F: The features extracted by a ConvNet, i.e. ConvNet(I) (feaures in all layers, not just the last one) * Then you can invert the representation of a network in two ways: * (1) An inversion that takes an F and returns roughly the I that resulted in F (it's *not* key here that ConvNet(reconstructed I) returns the same F again). * (2) An inversion that takes an F and projects it to *some* I so that ConvNet(I) returns roughly the same F again. * Similar to the autoencoder cases, they define a decoder, but not encoder. * They feed into the decoder a feature representation of an image. The features are extracted using AlexNet (they try the features from different layers). * The decoder has to reconstruct the original image (i.e. inversion scenario 1). They use their DeePSiM loss during the training. * The images can be reonstructed quite well from the last convolutional layer in AlexNet. Chosing the later fully connected layers results in more errors (specifially in the case of the very last layer). * They also try their luck with the inversion scenario (2), but didn't succeed (as their loss function does not care about diversity). * They iteratively encode and decode the same image multiple times (probably means: image -> features via AlexNet -> decode -> reconstructed image -> features via AlexNet -> decode -> ...). They observe, that the image does not get "destroyed", but rather changes semantically, e.g. three apples might turn to one after several steps. * They interpolate between images. The interpolations are smooth. ![]() |
[link]
This paper derives an algorithm for passing gradients through a sample from a mixture of Gaussians. While the reparameterization trick allows to get the gradients with respect to the Gaussian means and covariances, the same trick cannot be invoked for the mixing proportions parameters (essentially because they are the parameters of a multinomial discrete distribution over the Gaussian components, and the reparameterization trick doesn't extend to discrete distributions). One can think of the derivation as proceeding in 3 steps: 1. Deriving an estimator for gradients a sample from a 1-dimensional density $f(x)$ that is such that $f(x)$ is differentiable and its cumulative distribution function (CDF) $F(x)$ is tractable: $\frac{\partial \hat{x}}{\partial \theta} = - \frac{1}{f(\hat{x})}\int_{t=-\infty}^{\hat{x}} \frac{\partial f(t)}{\partial \theta} dt$ where $\hat{x}$ is a sample from density $f(x)$ and $\theta$ is any parameter of $f(x)$ (the above is a simplified version of Equation 6). This is probably the most important result of the paper, and is based on a really clever use of the general form of the Leibniz integral rule. 2. Noticing that one can sample from a $D$-dimensional Gaussian mixture by decomposing it with the product rule $f({\bf x}) = \prod_{d=1}^D f(x_d|{\bf x}_{<d})$ and using ancestral sampling, where each $f(x_d|{\bf x}_{<d})$ are themselves 1-dimensional mixtures (i.e. with differentiable densities and tractable CDFs) 3. Using the 1-dimensional gradient estimator (of Equation 6) and the chain rule to backpropagate through the ancestral sampling procedure. This requires computing the integral in the expression for $\frac{\partial \hat{x}}{\partial \theta}$ above, where $f(x)$ is one of the 1D conditional Gaussian mixtures and $\theta$ is a mixing proportion parameter $\pi_j$. As it turns out, this integral has an analytical form (see Equation 22). **My two cents** This is a really surprising and neat result. The author mentions it could be applicable to variational autoencoders (to support posteriors that are mixtures of Gaussians), and I'm really looking forward to read about whether that can be successfully done in practice. The paper provides the derivation only for mixtures of Gaussians with diagonal covariance matrices. It is mentioned that extending to non-diagonal covariances is doable. That said, ancestral sampling with non-diagonal covariances would become more computationally expensive, since the conditionals under each Gaussian involves a matrix inverse. Beyond the case of Gaussian mixtures, Equation 6 is super interesting in itself as its application could go beyond that case. This is probably why the paper also derived a sampling-based estimator for Equation 6, in Equation 9. However, that estimator might be inefficient, since it involves sampling from Equation 10 with rejection, and it might take a lot of time to get an accepted sample if $\hat{x}$ is very small. Also, a good estimate of Equation 6 might require *multiple* samples from Equation 10. Finally, while I couldn't find any obvious problem with the mathematical derivation, I'd be curious to see whether using the same approach to derive a gradient on one of the Gaussian mean or standard deviation parameters gave a gradient that is consistent with what the reparameterization trick provides. ![]()
3 Comments
|