arXiv is an e-print service in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance and statistics.

- 1996 1
- 2010 1
- 2011 1
- 2012 8
- 2013 31
- 2014 43
- 2015 134
- 2016 231
- 2017 182
- 2018 141
- 2019 100
- 2020 33
- 2021 10
- 2022 7
- 2023 3

Women also Snowboard: Overcoming Bias in Captioning Models (Extended Abstract)

Lisa Anne Hendricks and Kaylee Burns and Kate Saenko and Trevor Darrell and Anna Rohrbach

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV

**First published:** 2024/08/09 (just now)

**Abstract:** Most machine learning methods are known to capture and exploit biases of the
training data. While some biases are beneficial for learning, others are
harmful. Specifically, image captioning models tend to exaggerate biases
present in training data. This can lead to incorrect captions in domains where
unbiased captions are desired, or required, due to over reliance on the learned
prior and image context. We investigate generation of gender specific caption
words (e.g. man, woman) based on the person's appearance or the image context.
We introduce a new Equalizer model that ensures equal gender probability when
gender evidence is occluded in a scene and confident predictions when gender
evidence is present. The resulting model is forced to look at a person rather
than use contextual cues to make a gender specific prediction. The losses that
comprise our model, the Appearance Confusion Loss and the Confident Loss, are
general, and can be added to any description model in order to mitigate impacts
of unwanted bias in a description dataset. Our proposed model has lower error
than prior work when describing images with people and mentioning their gender
and more closely matches the ground truth ratio of sentences including women to
sentences including men.
more
less

Lisa Anne Hendricks and Kaylee Burns and Kate Saenko and Trevor Darrell and Anna Rohrbach

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV

[link]
This paper is to reduce gender bias in the captioning model. Concretely, traditional captioning models tend to rely on contextual cues, so they usually predict incorrect captions for an image that contains people. To reduce gender bias, they introduced a new $Equalizer$ model that contains two losses: (1) Appearance Confusion Loss: When it is hard to tell if there is a man or a woman in the image, the model should provide a fair probability of predicting a man or a woman. To define that loss, first, they define a confusion function, which indicates how likely a next predicted word belongs to a set of woman words or a set of man words. https://i.imgur.com/oI6xswy.png Where, $w~_{t}$ is the next predicted word, $G_{w}$ is the set of woman words, $G_{m}$ is the set of man words. And the Loss is defined as the normal cross-entropy loss multiplied by the Confusion function. https://i.imgur.com/kLpROse.png (2) Confident Loss: When it is easy to recognize a man or a woman in an image, this loss encourages the model to predict gender words correctly. In this loss, they also defined in-confidence functions, there are two in-confidence functions, the first one is the in-confidence function for man words, and the second one is for woman words. These two functions are the same. https://i.imgur.com/4stFjac.png This function tells that if the model is confident when predicting a gender (ex. woman), then the value of the in-confidence function for woman words should be low. Then, the confidence loss function is as follows: https://i.imgur.com/1pRgDir.png |

Learning to Count Objects in Natural Images for Visual Question Answering

Zhang, Yan and Hare, Jonathon S. and Prügel-Bennett, Adam

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Zhang, Yan and Hare, Jonathon S. and Prügel-Bennett, Adam

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Visual Question Answering can not do the counting objects problem properly. So in this paper, they figured out the reason is due to the Soft Attention module, and they also proposed a module that can produce reliable counting from object proposals. There are two challenges in VQA Counting tasks: (1) There is no ground truth label for the objects to be counted. (2) The additional module should not affect performance on non-counting problems. Why Soft Attention is not good for the counting task: One case to explain why Soft Attention limits counting ability: Consider the task of counting cats for two images: an image of a cat and an image that contains two images side-by-side that are copies of the first image. For image 1: after the normalization of the softmax function in the attention, the cat in this image will receive a normalized weight of 1. For image 2: each cat receives a weight of 0.5. Then, the attention module will do the weighted sum to produce an attention feature vector. Because the weighted sum process will average the two cats in the second image back to a single cat, so 2 attention feature vectors of the two images are the same. As a result, the information about possible counts is lost by using the attention map. Counting Component: This component will be in charge of counting objects for an image. This has two things to do: 1) A differentiable mechanism for counting from attention weights. 2) Handling overlapping object proposals to reduce object double-counting. The Counting Component is as follows: https://i.imgur.com/xVGcaov.png Note that, intra-objects are objects that point to the same object and the same class, while inter-objects are objects that point to the different object and the same class. They have three main components: (1) object proposals (4 vertices), the black ones are relevant objects while the white ones are irrelevant objects. Then (2) intra-object edges between duplicate proposals, and (3) blue edges mark the inter-object duplicate edges. Finally, there will be one edge and 2 vertices (2 relevant objects). To illustrate the component in more detail, there are 4 main steps: (1) Input: The component needs n attention weights $a = [a_{1}, a_{2},...,a_{n}]^{T}$ and their corresponding boxes $b = [b_{1}, ..., b_{n}]^{T}$ (2) Deduplication: The goal of this step is to make a graph $A=aa^{T}$ (attention matrix) where each vertex is a bounding box proposal if the $ith$ bounding box is a relevant box, then $a_{i} = 1$ otherwise, $a_{i} = 0$. And the Counting Component will modify this graph to delete those edges until the graph becomes a fully directed graph with self-loops. For example, [a1, a2, a3, a4, a5]=[1,0,1,0,1], the subgraph containing a1, a3, or a5 is a fully directed graph, as follows: https://i.imgur.com/cCKIQ0K.png The illustration for this graph is as follows: https://i.imgur.com/x93gk8c.png Then we will eliminate duplicate edges: (1) intra-object edges and (2) inter-object edges. 1. Intra-object edges First, we eliminate intra-object edges. To achieve this, we need to calculate the distance matrix $D$ where $D_{ij} = 1- IoU(b_{i}, b_{j})$, if $D_{ij}=1$ which means two bounding boxes are quite overlapped, and then should be eliminated. To remove them, multiply the attention matrix $A$, which is calculated before, with the matrix $D$, to remove the connection between duplicate proposals of a single object. https://i.imgur.com/TQAvAnW.png 2. Inter-object edges Second, we eliminate inter-object edges. The main idea is to combine the proposals of the duplicate objects into 1. To do this, scale down the weight of its associated edges (vertices connected to that vertex). For example, if an object has two proposals, the edges involving those proposals should be scaled by 0.5. Essentially, this is averaging the proposal within each base object, since we only use the sum of edge weights to compute the final count. https://i.imgur.com/4An0BAj.png |

MnasNet: Platform-Aware Neural Architecture Search for Mobile

Tan, Mingxing and Chen, Bo and Pang, Ruoming and Vasudevan, Vijay and Le, Quoc V.

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Tan, Mingxing and Chen, Bo and Pang, Ruoming and Vasudevan, Vijay and Le, Quoc V.

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
When machine learning models need to run on personal devices, that implies a very particular set of constraints: models need to be fairly small and low-latency when run on a limited-compute device, without much loss in accuracy. A number of human-designed architectures have been engineered to try to solve for these constraints (depthwise convolutions, inverted residual bottlenecks), but this paper's goal is to use Neural Architecture Search (NAS) to explicitly optimize the architecture against latency and accuracy, to hopefully find a good trade-off curve between the two. This paper isn't the first time NAS has been applied on the problem of mobile-optimized networks, but a few choices are specific to this paper. 1. Instead of just optimizing against accuracy, or optimizing against accuracy with a sharp latency requirement, the authors here construct a weighted loss that includes both accuracy and latency, so that NAS can explore the space of different trade-off points, rather than only those below a sharp threshold. 2. They design a search space where individual sections or "blocks" of the network can be configured separately, with the hope being that this flexibility helps NAS trade off complexity more strongly in the early parts of the network, where, at a higher spatial resolution, it implies greater computation cost and latency, without necessary dropping that complexity later in the network, where it might be lower-cost. Blocks here are specified by the type of convolution op, kernel size, squeeze-and-excitation ratio, use of a skip op, output filter size, and the number of times an identical layer of this construction will be repeated to constitute a block. Mechanically, models are specified as discrete strings of tokens (a block is made up of tokens indicating its choices along these design axes, and a model is made up of multiple blocks). These are represented in a RL framework, where a RNN model sequentially selects tokens as "actions" until it gets to a full model specification . This is repeated multiple times to get a batch of models, which here functions analogously to a RL episode. These models are then each trained for only five epochs (it's desirable to use a full-scale model for accurate latency measures, but impractical to run its full course of training). After that point, accuracy is calculated, and latency determined by running the model on an actual Pixel phone CPU. These two measures are weighted together to get a reward, which is used to train the RNN model-selection model using PPO. https://i.imgur.com/dccjaqx.png Across a few benchmarks, the authors show that models found with MNasNet optimization are able to reach parts of the accuracy/latency trade-off curve that prior techniques had not. |

Data-Efficient Hierarchical Reinforcement Learning

Ofir Nachum and Shixiang Gu and Honglak Lee and Sergey Levine

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**First published:** 2024/08/09 (just now)

**Abstract:** Hierarchical reinforcement learning (HRL) is a promising approach to extend
traditional reinforcement learning (RL) methods to solve more complex tasks.
Yet, the majority of current HRL methods require careful task-specific design
and on-policy training, making them difficult to apply in real-world scenarios.
In this paper, we study how we can develop HRL algorithms that are general, in
that they do not make onerous additional assumptions beyond standard RL
algorithms, and efficient, in the sense that they can be used with modest
numbers of interaction samples, making them suitable for real-world problems
such as robotic control. For generality, we develop a scheme where lower-level
controllers are supervised with goals that are learned and proposed
automatically by the higher-level controllers. To address efficiency, we
propose to use off-policy experience for both higher and lower-level training.
This poses a considerable challenge, since changes to the lower-level behaviors
change the action space for the higher-level policy, and we introduce an
off-policy correction to remedy this challenge. This allows us to take
advantage of recent advances in off-policy model-free RL to learn both higher-
and lower-level policies using substantially fewer environment interactions
than on-policy algorithms. We term the resulting HRL agent HIRO and find that
it is generally applicable and highly sample-efficient. Our experiments show
that HIRO can be used to learn highly complex behaviors for simulated robots,
such as pushing objects and utilizing them to reach target locations, learning
from only a few million samples, equivalent to a few days of real-time
interaction. In comparisons with a number of prior HRL methods, we find that
our approach substantially outperforms previous state-of-the-art techniques.
more
less

Ofir Nachum and Shixiang Gu and Honglak Lee and Sergey Levine

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

[link]
# Keypoints - Proposes the HIerarchical Reinforcement learning with Off-policy correction (**HIRO**) algorithm. - Does not require careful task-specific design. - Generic goal representation to make it broadly applicable, without any manual design of goal spaces, primitives, or controllable dimensions. - Use of off-policy experience using a novel off-policy correction. - A two-level hierarchy architecture - A higher-level controller outputs a goal for the lower-level controller every **c** time steps and collects the rewards given by the environment, being the goal the desired change in state space - The lower level controller has the goal given added to its input and acts directly in the environment, the reward received is parametrized from the current state and the goal. # Background This paper adopts a standard continuous control reinforcement learning setting, in which an agent acts on an environment that yields a next state and a reward from unknown functions. This paper utilizes the TD3 learning algorithm. ## General and Efficient Hierarchical Reinforcement Learning https://i.imgur.com/zAHoWWO.png ## Hierarchy of Two Policies The higher-level policy $\mu^{hi}$ outputs a goal $g_t$, which correspond directly to desired relative changes in state that the lower-level policy $\mu^{lo}$ attempts to reach. $\mu^{hi}$ operates at a time abstraction, updating the goal $g_t$ and collecting the environment rewards $R_t$ every $c$ environment steps, the higher-level transition $(s_{t:t+c−1},g_{t:t+c−1},a_{t:t+c−1},R_{t:t+c−1},s_{t+c})$ is stored for off-policy training. The lower-level policy $\mu^{lo}$ outputs an action to be applied directly to the environment, having as input the current environment observations $s_t$ and the goal $g_t$. The goal $g_t$ is given by $\mu^{hi}$ every $c$ environment time steps, for the steps in between, the goal $g_t$ used by $\mu^{lo}$ is given by the transition function $g_t=h(s_{t−1},g_{t−1},s_t)$, the lower-level controller reward is provided by the parametrized reward function $r_t=r(s_t,g_t,a_t,s_{t+1})$. The lower-level transition $(s_t,g_t,a_t,r_t,s_{t+1}, g_{t+1})$ is stored for off-policy training. ## Parameterized Rewards The goal $g_t$ indicates a desired relative changes in state observations, the lower-level agent task is to take actions from state $s_t$ that yield it an observation $s_{t+c}$ that is close to $s_t+g_t$. To maintain the same absolute position of the goal regardless of state change, the goal transition model, used between $\mu^{hi}$ updates every $c$ steps, is defined as: $h(s_t,g_t,s_{t+1}) =s_t+g_t−s_{t+1}$ And the reward given to the lower-level controller is defined as to reinforce reaching a state closer to the goal $g_t$, this paper parametrizes it by the function: $r(s_t,g_t,a_t,s_{t+1}) =−||s_t+g_t−s_{t+1}||_2$. ## Off-Policy Corrections for Higher-Level Training The higher-level transitions stored $(s_{t:t+c−1},g_{t:t+c−1},a_{t:t+c−1},R_{t:t+c−1},s_{t+c})$ have to be converted to state-action-reward transitions $(s_t,g_t,∑R_{t:t+c−1},s_{t+c})$ as they can be used in standard off-policy RL algorithms, however, since the lower-level controller is evolving, these past transitions do not accurately represent the actions tha would be taken by the current lower-level policy and must be corrected. This paper correction technique used is to change the goal $g_t$ of past transitions using an out of date lower-level controller to a relabeled goal $g ̃_t$ which is likely to induce the same lower-level behavior with the updated $\mu^{lo}$. In other words, we want to find a goal $g ̃_t$ which maximizes the probability $μ_{lo}(a_{t:t+c−1}|s_{t:t+c−1},g ̃_{t:t+c−1})$, in which the $\mu^{lo}$ is the current policy and the actions $a_{t:t+c−1}$ and states $s_{t:t+c−1}$ are from the stored high level transition. To approximately maximize this quantity in practice, the authors calculated the probability for 10 candidates $g ̃_t$, eight candidate goals sampled randomly from a Gaussian centered at $s_{t+c}−s_t$, the original goal $g_t$ and a goal corresponding to the difference $s_{t+c}−s_t$. # Experiments https://i.imgur.com/iko9nCd.png https://i.imgur.com/kGx8fZv.png The authors compared the $HIRO$ method to prior method in 4 different environments: - Ant Gather; - Ant Maze; - Ant Push; - Ant Fall. They also performed an ablative analysis with the following variants: - With lower-level re-labeling; - With pre-training; - No off-policy correction; - No HRL. # Closing Points - The method proposed is interesting in the hierarchical reinforcement learning setting for not needing a specific design, the generic goal representation enables applicability without the need of designing a goal space manually; - The off-policy correction method enables this algorithm to be sample efficient; - The hierarchical structure with intermediate goals on state-space enables to better visualize the agent goals; - The paper Appendix elaborates on possible alternative off-policy corrections. |

Recurrent World Models Facilitate Policy Evolution

David Ha and Jürgen Schmidhuber

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2018/09/04 (5 years ago)

**Abstract:** A generative recurrent neural network is quickly trained in an unsupervised
manner to model popular reinforcement learning environments through compressed
spatio-temporal representations. The world model's extracted features are fed
into compact and simple policies trained by evolution, achieving state of the
art results in various environments. We also train our agent entirely inside of
an environment generated by its own internal world model, and transfer this
policy back into the actual environment. Interactive version of paper at
https://worldmodels.github.io
more
less

David Ha and Jürgen Schmidhuber

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, stat.ML

[link]
## General Framework The take-home message is that the challenge of Reinforcement Learning for environments with high-dimensional and partial observations is learning a good representation of the environment. This means learning a sensory features extractor V to deal with the highly dimensional observation (pixels for example). But also learning a temporal representation M of the environment dynamics to deal with the partial observability. If provided with such representations, learning a controller so as to maximize a reward is really easy (single linear layer evolved with CMA-ES). Authors call these representations a *World model* since they can use the learned environment's dynamics to simulate roll-outs. They show that policies trained inside the world model transfer well back to the real environment provided that measures are taken to prevent the policy from exploiting the world model's inaccuracies. ## Method **Learning the World Model** ![](https://i.imgur.com/tgV17k4.png =600x) In this work they propose to learn these representations off-line in an unsupervized manner in order to be more efficient. They use a VAE for V that they train exclusively with the reconstruction loss, that way the learned representations are independent of the reward and can be used alongside any reward. They then train M as Mixture-Density-Network-RNN to predict the next sensory features (as extracted by the VAE) --and possibly the done condition and the reward-- and thus learn the dynamics of the environment in the VAE's latent space (which is likely simpler there than in the pixel space). Note that the VAE's latent space is a single Gaussian (adding stochasticity makes it more robust to the "next state" outputs of M), whereas M outputs next states in a mixture of Gaussians. Indeed, an image is likely to have one visual encoding, yet it can have multiple and different future scenarii which are captured by the multimodal output of M. **Training the policy** ![](https://i.imgur.com/H5vpb2H.png) * In the real env: The agent is provided with the visual features and M's hidden state (temporal features). * In the world model: To avoid that the agent exploits this imperfect simulator they increase its dynamics' stochasticity by playing with $\tau$ the sampling temperature of $z_{t+1}$ in M. ## Limitations If exploration is important in the environment the initial random policy might fail to collect data in all the relevant part of the environment and an iterative version of Algorithm 1 might be required (see https://worldmodels.github.io/ for a discussion on the different iterative methods) for the data collection. By training V independently of M it might fail to encode all the information relevant to the task. Another option would be to train V and M concurrently so that the reward and $z_{t+1}$'s prediction loss (or next state reconstruction loss) of M flows through V (that would also be trained with its own reconstruction loss). The trade-off is that now V is tuned to a particular reward and cannot be reused. The authors argue that since $h_t$ is such that it can predict $z_{t+1}$, it contains enough insight about the future for the agent not needing to *plan ahead* and just doing reflexive actions based on $h_t$. This is interesting but the considered tasks (driving, dodging fireball) are still very reflexive and do not require much planning. ## Results When trained on the true env, a simple controller with the V and M representations achieve SOTA on car-racing. V + M is better than V alone. When trained inside the world model, its dynamics' stochasticity must be tuned in order for the policy to transfer well and perform well on the real env: too little stochasticity and the agent overfits to the world model flaws and does not transfer to the real env, too much and the agent becomes risk-averse and robust but suboptimal. ![](https://i.imgur.com/e8ETSjQ.png) ## Additional ressources Thorough interactive blog post with additional experiments and discussions: https://worldmodels.github.io/ |

Junction Tree Variational Autoencoder for Molecular Graph Generation

Jin, Wengong and Barzilay, Regina and Jaakkola, Tommi S.

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Jin, Wengong and Barzilay, Regina and Jaakkola, Tommi S.

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Prior to this paper, most methods that used machine learning to generate molecular blueprints did so using SMILES representations - a string format with characters representing different atoms and bond types. This preference came about because ML had existing methods for generating strings that could be built on for generating SMILES (a particular syntax of string). However, an arguably more accurate and fundamental way of representing molecules is as graphs (with atoms as nodes and bonds as edges). Dealing with molecules as graphs avoids the problem of a given molecule having many potential SMILES representations (because there's no canonical atom to start working your way around the molecule on), and, hopefully, would have an inductive bias that somewhat more closely matches the actual biomechanical interactions within a molecule. One way you could imagine generating a graph structure is by adding on single components (atoms or bonds) at a time. However, the authors of this paper argue that this approach is harder to constrain to only construct valid molecular graphs, since, in the course of sampling out a molecule, you'd have to go through intermediate stages that you expect to be invalid (for example, bonds with no attached atoms), making it hard to add in explicit validity checks. The alternate approach proposed here works as follows: - Atoms within molecules are grouped into valid substructures, based on a combination of biologically-motivated rules (like treating aromatic rings as a single substructure) and computational heuristics. For the purpose of this paper, substructures are generally either 1) a ring, 2) two particular atoms on either end of a bond, or 3) a "tail" with a bond and an atom. Importantly, these substructures are designed to be overlapping - if you had a N bonded with O, and then O with C (this example are entirely made up, and I expect chemically incoherent), then you could have "N-O" as one substructure, and "O-C" as another. https://i.imgur.com/yGzRPjT.png - Using these substructures (or clusters), you can form a simplified representation of a molecule, as a connected, non-cyclic junction tree of clusters connected together. This doesn't give you all the information you'd need to construct the molecule - since there could be multiple different ways, on a molecular level, to connect two substructures, but it does give a blueprint of what the molecule will look like. - Given these two representations, the paper proposes a two-step encoding and decoding process. For a given molecule, we encode both the full molecular graph and the simplified junction tree, getting out vectors Zg and Zt respectively. - The first step of decoding generates a tree given the Zt representation. This generation process works via graph message-passing, taking in the Zt vector in addition to whatever part of the tree exists, and predicting a probability for whether that node has a child, and, if it exists, a probability for what cluster is at that child node. Given this parametrized set of probabilities, we can calculate the probability of the junction tree representation of whatever ground truth molecule we're decoding, and train the tree decoder to increase that model likelihood. (Importantly, although we frame this step as "reconstruction," during training, we're not actually sampling discrete nodes and edges, because we couldn't backprop through that, we're just defining a probability distribution and trying to increase the probability of our real data under it). - The second step of decoding takes in a tree - which at this point is a set of cluster labels with connections specified between one another - as well as the Zg vector, and generates a full, atom-level graph. This is done by enumerating all the ways that two substructures could be attached (this number is typically small, ≤4), and learning a parametrized function that scores each possible type of connection, based on the full tree "blueprint", the Zg embedding, and the molecule that has been generated so far. - When you want to sample a new molecule, you can draw a sample from the prior distributions of Zg and Zt, and run the decoding process in a sampling mode, actually making discrete draws from the distributions given by your model https://i.imgur.com/QdSY25u.png The authors perform three empirical tests: ability to successfully sample-reconstruct a given molecule, ability to optimize for a desired chemical property by finding a Z that scores high on that property according to an auxiliary predictive model, and ability to optimize for a property while staying within a given similarity radius to an original anchor molecule. The Junction Tree approach outperforms on all three tasks. On reconstruction, it matches the best alternative method on reconstruction reliability, but with 100% valid molecules, rather than 43.5% on the competing method. Overall, I found this paper really enjoyable and satisfying to read. Occasionally, ML-for-bio papers err on the side of too little domain thought (just throwing the most generic-for-images model structure at a problem), or too little machine learning thought (take hand-designed features and throw them at a whole range of models), where I think this one struck a nice balance of some amount of domain knowledge (around what makes for valid substructures) but embedded in a complex and thoughtfully designed neural network framework. |

Low Frequency Adversarial Perturbation

Chuan Guo and Jared S. Frank and Kilian Q. Weinberger

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV

**First published:** 2018/09/24 (5 years ago)

**Abstract:** Adversarial images aim to change a target model's decision by minimally
perturbing a target image. In the black-box setting, the absence of gradient
information often renders this search problem costly in terms of query
complexity. In this paper we propose to restrict the search for adversarial
images to a low frequency domain. This approach is readily compatible with many
existing black-box attack frameworks and consistently reduces their query cost
by 2 to 4 times. Further, we can circumvent image transformation defenses even
when both the model and the defense strategy are unknown. Finally, we
demonstrate the efficacy of this technique by fooling the Google Cloud Vision
platform with an unprecedented low number of model queries.
more
less

Chuan Guo and Jared S. Frank and Kilian Q. Weinberger

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV

[link]
Guo et al. propose to augment black-box adversarial attacks with low-frequency noise to obtain low-frequency adversarial examples as shown in Figure 1. To this end, the boundary attack as well as the NES attack are modified to sample from a low-frequency Gaussian distribution instead from Gaussian noise directly. This is achieved through an inverse discrete cosine transform as detailed in the paper. https://i.imgur.com/fejvuw7.jpg Figure 1: Example of a low-frequency adversarial example. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Semantic Adversarial Examples

Hossein Hosseini and Radha Poovendran

Conference and Computer Vision and Pattern Recognition - 2018 via Local CrossRef

Keywords:

Hossein Hosseini and Radha Poovendran

Conference and Computer Vision and Pattern Recognition - 2018 via Local CrossRef

Keywords:

[link]
Hosseini and Poovendran propose semantic adversarial examples by randomly manipulating hue and saturation of images. In particular, in an iterative algorithm, hue and saturation are randomly perturbed and projected back to their valid range. If this results in mis-classification the perturbed image is returned as the adversarial example and the algorithm is finished; if not, another iteration is run. The result is shown in Figure 1. As can be seen, the structure of the images is retained while hue and saturation changes, resulting in mis-classified images. https://i.imgur.com/kFcmlE3.jpg Figure 1: Examples of the computed semantic adversarial examples. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Sensitivity and Generalization in Neural Networks: an Empirical Study

Roman Novak and Yasaman Bahri and Daniel A. Abolafia and Jeffrey Pennington and Jascha Sohl-Dickstein

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.LG, cs.NE

**First published:** 2018/02/23 (6 years ago)

**Abstract:** In practice it is often found that large over-parameterized neural networks
generalize better than their smaller counterparts, an observation that appears
to conflict with classical notions of function complexity, which typically
favor smaller models. In this work, we investigate this tension between
complexity and generalization through an extensive empirical exploration of two
natural metrics of complexity related to sensitivity to input perturbations.
Our experiments survey thousands of models with various fully-connected
architectures, optimizers, and other hyper-parameters, as well as four
different image classification datasets.
We find that trained neural networks are more robust to input perturbations
in the vicinity of the training data manifold, as measured by the norm of the
input-output Jacobian of the network, and that it correlates well with
generalization. We further establish that factors associated with poor
generalization $-$ such as full-batch training or using random labels $-$
correspond to lower robustness, while factors associated with good
generalization $-$ such as data augmentation and ReLU non-linearities $-$ give
rise to more robust functions. Finally, we demonstrate how the input-output
Jacobian norm can be predictive of generalization at the level of individual
test points.
more
less

Roman Novak and Yasaman Bahri and Daniel A. Abolafia and Jeffrey Pennington and Jascha Sohl-Dickstein

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.LG, cs.NE

[link]
Novak et al. study the relationship between neural network sensitivity and generalization. Here, sensitivity is measured in terms of the Frobenius gradient of the network’s probabilities (resulting in a Jacobian matrix, not depending on the true label) or based on a coding scheme of activations. The latter is intended to quantify transitions between linear regions of the piece-wise linear model. To this end, all activations are assigned either $0$ or $1$ depending on their ReLU output. Based on a path between two or more input examples, the difference in this coding scheme is an estimator of how many linear regions have been “traversed”. Both metrics are illustrated in Figure 1, showing that they are low for test and training examples, or in regions within the same class, and high otherwise. The second metric is also illustrated in Figure 2. Based on these metrics, the authors show that these metrics correlate with the generalization gap, meaning that the sensitivity of the network and its generalization performance seem to be inherently connected. https://i.imgur.com/iRt3ADe.jpg Figure 1: For a network trained on MNIST, illustrations of a possible trajectory (left) and the corresponding sensitivity metrics (middle and right). I refer to the paper for details. https://i.imgur.com/0G8su3K.jpg Figure 2: Linear regions for a random 2-dimensional slice of the pre-logit space before and after training. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Group Normalization

Yuxin Wu and Kaiming He

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG

**First published:** 2018/03/22 (6 years ago)

**Abstract:** Batch Normalization (BN) is a milestone technique in the development of deep
learning, enabling various networks to train. However, normalizing along the
batch dimension introduces problems --- BN's error increases rapidly when the
batch size becomes smaller, caused by inaccurate batch statistics estimation.
This limits BN's usage for training larger models and transferring features to
computer vision tasks including detection, segmentation, and video, which
require small batches constrained by memory consumption. In this paper, we
present Group Normalization (GN) as a simple alternative to BN. GN divides the
channels into groups and computes within each group the mean and variance for
normalization. GN's computation is independent of batch sizes, and its accuracy
is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN
has 10.6% lower error than its BN counterpart when using a batch size of 2;
when using typical batch sizes, GN is comparably good with BN and outperforms
other normalization variants. Moreover, GN can be naturally transferred from
pre-training to fine-tuning. GN can outperform its BN-based counterparts for
object detection and segmentation in COCO, and for video classification in
Kinetics, showing that GN can effectively replace the powerful BN in a variety
of tasks. GN can be easily implemented by a few lines of code in modern
libraries.
more
less

Yuxin Wu and Kaiming He

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG

[link]
Wu and He propose group normalization as alternative to batch normalization. Instead of computing the statistics used for normalization based on the current mini-batch, group normalization computes these statistics per instance but in groups of channels (for convolutional layers). Specifically, given activations $x_i$ with $i = (i_N, i_C, i_H, i_W)$ indexing along batch size, channels, height and width, batch normalization computes $\mu_i = \frac{1}{|S|}\sum_{k \in S} x_k$ and $\sigma_i = \sqrt{\frac{1}{|S|} \sum_{k \in S} (x_k - \mu_i)^2 + \epsilon}$ with the set $S$ holds all indices for a specific channel (i.e. across samples, height and width). For group normalization, in contrast, $S$ holds all indices of the current instance and group of channels. Meaning the statistics are computed across height, width and the current group of channels. Here, all channels can be divided into groups arbitrarily. In the paper, on ImageNet, groups of $32$ channels are used. Then, Figure 1 shows that for a batch size of 32, group normalization performs en-par with batch normalization – although the validation error is slightly larger. This is attributed to the stochastic element of batch normalization that leads to regularization. Figure 2 additionally shows the influence of the batch size of batch normalization and group normalization. https://i.imgur.com/lwP5ycw.jpg Figure 1: Training and validation error for different normalization schemes on ImageNet. https://i.imgur.com/0c3CnEX.jpg Figure 2: Validation error for different batch sizes. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels

Zhang, Zhilu and Sabuncu, Mert R.

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Zhang, Zhilu and Sabuncu, Mert R.

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Zhang and Sabuncu propose a generalized cross entropy loss for robust learning on noisy labels. The approach is based on the work by Gosh et al. [1] showing that the mean absolute error can be robust to label noise. Specifically, they show that a symmetric loss, under specific assumptions on the label noise, is robust. Here, symmetry corresponds to $\sum_{j=1}^c \mathcal{L}(f(x), j) = C$ for all $x$ and $f$ where $c$ is the number of classes and $C$ some constant. The cross entropy loss is not symmetric, while the mean absolute error is. The mean absolute error however, usually results in slower learning and may reach lower accuracy. As alternative, the authors propose $\mathcal{L}(f(x), e_j) = \frac{(1 – f_j(x)^q)}{q}$. Here, $f$ is the classifier which is assumed to contain a softmax layer at the end. For $q \rightarrow 0$ this reduces to the cross entropy and for $q = 1$ it reduces to the mean absolute error. As shown in Figure 1, this loss (or a slightly adapted version, see paper, respectively) may obtain better performance on noisy labels. To this end, the label noise is assumed to be uniform, meaning that $p(\tilde{y} = k|y = j, x)= 1 - \eta$ where $\tilde{y}$ is the perturbed label. https://i.imgur.com/HRQ84Zv.jpg Figure 1: Performance of the proposed loss for different $q$ and noise rate $\eta$ on Cifar-10. A ResNet-34 is used. [1] Aritra Gosh, Himanshu Kumar, PS Sastry. Robust loss functions under label noise for deep neural networks. AAAI, 2017. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Taming VAEs

Danilo Jimenez Rezende and Fabio Viola

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2018/10/01 (5 years ago)

**Abstract:** In spite of remarkable progress in deep latent variable generative modeling,
training still remains a challenge due to a combination of optimization and
generalization issues. In practice, a combination of heuristic algorithms (such
as hand-crafted annealing of KL-terms) is often used in order to achieve the
desired results, but such solutions are not robust to changes in model
architecture or dataset. The best settings can often vary dramatically from one
problem to another, which requires doing expensive parameter sweeps for each
new case. Here we develop on the idea of training VAEs with additional
constraints as a way to control their behaviour. We first present a detailed
theoretical analysis of constrained VAEs, expanding our understanding of how
these models work. We then introduce and analyze a practical algorithm termed
Generalized ELBO with Constrained Optimization, GECO. The main advantage of
GECO for the machine learning practitioner is a more intuitive, yet principled,
process of tuning the loss. This involves defining of a set of constraints,
which typically have an explicit relation to the desired model performance, in
contrast to tweaking abstract hyper-parameters which implicitly affect the
model behavior. Encouraging experimental results in several standard datasets
indicate that GECO is a very robust and effective tool to balance
reconstruction and compression constraints.
more
less

Danilo Jimenez Rezende and Fabio Viola

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

[link]
The paper provides derivations and intuitions about the learning dynamics for VAEs based on observations about [$\beta$-VAEs][beta]. Using this they derive an alternative way to constrain the training of VAEs that doesn't require typical heuristics, such as warmup or adding noise to the data. How exactly would this change a typical implementation? Typically, SGD is used to [optimize the ELBO directly](https://github.com/pytorch/examples/blob/master/vae/main.py#L91-L95). Using GECO, I keep a moving average of my constraint $C$ (chosen based on what I want the VAE to do, but it can be just the likelihood plus a tolerance parameter) and use that to calculate Lagrange multipliers, which control the weighting of the constraint to the loss. [This implementation](https://github.com/denproc/Taming-VAEs/blob/master/train.py#L83-L97) from a class project appears to be correct. With the stabilization of training, I can't help but think of this as batchnorm for VAEs. [beta]: https://openreview.net/forum?id=Sy2fzU9gl |

Progress & Compress: A scalable framework for continual learning

Jonathan Schwarz and Jelena Luketina and Wojciech M. Czarnecki and Agnieszka Grabska-Barwinska and Yee Whye Teh and Razvan Pascanu and Raia Hadsell

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2018/05/16 (6 years ago)

**Abstract:** We introduce a conceptually simple and scalable framework for continual
learning domains where tasks are learned sequentially. Our method is constant
in the number of parameters and is designed to preserve performance on
previously encountered tasks while accelerating learning progress on subsequent
problems. This is achieved by training a network with two components: A
knowledge base, capable of solving previously encountered problems, which is
connected to an active column that is employed to efficiently learn the current
task. After learning a new task, the active column is distilled into the
knowledge base, taking care to protect any previously acquired skills. This
cycle of active learning (progression) followed by consolidation (compression)
requires no architecture growth, no access to or storing of previous data or
tasks, and no task-specific parameters. We demonstrate the progress & compress
approach on sequential classification of handwritten alphabets as well as two
reinforcement learning domains: Atari games and 3D maze navigation.
more
less

Jonathan Schwarz and Jelena Luketina and Wojciech M. Czarnecki and Agnieszka Grabska-Barwinska and Yee Whye Teh and Razvan Pascanu and Raia Hadsell

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

[link]
Proposes a two-stage approach for continual learning. An active learning phase and a consolidation phase. The active learning stage optimizes for a specific task that is then consolidated into the knowledge base network via Elastic Weight Consolidation (Kirkpatrick et al., 2016). The active learning phases uses a separate network than the knowledge base, but is not always trained from scratch - authors suggest a heuristic based on task-similarity. Improves EWC by deriving a new online method so parameters don’t increase linearly with the number of tasks. Desiderata for a continual learning solution: - A continual learning method should not suffer from catastrophic forgetting. That is, it should be able to perform reasonably well on previously learned tasks. - It should be able to learn new tasks while taking advantage of knowledge extracted from previous tasks, thus exhibiting positive forward transfer to achieve faster learning and/or better final performance. - It should be scalable, that is, the method should be trainable on a large number of tasks. - It should enable positive backward transfer as well, which means gaining improved performance on previous tasks after learning a new task which is similar or relevant. - Finally, it should be able to learn without requiring task labels, and ideally, it should even be applicable in the absence of clear task boundaries. Experiments: - Sequential learning of handwritten characters of 50 alphabets taken from the Omniglot dataset. - Sequential learning of 6 games in the Atari suite (Bellemare et al., 2012) (“Space Invaders”, “Krull”, “Beamrider”, “Hero”, “Stargunner” and “Ms. Pac-man”). - 8 navigation tasks in 3D environments inspired by experiments with Distral (Teh et al., 2017). |

19 dubious ways to compute the marginal likelihood of a phylogenetic tree topology

Mathieu Fourment and Andrew F. Magee and Chris Whidden and Arman Bilge and Frederick A. Matsen IV and Vladimir N. Minin

arXiv e-Print archive - 2018 via Local arXiv

Keywords: q-bio.PE, stat.CO

**First published:** 2018/11/28 (5 years ago)

**Abstract:** The marginal likelihood of a model is a key quantity for assessing the
evidence provided by the data in support of a model. The marginal likelihood is
the normalizing constant for the posterior density, obtained by integrating the
product of the likelihood and the prior with respect to model parameters. Thus,
the computational burden of computing the marginal likelihood scales with the
dimension of the parameter space. In phylogenetics, where we work with tree
topologies that are high-dimensional models, standard approaches to computing
marginal likelihoods are very slow. Here we study methods to quickly compute
the marginal likelihood of a single fixed tree topology. We benchmark the speed
and accuracy of 19 different methods to compute the marginal likelihood of
phylogenetic topologies on a suite of real datasets. These methods include
several new ones that we develop explicitly to solve this problem, as well as
existing algorithms that we apply to phylogenetic models for the first time.
Altogether, our results show that the accuracy of these methods varies widely,
and that accuracy does not necessarily correlate with computational burden. Our
newly developed methods are orders of magnitude faster than standard
approaches, and in some cases, their accuracy rivals the best established
estimators.
more
less

Mathieu Fourment and Andrew F. Magee and Chris Whidden and Arman Bilge and Frederick A. Matsen IV and Vladimir N. Minin

arXiv e-Print archive - 2018 via Local arXiv

Keywords: q-bio.PE, stat.CO

[link]
This paper compares methods to calculate the marginal likelihood, $p(D | \tau)$, when you have a tree topology $\tau$ and some data $D$ and you need to marginalise over the possible branch lengths $\mathbf{\theta}$ in the process of Bayesian inference. In other words, solving the following integral: $$ \int_{ [ 0, \infty ]^{2S - 3} } p(D | \mathbf{\theta}, \tau ) p( \mathbf{\theta} | \tau) d \mathbf{\theta} $$ There are some details about this problem that are common to phylogenetic problems, such as an exponential prior on the branch lengths, but otherwise this is the common problem of approximate Bayesian inference. This paper compares the following methods: * ELBO (appears to be [BBVI][]) * Gamma Laplus Importance Sampling * Variational Bayes Importance Sampling * Beta' Laplus * Gamma Laplus * Maximum un-normalized posterior probability * Maximum likelihood * Naive Monte Carlo * Bridge Sampling * Conditional Predictive Ordinates * Harmonic Mean * Stabilized Harmonic Mean * Nested Sampling * Pointwise Predictive Density * Path Sampling * Modified Path Sampling * Stepping Stone * Generalized Stepping Stone I leave the in depth description of each algorithm to the paper and appendices, although it's worth mentioning that Laplus is a Laplace approximation where the approximating distribution is constrained to be positive. Some takeaways from the empirical results: * If runtime is not a concern power posterior methods are preferred: > The power posterior methods remain the best general-purpose tools for phylogenetic modelcomparisons, though they are certainly too slow to explore the tree space produced by PT. * Bridge sampling is the next choice, if you need something faster. * Harmonic Mean is a bad estimator for phylogenetic tree problems. * Gamma Laplus is a good fast option. * Naive Monte Carlo is a poor estimator, which is probably to be expected. * Gamma Laplus is the best option for very fast algorithms: > Empirical posterior distributions on branch lengths are clearly not point-masses, and yet simply normalizing the unnormalized posterior at the maximum outperforms 6 of the 19 tested methods. All methods were compared on metrics important to phylogenetic inference, such as *average standard deviation of split frequencies" (ASDSF), which is typically used to confirm whether parallel MCMC chains are sampling from the same distribution over tree topologies. Methods were also compared on KL divergence to the true posterior and RMSD (appears to be the mean squared error between CDFs?). [bbvi]: https://arxiv.org/abs/1401.0118 |

Rao-Blackwellized Stochastic Gradients for Discrete Distributions

Runjing Liu and Jeffrey Regier and Nilesh Tripuraneni and Michael I. Jordan and Jon McAuliffe

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2018/10/10 (5 years ago)

**Abstract:** We wish to compute the gradient of an expectation over a finite or countably
infinite sample space having $K \leq \infty$ categories. When $K$ is indeed
infinite, or finite but very large, the relevant summation is intractable.
Accordingly, various stochastic gradient estimators have been proposed. In this
paper, we describe a technique that can be applied to reduce the variance of
any such estimator, without changing its bias---in particular, unbiasedness is
retained. We show that our technique is an instance of Rao-Blackwellization,
and we demonstrate the improvement it yields on a semi-supervised
classification problem and a pixel attention task.
more
less

Runjing Liu and Jeffrey Regier and Nilesh Tripuraneni and Michael I. Jordan and Jon McAuliffe

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

[link]
This paper approaches the problem of optimizing parameters of a discrete distribution with respect to some loss function that is an expectation over that distribution. In other words, an experiment will probably be a variational autoencoder with discrete latent variables, but there are many real applications: $$ \mathcal{L} (\eta) : = \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_{\eta} (z) \right] $$ Using the [product rule of differentiation][product] the derivative of this loss function can be computed by enumerating all $1 \to K$ possible values of $z$: $$ \nabla_\eta \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_{\eta} (z) \right] = \nabla_\eta \sum_{k=1}^{K} q_\eta (k) f_\eta (k) \\ = \sum_{k=1}^{K} f_\eta (k) \nabla_\eta q_\eta (k) + q_\eta (k) \nabla_\eta f_\eta (k) $$ This expectation can also be expressed as the score function estimator (aka the REINFORCE estimator): $$ \nabla_\eta \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_{\eta} (z) \right] = \sum_{k=1}^{K} \left(f_\eta (k) \nabla_\eta q_\eta (k) + q_\eta (k) \nabla_\eta f_\eta (k)\right)\frac{q_\eta (k)}{q_\eta (k)} \\ = \sum_{k=1}^{K} q_\eta (k) f_\eta (k) \nabla_\eta \log q_\eta (k) + q_\eta (k) \nabla_\eta f_\eta (k) \\ = \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_\eta (k) \nabla_\eta \log q_\eta (k) + \nabla_\eta f_\eta (k) \right] \\ = \sum_{k=1}^{K} f_\eta (k) \nabla_\eta q_\eta (k) + q_\eta (k) \nabla_\eta f_\eta (k) = \mathbb{E}_{z \sim q_{\eta} (z)} \left[ g(z) \right] $$ In other words, both can be referred to as estimators $g(z)$. The authors note that this can be calculated over a subset of the $k$ most probable states (overloading their $k$ from possible values of $z$). Call this set $C_k$: $$ \nabla_\eta \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_{\eta} (z) \right] = \mathbb{E}_{z \sim q_{\eta} (z)} \left[ g(z) \right] \\ = \mathbb{E}_{z \sim q_{\eta} (z)} \left[ g(z) \mathbb{1}\{ z \in C_k\} + g(z) \mathbb{1} \{ z \notin C_k \} \right] \\ = \sum_{z \in C_k} q_\eta(z) g(z) + \mathbb{E}_{z \sim q_{\eta} (z)} \left[ g(z) \mathbb{1} \{ z \notin C_k \} \right] $$ As long as $k$ is small, it's easy to calculate the first term, and if most of the probability mass is contained in that set, then it shouldn't matter how well we approximate the second term. The authors choose an importance-sampling for the second term, but this is where I get confused. They denote their importance weighting function $q_\eta (z \notin C_k)$ which *could* mean all of the probability mass *not* under the states in $C_k$? Later, they define a decision variable $b$ that expresses whether we are in this set or not, and it's sampled with probability $q_\eta (z \notin C_k)$, so I think my interpretation is correct. The gradient estimator then becomes: $$ \hat{g} (v) = \sum_{z \in C_k} q_\eta (z) g(z) + q_\eta (z \notin C_k) g(v)\\ v \sim q_\eta | v \notin C_k $$ [product]: https://en.wikipedia.org/wiki/Product_rule Showing this is Rao-Blackwellization ---------------------------------------------- Another way to express $z$ would be to sample a Bernoulli r.v. with probability $\sum_{j \notin C_k} q_\eta (j) $, then if it's $1$ sample from $z \in C_k$ and if it's $0$ sample from $z \notin C_k$. As long as those samples are drawn using $q_\eta$ then: $$ T(u,v,b) \stackrel{d}{=} z \\ T := u^{1-b} v^b $$ where $u \sim q_\eta | C_k$, $v \sim q_\eta | v \notin C_k$ and $b \sim \text{Bernoulli}(\sum_{j \notin C_k} q_\eta (j))$. Expressing $z$ in this way means the gradient estimator from before can be written as: $$ \hat{g} (v) = \mathbb{E} \left[ g( T(u,v,b) ) | v \right] $$ And they left it as an exercise for the reader to expand that out and show it's the same as equation 6: $$ \mathbb{E} \left[ g( T(u,v,b) ) | v \right] = \mathbb{E} \left[ g( T(u,v,b)) \mathbb{1} \{ b=0 \} + g( T(u,v,b)) \mathbb{1} \{ b=1 \} \right] \\ = \mathbb{E} \left[ g(z) \mathbb{1} \{ z \in C_k \} + g( z) \mathbb{1} \{ z \notin C_k \} \right] = \mathbb{E} \left[ g(z) \right] $$ Writing the estimator as a conditional expectation of some statistic of the random variables under the distribution is sufficient to show that this is an instance of Rao-Blackwellization. To be safe, the authors also apply the [conditional variance decomposition][eve] to reinforce the property that RB estimators always have lower variance: $$ Var(Y) = E\left[ Var (Y|X) \right] + Var(E \left[ Y | X \right] ) \\ Var(g (z) ) = Var (\mathbb{E} \left[ g( T(u,v,b) ) | v \right] ) + E \left[ Var ( g( T(u,v,b) ) | v ) \right] \\ Var (\mathbb{E} \left[ g( T(u,v,b) ) | v \right] ) = Var (\hat{g} (v) ) = Var(g (z) ) - E \left[ Var ( g( T(u,v,b) ) | v ) \right] $$ They go on to show that the variance is less than or equal to $Var(g(z)) \sum_{j \notin C_k} q_\eta (j)$. Finally, they note that the variance of a simple estimator can also be reduced by taking multiple samples and averaging. They then provide an equation to calculate the optimal $k$ number of elements of $z$ to evaluate depending on how concentrated the distribution being evaluated is, and a proof showing that this will have a lower variance than the naive estimator. $$ \hat{k} = \underset{k \in {0, ..., N}}{\operatorname{argmin}} \frac{\sum_{j \notin C_k} q_\eta (j)}{N-k} $$ I'm not very interested in the experiments right now, but skimming through them it's interesting to see that this method performs very well on a high dimensional hard attention task on MNIST. Particularly because a Gumbel-softmax estimator falls apart in the same experiment. It would be nice to see results on RL problems as were shown in the [RELAX][] paper. [eve]: https://en.wikipedia.org/wiki/Law_of_total_variance [relax]: https://arxiv.org/abs/1711.00123 |

MultiPoseNet: Fast Multi-Person Pose Estimation Using Pose Residual Network

Muhammed Kocabas and Salih Karagoz and Emre Akbas

Computer Vision – ECCV 2018 - 2018 via Local CrossRef

Keywords:

Muhammed Kocabas and Salih Karagoz and Emre Akbas

Computer Vision – ECCV 2018 - 2018 via Local CrossRef

Keywords:

[link]
The method is a multi-task learning model performing person detection, keypoint detection, person segmentation, and pose estimation. It is a bottom-up approach as it first localizes identity-free semantics and then group them into instances. https://i.imgur.com/kRs9687.png Model structure: - **Backbone**. A feature extractor is presented by ResNet-(50 or 101) with one [Feature Pyramid Network](https://arxiv.org/pdf/1612.03144.pdf) (FPN) for keypoint branch and one for person detection branch. FPN enhances extracted features through multi-level representation. - **Keypoint detection** detects keypoints as well as produces a pixel-level segmentation mask. https://i.imgur.com/XFAi3ga.png FPN features $K_i$ are processed with multiple $3\times3$ convolutions followed by concatenation and final $1\times1$ convolution to obtain predictions for each keypoint, as well as segmentation mask (see Figure for details). This results in #keypoints_in_dataset_per_person + 1 output layers. Additionally, intermediate supervision (i.e. loss) is applied at the FPN outputs. $L_2$ loss between predictions and Gaussian peaks at the keypoint locations is used. Similarly, $L_2$ loss is applied for segmentation predictions and corresponding ground truth masks. - **Person detection** is essentially a [RetinaNet](https://arxiv.org/pdf/1708.02002.pdf), a one-stage object detector, modified to only handle *person* class. - **Pose estimation**. Given initial keypoint predictions, Pose Estimation Network (PRN) selects a single keypoint for each class. https://i.imgur.com/k8wNP5p.png During inference, PRN takes cropped outputs from keypoint detection branch defined by the predicted bounding boxes from the person detection branch, resizes it to a fixed size, and forwards it through a multilayer perceptron with residual connection. During the training, the same process is performed, except the cropped keypoints come from the ground truth annotation defined by a labeled bounding box. This model is not an end-to-end trainable model. While keypoint and person detection branches can, in theory, be trained simultaneously, PRN network requires separate training. **Personal note**. Interestingly, PRN training with ground truth inputs (i.e. "perfect" inputs) only reaches 89.4 mAP validation score which is surprisingly quite far from the max possible score. This presumably means that even if preceding networks or branches perform god-like, the PRN might become a bottleneck in the performance. Therefore, more efforts should be directed to PRN itself. Moreover, modifying the network to support end-to-end training might help in boosting the performance. Open-source implementations used to make sure the paper apprehension is correct: [link1](https://github.com/LiMeng95/MultiPoseNet.pytorch), [link2](https://github.com/IcewineChen/pytorch-MultiPoseNet). |

WAIC, but Why? Generative Ensembles for Robust Anomaly Detection

Hyunsun Choi and Eric Jang and Alexander A. Alemi

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2018/10/02 (5 years ago)

**Abstract:** Machine learning models encounter Out-of-Distribution (OoD) errors when the
data seen at test time are generated from a different stochastic generator than
the one used to generate the training data. One proposal to scale OoD detection
to high-dimensional data is to learn a tractable likelihood approximation of
the training distribution, and use it to reject unlikely inputs. However,
likelihood models on natural data are themselves susceptible to OoD errors, and
even assign large likelihoods to samples from other datasets. To mitigate this
problem, we propose Generative Ensembles, which robustify density-based OoD
detection by way of estimating epistemic uncertainty of the likelihood model.
We present a puzzling observation in need of an explanation -- although
likelihood measures cannot account for the typical set of a distribution, and
therefore should not be suitable on their own for OoD detection, WAIC performs
surprisingly well in practice.
more
less

Hyunsun Choi and Eric Jang and Alexander A. Alemi

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

[link]
### Summary Knowing when a model is qualified to make a prediction is critical to safe deployment of ML technology. Model-independent / Unsupervised Out-of-Distribution (OoD) detection is appealing mostly because it doesn't require task-specific labels to train. It is tempting to suggest a simple one-tailed test in which lower likelihoods are OoD (assigned by a Likelihood Model), but the intuition that In-Distribution (ID) inputs should have highest likelihoods _does not hold in higher dimension_. The authors propose to use the Watanabe-Akaike Information Criterion (WAIC) to circumvent this problem and empirically show the robustness of the approach. ### Counterintuitive Properties of Likelihood Models: https://i.imgur.com/4vo0Ff5.png So a GLOW model with Gaussian prior maps SVHN closer to the origin than Cifar (but never actually generates SVHN because Gaussian samples are on the shell). This is bad news for OoD detection. ### Proposed Methodology: Use the WAIC criterion for OoD detection which gives an asymptotically correct estimate of the gap between the training set and test set expectations: https://i.imgur.com/vasSxuk.png Basically, the correction term subtracts the variance in likelihoods across independent samples from the posterior. This acts to robustify the estimate, ensuring that points that are sensitive to the particular choice of posterior are penalized. They use an ensemble of generative models as a proxy for posterior samples i.e. the ensembles acts as approximate posterior samples. Now, OoD can be detected with a Likelihood Model: https://i.imgur.com/M3CDKOA.png ### Discussion Interestingly, GLOW maps Cifar and other datasets INSIDE the gaussian shell (which is an annulus of radius $\sqrt{dim} = \sqrt{3072} \approx 55.4$ https://i.imgur.com/ERdgOaz.png This is in itself quite disturbing, as it suggests that better flow-based generative models (for sampling) can be obtained by encouraging the training distribution to overlap better with the typical set in latent space. |

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang

Conference and Computer Vision and Pattern Recognition - 2018 via Local CrossRef

Keywords:

Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang

Conference and Computer Vision and Pattern Recognition - 2018 via Local CrossRef

Keywords:

[link]
# Summary This paper presents state-of-the-art methods for both caption generation of images and visual question answering (VQA). The authors build on previous methods by adding what they call a "bottom-up" approach to previous "top-down" attention mechanisms. They show that using their approach they obtain SOTA on both Image captioning (MSCOCO) and the Visual Question and Answering (2017 VQA challenge). They propose a specific network configurations for each. Their biggest contribution is using Faster-R-CNN to retrieve the "important" parts of an image to focus on in both models. ## Top Down Up until this paper, the traditional approach was to use a "top-down" approach, in which the last feature map layer of a CNN is used to obtain a latent representation of the given input image. These features, along with the context of the caption being generated, were used to generate attention weights that were used to predict the next sequence in the context of caption generation. The network would learn to focus its attention on regions of the feature map that matters most. This is the approach used in previous SOTA methods like [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/abs/1502.03044). ## Bottom-up The authors argue that the feature map of a CNN is too generic and can be thought of operating on a uniform, grid-like feature map. In other words, there is no particular reason to think that the feature map of generated by a CNN would give optimal regions to attend to. Also, carefully choosing the dimensions of the feature map can be very arbitrary. In order to fix this, the authors propose combining object detection methods in a *bottom-up* approach. To do so, the authors propose using Faster-R-CNN to identify regions of interest in an image. Given an input image, Faster-R-CNN will identify bounding boxes of the image that likely correspond to objects of a given category and simultaneously compute a feature vector of that bounding box. Figure 1 shows the difference between the Bottom-up and Top-Down approach. ![image](https://user-images.githubusercontent.com/18450628/61817263-2683cd00-ae1c-11e9-971a-d3b531dbbd98.png) ## Combining the two In this paper, the authors suggest using the bottom-up approach to compute the salient regions of the image the network should focus on using Faster-R-CNN. FRCNN is carefully pretrained on both imagenet and the Visual Genome dataset. It is then frozen and only used to generate bounding boxes of regions with high confidence of being of interest. The top-down approach is then used on the features obtained from the bottom-up approach. In order to "enhance" the FRCNN performance, they initialize their FRCNN with a ResNet-101 pre-trained on imagenet. They train their FRCNN on the Visual Genome dataset, adding attributes to the loss function that are available from the Visual Genome dataset, attributes such as color (black, white, gold etc.), state (open, close, dark, bright, etc.). A sample of FRCNN outputs are shown in figure 2. It is important to stress that only the feature representations and not the actual outputs (i.e. not the labels) are used in their model. ![image](https://user-images.githubusercontent.com/18450628/61817487-aca01380-ae1c-11e9-90fa-134033b95bb0.png) ## Caption Generation Figure 3 provides a high-level overview of the model being used for caption generation for images. The image is first passed through FRCNN which produces a set of image features *V*. In their specific implementation, *V* consists of *k* vectors of size 1x2048. Their model consists of two LSTM blocks, one for attention and the other for language generation. ![image](https://user-images.githubusercontent.com/18450628/61818488-effb8180-ae1e-11e9-8ae4-14355115429a.png) The first block of their model is a Top-Down Attention LSTM layer. It takes as input the mean-pooled features *V* , i.e. 1/k*sum(v_i), concatenated with the previous timestep's hidden representation of the language LSTM as well as the word embedding of the previously generated word. The word embedding is learned and not pretrained. The output of the first LSTM is used to compute the attention for each vector using an MLP and softmax: ![image](https://user-images.githubusercontent.com/18450628/61819982-21298100-ae22-11e9-80a9-99640896413d.png) The attention weighted image feature is then used as an input to the language LSTM model, concatenated with the output from the top-down Attention LSTM and a softmax is used to predict the next word in the sequence. The loss function seeks to minimize the cross-entropy of the generated sentence. ## VQA Model The VQA task differs to the image generation in that a text-based question accompanies an input image and the network must produce an answer. The VQA model proposed is different to that of the caption generation model previously described, however they both use the same bottom-up approach to generate the feature vectors of the image based on the FRCNN architecture. A high-level overview of the architecture for the VQA model is presented in Figure 4. ![image](https://user-images.githubusercontent.com/18450628/61821988-8da67f00-ae26-11e9-8456-3c9e5ec60787.png) Each word from the question is converted to a learned word embedding which is used as input to a GRU. The number of words for each question is limited to 14 for computational efficiency. The output from the GRU is concatenated with each of the *k* image features, and attention weights are computed for each *k*th feature using an MLP and softmax, similar to what is done in the attention for caption generation. The weighted sum of the feature vectors is then passed through an linear layer such that its shape is compatible with the gru output, and the Hadamard product (element-wise product) is computed over the GRU output and attention-weighted image feature representation. Finally, a tanh non-linear activation is used. This results in a "gated tanh", which have been shown empirically to outperform both ReLU and tanh. Finally, a softmax probability distribution is generated at the output which selects a candidate answer among all possible candidate answers. ## Results and experiments ### Resnet Baseline To demonstrate that their contribution of bottom-up mechanism actually improves on results, the authors use a ResNet trained on imagenet as a baseline for generating the image feature vectors (they resize the final CNN layers using bilinear interpolation when needed). They consistently obtain better results when using the bottom-up approach over the ResNet approach in both caption generation and VQA. ## MSCOCO The authors demonstrate that they outperform all results on all metrics on the MSCOCO test server. ![image](https://user-images.githubusercontent.com/18450628/61824157-4f5f8e80-ae2b-11e9-8d90-657db453e26e.png) They also show how using the bottom-up approach over ResNet consistently scores them higher on detecting instances of objects, attributes, relations, etc: ![image](https://user-images.githubusercontent.com/18450628/61824238-7fa72d00-ae2b-11e9-81b3-b5a7f80153f3.png) The authors, like their predecessors, insist on demonstrating their network's frisbee ability: ![image](https://user-images.githubusercontent.com/18450628/61824344-bed57e00-ae2b-11e9-87cd-597568587e1d.png) ## VQA Results They also demonstrate that the addition of bottom-up attention improves results over a ResNet baseline. ![image](https://user-images.githubusercontent.com/18450628/61824500-28ee2300-ae2c-11e9-9016-2120a91917e4.png) They also show that their model outperformed all other submissions on the VQA submission. They mention using an ensemble of 30 models for their submission. ![image](https://user-images.githubusercontent.com/18450628/61824634-83877f00-ae2c-11e9-8d84-9589e0ea2be2.png) A sample image of what is attended in an image given a proper answer is shown in figure 6. ![image](https://user-images.githubusercontent.com/18450628/61824608-736f9f80-ae2c-11e9-9d4e-8cb6bd0a1a92.png) # Comments The authors introduce a new way to select portions of the image on which to focus attention. The idea is very original and came at a time when object detection was making significant progress (i.e. FRCNN). A few comments: * This method might not generalize well to other types of data. It requires pre-training on larger datasets (visual genome, imagenet, etc.) which consist of categories that overlap with both the MSCOCO and VQA datasets (i.e. cars, people, etc.). It would be interesting to see an end-to-end model that does not rely on pre-training on other similar datasets. * No insight is given to the computational complexity nor to the inference time or training time. I imagine that FRCNN is resource intensive, and having to do a forward pass of FRCNN for every pass of the network must be a computational bottleneck. Not to mention that they ensembled 30 of them! |

Sanity Checks for Saliency Maps

Adebayo, Julius and Gilmer, Justin and Muelly, Michael and Goodfellow, Ian J. and Hardt, Moritz and Kim, Been

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

Adebayo, Julius and Gilmer, Justin and Muelly, Michael and Goodfellow, Ian J. and Hardt, Moritz and Kim, Been

Neural Information Processing Systems Conference - 2018 via Local Bibsonomy

Keywords: dblp

[link]
The paper designs some basic tests to compare saliency methods. It founds that some of the most popular methods are independent of model parameters and the data, meaning they are effectively useless. ## Methods compared The paper compare the following methods: gradient explanation, gradient x input, integrated gradients, guided backprop, guided GradCam and SmoothGrad. They provide a refresher on those methods in the appendix. All those methods can be put in the same framework. They require a classification model and an input (typically an image). The output of the method is an *explanation map* of the shape of the input where a higher value for a feature implies greater relevance in the decision of the model. ## Metrics of comparison The authors argue that visual inspection of the saliency maps can be misleading. They propose to compute the Spearman rank correlation, the structural similarity index (SSMI) and the Pearson correlation of the histogram of gradients. The authors point out that those metrics capture various notions of similarity, but it is an active area of research and those metrics are imperfect. ## First test: model parameters randomization A saliency method must be dependent of model parameters, otherwise it cannot help us understand a model. In this test, the authors randomize the model parameters, layer per layer, starting from the top. Surprisingly, methods such as guided backprop and guided gradcam are completely insensitive to model parameters, as illustrated on this Inception v3 trained on ImageNet: ![image](https://user-images.githubusercontent.com/8659132/61403152-b10b8000-a8a2-11e9-9f6a-cf1ed6a876cc.png) Integrated gradients looks also dubious as the bird is still visible with a mostly fully randomized model, but the quantitative metrics reveal the difference is actually big between the two models. ## Second test: data randomization It is well-known that randomly shuffling the labels of a dataset does not prevent a neural network from getting a high accuracy on the training set, though it does prevent generalization. The model is able to learn by either memorizing the data or finding spurious patterns. As a result, saliency maps obtained from such a network should have no clearly interpretable signal. Here is the result for a ConvNet trained on MNIST and a shuffled MNIST: ![image](https://user-images.githubusercontent.com/8659132/61406757-7efe1c00-a8aa-11e9-9826-a859a373cb4f.png) The results are very damning for most methods. Only gradients and GradCam are very different between both models, as confirmed by the low correlation. ## Discussion - Even though some methods do no depend on model parameters and data, they might still depend on the architecture of the models, which could be of some use in some contexts. - Methods that multiply the input with the gradient are dominated by the input. - Complex saliency methods are just fancy edge detectors. - Only gradient, smooth gradient and GradCam survives the sanity checks. # Comments - Why is their GradCam maps so ugly? They don't look like usual GradCam maps at all. - Their tests are simple enough that it's hard to defend a method that doesn't pass them. - The methods that are left are not very good either. They give fuzzy maps that are difficult to interpret. - In the case of integrated gradients (IG), I'm not convinced this is sufficient to discard the method. IG requires a "baseline input" that represents the absence of features. In the case of images, people usually just set the image to 0, which is not at all the absence of a feature. The authors also use the "set the image to 0" strategy, and I'd say their tests are damning for this strategy, not for IG in general. I'd expect an estimation of the baseline such as done in [this paper](https://arxiv.org/abs/1702.04595) would be a fairer evaluation of IG. Code: [GitHub](https://github.com/adebayoj/sanity_checks_saliency) (not available as of 17/07/19) |

Second-Order Adversarial Attack and Certifiable Robustness

Li, Bai and Chen, Changyou and Wang, Wenlin and Carin, Lawrence

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Li, Bai and Chen, Changyou and Wang, Wenlin and Carin, Lawrence

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Li et al. propose an adversarial attack motivated by second-order optimization and uses input randomization as defense. Based on a Taylor expansion, the optimal adversarial perturbation should be aligned with the dominant eigenvector of the Hessian matrix of the loss. As the eigenvectors of the Hessian cannot be computed efficiently, the authors propose an approximation; this is mainly based on evaluating the gradient under Gaussian noise. The gradient is then normalized before taking a projected gradient step. As defense, the authors inject random noise on the input (clean example or adversarial example) and compute the average prediction over multiple iterations. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Certified Robustness to Adversarial Examples with Differential Privacy

Mathias Lecuyer and Vaggelis Atlidakis and Roxana Geambasu and Daniel Hsu and Suman Jana

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.CR, cs.LG

**First published:** 2018/02/09 (6 years ago)

**Abstract:** Adversarial examples that fool machine learning models, particularly deep
neural networks, have been a topic of intense research interest, with attacks
and defenses being developed in a tight back-and-forth. Most past defenses are
best effort and have been shown to be vulnerable to sophisticated attacks.
Recently a set of certified defenses have been introduced, which provide
guarantees of robustness to norm-bounded attacks, but they either do not scale
to large datasets or are limited in the types of models they can support. This
paper presents the first certified defense that both scales to large networks
and datasets (such as Google's Inception network for ImageNet) and applies
broadly to arbitrary model types. Our defense, called PixelDP, is based on a
novel connection between robustness against adversarial examples and
differential privacy, a cryptographically-inspired formalism, that provides a
rigorous, generic, and flexible foundation for defense.
more
less

Mathias Lecuyer and Vaggelis Atlidakis and Roxana Geambasu and Daniel Hsu and Suman Jana

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.CR, cs.LG

[link]
Lecuyer et al. propose a defense against adversarial examples based on differential privacy. Their main insight is that a differential private algorithm is also robust to slight perturbations. In practice, this amounts to injecting noise in some layer (or on the image directly) and using Monte Carlo estimation for computing the expected prediction. The approach is compared to adversarial training against the Carlini+Wagner attack. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models

Sven Gowal and Krishnamurthy Dvijotham and Robert Stanforth and Rudy Bunel and Chongli Qin and Jonathan Uesato and Relja Arandjelovic and Timothy Mann and Pushmeet Kohli

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CR, stat.ML

**First published:** 2018/10/30 (5 years ago)

**Abstract:** Recent work has shown that it is possible to train deep neural networks that
are verifiably robust to norm-bounded adversarial perturbations. Most of these
methods are based on minimizing an upper bound on the worst-case loss over all
possible adversarial perturbations. While these techniques show promise, they
remain hard to scale to larger networks. Through a comprehensive analysis, we
show how a careful implementation of a simple bounding technique, interval
bound propagation (IBP), can be exploited to train verifiably robust neural
networks that beat the state-of-the-art in verified accuracy. While the upper
bound computed by IBP can be quite weak for general networks, we demonstrate
that an appropriate loss and choice of hyper-parameters allows the network to
adapt such that the IBP bound is tight. This results in a fast and stable
learning algorithm that outperforms more sophisticated methods and achieves
state-of-the-art results on MNIST, CIFAR-10 and SVHN. It also allows us to
obtain the first verifiably robust model on a downscaled version of ImageNet.
more
less

Sven Gowal and Krishnamurthy Dvijotham and Robert Stanforth and Rudy Bunel and Chongli Qin and Jonathan Uesato and Relja Arandjelovic and Timothy Mann and Pushmeet Kohli

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CR, stat.ML

[link]
Gowal et al. propose interval bound propagation to obtain certified robustness against adversarial examples. In particular, given a neural network consisting of linear layers and monotonic increasing activation functions, a set of allowed perturbations is propagated to obtain upper and lower bounds at each layer. These lead to bounds on the logits of the network; these are used to verify whether the network changes its prediction on the allowed perturbations. Specifically, Gowal et al. consider an $L_\infty$ ball around input examples; the initial bounds are, thus, $\underline{z}_0 = x - \epsilon$ and $\overline{z}_0 = x + \epsilon$. For each layer, the bounds are defined as $\underline{z}_{k,i} = \min_{\underline{z}_{k – 1} \leq z_{k – 1} \leq \overline{z}_{k-1}} e_i^T h_k(z_{k – 1})$ and the analogous maximization problem for the upper bound; here, $h$ denotes the applied layer. For Linear layers and monotonic activation functions, this is easy to solve, as shown in the paper. Moreover, computing these bounds is very efficient, only needing roughly two times the computation of one forward pass. During training, a combination of a clean loss and adversarial loss is used: $\kappa l(z_K, y) + (1 - \kappa) l(\hat{z}_K, y)$ where $z_K$ are the logits of the input $x$, and $\hat{z}_K$ are the adversarial logits computed as $\hat{Z}_{K,y’} = \begin{cases} \overline{z}_{K,y’} & \text{if } y’ \neq y\\\underline{z}_{K,y} & \text{otherwise}\end{cases}$ Both $\epsilon$ and $\kappa$ are annealed during training. In experiments, it is shown that this method results in quite tight bounds on robustness. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Deep-RBF Networks Revisited: Robust Classification with Rejection

Zadeh, Pourya Habib and Hosseini, Reshad and Sra, Suvrit

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Zadeh, Pourya Habib and Hosseini, Reshad and Sra, Suvrit

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Zadeh et al. propose a layer similar to radial basis functions (RBFs) to increase a network’s robustness against adversarial examples by rejection. Based on a deep feature extractor, the RBF units compute $d_k(x) = \|A_k^Tx + b_k\|_p^p$ with parameters $A$ and $b$. The decision rule remains unchanged, but the output does not resemble probabilities anymore. The full network, i.e., feature extractor and RBF layer, is trained using an adapted loss that resembles a max margin loss: $J = \sum_i (d_{y_i}(x_i) + \sum_{j \neq y_i} \max(0, \lambda – d_j(x_i)))$ where $(x_i, y_i)$ is a training examples including label. The loss essentially minimizes the output corresponding to the true class while maximizing the output for all other classes up to a specified margin. Additionally, noise examples are injected during training. For these noise examples, $\sum_j \max(0, \lambda – d_j(x))$ is maximized to enforce that these examples are treated as negatives in a rejection setting where samples not corresponding to the data distribution (or adversarial examples) can be rejected by the model. In experiments, the proposed method seems to be more robust against FGSM and iterative attacks (as evaluated on Foolbox). Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Neural Networks with Structural Resistance to Adversarial Attacks

Luca de Alfaro

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.CR, cs.LG, cs.NE

**First published:** 2018/09/25 (5 years ago)

**Abstract:** In adversarial attacks to machine-learning classifiers, small perturbations
are added to input that is correctly classified. The perturbations yield
adversarial examples, which are virtually indistinguishable from the
unperturbed input, and yet are misclassified. In standard neural networks used
for deep learning, attackers can craft adversarial examples from most input to
cause a misclassification of their choice.
We introduce a new type of network units, called RBFI units, whose non-linear
structure makes them inherently resistant to adversarial attacks. On
permutation-invariant MNIST, in absence of adversarial attacks, networks using
RBFI units match the performance of networks using sigmoid units, and are
slightly below the accuracy of networks with ReLU units. When subjected to
adversarial attacks, networks with RBFI units retain accuracies above 90% for
attacks that degrade the accuracy of networks with ReLU or sigmoid units to
below 2%. RBFI networks trained with regular input are superior in their
resistance to adversarial attacks even to ReLU and sigmoid networks trained
with the help of adversarial examples.
The non-linear structure of RBFI units makes them difficult to train using
standard gradient descent. We show that networks of RBFI units can be
efficiently trained to high accuracies using pseudogradients, computed using
functions especially crafted to facilitate learning instead of their true
derivatives. We show that the use of pseudogradients makes training deep RBFI
networks practical, and we compare several structural alternatives of RBFI
networks for their accuracy.
more
less

Luca de Alfaro

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.CR, cs.LG, cs.NE

[link]
De Alfaro proposes a deep radial basis function (RBF) network to obtain robustness against adversarial examples. In contrast to “regular” RBF networks, which usually consist of only one hidden layer containing RBF units, de Alfaro proposes to stack multiple layers with RBF units. Specifically, a Gaussian unit utilizing the $L_\infty$ norm is used: $\exp\left( - \max_i(u_i(x_i – w_i))^2\right)$ where $u_i$ and $w_i$ are parameters and $x_i$ are the inputs to the unit – so the network inputs or the outputs of the previous hidden layer. This unit can be understood as computing a soft AND operation; therefore, an alternative OR operation $1 - \exp\left( - \max_i(u_i(x_i – w_i))^2\right)$ is used as well. These two units are used alternatingly in hidden layers in the conducted experiments. Based on these units, de Alfaro argues that the model is less sensitive to adversarial examples, compared to linear operations as commonly used in ReLU networks. For training a deep RBF-network, pseudo gradients are used for both the maximum operation and the exponential function. This is done for simplifying training; I refer to the paper for details. In their experiments, on MNIST, a multi-layer perceptron with the proposed RBF units is used. The network consists of 512 AND units, 512 OR units, 512 AND units and finally 10 OR units. Robustness against FGSM and I-FGSM as well as PGD attacks seems to improve. However, the used PGD attack seems to be weaker than usually, it does not manage to reduce adversarial accuracy of a normal networks to near-zero. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples

Lee, Kimin and Lee, Honglak and Lee, Kibok and Shin, Jinwoo

International Conference on Learning Representations - 2018 via Local Bibsonomy

Keywords: dblp

Lee, Kimin and Lee, Honglak and Lee, Kibok and Shin, Jinwoo

International Conference on Learning Representations - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Lee et al. propose a generative model for obtaining confidence-calibrated classifiers. Neural networks are known to be overconfident in their predictions – not only on examples from the task’s data distribution, but also on other examples taken from different distributions. The authors propose a GAN-based approach to force the classifier to predict uniform predictions on examples not taken from the data distribution. In particular, in addition to the target classifier, a generator and a discriminator are introduced. The generator generates “hard” out-of-distribution examples; ideally these examples are close to the in-distribution, i.e., the data distribution of the actual task. The discriminator is intended to distinguish between out- and in-distribution. The overall algorithm, including the necessary losses, is given in Algorithm 1. In experiments, the approach is shown to allow detecting out-distribution examples nearly perfectly. Examples of the generated “hard” out-of-distribution samples are given in Figure 1. https://i.imgur.com/NmF0fpN.png Algorithm 1: The proposed joint training scheme of out-distribution generator $G$, the in-/out-distribution discriminator $G$ and the original classifier providing $P_\theta$(y|x)$ with parameters $\theta$. https://i.imgur.com/kAclSQz.png Figure 1: A comparison of a regular GAN (a and c) to the proposed framework (c and d). Clearly, the proposed approach generates out-of-distribution samples (i.e., no meaningful digits) close to the original data distribution. |

On the importance of single directions for generalization

Ari S. Morcos and David G. T. Barrett and Neil C. Rabinowitz and Matthew Botvinick

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.LG, cs.NE

**First published:** 2018/03/19 (6 years ago)

**Abstract:** Despite their ability to memorize large datasets, deep neural networks often
achieve good generalization performance. However, the differences between the
learned solutions of networks which generalize and those which do not remain
unclear. Additionally, the tuning properties of single directions (defined as
the activation of a single unit or some linear combination of units in response
to some input) have been highlighted, but their importance has not been
evaluated. Here, we connect these lines of inquiry to demonstrate that a
network's reliance on single directions is a good predictor of its
generalization performance, across networks trained on datasets with different
fractions of corrupted labels, across ensembles of networks trained on datasets
with unmodified labels, across different hyperparameters, and over the course
of training. While dropout only regularizes this quantity up to a point, batch
normalization implicitly discourages single direction reliance, in part by
decreasing the class selectivity of individual units. Finally, we find that
class selectivity is a poor predictor of task importance, suggesting not only
that networks which generalize well minimize their dependence on individual
units by reducing their selectivity, but also that individually selective units
may not be necessary for strong network performance.
more
less

Ari S. Morcos and David G. T. Barrett and Neil C. Rabinowitz and Matthew Botvinick

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.LG, cs.NE

[link]
Morcos et al. study the influence of ablating single units as a proxy to generalization performance. On Cifar10, for example, a 11-layer convolutional network is trained on the clean dataset, as well as on versions of Cifar10 where a fraction of $p$ samples have corrupted labels. In the latter cases, the network is forced to memorize examples, as there is no inherent structure in the labels assignment. Then, it is experimentally shown that these memorizing networks are less robust to setting whole feature maps to zero, i.e., ablating them. This is shown in Figure 1. Based on this result, the authors argue that the area under this ablation curve (AUC) can be used as proxy for generalization performance. For example, early stopping or hyper-parameter selection can be done based on this AUC value. Furthermore, they show that batch normalization discourages networks to rely on these so-called single-directions, i.e., single units or feature maps. Specifically, batch normalization seems to favor units holding information about multiple classes/concepts. https://i.imgur.com/h2JwLUF.png Figure 1: Classification accuracy (y-axis) over the number of units that are ablated (x-axis) for networks trained on Cifar10 with various degrees of corrupted labels. The same experiments (left and right) for MNIST and ImageNet. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Improving Transferability of Adversarial Examples with Input Diversity

Cihang Xie and Zhishuai Zhang and Yuyin Zhou and Song Bai and Jianyu Wang and Zhou Ren and Alan Yuille

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG, stat.ML

**First published:** 2018/03/19 (6 years ago)

**Abstract:** Though CNNs have achieved the state-of-the-art performance on various vision
tasks, they are vulnerable to adversarial examples --- crafted by adding
human-imperceptible perturbations to clean images. However, most of the
existing adversarial attacks only achieve relatively low success rates under
the challenging black-box setting, where the attackers have no knowledge of the
model structure and parameters. To this end, we propose to improve the
transferability of adversarial examples by creating diverse input patterns.
Instead of only using the original images to generate adversarial examples, our
method applies random transformations to the input images at each iteration.
Extensive experiments on ImageNet show that the proposed attack method can
generate adversarial examples that transfer much better to different networks
than existing baselines. By evaluating our method against top defense solutions
and official baselines from NIPS 2017 adversarial competition, the enhanced
attack reaches an average success rate of 73.0%, which outperforms the top-1
attack submission in the NIPS competition by a large margin of 6.6%. We hope
that our proposed attack strategy can serve as a strong benchmark baseline for
evaluating the robustness of networks to adversaries and the effectiveness of
different defense methods in the future. Code is available at
https://github.com/cihangxie/DI-2-FGSM.
more
less

Cihang Xie and Zhishuai Zhang and Yuyin Zhou and Song Bai and Jianyu Wang and Zhou Ren and Alan Yuille

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG, stat.ML

[link]
Xie et al. propose to improve the transferability of adversarial examples by computing them based on transformed input images. In particular, they adapt I-FGSM such that, in each iteration, the update is computed on a transformed version of the current image with probability $p$. When, at the same time attacking an ensemble of networks, this is shown to improve transferability. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Beyond Pixel Norm-Balls: Parametric Adversaries using an Analytically Differentiable Renderer

Hsueh-Ti Derek Liu and Michael Tao and Chun-Liang Li and Derek Nowrouzezahrai and Alec Jacobson

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CV, cs.GR, stat.ML

**First published:** 2018/08/08 (6 years ago)

**Abstract:** Many machine learning image classifiers are vulnerable to adversarial
attacks, inputs with perturbations designed to intentionally trigger
misclassification. Current adversarial methods directly alter pixel colors and
evaluate against pixel norm-balls: pixel perturbations smaller than a specified
magnitude, according to a measurement norm. This evaluation, however, has
limited practical utility since perturbations in the pixel space do not
correspond to underlying real-world phenomena of image formation that lead to
them and has no security motivation attached. Pixels in natural images are
measurements of light that has interacted with the geometry of a physical
scene. As such, we propose the direct perturbation of physical parameters that
underly image formation: lighting and geometry. As such, we propose a novel
evaluation measure, parametric norm-balls, by directly perturbing physical
parameters that underly image formation. One enabling contribution we present
is a physically-based differentiable renderer that allows us to propagate pixel
gradients to the parametric space of lighting and geometry. Our approach
enables physically-based adversarial attacks, and our differentiable renderer
leverages models from the interactive rendering literature to balance the
performance and accuracy trade-offs necessary for a memory-efficient and
scalable adversarial data augmentation workflow.
more
less

Hsueh-Ti Derek Liu and Michael Tao and Chun-Liang Li and Derek Nowrouzezahrai and Alec Jacobson

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CV, cs.GR, stat.ML

[link]
Liu et al. propose adversarial attacks on physical parameters of images, which can be manipulated efficiently through differentiable renderer. In particular, they propose adversarial lighting and adversarial geometry; in both cases, an image is assumed to be a function of lighting and geometry, generated by a differentiable renderer. By directly manipulating these latent variables, more realistic looking adversarial examples can be generated for synthetic images as shown in Figure 1. https://i.imgur.com/uh2pj9w.png Figure 1: Comparison of the proposed attack with known attacks applied to large perturbations, $L_\infty \approx 0.82$. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Breaking Transferability of Adversarial Samples with Randomness

Zhou, Yan and Kantarcioglu, Murat and Xi, Bowei

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Zhou, Yan and Kantarcioglu, Murat and Xi, Bowei

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Zhou et al. study transferability of adversarial examples against ensembles of randomly perturbed networks. Specifically, they consider randomly perturbing the weights using Gaussian additive noise. Using an ensemble of these perturbed networks, the authors show that transferability of adversarial examples decreases significantly. However, the authors do not consider adapting their attack to this defense scenario. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Cost-Sensitive Robustness against Adversarial Examples

Zhang, Xiao and Evans, David

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Zhang, Xiao and Evans, David

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Thang and Evanse propose cost-sensitive certified robustness where different adversarial examples can be weighted based on their actual impact for the application. Specifically, they consider the certified robustness formulation (and the corresponding training scheme) by Wong and Kolter. This formulation is extended by acknowledging that different adversarial examples have different impact for specific applications; this is formulized through a cost matrix which quantifies which source-target label combinations of adversarial examples are actually harmful. Based on this cost matrix, cost-sensitive certified robustness as well as the corresponding training scheme is proposed and evaluated in experiments. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Black-box Adversarial Attacks with Limited Queries and Information

Andrew Ilyas and Logan Engstrom and Anish Athalye and Jessy Lin

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.CR, stat.ML

**First published:** 2018/04/23 (6 years ago)

**Abstract:** Current neural network-based classifiers are susceptible to adversarial
examples even in the black-box setting, where the attacker only has query
access to the model. In practice, the threat model for real-world systems is
often more restrictive than the typical black-box model where the adversary can
observe the full output of the network on arbitrarily many chosen inputs. We
define three realistic threat models that more accurately characterize many
real-world classifiers: the query-limited setting, the partial-information
setting, and the label-only setting. We develop new attacks that fool
classifiers under these more restrictive threat models, where previous methods
would be impractical or ineffective. We demonstrate that our methods are
effective against an ImageNet classifier under our proposed threat models. We
also demonstrate a targeted black-box attack against a commercial classifier,
overcoming the challenges of limited query access, partial information, and
other practical issues to break the Google Cloud Vision API.
more
less

Andrew Ilyas and Logan Engstrom and Anish Athalye and Jessy Lin

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.CR, stat.ML

[link]
Ilyas et al. propose three query-efficient black-box adversarial example attacks using distribution-based gradient estimation. In particular, their simplest attacks involves estimating the gradient locally using a search distribution: $ \nabla_x \mathbb{E}_{\pi(\theta|x)} [F(\theta)] = \mathbb{E}_{\pi(\theta|x)} [F(\theta) \nabla_x \log(\pi(\theta|x))]$ where $F(\cdot)$ is a loss function – e.g., using the cross-entropy loss which is maximized to obtain an adversarial example. The above equation, using a Gaussian noise search distribution leads to a simple approximator for the gradient: $\nabla \mathbb{E}[F(\theta)] = \frac{1}{\sigma n} \sum_{i = 1}^n \delta_i F(\theta + \sigma \delta_i)$ where $\sigma$ is the search variance and $\delta_i$ are sampled from a unit Gaussian. This scheme can then be applied as part of the projected gradient descent white-box attacks to obtain adversarial examples. The above attack assumes that the black-box network provides probability outputs in order to compute the loss $F$. In the remainder of the paper, the authors also generalize this approach to the label-only case, where the network only provides the top $k$ labels for each input. In experiments, the attacks is shown to be effective while rarely requiring more than $50$k queries on ImageNet. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

On the Intriguing Connections of Regularization, Input Gradients and Transferability of Evasion and Poisoning Attacks

Ambra Demontis and Marco Melis and Maura Pintor and Matthew Jagielski and Battista Biggio and Alina Oprea and Cristina Nita-Rotaru and Fabio Roli

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CR, stat.ML, 68T10, 68T45

**First published:** 2018/09/08 (5 years ago)

**Abstract:** Transferability captures the ability of an attack against a machine-learning
model to be effective against a different, potentially unknown, model. Studying
transferability of attacks has gained interest in the last years due to the
deployment of cyber-attack detection services based on machine learning. For
these applications of machine learning, service providers avoid disclosing
information about their machine-learning algorithms. As a result, attackers
trying to bypass detection are forced to craft their attacks against a
surrogate model instead of the actual target model used by the service. While
previous work has shown that finding test-time transferable attack samples is
possible, it is not well understood how an attacker may construct adversarial
examples that are likely to transfer against different models, in particular in
the case of training-time poisoning attacks. In this paper, we present the
first empirical analysis aimed to investigate the transferability of both
test-time evasion and training-time poisoning attacks. We provide a unifying,
formal definition of transferability of such attacks and show how it relates to
the input gradients of the surrogate and of the target classification models.
We assess to which extent some of the most well-known machine-learning systems
are vulnerable to transfer attacks, and explain why such attacks succeed (or
not) across different models. To this end, we leverage some interesting
connections highlighted in this work among the adversarial vulnerability of
machine-learning models, their regularization hyperparameters and input
gradients.
more
less

Ambra Demontis and Marco Melis and Maura Pintor and Matthew Jagielski and Battista Biggio and Alina Oprea and Cristina Nita-Rotaru and Fabio Roli

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CR, stat.ML, 68T10, 68T45

[link]
Demontis et al. study transferability of adversarial examples and data poisening attacks in the light of the targeted models gradients. In particular, they experimentally validate the following hypotheses: First, susceptibility to these attacks depends on the size of the model’s gradients; the higher the gradient, the smaller is the perturbation needed to increase the loss. Second, the size of the gradient depends on regularization. And third, the cosine between the target model’s gradients and the surrogate model’s gradients (trained to compute transferable attacks) influences vulnerability. These insights hold for both evasion and poisening attacks and are motivated by a simple linear Taylor expansion of the target model’s loss. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Adversarial Dropout for Supervised and Semi-Supervised Learning

Park, Sungrae and Park, Jun-Keon and Shin, Su-Jin and Moon, Il-Chul

AAAI Conference on Artificial Intelligence - 2018 via Local Bibsonomy

Keywords: dblp

Park, Sungrae and Park, Jun-Keon and Shin, Su-Jin and Moon, Il-Chul

AAAI Conference on Artificial Intelligence - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Park et al. introduce adversarial dropout, a variant of adversarial training based on adversarially computing dropout masks. Specifically, instead of training on adversarial examples, the authors propose an efficient method to compute adversarial dropout masks during training. In experiments, this approach seems to improve generalization performance in semi-supervised settings. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Fine-Pruning: Defending Against Backdooring Attacks on Deep Neural Networks

Liu, Kang and Dolan-Gavitt, Brendan and Garg, Siddharth

Springer RAID - 2018 via Local Bibsonomy

Keywords: dblp

Liu, Kang and Dolan-Gavitt, Brendan and Garg, Siddharth

Springer RAID - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Liu et al. propose fine-pruning, a combination of weight pruning and fine-tuning to defend against backdoor attacks on neural networks. Specifically, they consider a setting where training is outsourced to a machine learning service; the attacker has access to the network and training set, however, any change in network architecture would be easily detected. Thus, the attacker tries to inject backdoors through data poisening. As defense against such attacks, the authors propose to identify and prune weights that are not used for the actual tasks but only for the backdoor inputs. This defense can then be combined with fine-tuning and, as shown in experiments, is able to make backdoor attacks less effective – even when considering an attacker aware of this defense. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

On the Geometry of Adversarial Examples

Khoury, Marc and Hadfield-Menell, Dylan

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Khoury, Marc and Hadfield-Menell, Dylan

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Khoury and Hadfield-Menell provide two important theoretical insights regarding adversarial robustness: it is impossible to be robust in terms of all norms, and adversarial training is sample inefficient. Specifically, they study robustness in relation to the problem’s codimension, i.e., the difference between the dimensionality of the embedding space (e.g., image space) and the dimensionality of the manifold (where the data is assumed to actually live on). Then, adversarial training is shown to be sample inefficient in high codimensions. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

The Limitations of Model Uncertainty in Adversarial Settings

Grosse, Kathrin and Pfaff, David and Smith, Michael T. and Backes, Michael

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Grosse, Kathrin and Pfaff, David and Smith, Michael T. and Backes, Michael

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Grosse et al. show that Gaussian Processes allow to reject some adversarial examples based on their confidence and uncertainty; however, attacks maximizing confidence and minimizing uncertainty are still successful. While some state-of-the-art adversarial examples seem to result in significantly different confidence and uncertainty estimates compared to benign examples, Gaussian Processes can still be fooled through particularly crafted adversarial examples. To this end, the confidence is explicitly maximized and, additionally, the uncertainty is constrained to not be larger than the uncertainty of the corresponding benign test example. In experiments, this attack is shown to successfully fool Gaussian Processes while resulting in imperceptible perturbations. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

The Secret Sharer: Measuring Unintended Neural Network Memorization & Extracting Secrets

Carlini, Nicholas and Liu, Chang and Kos, Jernej and Erlingsson, Úlfar and Song, Dawn

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Carlini, Nicholas and Liu, Chang and Kos, Jernej and Erlingsson, Úlfar and Song, Dawn

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Carlini et al. propose several attacks to extract secrets form trained black-box models. Additionally, they show that state-of-the-art neural networks memorize secrets early during training. Particularly on the Penn treebank, after inserting a secret of specific format, the authors validate that the secret can be identified based on the models output probabilities (i.e., black-box access). Several metrics based on the log-perplexity of the secret show that secrets are memorized early during training and memorization happens for all popular architectures and training strategies; additionally, memorization also works for multiple secrets. Furthermore, the authors propose several attacks to extract secrets, most notably through shortest path search. Here, starting with an empty secret, the characters of the secret are identified sequentially in order to minimize log-perplexity. Using this attack, secrets such as credit card numbers are extractable from popular mail datasets. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Playing the Game of Universal Adversarial Perturbations

Julien Perolat and Mateusz Malinowski and Bilal Piot and Olivier Pietquin

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

**First published:** 2018/09/20 (5 years ago)

**Abstract:** We study the problem of learning classifiers robust to universal adversarial
perturbations. While prior work approaches this problem via robust
optimization, adversarial training, or input transformation, we instead phrase
it as a two-player zero-sum game. In this new formulation, both players
simultaneously play the same game, where one player chooses a classifier that
minimizes a classification loss whilst the other player creates an adversarial
perturbation that increases the same loss when applied to every sample in the
training set. By observing that performing a classification (respectively
creating adversarial samples) is the best response to the other player, we
propose a novel extension of a game-theoretic algorithm, namely fictitious
play, to the domain of training robust classifiers. Finally, we empirically
show the robustness and versatility of our approach in two defence scenarios
where universal attacks are performed on several image classification datasets
-- CIFAR10, CIFAR100 and ImageNet.
more
less

Julien Perolat and Mateusz Malinowski and Bilal Piot and Olivier Pietquin

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

[link]
Pérolat et al. propose a game-theoretic variant of adversarial training on universal adversarial perturbations. In particular, in each training iteration, the model is trained for a specific number of iterations on the current training set. Afterwards, a universal perturbation is found (and the corresponding test images) that fools the network. The found adversarial examples are added to the training set. In the next iteration, the network is trained on the new training set which includes adversarial examples. Overall, this leads to a network being trained on a sequence of universal adversarial perturbations corresponding to earlier versions of that network. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Are adversarial examples inevitable?

Shafahi, Ali and Huang, W. Ronny and Studer, Christoph and Feizi, Soheil and Goldstein, Tom

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Shafahi, Ali and Huang, W. Ronny and Studer, Christoph and Feizi, Soheil and Goldstein, Tom

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Shafahi et al. discuss fundamental limits of adversarial robustness, showing that adversarial examples are – to some extent – inevitable. Specifically, for the unit sphere, the unit cube as well as for different attacks (e.g., sparse attacks and dense attacks), the authors show that adversarial examples likely exist. The provided theoretical arguments also provide some insights on which problems are more (or less) robust. For example, more concentrated class distributions seem to be more robust by construction. Overall, these insights lead the authors to several interesting conclusions: First, the results are likely to extent to datasets which actually live on low-dimensional manifolds of the unit sphere/cube. Second, it needs to be differentiated between the existence adversarial examples and our ability to compute them efficiently. Making it harder to compute adversarial examples might, thus, be a valid defense mechanism. And third, the results suggest that lower-dimensional data might be less susceptible to adversarial examples. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Universal Adversarial Training

Shafahi, Ali and Najibi, Mahyar and Xu, Zheng and Dickerson, John P. and Davis, Larry S. and Goldstein, Tom

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Shafahi, Ali and Najibi, Mahyar and Xu, Zheng and Dickerson, John P. and Davis, Larry S. and Goldstein, Tom

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Shafahi et al. propose universal adversarial training, meaning training on universal adversarial examples. In contrast to regular adversarial examples, universal ones represent perturbations that cause a network to mis-classify many test images. In contrast to regular adversarial training, where several additional iterations are required on each batch of images, universal adversarial training only needs one additional forward/backward pass on each batch. The obtained perturbations for each batch are accumulated in a universal adversarial examples. This makes adversarial training more efficient, however reduces robustness significantly. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Fortified Networks: Improving the Robustness of Deep Networks by Modeling the Manifold of Hidden Representations

Lamb, Alex and Binas, Jonathan and Goyal, Anirudh and Serdyuk, Dmitriy and Subramanian, Sandeep and Mitliagkas, Ioannis and Bengio, Yoshua

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Lamb, Alex and Binas, Jonathan and Goyal, Anirudh and Serdyuk, Dmitriy and Subramanian, Sandeep and Mitliagkas, Ioannis and Bengio, Yoshua

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Lamb et al. introduce fortified networks with denoising auto encoders as hidden layers. These denoising auto encoders are meant to learn the manifold of hidden representations, project adversarial input back to the manifold and improve robustness. The main idea is illustrated in Figure 1. The denoising auto encoders can be added at any layer and are trained jointly with the classification network – either on the original input, or on adversarial examples as done in adversarial training. https://i.imgur.com/5vaZrVk.png Figure 1: Illustration of a fortified layer, i.e., a hidden layer that is reconstructed through a denoising auto encoder as defense mechanism. The denoising auto encoders are trained jointly with the network. In experiments, they show that the proposed defense mechanism improves robustness on MNIST and CIFAR, compared to adversarial training and other baselines. The improvements are, however, very marginal. Especially, as the proposed method imposes an additional overhead (in addition to adversarial training). Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Towards the first adversarially robust neural network model on MNIST

Lukas Schott and Jonas Rauber and Matthias Bethge and Wieland Brendel

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV

**First published:** 2018/05/23 (6 years ago)

**Abstract:** Despite much effort, deep neural networks remain highly susceptible to tiny
input perturbations and even for MNIST, one of the most common toy datasets in
computer vision, no neural network model exists for which adversarial
perturbations are large and make semantic sense to humans. We show that even
the widely recognized and by far most successful defense by Madry et al. (1)
overfits on the L-infinity metric (it's highly susceptible to L2 and L0
perturbations), (2) classifies unrecognizable images with high certainty, (3)
performs not much better than simple input binarization and (4) features
adversarial perturbations that make little sense to humans. These results
suggest that MNIST is far from being solved in terms of adversarial robustness.
We present a novel robust classification model that performs analysis by
synthesis using learned class-conditional data distributions. We derive bounds
on the robustness and go to great length to empirically evaluate our model
using maximally effective adversarial attacks by (a) applying decision-based,
score-based, gradient-based and transfer-based attacks for several different Lp
norms, (b) by designing a new attack that exploits the structure of our
defended model and (c) by devising a novel decision-based attack that seeks to
minimize the number of perturbed pixels (L0). The results suggest that our
approach yields state-of-the-art robustness on MNIST against L0, L2 and
L-infinity perturbations and we demonstrate that most adversarial examples are
strongly perturbed towards the perceptual boundary between the original and the
adversarial class.
more
less

Lukas Schott and Jonas Rauber and Matthias Bethge and Wieland Brendel

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV

[link]
Schott et al. propose an analysis-by-synthetis approach for adversarially robust MNIST classification. In particular, as illustrated in Figure 1, class-conditional variational auto-encoders (i.e., one variational auto-encoder per class) are learned. The respective recognition models, i.e., encoders, are discarded. For classification, the optimization problem $l_y^*(x) = \max_z \log p(x|z) - \text{KL}(\mathcal{N}(z, \sigma I)|\mathcal{N}(0,1))$ is solved for each class $z$. Here, $p(x|z)$ represents the learned generative model. The optimization problem leads a latent code $z$ corresponding to the best reconstruction of the input. The corresponding likelihood can be used for classificaiton using Bayes’ theorem. The obtained posteriors $p(y|x)$ are then scaled using a modified softmax (see paper) to obtain the final decision. (Additionally, input binarization is used as defense.) https://i.imgur.com/ignvoHQ.png Figure 1: The proposed analysis by synthesis approach to MNIST classification. The depicted generators are taken from class-specific variational auto-encoders. In addition to the proposed defense, Schott et al. also derive lower and upper bounds on the robustness of the classification procedure. These bounds can be derived from the optimization problem above, see the paper for details. In experiments, they show that their defense outperforms state-of-the-art adversarial training and allows to estimate tight bounds. In addition, the method is robust against distal adversarial examples and the adversarial examples look more meaningful, see Figure 2. https://i.imgur.com/uxGzzg1.png Figure 2: Adversarial examples for the proposed “ABS” method, its binary variant and related work. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Generating Natural Adversarial Examples

Zhao, Zhengli and Dua, Dheeru and Singh, Sameer

International Conference on Learning Representations - 2018 via Local Bibsonomy

Keywords: dblp

Zhao, Zhengli and Dua, Dheeru and Singh, Sameer

International Conference on Learning Representations - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Zhao et al. propose a generative adversarial network (GAN) based approach to generate meaningful and natural adversarial examples for images and text. With natural adversarial examples, the authors refer to meaningful changes in the image content instead of adding seemingly random/adversarial noise – as illustrated in Figure 1. These natural adversarial examples can be crafted by first learning a generative model of the data, e.g., using a GAN together with an inverter (similar to an encoder), see Figure 2. Then, given an image $x$ and its latent code $z$, adversarial examples $\tilde{z} = z + \delta$ can be found within the latent code. The hope is that these adversarial examples will correspond to meaningful, naturally looking adversarial examples in the image space. https://i.imgur.com/XBhHJuY.png Figure 1: Illustration of natural adversarial examples in comparison ot regular, FGSM adversarial examples. https://i.imgur.com/HT2StGI.png Figure 2: Generative model (GAN) together with the required inverter. In practice, e.g., on MNIST, any black-box classifier can be attacked by randomly sampling possible perturbations $\delta$ in the random space (with increasing norm) until an adversarial perturbation is found. Here, the inverted from Figure 2 is trained on top of the critic of the GAN (although specific details are missing in the paper). Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Adversarial Training Versus Weight Decay

Angus Galloway and Thomas Tanay and Graham W. Taylor

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2018/04/10 (6 years ago)

**Abstract:** Performance-critical machine learning models should be robust to input
perturbations not seen during training. Adversarial training is a method for
improving a model's robustness to some perturbations by including them in the
training process, but this tends to exacerbate other vulnerabilities of the
model. The adversarial training framework has the effect of translating the
data with respect to the cost function, while weight decay has a scaling
effect. Although weight decay could be considered a crude regularization
technique, it appears superior to adversarial training as it remains stable
over a broader range of regimes and reduces all generalization errors. Equipped
with these abstractions, we provide key baseline results and methodology for
characterizing robustness. The two approaches can be combined to yield one
small model that demonstrates good robustness to several white-box attacks
associated with different metrics.
more
less

Angus Galloway and Thomas Tanay and Graham W. Taylor

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, stat.ML

[link]
Galloway et al. provide a theoretical and experimental discussion of adversarial training and weight decay with respect to robustness as well as generalization. In the following I want to try and highlight the most important findings based on their discussion of linear logistic regression. Considering the softplus loss $\mathcal{L}(z) = \log(1 + e^{-z})$, the learning problem takes the form: $\min_w \mathbb{E}_{x,y \sim p_{data}} [\mathcal{L}(y(w^Tx + b)]$ where $y \in \{-1,1\}$. This optimization problem is also illustrated in Figure 1 (top). Now considering $L_2$ weight decay can also be seen to be equivalent to scaling the softplus loss. In particular, Galloway et al. Argue that $w^Tx + b = \|w\|_2 d(x)$ where $d(x)$ is the (signed) Euclidean distance to the decision boundary. (This follows directly from the fact that $d(x) = \frac{w^Tx +b}{\|w\|w_2}$.) Then, the problem can be rewritten as $\min_w \mathbb{E}_{x,y \sim p_{data}} [\mathcal{L}(yd(x) \|w\|_2)]$ This can be understood as a scaled version of the softplus loss; adding a $L_2$ weight decay term basically controls the level of scaling. This is illustrated in Figure 1 (middle) for different levels of scaling. Finally, adversarial training means training on the worst-case example for a given $\epsilon$. In practice, for the linear logistic regression model, this results in training on $x - \epsilon y \frac{w}{\|w\|_2}$ - which can easily be understood when considering that the attacker can cause the most disturbance when changing the samples in the direction of $-w$ for label $1$. Then, $y (w^T(x - \epsilon y \frac{w}{\|w\|_2}) + b) = y(w^Tx + b) - \epsilon \|w\|_2 = \|w\|_2 (yd(x) - \epsilon)$, which results in a shift of the data by $\epsilon$ - as illustrated in Figure 1 (bottom). Overall, show that weight decay acts as scaling the objective and adversarial training acts as shifting the data (or equivalently the objective). In the non-linear case, decaying weights is argued to be equivalent to decaying the logits. Effectively, this results in a temperature parameter for the softmax function resulting in smoother probability distributions. Similarly, adversarial training (in a first-order approximation) can be understood as effectively reducing the probability attributed to the correct class. Here, again, weight decay results in a scaling effect and adversarial training in a shifting effect. In conclusion, adversarial training is argued to be only effective with small perturbation sizes (i.e., if the shift is not too large), weil weight decay is also beneficial for generalization. However, from reading the paper, it is unclear what the actual recommendation on both methods is. In the experimental section, the authors focus on two models, a wide residual network and a very constrained 4-layer convolutional neural network. Here, their discussion shifts slightly to the complexity of the employed model. While not stated very explicitly, one of the take-aways is that the simpler model might be more robust, especially for fooling images. https://i.imgur.com/FKT3a2O.png https://i.imgur.com/wWwFKqn.png https://i.imgur.com/oaTfqHJ.png Figure 1: Illustration of the linear logistic regression argument. Top: illustration of linear logistic regression where $\xi$ is the loss $\mathcal{L}$, middle: illustration of the impact of weight decay/scaling, bottom: illustration of the impact of shift for adversarial training. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Unsupervised Learning via Meta-Learning

Kyle Hsu and Sergey Levine and Chelsea Finn

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, cs.CV, stat.ML

**First published:** 2018/10/04 (5 years ago)

**Abstract:** A central goal of unsupervised learning is to acquire representations from
unlabeled data or experience that can be used for more effective learning of
downstream tasks from modest amounts of labeled data. Many prior unsupervised
learning works aim to do so by developing proxy objectives based on
reconstruction, disentanglement, prediction, and other metrics. Instead, we
develop an unsupervised learning method that explicitly optimizes for the
ability to learn a variety of tasks from small amounts of data. To do so, we
construct tasks from unlabeled data in an automatic way and run meta-learning
over the constructed tasks. Surprisingly, we find that, when integrated with
meta-learning, relatively simple mechanisms for task design, such as clustering
unsupervised representations, lead to good performance on a variety of
downstream tasks. Our experiments across four image datasets indicate that our
unsupervised meta-learning approach acquires a learning algorithm without any
labeled data that is applicable to a wide range of downstream classification
tasks, improving upon the representation learned by four prior unsupervised
learning methods.
more
less

Kyle Hsu and Sergey Levine and Chelsea Finn

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, cs.CV, stat.ML

[link]
What is stopping us from applying meta-learning to new tasks? Where do the tasks come from? Designing task distribution is laborious. We should automatically learn tasks! Unsupervised Learning via Meta-Learning: The idea is to use a distance metric in an out-of-the-box unsupervised embedding space created by BiGAN/ALI or DeepCluster to construct tasks in an unsupervised way. If you cluster points to randomly define classes (e.g. random k-means) you can then sample tasks of 2 or 3 classes and use them to train a model. Where does the extra information come from? The metric space used for k-means asserts specific distances. The intuition why this works is that it is useful model initialization for downstream tasks. This summary was written with the help of Chelsea Finn. |

Model-Based Reinforcement Learning via Meta-Policy Optimization

Clavera, Ignasi and Rothfuss, Jonas and Schulman, John and Fujita, Yasuhiro and Asfour, Tamim and Abbeel, Pieter

Conference on Robot Learning - 2018 via Local Bibsonomy

Keywords: dblp

Clavera, Ignasi and Rothfuss, Jonas and Schulman, John and Fujita, Yasuhiro and Asfour, Tamim and Abbeel, Pieter

Conference on Robot Learning - 2018 via Local Bibsonomy

Keywords: dblp

[link]
In terms of model based RL, learning dynamics models is imperfect, which often leads to the learned policy overfitting to the learned dynamics model, doing well in the learned simulator but not in the real world. Key solution idea: No need to try to learn one accurate simulator. We can learn an ensemble of models that together will sufficiently represent the space. If we learn an ensemble of models (to be used as many learned simulators) we can denoise estimates of performance. In a meta-learning sense these simulations become the tasks. The real world is then just yet another task, to which the policy could adapt quickly. One experimental observation is that at the start of training there is a lot of variation between learned simulators, and then the simulations come together over training, which might also point to this approach providing improved exploration. This summary was written with the help of Pieter Abbeel. |

Explaining Image Classifiers by Counterfactual Generation

Chun-Hao Chang and Elliot Creager and Anna Goldenberg and David Duvenaud

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV

**First published:** 2018/07/20 (6 years ago)

**Abstract:** When an image classifier makes a prediction, which parts of the image are
relevant and why? We can rephrase this question to ask: which parts of the
image, if they were not seen by the classifier, would most change its decision?
Producing an answer requires marginalizing over images that could have been
seen but weren't. We can sample plausible image in-fills by conditioning a
generative model on the rest of the image. We then optimize to find the image
regions that most change the classifier's decision after in-fill. Our approach
contrasts with ad-hoc in-filling approaches, such as blurring or injecting
noise, which generate inputs far from the data distribution, and ignore
informative relationships between different parts of the image. Our method
produces more compact and relevant saliency maps, with fewer artifacts compared
to previous methods.
more
less

Chun-Hao Chang and Elliot Creager and Anna Goldenberg and David Duvenaud

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV

[link]
In the area of explaining model predictions over images, there are two main strains of technique: methods that look for pixels that have the highest gradient effect on the output class, and assign those as the “reason” for the class, and approaches that ask which pixel regions are most responsible for a given classification, in the sense that the classification would change the most if they were substituted with some uninformative reference value. The tricky thing about the second class of methods is that you need to decide what to use as your uninformative fill-in value. It’s easy enough to conceptually pose the problem of “what would our model predict if it couldn’t see this region of pixels,” but as a practical matter, these models take in full images, and you have to put *something* to give the classifier in a region, if you’re testing what the score would be if you removed the information contained in the pixels in that region. What should you fill in instead? The simplest answers are things like “zeros”, or “a constant value” or “white noise”. But all of these are very off-distribution for the model; it wouldn’t have typically seen images that resemble white noise, or all zeros, or all a single value. So if you measure the change in your model score from an off-distribution baseline to your existing pixels, you may not be getting the marginal value of the pixels, so much as the marginal disutility of having something so different from what the model has previously seen. There are other, somewhat more sensible approaches, like blurring out the areas around the pixel region of interest, but these experience less intense forms of the same issue. This paper proposes instead, using generative models to fill in the regions conditioned on the surrounding pixels, and use that as a reference. The notion here is that a conditioned generative model, like a GAN or VAE, can take into account the surrounding pixels, and “imagine” a fill-in that flows smoothly from the surrounding pixels, and looks generally like an image, but which doesn’t contain the information from the pixels in the region being tested, since it wasn’t conditioned on that. https://i.imgur.com/2fKnY0M.png Using this approach, the authors run two types of test: one where they optimize to find the smallest region they can remove from the image, and have it switch class (Smallest Deletion Region, or SDR), and also the smallest informative region that can be added to an otherwise uninformative image, and have the model predict the class connected to that region. They find that regions calculated using their generative model fill in, and specifically with GANs, find a smaller and more compact pixel region as their explanation for the prediction, which is consistent with both human intuitions and also with a higher qualitative sensibleness of the explanations found. |

Temporal Difference Variational Auto-Encoder

Gregor, Karol and Besse, Frederic

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Gregor, Karol and Besse, Frederic

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
This was definitely one of the more conceptually nuanced and complicated papers I’ve read recently, and I’ve only got about 60% confidence that I fully grasp all of its intuitions. However, I’m going to try to collect together what I did understand. There is a lot of research into generative models of text or image sequences, and some amount of research into building “models” in the reinforcement learning sense, where your model can predict future observations given current observations and your action. There’s an important underlying distinction here between model-based RL (where you learn a model of how the world evolves, and use that to optimize reward) and model-free RL (where you leave don’t bother explicitly learning a world model, and just directly try to optimize rewards) However, this paper identifies a few limitations of this research. 1) It’s largely focused on predicting observations, rather than predicting *state*. State is a bit of a fuzzy concept, and corresponds to, roughly, “the true underlying state of the game”. An example I like to use is a game where you walk in one door, and right next to it is a second door, which requires you to traverse the space and find rewards and a key before you can open. Now, imagine that the observation of your agent is it looking at the door. If the game doesn’t have any on-screen representation of the fact that you’ve found the key, it won’t be present in your observations, and you’ll observe the same thing at the point you have just entered and once you found the key. However, the state of the game at these two points will be quite different, in that in the latter case, your next states might be “opening the door” rather than “going to collect rewards”. Scenarios like this are referred to broadly as Partially Observable games or environments. This paper wants to build a model of how the game evolves into the future, but it wants to build a model of *state-to-state* evolution, rather than observation-to-observation evolution, since observations are typically both higher-dimensionality and also more noisy/less informative. 2) Past research has typically focused on predicting each next-step observation, rather than teaching models to be able to directly predict a state many steps in the future, without having to roll out the entire sequence of intermediate predictions. This is arguably quite valuable for making models that can predict the long term consequences of their decision This paper approaches that with an approach inspired by the Temporal Difference framework used in much of RL, in which you update your past estimate of future rewards by forcing it to be consistent with the actual observed rewards you encounter in the future. Except, in this model, we sample two a state (z1) and then a state some distance into the future (z2), and try to make our backwards-looking prediction of the state at time 1, taking into account observations that happened in between, match what our prediction was with only the information at time one. An important mechanistic nuance here is the idea of a “belief state”, something that captures all of your knowledge about game history up to a certain point. We can then directly sample a state Zt given the belief state Bt at that point. This isn’t actually possible with a model where we predict a state at time T given the state at time T-1, because the state at time Z-1 is itself a sample, and so in order to get a full distribution of Zt, you have to sample Zt over the distribution of Zt-1, and in order to get the distribution of Zt-1 you have to sample over the distribution of Zt-2, and so on and so on. Instead, we have a separate non-state variable, Bt that is created conditional on all our past observations (through a RNN). https://i.imgur.com/N0Al42r.png All said and done, the mechanics of this model look like: 1) Pick two points along the sequence trajectory 2) Calculate the belief state at each point, and, from that, construct a distribution over states at each timestep using p(z|b) 3) Have an additional model that predicts z1 given z2, b1, and b2 (that is, the future beliefs and states), and push the distribution over z1 from this model to be close to the distribution over z1 given only the information available at time t1 4) Have a model that predicts Z2 given Z1 and the time interval ahead that we’re jumping, and try to have this model be predictive/have high likelihood over the data 5) And, have a model that predicts an observation at time T2 given the state Z2, and train that so that we can convert our way back to an observation, given a state They mostly test it on fairly simple environments, but it’s an interesting idea, and I’d be curious to see other people develop it in future. (A strange aspect of this model is that, as far as I can tell, it’s non-interventionist, in that we’re not actually conditioning over agent action, or trying to learn a policy for an agent. This is purely trying to learn the long term transitions between states) |

Meta-Learning Update Rules for Unsupervised Representation Learning

Luke Metz and Niru Maheswaranathan and Brian Cheung and Jascha Sohl-Dickstein

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.NE, stat.ML

**First published:** 2018/03/31 (6 years ago)

**Abstract:** A major goal of unsupervised learning is to discover data representations
that are useful for subsequent tasks, without access to supervised labels
during training. Typically, this involves minimizing a surrogate objective,
such as the negative log likelihood of a generative model, with the hope that
representations useful for subsequent tasks will arise as a side effect. In
this work, we propose instead to directly target later desired tasks by
meta-learning an unsupervised learning rule which leads to representations
useful for those tasks. Specifically, we target semi-supervised classification
performance, and we meta-learn an algorithm -- an unsupervised weight update
rule -- that produces representations useful for this task. Additionally, we
constrain our unsupervised update rule to a be a biologically-motivated,
neuron-local function, which enables it to generalize to different neural
network architectures, datasets, and data modalities. We show that the
meta-learned update rule produces useful features and sometimes outperforms
existing unsupervised learning techniques. We further show that the
meta-learned unsupervised update rule generalizes to train networks with
different widths, depths, and nonlinearities. It also generalizes to train on
data with randomly permuted input dimensions and even generalizes from image
datasets to a text task.
more
less

Luke Metz and Niru Maheswaranathan and Brian Cheung and Jascha Sohl-Dickstein

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.NE, stat.ML

[link]
Unsupervised representation learning is a funny thing: our aspiration in learning representations from data is typically that they’ll be useful for future tasks, but, since we (by definition) don’t have access to labels, our approach has historically been to define heuristics, such as representing the data distribution in a low-dimensional space, and hope that those heuristics translate to useful learned representations. And, to a fair extent, they have. However, this paper’s goal is to attach this problem more directly, by explicitly meta-learning an unsupervised update rule so that performs well in future tasks. They do this by: https://i.imgur.com/EEkpW9g.png 1) Defining a parametrized weight update function, to slot into the role that Stochastic Gradient Descent on a label-defined loss function would play in a supervised network. This function calculates a “hidden state”, is defined for each neuron in each layer, and takes in the pre and post-nonlinearity activations for that batch, the hidden state of the next layer, and a set of learned per-layer “backwards weights”. The weight update for that neuron is then calculated using the current hidden state, the last batch's hidden state, and the current value of the weight. In the traditional way of people in this field who want to define some generic function, they instantiate these functions as a MLP. 2) Using that update rule on the data from a new task, taking the representing resulting from applying the update rule, and using it in a linear regression with a small number of samples. The generalization performance from this k-shot regression, taken in expectation over multiple tasks, acts as our meta training objective. By back-propagating from this objective, to the weight values of the representation, and from there to the parameters of the optimization step, they incentivize their updater to learn representations that are useful across some distribution of tasks. A slightly weird thing about this paper is that they train on image datasets, but shuffle the pixels and use a fully connected network rather than a conv net. I presume this has to do with the complexities of defining a weight update rule for a convolution, but it does make it harder to meaningfully compare with other image-based unsupervised systems, which are typically done using convolution. An interesting thing they note is that, early in meta-training on images, their update rules generalize fairly well to text data. However, later in training the update rules seem to have specialized to images, and generalize more poorly to images. |

Learning to Navigate in Cities Without a Map

Mirowski, Piotr and Grimes, Matthew Koichi and Malinowski, Mateusz and Hermann, Karl Moritz and Anderson, Keith and Teplyashin, Denis and Simonyan, Karen and Kavukcuoglu, Koray and Zisserman, Andrew and Hadsell, Raia

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Mirowski, Piotr and Grimes, Matthew Koichi and Malinowski, Mateusz and Hermann, Karl Moritz and Anderson, Keith and Teplyashin, Denis and Simonyan, Karen and Kavukcuoglu, Koray and Zisserman, Andrew and Hadsell, Raia

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
This paper out of DeepMind used a Google StreetView dataset and set out to train a network capable of navigating to a given goal destination, without knowing where it was on any birds-eye map, and with its only input being photographic viewpoint images of its current location and orientation. This was done through a framework of reinforcement learning, where the model is conditioned on a representation of its goal, and given the image features of its current view of the world, and has to take actions like “turn left,” “turn sharply left”, “go forward”, etc, in order to navigate. Rather than lat-long, goals are specified in city-specific ways, in terms of the distance between the goal position and a reference set of landmarks. I don’t entirely understand the motivation behind this approach; the authors say it’s more scalable, but it wasn’t obvious to me why that would be the case. https://i.imgur.com/V3UATsK.png - The authors construct different architectures that combine these two fundamental pieces of input data - current image and the goal you’re trying to reach - in different ways. In the simplest model, called GoalNav, there’s a single LSTM that combines the goal information with the output of a convolutional encoder processing images of your current viewpoint. - In the next most complex, CityNav, there are two LSTMs: one for processing your goal, and the other for combining the output of the goal network with your convolutional inputs, in order to decide on an action. Notionally, this separation of tasks corresponds to “figure out what absolute to go in, given your goal”, and “figure out how to go in that absolute direction from where you are now”. As a way to support training, the goal network is trained with an auxiliary loss function where it needs to predict how far its current orientation is from North. Note that this does pass some amount of information about current location into the model (since the network gets to know its actual orientation relative to true north), but this is only available during training, with the hope that the model will have gotten good enough at predicting orientation to perform well. - The final model, similar to above, is called MultiCityNav, and is explicitly designed for transfer learning. Instead of training multiple cities on a single shared network, only the convolutional encoder and policy network (the “how do I go in the absolute direction needed to reach my goal” parts) are shared between cities, and the goal processing LSTM (the “which direction should I be going in” part) is re-trained per city. This is designed to allow for transfer in the parts of learning you would expect to generalize, but allow the network to learn a city-specific approach for converting between goal specifications (in terms of city landmarks) and direction. In order to get over the fact that reward in this setting is very sparse (i.e. you only get reward when you reach the goal), the authors (1) train in a curriculum fashion, starting with tasks very nearby the model’s starting point, and gradually getting longer, and (2) add a small amount of reward shaping, where you get rewarded for moving in the direction of the goal, but only if you’re within 200m of it. This last is a bit of a concession on the realism front, and the authors say as much, but it’s just quite hard to train RL with purely dense rewards, and it makes sense that reward shaping would help here. Ultimately, they were able to get performance (in terms of goal-reaching rewards) around ¾ as strong as an Oracle model, who had access to the full map and could calculate the true shortest path. |

Systematic Generalization: What Is Required and Can It Be Learned?

Dzmitry Bahdanau and Shikhar Murty and Michael Noukhovitch and Thien Huu Nguyen and Harm de Vries and Aaron Courville

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CL, cs.AI

**First published:** 2018/11/30 (5 years ago)

**Abstract:** Numerous models for grounded language understanding have been recently
proposed, including (i) generic models that can be easily adapted to any given
task and (ii) intuitively appealing modular models that require background
knowledge to be instantiated. We compare both types of models in how much they
lend themselves to a particular form of systematic generalization. Using a
synthetic VQA test, we evaluate which models are capable of reasoning about all
possible object pairs after training on only a small subset of them. Our
findings show that the generalization of modular models is much more systematic
and that it is highly sensitive to the module layout, i.e. to how exactly the
modules are connected. We furthermore investigate if modular models that
generalize well could be made more end-to-end by learning their layout and
parametrization. We find that end-to-end methods from prior work often learn
inappropriate layouts or parametrizations that do not facilitate systematic
generalization. Our results suggest that, in addition to modularity, systematic
generalization in language understanding may require explicit regularizers or
priors.
more
less

Dzmitry Bahdanau and Shikhar Murty and Michael Noukhovitch and Thien Huu Nguyen and Harm de Vries and Aaron Courville

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CL, cs.AI

[link]
The paper discusses neural module network trees (NMN-trees). Here modules are composed in a tree structure to answer a question/task and modules are trained in different configurations to ensure they learn more core concepts and can generalize. Longer summary: How to perform systematic generalization? First we need to ask how good current models are at understanding language. Adversarial examples show how fragile these models can be. This leads us to conclude that systematic generalization is an issue that requires specific attention. Maybe we should rethink the modeling assumptions being made. We can think that samples can come from different data domains but are generated by some set of shared rules. If we correctly learned these rules then domain shift in the test data would not hurt model performance. Currently we can construct an experiment to introduce systematic bias in the data which causes the performance to suffer. From this experiment we can start to determine what the issue is. A recent new idea is to force a model to have more independent units is neural module network trees (NMN-trees). Here modules are composed in a tree structure to answer a question/task and modules are trained in different configurations to ensure they learn more core concepts and can generalize. |

Diversity is All You Need: Learning Skills without a Reward Function

Eysenbach, Benjamin and Gupta, Abhishek and Ibarz, Julian and Levine, Sergey

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Eysenbach, Benjamin and Gupta, Abhishek and Ibarz, Julian and Levine, Sergey

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
[I do occasionally wonder if people will look back on the “Is All You Need” with genuine confusion in a few years. “Really…all you need?”] This paper merges the ideas of curiosity-based learning and hierarchical reinforcement learning, to propose an architecture for learning distinctive skills based solely on an incentive to make those skills distinguishable from one another and relatively internally random, rather than because they’re directly useful in achieving some reward. The notion of hierarchical reinforcement learning is that, instead of learning a single joint policy, we learn some discrete number of subpolicies, and then treat the distribution over those subpolicies as you would a distribution over actions in a baseline RL policy. In order to achieve a reward, a model jointly optimizes the action distribution of the subpolicies, and also the distribution over subpolicies. One issue with this approach, which is raised by this paper (though I don’t really have strong enough domain background here to know how much of a problem this is in practice) is that this joint optimization process means that, early in the process, we choose subpolicies that are doing the best, and sample more from and thus improve those. This “early exploitation” problem (in the explore vs exploit frame) means that we might not learn skills that would be valuable to know later on, but that don’t give us any reward until we’ve developed them further. To address this, this paper proposes DIAYN, an algorithm which (1) samples discrete latent skill vectors according to a uniform, high-entropy prior, rather than according to how useful we think they are now, and (2) doesn’t even have a direct notion of usefulness, but instead incentivizes shaping of skills to be more distinct from one another, in terms of the states that are visited by each skill’s policy. The network then learns policies conditioned on each skill vector, and at each point operates according to whichever has been sampled. This idea of distinctiveness is encapsulated by saying “we want to have high mutual information between the states visited by a skill, and the discrete ID of that skill,” or, in more practical terms, “we want to be able to train a discriminator to do a good job predicting which skill we’re sampling from, based on the states it sees. (I swear, every time I read a paper where someone uses mutual information these days, it’s actually a discriminator under the hood). https://i.imgur.com/2a378Bo.png This incentivizes the model to take actions associated with each skill that will get it to states that are unlikely to occur in any of the existing skills. Depending on what set of observations you give the discriminator to work with, you can shape what axes your skills are incentivized to vary on; if you try to discriminate skills based solely on an agent’s center of mass, you’ll end up with policies that vary their center of mass more wildly. The paper shows that, at least on simple environments, agents can learn distinctive clusters of skills based on this objective. An interesting analogy here is to unsupervised pretraining of e.g. large language models and other similar settings, where we first train a model without (potentially costly) explicit reward, and this gives us a starting point set of representations that allow us to reach good performance more quickly once we start training on supervised reward signal. There is some evidence that this pretraining effect could be captured by this kind of purely-exploratory approach, as suggested by experiments done to take the learned skills or subpolicies, hold them fixed, and train a meta-controller to pick subpolicies according to an external reward, where the “pretrained” policy reaches high reward more quickly. |

Exploration by Random Network Distillation

Burda, Yuri and Edwards, Harrison and Storkey, Amos and Klimov, Oleg

- 2018 via Local Bibsonomy

Keywords: reinforcement-learning

Burda, Yuri and Edwards, Harrison and Storkey, Amos and Klimov, Oleg

- 2018 via Local Bibsonomy

Keywords: reinforcement-learning

[link]
Reward functions are a funny part of modern reinforcement learning: enormously salient from the inside, if you’re coding or working with RL systems, yet not as clearly visible from the outside perspective, where we just see agents playing games in what seem to be human-like ways. Just seeing things from this angle, it can be easy to imagine that the mechanisms being used to learn are human-like as well. And, it’s true that some of the Atari games being examined are cases where there is in fact a clear, explicit reward in the form of points, that human players would also be trying to optimize. But in most cases, the world isn’t really in the habit of producing clear reward signals, and it definitely doesn’t typically do so on time scales that account for most of the learning humans do. So, it’s generally hypothesized that in addition to updating on (sparse) environmental rewards, humans also operate according to certain pre-coded, possibly evolutionarily-engineered heuristics, of which one is curiosity. The intuition is: it sure seems like, especially early in life, humans learn by interacting with objects purely driven by curiosity, and we’d love to somehow harness that same drive to allow our learning systems to function in environments lacking dense, informative reward signals. One such environment is the video game Montezuma’s Revenge, which in addition to being amusingly difficult to search for, is a game with sparse, long-range rewards, on which typical reward-based agents have historically performed poorly, and on which this current paper focuses. A strong existing tradition of curiosity objectives focuses on incentivizing agents to be able to better predict the next observation, given the current observation and their action within it. Intuitively, by training such a network on historical observations, and giving agents a bonus according to that prediction’s error on a given observation. The theory behind this is that if an agent isn’t able to predict the observation-transition dynamics at a given state, that probably means it hasn’t visited many nearby states, and so we want to incentivize it doing so to gain information. If this sounds familiar to the classic “explore vs exploit” trade-off, it’s very much a similar idea: in cases of clear reward, we should take the reward, but in cases of low or uncertain reward, there’s value to exploration. One difficulty of systems like the one described above is that they reward the agent for being in environments where the next observation is difficult to predict from the current one. And while that could describe novel states about which the agent needs to gain information, it can also describe states that are inherently stochastic; the canonical example being random static on a TV screen. The agent has a lot of trouble predicting the next observation because it’s fundamentally non-deterministic to a greater degree than even the random-but-causal dynamics of most games. The proposed alternative of this paper is a little strange, but makes more sense in the context of responding to this stochasticity problem. The authors propose to create a random mapping, in the form of an initialized but untrained neural network, taking in observations and spitting out embedding vectors. Then, they incentivize their agent to go to places that have high prediction error on a network designed to predict these random embeddings. Since the output is just a function mapping, it’s deterministic with respect to observations. The idea here is that if you’ve seen observations similar to your current observation, you’ll be more able to predict the corresponding embedding, even if there’s no meaningful relationship that you’re learning. https://i.imgur.com/Ds5gHDE.png The authors found that this performed well on Montezuma’s Revenge and Private Eye, but only middlingly-well on other environments. I’m a bit torn on this paper overall. On one hand, it seems like a clever idea, and I’m in general interested in seeing more work on curiosity. It does clearly seem to be capturing something that corresponds to novelty-seeking, and the agent trained using it explores a higher number of rooms than alternative options. On the other, I’m a little skeptical of the fact that it only has consistent performance in two environments, and wish there had been more comparisons to simpler forms of observation similarity, since this really does just seem like a metric of “how similar of observation vectors to this have you seen before”. I find myself wondering if some sort of density modeling could even be effective here, especially if (as may be the case, I’m unsure) the input observations are metadata rather than pixels. |

Effective Ways to Build and Evaluate Individual Survival Distributions

Humza Haider and Bret Hoehn and Sarah Davis and Russell Greiner

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2018/11/28 (5 years ago)

**Abstract:** An accurate model of a patient's individual survival distribution can help
determine the appropriate treatment for terminal patients. Unfortunately, risk
scores (e.g., from Cox Proportional Hazard models) do not provide survival
probabilities, single-time probability models (e.g., the Gail model, predicting
5 year probability) only provide for a single time point, and standard
Kaplan-Meier survival curves provide only population averages for a large class
of patients meaning they are not specific to individual patients. This
motivates an alternative class of tools that can learn a model which provides
an individual survival distribution which gives survival probabilities across
all times - such as extensions to the Cox model, Accelerated Failure Time, an
extension to Random Survival Forests, and Multi-Task Logistic Regression. This
paper first motivates such "individual survival distribution" (ISD) models, and
explains how they differ from standard models. It then discusses ways to
evaluate such models - namely Concordance, 1-Calibration, Brier score, and
various versions of L1-loss - and then motivates and defines a novel approach
"D-Calibration", which determines whether a model's probability estimates are
meaningful. We also discuss how these measures differ, and use them to evaluate
several ISD prediction tools, over a range of survival datasets.
more
less

Humza Haider and Bret Hoehn and Sarah Davis and Russell Greiner

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, stat.ML

[link]
The paper looks at approaches to predicting individual survival time distributions (isd). The motivation is shown in the figure below. Between two patients the survival time varies greatly so we should be able to predict a distribution like the red curve. https://i.imgur.com/2r9JvUp.png The paper studies the following methods: - class-based survival curves Kaplan-Meier [31] - Kalbfleisch-Prentice extension of the Cox (cox-kp) [29] - Accelerated Failure Time (aft) model [29] - Random Survival Forest model with Kaplan-Meier extensions (rsf-km) - elastic net Cox (coxen-kp) [55] - Multi-task Logistic Regression (mtlr) [57] Looking at the predictions of these methods side by side we can observe some systematic differences between the methods: https://i.imgur.com/vJoCL4a.png The paper presents a "D-Calibration" metric (distributional calibration) which represents of the method answers this question: Should the patient believe the predictions implied by the survival curve? https://i.imgur.com/MX8CbZ7.png |

On the Suitability of Lp-Norms for Creating and Preventing Adversarial Examples

Sharif, Mahmood and Bauer, Lujo and Reiter, Michael K.

Conference and Computer Vision and Pattern Recognition - 2018 via Local Bibsonomy

Keywords: dblp

Sharif, Mahmood and Bauer, Lujo and Reiter, Michael K.

Conference and Computer Vision and Pattern Recognition - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Sharif et al. study the effectiveness of $L_p$ norms for creating adversarial perturbations. In this context, their main discussion revolves around whether $L_p$ norms are sufficient and/or necessary for perceptual similarity. Their main conclusion is that $L_p$ norms are neither necessary nor sufficient to ensure perceptual similarity. For example, an adversarial example might be within a specific $L_p$ bal, but humans might still identify it as not similar enough to the originally attacked sample; on the other hand, there are also some imperceptible perturbations that usually extend beyond a reasonable $L_p$ ball. Such transformatons might for example include small rotations or translation. These findings are interesting because it indicates that our current model, or approximation, or perceptual similarity is not meaningful in all cases. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning

Nicolas Papernot and Patrick McDaniel

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2018/03/13 (6 years ago)

**Abstract:** Deep neural networks (DNNs) enable innovative applications of machine
learning like image recognition, machine translation, or malware detection.
However, deep learning is often criticized for its lack of robustness in
adversarial settings (e.g., vulnerability to adversarial inputs) and general
inability to rationalize its predictions. In this work, we exploit the
structure of deep learning to enable new learning-based inference and decision
strategies that achieve desirable properties such as robustness and
interpretability. We take a first step in this direction and introduce the Deep
k-Nearest Neighbors (DkNN). This hybrid classifier combines the k-nearest
neighbors algorithm with representations of the data learned by each layer of
the DNN: a test input is compared to its neighboring training points according
to the distance that separates them in the representations. We show the labels
of these neighboring points afford confidence estimates for inputs outside the
model's training manifold, including on malicious inputs like adversarial
examples--and therein provides protections against inputs that are outside the
models understanding. This is because the nearest neighbors can be used to
estimate the nonconformity of, i.e., the lack of support for, a prediction in
the training data. The neighbors also constitute human-interpretable
explanations of predictions. We evaluate the DkNN algorithm on several
datasets, and show the confidence estimates accurately identify inputs outside
the model, and that the explanations provided by nearest neighbors are
intuitive and useful in understanding model failures.
more
less

Nicolas Papernot and Patrick McDaniel

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, stat.ML

[link]
Papernot and McDaniel introduce deep k-nearest neighbors where nearest neighbors are found at each intermediate layer in order to improve interpretbaility and robustness. Personally, I really appreciated reading this paper; thus, I will not only discuss the actually proposed method but also highlight some ideas from their thorough survey and experimental results. First, Papernot and McDaniel provide a quite thorough survey of relevant work in three disciplines: confidence, interpretability and robustness. To the best of my knowledge, this is one of few papers that explicitly make the connection of these three disciplines. Especially the work on confidence is interesting in the light of robustness as Papernot and McDaniel also frequently distinguish between in-distribution and out-distribution samples. Here, it is commonly known that deep neural networks are over-confidence when moving away from the data distribution. The deep k-nearest neighbor approach is described in Algorithm 1 and summarized in the following. For a trained model and a training set of labeled samples, they first find k nearest neighbors for each intermediate layer of the network. The layer nonconformity with a specific label $j$, referred to as $\alpha$ in Algorithm 1, is computed as the number of labels that in the set of nearest neighbors that do not share this label. By comparing these nonconformity values to a set of reference values (computing over a set of labeled calibration data), the prediction can be refined. In particular, the probability for label $j$ can be computed as the fraction of reference nonconformity values that are higher than the computed one. See Algorthm 1 or the paper for details. https://i.imgur.com/RA6q1VI.png https://i.imgur.com/CkRf8ex.png Algorithm 1: The deep k-nearest neighbor algorithm and an illustration. Finally, they provide experimental results – again considering the three disciplines of confidence/credibility, interpretability and robustness. The main take-aways are that the resulting confidences are more reliable on out-of-distribution samples, which also include adversarial examples. Additioanlly, the nearest neighbor allow very basic interpretation of the predictions. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Towards Imperceptible and Robust Adversarial Example Attacks against Neural Networks

Bo Luo and Yannan Liu and Lingxiao Wei and Qiang Xu

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CR, stat.ML

**First published:** 2018/01/15 (6 years ago)

**Abstract:** Machine learning systems based on deep neural networks, being able to produce
state-of-the-art results on various perception tasks, have gained mainstream
adoption in many applications. However, they are shown to be vulnerable to
adversarial example attack, which generates malicious output by adding slight
perturbations to the input. Previous adversarial example crafting methods,
however, use simple metrics to evaluate the distances between the original
examples and the adversarial ones, which could be easily detected by human
eyes. In addition, these attacks are often not robust due to the inevitable
noises and deviation in the physical world. In this work, we present a new
adversarial example attack crafting method, which takes the human perceptual
system into consideration and maximizes the noise tolerance of the crafted
adversarial example. Experimental results demonstrate the efficacy of the
proposed technique.
more
less

Bo Luo and Yannan Liu and Lingxiao Wei and Qiang Xu

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CR, stat.ML

[link]
Luo et al. Propose a method to compute less-perceptible adversarial examples compared to standard methods constrained in $L_p$ norms. In particular, they consider the local variation of the image and argue that humans are more likely to notice larger variations in low-variance regions than vice-versa. The sensitivity of a pixel is therefore defined as one over its local variance, meaning that it is more sensitive to perturbations. They propose a simple algorithm which iteratively sorts pixels by their sensitivity and then selects a subset to perturb each step. Personally, I wonder why they do not integrate the sensitivity into simple projected gradient descent attacks, where a Lagrange multiplier is used to enforce the $L_p$ norm of the sensitivity weighted perturbation. However, qualitative results show that their approach also works well and results in (partly) less perceptible changes, see Figure 1. https://i.imgur.com/M7Ile8Y.png Figure 1: Qualitative results including a comparison to other state-of-the-art attacks. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Spatially Transformed Adversarial Examples

Xiao, Chaowei and Zhu, Jun-Yan and Li, Bo and He, Warren and Liu, Mingyan and Song, Dawn

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Xiao, Chaowei and Zhu, Jun-Yan and Li, Bo and He, Warren and Liu, Mingyan and Song, Dawn

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Xiao et al. propose adversarial examples based on spatial transformations. Actually, this work is very similar to the adversarial deformations of [1]. In particular, a deformation flow field is optimized (allowing individual deformations per pixel) to cause a misclassification. The distance of the perturbation is computed on the flow field directly. Examples on MNIST are shown in Figure 1 – it can clearly be seen that most pixels are moved individually and no kind of smoothness is enforced. They also show that commonly used defense mechanisms are more or less useless against these attacks. Unfortunately, and in contrast to [1], they do not consider adversarial training on their own adversarial transformations as defense. https://i.imgur.com/uDfttMU.png Figure 1: Examples of the computed adversarial examples/transformations on MNIST for three different models. Note that these are targeted attacks. [1] R. Alaifair, G. S. Alberti, T. Gauksson. Adef: an Iterative Algorithm to Construct Adversarial Deformations. ArXiv, abs/1804.07729v2, 2018. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Robustness of Rotation-Equivariant Networks to Adversarial Perturbations

Dumont, Beranger and Maggio, Simona and Montalvo, Pablo

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Dumont, Beranger and Maggio, Simona and Montalvo, Pablo

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Dumont et al. Compare different adversarial transformation attacks (including rotations and translations) against common as well as rotation-invariant convolutional neural networks. On MNIST, CIFAR-10 and ImageNet, they consider translations, rotations as well as the attack of [1] based on spatial transformer networks. Additionally, they consider rotation-invariant convolutional neural networks – however, both the attacks and the networks are not discussed/introduced in detail. The results are interesting because translation- and rotation-based attacks are significantly more successful on CIFAR-10 compared to MNIST and ImageNet. The authors, however, do not give a satisfying explanation of this observation. [1] C. Xiao, J.-Y. Zhu, B. Li, W. H, M. Liu, D. Song. Spatially-Transformed Adversarial Examples. ICLR, 2018. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples

Athalye, Anish and Carlini, Nicholas and Wagner, David A.

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Athalye, Anish and Carlini, Nicholas and Wagner, David A.

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Athalye et al. propose methods to circumvent different types of defenses against adversarial example based on obfuscated gradients. In particular, they identify three types of obfuscated gradients: shattered gradients (e.g., caused by undifferentiable parts of a network or through numerical instability), stochastic gradients, and exploding and vanishing gradients. These phenomena all influence the effectiveness of gradient-based attacks. Athalye et al. Give several indicators of how to find out when obfuscated gradients occur. Personally, I find most of these points straight forward, but it is still beneficial to write these “debug strategies” down. The main contribution, however, is a comprehensive evaluation of all eight ICLR’18 defenses against state-of-the-art attacks. As all (except adversarial training) cause obfuscated gradients, Athalye et al. Discuss several strategies to “un-obfuscate” the gradients to successfully compute adversarial examples. Overall, they show that seven out of eight defenses are not reliable, only adversarial training with projected gradient descent can withstand attacks limited to $\epsilon\approx 0.3$. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

ADef: an Iterative Algorithm to Construct Adversarial Deformations

Alaifari, Rima and Alberti, Giovanni S. and Gauksson, Tandri

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Alaifari, Rima and Alberti, Giovanni S. and Gauksson, Tandri

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Alaifari et al. propose an iterative attack to construct adversarial deformations of images. In particular, and in contrast to general adversarial perturbations, adversarial deformations are described through a deformation vector field – and the corresponding norm of this vector field may be bounded; an illustration can be found in Figure 1. The adversarial deformation is computed iteratively where the deformation itself is expressed in a differentiable manner. In contrast to very simple transformations such as rotations and translations, the computed adversarial deformations may contain significantly more subtle deformations as shown in Figure 2. The authors show that such deformations can successful attack MNIST and ImageNet models. https://i.imgur.com/7N8rLaK.png Figure 1: Illustration of the advantage of using general pixel-level deformations compared to simple transformations such as translations or rotations. https://i.imgur.com/dCWBoI8.png Figure 2: Illustration of untargeted (top) and targeted (bottom) attacks on ImageNet. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

On the Robustness of the CVPR 2018 White-Box Adversarial Example Defenses

Athalye, Anish and Carlini, Nicholas

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Athalye, Anish and Carlini, Nicholas

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Athalye and Carlini present experiments showing that pixel deflection [1] and high-level guided denoiser [2] are ineffective as defense against adversarial examples. In particular, they show that these defenses are not effective against the (currently) strongest first-order attack, projected gradient descent. Here, they also comment on the right threat model to use and explicitly state that the attacker would know the employed defense – which intuitively makes much sense when evaluating defenses. [1] Prakash, Aaditya, Moran, Nick, Garber, Solomon, DiLillo, Antonella, and Storer, James. Deflecting adversarial at tacks with pixel deflection. In CVPR, 2018. [2] Liao, Fangzhou, Liang, Ming, Dong, Yinpeng, Pang, Tianyu, Zhu, Jun, and Hu, Xiaolin. Defense against adversarial attacks using high-level representation guided denoiser. In CVPR, 2018. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

There Is No Free Lunch In Adversarial Robustness (But There Are Unexpected Benefits)

Tsipras, Dimitris and Santurkar, Shibani and Engstrom, Logan and Turner, Alexander and Madry, Aleksander

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Tsipras, Dimitris and Santurkar, Shibani and Engstrom, Logan and Turner, Alexander and Madry, Aleksander

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Tsipras et al. investigate the trade-off between classification accuracy and adversarial robustness. In particular, on a very simple toy dataset, they proof that such a trade-off exists; this means that very accurate models will also have low robustness. Overall, on this dataset, they find that there exists a sweet-spot where the accuracy is 70% and the adversarial accuracy (i.e., accuracy on adversarial examples) is 70%. Using adversarial training to obtain robust networks, they additionally show that the robustness is increased by not using “fragile” features – features that are only weakly correlated with the actual classification tasks. Only focusing on few, but “robust” features also has the advantage of more interpretable gradients and sparser weights (or convolutional kernels). Due to the induced robustness, adversarial examples are perceptually significantly more different from the original examples, as illustrated in Figure 1 on MNIST. https://i.imgur.com/OP2TOOu.png Figure 1: Illustration of adversarial examples for a standard model, a model trained using $L_\infty$ adversarial training and $L_2$ adversarial training. Especially for the $L_2$ case it is visible that adversarial examples need to change important class characteristics to fool the network. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Protecting JPEG Images Against Adversarial Attacks

Prakash, Aaditya and Moran, Nick and Garber, Solomon and DiLillo, Antonella and Storer, James A.

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Prakash, Aaditya and Moran, Nick and Garber, Solomon and DiLillo, Antonella and Storer, James A.

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Motivated by JPEG compression, Prakash et al. propose an adaptive quantization scheme as defense against adversarial attacks. They argue that JPEG experimentally reduces adversarial noise; however, it is difficult to automatically decide on the level of compression as it also influences a classifier’s performance. Therefore, Prakash et al. use a saliency detector to identify background region, and then apply adaptive quantization – with coarser detail at the background – to reduce the impact of adversarial noise. In experiments, they demonstrate that this approach outperforms simple JPEG compression as defense while having less impact on the image quality. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Adversarial Logit Pairing

Kannan, Harini and Kurakin, Alexey and Goodfellow, Ian J.

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Kannan, Harini and Kurakin, Alexey and Goodfellow, Ian J.

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Kannan et al. propose a defense against adversarial examples called adversarial logit pairing where the logits of clean and adversarial example are regularized to be similar. In particular, during adversarial training, they add a regularizer of the form $\lambda L(f(x), f(x’))$ were $L$ is, for example, the $L_2$ norm and $f(x’)$ the logits corresponding to adversarial example $x’$ (corresponding to clean example $x$). Intuitively, this is a very simple approach – adversarial training itself enforces the classification results of clean and corresponding adversarial examples to be the same and adversarial logit pairing enforces the internal representation, i.e., the logits, to be similar. In theory, this could also be applied to any set of activations within the network. In the paper, they conclude that “We hypothesize that adversarial logit pairing works well because it provides an additional prior that regularizes the model toward a more accurate understanding of the classes.” In experiments, they show that this approach slightly outperforms adversarial training alone on SVHN, MNIST as well as ImageNet. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Out-distribution training confers robustness to deep neural networks

Abbasi, Mahdieh and Gagné, Christian

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Abbasi, Mahdieh and Gagné, Christian

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Abbasi and Gagné propose explicit but natural out-distribution training as defense against adversarial examples. Specifically, as also illustrated on the toy dataset in Figure 1, they argue that networks commonly produce high-confident predictions in regions that are clearly outside of the data manifold (i.e., the training data distribution). As mitigation strategy, the authors propose to explicitly train on out-of-distribution data, allowing the network to additionally classify this data as “dustbin” data. On MNIST, for example, this data comes from NotMNIST, a dataset of letters A-J – on CIFA-10, this data could be CIFAR-100. Experiments show that this out-of-distribution training allow networks to identify adversarial examples as “dustbin” and thus improve robustness. https://i.imgur.com/nUSDZay.png Figure 1: Illustration of a naive model versus an augmented model, i.e., trained on out-of-distribution data, on a toy dataset (left) and on MNIST (right). Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Adversarial Defense based on Structure-to-Signal Autoencoders

Folz, Joachim and Palacio, Sebastian and Hees, Jörn and Borth, Damian and Dengel, Andreas

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Folz, Joachim and Palacio, Sebastian and Hees, Jörn and Borth, Damian and Dengel, Andreas

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Folz et al. propose an auto-encoder based defense against adversarial examples. In particular, they propose structure-to-signal auto-encoders, S2SNets, as defense mechanism – this auto-encoder is first trained in an unsupervised fashion to reconstruct images (which can be done independent of attack models or the classification network under attack). Then, the network’s decoder is fine tuned using gradients from the classification network. Their main argumentation is that the gradients of the composite network – auto-encoder plus classification network – are not class specific anymore as only the decoder is fine-tuned but not the encoder (as the encoder is trained to encode any image independent of the classification task). Experimentally they show that the gradients are indeed less class-specific. Additionally, the authors highlight that this defense is independent of an attack model and can be applied to any pre-trained classification model. Unforutntely, the approach is not compared against other defense machenisms – while related work was mentioned, a comparison would have been useful. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Improving MMD-GAN Training with Repulsive Loss Function

Wei Wang and Yuan Sun and Saman Halgamuge

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

**First published:** 2018/12/24 (5 years ago)

**Abstract:** Generative adversarial nets (GANs) are widely used to learn the data sampling
process and their performance may heavily depend on the loss functions, given a
limited computational budget. This study revisits MMD-GAN that uses the maximum
mean discrepancy (MMD) as the loss function for GAN and makes two
contributions. First, we argue that the existing MMD loss function may
discourage the learning of fine details in data as it attempts to contract the
discriminator outputs of real data. To address this issue, we propose a
repulsive loss function to actively learn the difference among the real data by
simply rearranging the terms in MMD. Second, inspired by the hinge loss, we
propose a bounded Gaussian kernel to stabilize the training of MMD-GAN with the
repulsive loss function. The proposed methods are applied to the unsupervised
image generation tasks on CIFAR-10, STL-10, CelebA, and LSUN bedroom datasets.
Results show that the repulsive loss function significantly improves over the
MMD loss at no additional computational cost and outperforms other
representative loss functions. The proposed methods achieve an FID score of
16.21 on the CIFAR-10 dataset using a single DCGAN network and spectral
normalization.
more
less

Wei Wang and Yuan Sun and Saman Halgamuge

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

[link]
**TL;DR**: Rearranging the terms in Maximum Mean Discrepancy yields a much better loss function for the discriminator of Generative Adversarial Nets. **Keywords**: Generative adversarial nets, Maximum Mean Discrepancy, spectral normalization, convolutional neural networks, Gaussian kernel, local stability. **Summary** Generative adversarial nets (GANs) are widely used to learn the data sampling process and are notoriously difficult to train. The training of GANs may be improved from three aspects: loss function, network architecture, and training process. This study focuses on a loss function called the Maximum Mean Discrepancy (MMD), defined as: $$ MMD^2(P_X,P_G)=\mathbb{E}_{P_X}k_{D}(x,x')+\mathbb{E}_{P_G}k_{D}(y,y')-2\mathbb{E}_{P_X,P_G}k_{D}(x,y) $$ where $G,D$ are the generator and discriminator networks, $x,x'$ are real samples, $y,y'$ are generated samples, $k_D=k\circ D$ is a learned kernel that calculates the similariy between two samples. Overall, MMD calculates the distance between the real and the generated sample distributions. Thus, traditionally, the generator is trained to minimize $L_G=MMD^2(P_X,P_G)$, while the discriminator minimizes $L_D=-MMD^2(P_X,P_G)$. This study makes three contributions: - It argues that $L_D$ encourages the discriminator to ignores the fine details in real data. By minimizing $L_D$, $D$ attempts to maximize $\mathbb{E}_{P_X}k_{D}(x,x')$, the similarity between real samples scores. Thus, $D$ has to focus on common features shared by real samples rather than fine details that separate them. This may slow down training. Instead, a repulsive loss is proposed, with no additional computational cost to MMD: $$ L_D^{rep}=\mathbb{E}_{P_X}k_{D}(x,x')-\mathbb{E}_{P_G}k_{D}(y,y') $$ - Inspired by the hinge loss, this study proposes a bounded Gaussian kernel for the discriminator to facilitate stable training of MMD-GAN. - The spectral normalization method divides the weight matrix at each layer by its spectral norm to enforce that each layer is Lipschitz continuous. This study proposes a simple method to calculate the spectral norm of a convolutional kernel. The results show the efficiency of proposed methods on CIFAR-10, STL-10, CelebA and LSUN-bedroom datasets. In Appendix, we prove that MMD-GAN training using gradient method is locally exponentially stable (a property that the Wasserstein loss does not have), and show that the repulsive loss works well with gradient penalty. The paper has been accepted at ICLR 2019 ([OpenReview link](https://openreview.net/forum?id=HygjqjR9Km)). The code is available at [GitHub link](https://github.com/richardwth/MMD-GAN). |

Learning a SAT Solver from Single-Bit Supervision

Selsam, Daniel and Lamm, Matthew and Bünz, Benedikt and Liang, Percy and de Moura, Leonardo and Dill, David L.

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Selsam, Daniel and Lamm, Matthew and Bünz, Benedikt and Liang, Percy and de Moura, Leonardo and Dill, David L.

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
The goal is to solve SAT problems with weak supervision: In that case a model is trained only to predict ***the satisfiability*** of a formula in conjunctive normal form. As a byproduct, when the formula is satisfiable, an actual satisfying assignment can be worked out by clustering the network's activations in most cases. * **Pros (+):** Weak supervision, interesting structured architecture, seems to generalize nicely to harder problems by increasing the number message passing iterations. * **Cons (-):** Limited practical applicability since it is outperfomed by classical SAT solvers. --- # NeuroSAT ## Inputs We consider Boolean logic formulas in their ***conjunctive normal form*** (CNF), i.e. each input formula is represented as a conjunction ($\land$) of **clauses**, which are themselves disjunctions ($\lor$) of litterals (positive or negative instances of variables). The goal is to learn a classifier to predict whether such a formula is satisfiable. A first problem is how to encode the input formula in such a way that it preserves the CNF invariances (invariance to negating a litteral in all clauses, invariance to permutations in $\lor$ and $\land$ etc.). The authors use an ***undirected graph representation*** where: * $\mathcal V$: vertices are the litterals (positive and negative form of variables, denoted as $x$ and $\bar x$) and the clauses occuring in the input formula * $\mathcal E$: Edges are added to connect (i) the litterals with clauses they appear in and (ii) each litteral to its negative counterpart. The graph relations are encoded as an ***adjacency matrix***, $A$, with as many rows as there are litterals and as many columns as there are clauses. In particular, this structure does not constrain the vertices ordering, and does not make any preferential treatment between positive or negative litterals. However it still has some caveats, which can be avoided by pre-processing the formula. For instance when there are disconnected components in the graph, the averaging decision rule (see next paragraph) can lead to false positives. ## Message-passing model In a high-level view, the model keeps track of an embedding for each vertex (litterals, $L^t$ and clauses, $C^t$), updated via ***message-passing on the graph***, and combined via a Multi Layer perceptrion (MLP) to output the model prediction of the formula's satisfiability. The model updates are as follow: $$ \begin{align} C^t, h_C^t &= \texttt{LSTM}_\texttt{C}(h_C^{t - 1}, A^T \texttt{MLP}_{\texttt{L}}(L^{t - 1}) )\ \ \ \ \ \ \ \ \ \ \ (1)\\ L^t, h_L^t &= \texttt{LSTM}_\texttt{L}(h_L^{t - 1}, \overline{L^{t - 1}}, A\ \texttt{MLP}_{\texttt{C}}(C^{t }) )\ \ \ \ \ \ (2) \end{align} $$ where $h$ designates a hidden context vector for the LSTMs. The operator $L \mapsto \bar{L}$ returns $\overline{L}$, the embedding matrix $L$ where the row of each litteral is swapped with the one corresponding to the litteral's negation. In other words, in **(1)** each clause embedding is updated based on the litteral that composes it, while in **(2)** each litteral embedding is updated based on the clauses it appears in and its negated counterpart. After $T$ iterations of this message-passing scheme, the model computes a ***logit for the satisfiability classification problem***, which is trained via sigmoid cross-entropy: $$ \begin{align} L^t_{\mbox{vote}} &= \texttt{MLP}_{\texttt{vote}}(L^t)\\ y^t &= \mbox{mean}(L^t_{\mbox{vote}}) \end{align} $$ --- # Training and Inference ## Training Set The training set is built such that for any satisfiable training formula $S$, it also includes an unsatisfiable counterpart $S'$ which differs from $S$ ***only by negating one litteral in one clause***. These carefully curated samples should constrain the model to pick up substantial characteristics of the formula. In practice, the model is trained on formulas containing up to ***40 variables***, and on average ***200 clauses***. At this size, the SAT problem can still be solved by state-of-the-art solvers (yielding the supervision) but are large enough they prove challenging for Machine Learning models. ## Inferring the SAT assignment When a formula is satisfiable, one often also wants to know a ***valuation*** (variable assignment) that satisfies it. Recall that $L^t_{\mbox{vote}}$ encodes a "vote" for every litteral and its negative counterpart. Qualitative experiments show that thoses scores cannot be directly used for inferring the variable assignment, however they do induce a nice clustering of the variables (once the message passing has converged). Hence an assignment can be found as follows: * (1) Reshape $L^T_{\mbox{vote}}$ to size $(n, 2)$ where $n$ is the number of litterals. * (2) Cluster the litterals into two clusters with centers $\Delta_1$ and $\Delta_2$ using the following criterion: \begin{align} \|x_i - \Delta_1\|^2 + \|\overline{x_i} - \Delta_2\|^2 \leq \|x_i - \Delta_2\|^2 + \|\overline{x_i} - \Delta_1\|^2 \end{align} * (3) Try the two resulting assignments (set $\Delta_1$ to true and $\Delta_2$ to false, or vice-versa) and choose the one that yields satisfiability if any. In practice, this method retrieves a satistifiability assignment for over 70% of the satisfiable test formulas. --- # Experiments In practice, the ***NeuroSAT*** model is trained with embeddings of dimension 128 and 26 message passing iterations using standard MLPs: 3 layers followed by ReLU activations. The final model obtains 85% accuracy in predicting a formula's satisfiability on the test set. It also can generalize to ***larger problems***, requiring to increase the number of message passing iterations, although the classification performance decreases as the problem size grows (e.g. 25% for 200 variables). Interestingly, the model also generalizes well to other classes of problems that were first ***reduced to SAT***, although they have different structure than the random formulas generated for training, which seems to show that the model does learn some general structural characteristics of Boolean formulas. |

Data Augmentation for Skin Lesion Analysis

Fábio Perez and Cristina Vasconcelos and Sandra Avila and Eduardo Valle

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV

**First published:** 2018/09/05 (5 years ago)

**Abstract:** Deep learning models show remarkable results in automated skin lesion
analysis. However, these models demand considerable amounts of data, while the
availability of annotated skin lesion images is often limited. Data
augmentation can expand the training dataset by transforming input images. In
this work, we investigate the impact of 13 data augmentation scenarios for
melanoma classification trained on three CNNs (Inception-v4, ResNet, and
DenseNet). Scenarios include traditional color and geometric transforms, and
more unusual augmentations such as elastic transforms, random erasing and a
novel augmentation that mixes different lesions. We also explore the use of
data augmentation at test-time and the impact of data augmentation on various
dataset sizes. Our results confirm the importance of data augmentation in both
training and testing and show that it can lead to more performance gains than
obtaining new images. The best scenario results in an AUC of 0.882 for melanoma
classification without using external data, outperforming the top-ranked
submission (0.874) for the ISIC Challenge 2017, which was trained with
additional data.
more
less

Fábio Perez and Cristina Vasconcelos and Sandra Avila and Eduardo Valle

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV

[link]
_Disclaimer: I'm the first author of this paper._ The code for this paper can be found at https://github.com/fabioperez/skin-data-augmentation. In this work, we wanted to compare different data augmentation scenarios for skin lesion analysis. We tried 13 scenarios, including commonly used augmentation techniques (color and geometry transformations), unusual ones (random erasing, elastic transformation, and a novel lesion mix to simulate collision lesions), and a combination of those. Examples of the augmentation scenarios: https://i.imgur.com/TpgxzLZ.png a) no augmentation b) color (saturation, contrast, and brightness) c) color (saturation, contrast, brightness, and hue) d) affine (rotation, shear, scaling) e) random flips f) random crops g) random erasing h) elastic i) lesion mix j) basic set (f, d, e, c) k) basic set + erasing (f, g, d, e, c) l) basic set + elastic (f, d, h, e, c) m) basic set + mix (i, f, d, e, c) --- We used the ISIC 2017 Challenge dataset (2000 training images, 150 validation images, and 600 test images). We tried three network architectures: Inception-v4, ResNet-152, and DenseNet-161. We also compared different test-time data augmentation methods: a) no augmentation; b) 144-crops; c) same data augmentation as training (64 augmented copies of the original image). Final prediction was the average of all augmented predictions. ## Results https://i.imgur.com/WK5VKUf.png * Basic set (combination of commonly used augmentations) is the best scenario. * Data augmentation at test-time is very beneficial. * Elastic is better than no augmentation, but when compared incorporated to the basic set, decreases the performance. * The best result was better than the winner of the challenge in 2017, without using ensembling. * Test data augmentation is very similar with 144-crop, but takes less images during prediction (64 vs 144), so it's faster. # Impact of data augmentation on dataset sizes We also used the basic set scenarios on different dataset sizes by sampling random subsets of the original dataset, with sizes 1500, 1000, 500, 250 and 125. https://i.imgur.com/m3Ut6ht.png ## Results * Using data augmentation can be better than using more data (but you should always use more data since the model can benefit from both). For instance, using 500 images with data augmentation on training and test for Inception is better than training with no data augmentation with 2000 images. * ResNet and DenseNet works better than Inception for less data. * Test-time data augmentation is always better than not augmenting on test-time. * Using data augmentation on train only was worse than not augmenting at all in some cases. |

Investigating Human Priors for Playing Video Games

Rachit Dubey and Pulkit Agrawal and Deepak Pathak and Thomas L. Griffiths and Alexei A. Efros

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.AI, cs.LG

**First published:** 2018/02/28 (6 years ago)

**Abstract:** What makes humans so good at solving seemingly complex video games? Unlike
computers, humans bring in a great deal of prior knowledge about the world,
enabling efficient decision making. This paper investigates the role of human
priors for solving video games. Given a sample game, we conduct a series of
ablation studies to quantify the importance of various priors on human
performance. We do this by modifying the video game environment to
systematically mask different types of visual information that could be used by
humans as priors. We find that removal of some prior knowledge causes a drastic
degradation in the speed with which human players solve the game, e.g. from 2
minutes to over 20 minutes. Furthermore, our results indicate that general
priors, such as the importance of objects and visual consistency, are critical
for efficient game-play. Videos and the game manipulations are available at
https://rach0012.github.io/humanRL_website/
more
less

Rachit Dubey and Pulkit Agrawal and Deepak Pathak and Thomas L. Griffiths and Alexei A. Efros

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.AI, cs.LG

[link]
Authors investigated why humans play some video games better than machines. That is the case for games that do not have continuous rewards (e.g., scores). They experimented with a game -- inspired by _Montezuma's Revenge_ -- in which the player has to climb stairs, collect keys and jump over enemies. RL algorithms can only know if they succeed if they finish the game, as there is no rewards during the gameplay, so they tend to do much worse than humans in these games. To compare between humans and machines, they set up RL algorithms and recruite players from Amazon Mechanical Turk. Humans did much better than machines for the original game setup. However, authors wanted to check the impact of semantics and prior knowledge on humans performance. They set up scenarios with different levels of reduced semantic information, as shown in Figure 2. https://i.imgur.com/e0Dq1WO.png This is what the game originally looked like: https://rach0012.github.io/humanRL_website/main.gif And this is the version with lesser semantic clues: https://rach0012.github.io/humanRL_website/random2.gif You can try yourself in the [paper's website](https://rach0012.github.io/humanRL_website/). Not surprisingly, humans took much more time to complete the game in scenarios with less semantic information, indicating that humans strongly rely on prior knowledge to play video games. The authors argue that this prior knowledge should also be somehow included into RL algorithms in order to move their efficiency towards the human level. ## Additional reading [Why humans learn faster than AI—for now](https://www.technologyreview.com/s/610434/why-humans-learn-faster-than-ai-for-now/). [OpenReview submission](https://openreview.net/forum?id=Hk91SGWR-) |

Neural Ordinary Differential Equations

Ricky T. Q. Chen and Yulia Rubanova and Jesse Bettencourt and David Duvenaud

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**First published:** 2018/06/19 (6 years ago)

**Abstract:** We introduce a new family of deep neural network models. Instead of
specifying a discrete sequence of hidden layers, we parameterize the derivative
of the hidden state using a neural network. The output of the network is
computed using a black-box differential equation solver. These continuous-depth
models have constant memory cost, adapt their evaluation strategy to each
input, and can explicitly trade numerical precision for speed. We demonstrate
these properties in continuous-depth residual networks and continuous-time
latent variable models. We also construct continuous normalizing flows, a
generative model that can train by maximum likelihood, without partitioning or
ordering the data dimensions. For training, we show how to scalably
backpropagate through any ODE solver, without access to its internal
operations. This allows end-to-end training of ODEs within larger models.
more
less

Ricky T. Q. Chen and Yulia Rubanova and Jesse Bettencourt and David Duvenaud

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

[link]
Summary by senior author [duvenaud on hackernews](https://news.ycombinator.com/item?id=18678078). A few years ago, everyone switched their deep nets to "residual nets". Instead of building deep models like this: h1 = f1(x) h2 = f2(h1) h3 = f3(h2) h4 = f3(h3) y = f5(h4) They now build them like this: h1 = f1(x) + x h2 = f2(h1) + h1 h3 = f3(h2) + h2 h4 = f4(h3) + h3 y = f5(h4) + h4 Where f1, f2, etc are neural net layers. The idea is that it's easier to model a small change to an almost-correct answer than to output the whole improved answer at once. In the last couple of years a few different groups noticed that this looks like a primitive ODE solver (Euler's method) that solves the trajectory of a system by just taking small steps in the direction of the system dynamics and adding them up. They used this connection to propose things like better training methods. We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system? So instead of updating the hidden units layer by layer, we define their derivative with respect to depth instead. We call this an ODE net. Now, we can use off-the-shelf adaptive ODE solvers to compute the final state of these dynamics, and call that the output of the neural network. This has drawbacks (it's slower to train) but lots of advantages too: We can loosen the numerical tolerance of the solver to make our nets faster at test time. We can also handle continuous-time models a lot more naturally. It turns out that there is also a simpler version of the change of variables formula (for density modeling) when you move to continuous time. |

Do Deep Generative Models Know What They Don't Know?

Eric Nalisnick and Akihiro Matsukawa and Yee Whye Teh and Dilan Gorur and Balaji Lakshminarayanan

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2018/10/22 (5 years ago)

**Abstract:** A neural network deployed in the wild may be asked to make predictions for
inputs that were drawn from a different distribution than that of the training
data. A plethora of work has demonstrated that it is easy to find or synthesize
inputs for which a neural network is highly confident yet wrong. Generative
models are widely viewed to be robust to such mistaken confidence as modeling
the density of the input features can be used to detect novel,
out-of-distribution inputs. In this paper we challenge this assumption. We find
that the density learned by flow-based models, VAEs, and PixelCNNs cannot
distinguish images of common objects such as dogs, trucks, and horses (i.e.
CIFAR-10) from those of house numbers (i.e. SVHN), assigning a higher
likelihood to the latter when the model is trained on the former. Moreover, we
find evidence of this phenomenon when pairing several popular image data sets:
FashionMNIST vs MNIST, CelebA vs SVHN, ImageNet vs CIFAR-10 / CIFAR-100 / SVHN.
To investigate this curious behavior, we focus analysis on flow-based
generative models in particular since they are trained and evaluated via the
exact marginal likelihood. We find such behavior persists even when we restrict
the flow models to constant-volume transformations. These transformations admit
some theoretical analysis, and we show that the difference in likelihoods can
be explained by the location and variances of the data and the model curvature.
Our results caution against using the density estimates from deep generative
models to identify inputs similar to the training distribution until their
behavior for out-of-distribution inputs is better understood.
more
less

Eric Nalisnick and Akihiro Matsukawa and Yee Whye Teh and Dilan Gorur and Balaji Lakshminarayanan

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

[link]
CNNs predictions are known to be very sensitive to adversarial examples, which are samples generated to be wrongly classifiied with high confidence. On the other hand, probabilistic generative models such as `PixelCNN` and `VAEs` learn a distribution over the input domain hence could be used to detect ***out-of-distribution inputs***, e.g., by estimating their likelihood under the data distribution. This paper provides interesting results showing that distributions learned by generative models are not robust enough yet to employ them in this way. * **Pros (+):** convincing experiments on multiple generative models, more detailed analysis in the invertible flow case, interesting negative results. * **Cons (-):** It would be interesting to provide further results for different datasets / domain shifts to observe if this property can be quanitfied as a characteristics of the model or of the input data. --- ## Experimental negative result Three classes of generative models are considered in this paper: * **Auto-regressive** models such as `PixelCNN` [1] * **Latent variable** models, such as `VAEs` [2] * Generative models with **invertible flows** [3], in particular `Glow` [4]. The authors train a generative model $G$ on input data $\mathcal X$ and then use it to evaluate the likelihood on both the training domain $\mathcal X$ and a different domain $\tilde{\mathcal X}$. Their main (negative) result is showing that **a model trained on the CIFAR-10 dataset yields a higher likelihood when evaluated on the SVHN test dataset than on the CIFAR-10 test (or even train) split**. Interestingly, the converse, when training on SVHN and evaluating on CIFAR, is not true. This result was consistantly observed for various architectures including [1], [2] and [4], although it is of lesser effect in the `PixelCNN` case. Intuitively, this could come from the fact that both of these datasets contain natural images and that CIFAR-10 is strictly more diverse than SVHN in terms of semantic content. Nonetheless, these datasets vastly differ in appearance, and this result is counter-intuitive as it goes against the direction that generative models can reliably be use to detect out-of-distribution samples. Furthermore, this observation also confirms the general idea that higher likelihoods does not necessarily coincide with better generated samples [5]. --- ## Further analysis for invertible flow models The authors further study this phenomenon in the invertible flow models case as they provide a more rigorous analytical framework (exact likelihood inference unlike VAE which only provide a bound on the true likelihood). More specifically invertible flow models are characterized with a ***diffeomorphism*** (invertible function), $f(x; \phi)$, between input space $\mathcal X$ and latent space $\mathcal Z$, and choice of the latent distribution $p(z; \psi)$. The ***change of variable formula*** links the density of $x$ and $z$ as follows: $$ \int_x p_x(x)d_x = \int_x p_z(f(x)) \left| \frac{\partial f}{\partial x} \right| dx $$ And the training objective under this transformation becomes $$ \arg\max_{\theta} \log p_x(\mathbf{x}; \theta) = \arg\max_{\phi, \psi} \sum_i \log p_z(f(x_i; \phi); \psi) + \log \left| \frac{\partial f_{\phi}}{\partial x_i} \right| $$ Typically, $p_z$ is chosen to be Gaussian, and samples are build by inverting $f$, i.e.,$z \sim p(\mathbf z),\ x = f^{-1}(z)$. And $f_{\phi}$ is build such that computing the log determinant of the Jacabian in the previous equation can be done efficiently. First, they observe that contribution of the flow can be decomposed in a ***density*** element (left term) and a ***volume*** element (right term), resulting from the change of variables formula. Experiment results with Glow [4] show that the higher density on SVHN mostly comes from the ***volume element contribution***. Secondly, they try to directly analyze the difference in likelihood between two domains $\mathcal X$ and $\tilde{\mathcal X}$; which can be done by a second-order expansion of the log-likelihood locally around the expectation of the distribution (assuming $\mathbb{E} (\mathcal X) \sim \mathbb{E}(\tilde{\mathcal X})$). For the constant volume Glow module, the resulting analytical formula indeed confirms that the log-likelihood of SVHN should be higher than CIFAR's, as observed in practice. --- ## References * [1] Conditional Image Generation with PixelCNN Decoders, van den Oord et al, 2016 * [2] Auto-Encoding Variational Bayes, Kingma and Welling, 2013 * [3] Density estimation using Real NVP, Dinh et al., ICLR 2015 * [4] Glow: Generative Flow with Invertible 1x1 Convolutions, Kingma and Dhariwal * [5] A Note on the Evaluation of Generative Models, Theis et al., ICLR 2016 |

Learning Confidence for Out-of-Distribution Detection in Neural Networks

Terrance DeVries and Graham W. Taylor

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2018/02/13 (6 years ago)

**Abstract:** Modern neural networks are very powerful predictive models, but they are
often incapable of recognizing when their predictions may be wrong. Closely
related to this is the task of out-of-distribution detection, where a network
must determine whether or not an input is outside of the set on which it is
expected to safely perform. To jointly address these issues, we propose a
method of learning confidence estimates for neural networks that is simple to
implement and produces intuitively interpretable outputs. We demonstrate that
on the task of out-of-distribution detection, our technique surpasses recently
proposed techniques which construct confidence based on the network's output
distribution, without requiring any additional labels or access to
out-of-distribution examples. Additionally, we address the problem of
calibrating out-of-distribution detectors, where we demonstrate that
misclassified in-distribution examples can be used as a proxy for
out-of-distribution examples.
more
less

Terrance DeVries and Graham W. Taylor

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

[link]
## Summary In a prior work 'On Calibration of Modern Nueral Networks', temperature scailing is used for outputing confidence. This is done at inference stage, and does not change the existing classifier. This paper considers the confidence at training stage, and directly outputs the confidence from the network. ## Architecture An additional branch for confidence is added after the penultimate layer, in parallel to logits and probs (Figure 2). https://i.imgur.com/vtKq9g0.png ## Training The network outputs the prob $p$ and the confidence $c$ which is a single scalar. The modified prob $p'=c*p+(1-c)y$ where $y$ is the label (hint). The confidence loss is $\mathcal{L}_c=-\log c$, the NLL is $\mathcal{L}_t= -\sum \log(p'_i)y_i$. ### Budget Parameter The authors introduced the confidence loss weight $\lambda$ and a budget $\beta$. If $\mathcal{L}_c>\beta$, increase $\lambda$, if $\mathcal{L}_c<\beta$, decrease $\lambda$. $\beta$ is found reasonable in [0.1,1.0]. ### Hinting with 50% Sometimes the model relies on the free label ($c=0$) and does not fit the complicated structure of data. The authors give hints with 50% so the model cannot rely 100% on the hint. They used $p'$ for only half of the bathes for each epoch. ### Misclassified Examples A high-capacity network with small dataset overfits well, and mis-classified samples are required to learn the confidence. The network likely assigns low confidence to samples. The paper used an aggressive data augmentation to create difficult examples. ## Inference Reject if $c\le\delta$. For out-of-distribution detection, they used the same input perturbation as in ODIN (2018). ODIN used temperature scailing and used the max prob, while this paper does not need temperature scailing since it directly outputs $c$. In evaluation, this paper outperformed ODIN. ## Reference ODIN: [Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks](http://www.shortscience.org/paper?bibtexKey=journals/corr/1706.02690#elbaro) |

Discovery of Latent 3D Keypoints via End-to-end Geometric Reasoning

Supasorn Suwajanakorn and Noah Snavely and Jonathan Tompson and Mohammad Norouzi

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG, stat.ML

**First published:** 2018/07/05 (6 years ago)

**Abstract:** This paper presents KeypointNet, an end-to-end geometric reasoning framework
to learn an optimal set of category-specific 3D keypoints, along with their
detectors. Given a single image, KeypointNet extracts 3D keypoints that are
optimized for a downstream task. We demonstrate this framework on 3D pose
estimation by proposing a differentiable objective that seeks the optimal set
of keypoints for recovering the relative pose between two views of an object.
Our model discovers geometrically and semantically consistent keypoints across
viewing angles and instances of an object category. Importantly, we find that
our end-to-end framework using no ground-truth keypoint annotations outperforms
a fully supervised baseline using the same neural network architecture on the
task of pose estimation. The discovered 3D keypoints on the car, chair, and
plane categories of ShapeNet are visualized at http://keypointnet.github.io/.
more
less

Supasorn Suwajanakorn and Noah Snavely and Jonathan Tompson and Mohammad Norouzi

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG, stat.ML

[link]
What the paper is about: KeypointNet learns the optimal set of 3D keypoints and their 2D detectors for a specified downstream task. The authors demonstrate this by extracting 3D keypoints and their 2D detectors for the task of relative pose estimation across views. They show that, using keypoints extracted by KeypointNet, relative pose estimates are superior to ones that are obtained from a supervised set of keypoints. Approach: Training samples for KeypointNet comprise two views (images) of an object. The task is to then produce an ordered list of 3D keypoints that, upon orthogonal procrustes alignment, produce the true relative 3D pose across those views. The network has N heads, each of which extracts one (3D) keypoint (from a 2D image). There are two primary loss terms. A multi-view consistency loss measures the discrepancy between the two sets of extracted keypoints under the ground-truth transform. A relative-pose estimation loss penalizes the angular discrepency (under orthogonal procrustes) of the estimated transform using the extracted keypoints vs the GT transform. Additionally, they require keypoints to be distant from each other, and to lie within the object silhouette. |

Learning Latent Dynamics for Planning from Pixels

Danijar Hafner and Timothy Lillicrap and Ian Fischer and Ruben Villegas and David Ha and Honglak Lee and James Davidson

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**First published:** 2018/11/12 (5 years ago)

**Abstract:** Planning has been very successful for control tasks with known environment
dynamics. To leverage planning in unknown environments, the agent needs to
learn the dynamics from interactions with the world. However, learning dynamics
models that are accurate enough for planning has been a long-standing
challenge, especially in image-based domains. We propose the Deep Planning
Network (PlaNet), a purely model-based agent that learns the environment
dynamics from pixels and chooses actions through online planning in latent
space. To achieve high performance, the dynamics model must accurately predict
the rewards ahead for multiple time steps. We approach this problem using a
latent dynamics model with both deterministic and stochastic transition
function and a generalized variational inference objective that we name latent
overshooting. Using only pixel observations, our agent solves continuous
control tasks with contact dynamics, partial observability, and sparse rewards.
PlaNet uses significantly fewer episodes and reaches final performance close to
and sometimes higher than top model-free algorithms.
more
less

Danijar Hafner and Timothy Lillicrap and Ian Fischer and Ruben Villegas and David Ha and Honglak Lee and James Davidson

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

[link]
**Summary**: This paper presents three tricks that make model-based reinforcement more reliable when tested in tasks that require walking and balancing. The tricks are 1) are planning based on features, 2) using a recursive network that mixes probabilistic and deterministic information, and 3) looking forward multiple steps. **Longer summary** Imagine playing pool, armed with a tablet that can predict exactly where the ball will bounce, and the next bounce, and so on. That would be a huge advantage to someone learning pool, however small inaccuracies in the model could mislead you especially when thinking ahead to the 2nd and third bounce. The tablet is analogous to the dynamics model in model-based reinforcement learning (RL). Model based RL promises to solve a lot of the open problems with RL, letting the agent learn with less experience, transfer well, dream, and many others advantages. Despite the promise, dynamics models are hard to get working: they often suffer from even small inaccuracies, and need to be redesigned for specific tasks. Enter PlaNet, a clever name, and a net that plans well in range of environments. To increase the challenge the model must predict directly from pixels in fairly difficult tasks such as teaching a cheetah to run or balancing a ball in a cup. How do they do this? Three main tricks. - Planning in latest space: this means that the policy network doesn't need to look at the raw image, but looks at a summary of it as represented by a feature vector. - Recurrent state space models: They found that probabilistic information helps describe the space of possibilities but makes it harder for their RNN based model to look back multiple steps. However mixing probabilistic information and deterministic information gives it the best of both worlds, and they have results that show a starting performance increase when both compared to just one. - Latent overshooting: They train the model to look more than one step ahead, this helps prevent errors that build up over time Overall this paper shows great results that tackle the shortfalls of model based RL. I hope the results remain when tested on different and more complex environments. |

Gradient Reversal Against Discrimination

Edward Raff and Jared Sylvester

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.LG

**First published:** 2018/07/01 (6 years ago)

**Abstract:** No methods currently exist for making arbitrary neural networks fair. In this
work we introduce GRAD, a new and simplified method to producing fair neural
networks that can be used for auto-encoding fair representations or directly
with predictive networks. It is easy to implement and add to existing
architectures, has only one (insensitive) hyper-parameter, and provides
improved individual and group fairness. We use the flexibility of GRAD to
demonstrate multi-attribute protection.
more
less

Edward Raff and Jared Sylvester

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.LG

[link]
Given some input data $x$ and attribute $a_p$, the task is to predict label $y$ from $x$ while making $a_p$ *protected*, in other words, such that the model predictions are invariant to changes in $a_p$. * **Pros (+)**: Simple and intuitive idea, easy to train, naturally extended to protecting multiple attributes. * **Cons (-)**: Comparison to baselines could be more detailed / comprehensive, in particular the comparison to ALFR [4] which also relies on adversarial training. --- ## Proposed Method **Domain adversarial networks.** The proposed model builds on the *Domain Adversarial Network* [1], originally introduced for unsupervised domain adaptation. Given some labeled data $(x, y) \sim \mathcal X \times \mathcal Y$, and some unlabeled data $\tilde x \sim \tilde{\mathcal X}$, the goal is to learn a network that solves both classification tasks $\mathcal X \rightarrow \mathcal Y$ and $\tilde{\mathcal X} \rightarrow \mathcal Y$ while learning a shared representation between $\mathcal X$ and $\tilde{\mathcal X}$. The model is composed of a feature extractor $G_f$ which then branches off into a *target* branch, $G_t$, to predict the target label, and a *domain* branch, $G_d$, predicting whether the input data comes either from domain $\mathcal X$ or $\tilde{\mathcal X}$. The model parameters are trained with the following objective: $$ \begin{align} (\theta_{G_f}, \theta_{G_t} ) &= \arg\min \mathbb E_{(x, y) \sim \mathcal X \times \mathcal Y}\ \ell_t \left( G_t \circ G_f(x), y \right)\\ \theta_{G_d} &= \arg\max \mathbb E_{x \sim \mathcal X} \ \ell_d\left( G_d \circ G_f(x), 1 \right) + \mathbb E_{\tilde x \sim \tilde{\mathcal X}}\ \ell_d \left(G_d \circ G_f(\tilde x), 0\right)\\ \mbox{where } &\ell_t \mbox{ and } \ell_d \mbox{ are classification losses} \end{align} $$ The gradient updates for this saddle point problem can be efficiently implemented using the Gradient Reversal Layer introduced in [1] **GRAD-pred.** In **G**radient **R**eversal **A**gainst **D**iscrimination, samples come only from one domain $\mathcal X$, and the domain classifier $G_d$ is replaced by an *attribute* classifier, $G_p$, whose goal is to predict the value of the protected attribute $a_p$. In other words, the training objective strives to build a feature representation of $x$ that is good enough to predict the correct label $y$ but such that $a_p$ cannot easily be deduced from it. On the contrary, directly learning classification network $G_y \circ G_f$ penalized when predicting the correct value of attribute $a_p$ could instead lead to a model that learns $a_p$ and trivially outputs an incorrect value. This situation is prevented by the adversarial training scheme here. **GRAD-auto.** The authors also consider a variant of the described model where the target branch $G_t$ instead solves the auto-encoding/reconstruction task. The features learned by the encoder $G_f$ can then later be used as entry point of a smaller network for classification or any other task. --- ## Experiments **Evaluation metrics.** The model is evaluated on four metrics to qualify both accuracy and fairness, following the protocol in [2]: * *Accuracy*, the proportion of correct classifications * *Discrimination*, the average score differences (logits of the ground-truth class) between samples with $a_p = + 1$ and $a_p = -1 $ (assuming a binary attribute) * *Consistency*, the average difference between a sample score and the mean of its nearest neighbors' score. * *Delta = Accuracy - Discrimination*, a penalized version of accuracy **Baselines.** * **Vanilla** CNN trained without the protected attribute protection branch * **LFR** [2]: A classifier with an intermediate latent code $Z \in \{1 \dots K\}$ is trained with an objective that combines a classification loss (the model should accurately classify $x$), a reconstruction loss (the learned representation should encode enough information about the input to reconstruct it accurately) and a parity loss (estimate the probability $P(Z=z | x)$ for both populations with $a_p = 1$ and $a_p = -1$ and strive to make them equal) * **VFA** [3]: A VAE where the protected attribute $a_p$ is factorized out of the latent code $z$, and additional invariance is imposed via a MMD objective which tries to match the moments of the posterior distributions $q(z|a_p = -1)$ and $q(z| a_p = 1)$. * **ALFR** [4] : As in LFR, this paper proposes a model trained with a reconstruction loss and a classification loss. Additionally, they propose to quantify the dependence between the learned representation and the protected attribute by adding an adversary classifier that tries to extract the attribute value from the representation, formulated and trained as in the Generative Adversarial Network (GAN) setting. **Results.** GRAD always reaches highest consistency compared to baselines. For the other metrics, the results are more mitigated, although it usually achieves best or second best results. It's also not clear how to choose between GRAD-pred and GRAD-auto as there does not seem to be a clear winner, although GRAD-pred seems more intuitive when supervision is available, as it directly solves the classification task. Authors also report a small experiment showing that protecting several attributes at the same time can be more beneficial than protecting a single attribute. This can be expected as some attributes are highly correlated or interact in meaningful way. In particular, protecting several attributes at once can easily be done in the GRAD framework by making the attribute prediction branch multi-class for instance: however it is not clear in the paper how it is actually done in practice, nor whether the same idea could also be integrated in the baselines for further comparison. --- ## References * [1] Domain-Adversarial Training of Neural Networks, Ganin et al, JMRL 2016 * [2] Learning Fair Representations, Zemel et al, ICML 2013 * [3] The Variational Fair Autoencoder, Louizos et al, 2016 * [4] Censoring Representations with an Adversary, Edwards and Storkey, ICLR 2016 |

BSN: Boundary Sensitive Network for Temporal Action Proposal Generation

Tianwei Lin and Xu Zhao and Haisheng Su and Chongjing Wang and Ming Yang

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV

**First published:** 2018/06/08 (6 years ago)

**Abstract:** Temporal action proposal generation is an important yet challenging problem,
since temporal proposals with rich action content are indispensable for
analysing real-world videos with long duration and high proportion irrelevant
content. This problem requires methods not only generating proposals with
precise temporal boundaries, but also retrieving proposals to cover truth
action instances with high recall and high overlap using relatively fewer
proposals. To address these difficulties, we introduce an effective proposal
generation method, named Boundary-Sensitive Network (BSN), which adopts "local
to global" fashion. Locally, BSN first locates temporal boundaries with high
probabilities, then directly combines these boundaries as proposals. Globally,
with Boundary-Sensitive Proposal feature, BSN retrieves proposals by evaluating
the confidence of whether a proposal contains an action within its region. We
conduct experiments on two challenging datasets: ActivityNet-1.3 and THUMOS14,
where BSN outperforms other state-of-the-art temporal action proposal
generation methods with high recall and high temporal precision. Finally,
further experiments demonstrate that by combining existing action classifiers,
our method significantly improves the state-of-the-art temporal action
detection performance.
more
less

Tianwei Lin and Xu Zhao and Haisheng Su and Chongjing Wang and Ming Yang

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV

[link]
## Boundary sensitive network ### **keyword**: action detection in video; accurate proposal **Summary**: In order to generate precise temporal boundaries and improve recall with lesses proposals, Tianwei Lin et al use BSN which first combine temporal boundaries with high probability to form proposals and then select proposals by evaluating whether a proposal contains an action(confidence score+ boundary probability). **Model**: 1. video feature encoding: use the two-stream extractor to form the input of BSN. $F = \{f_{tn}\}_{n=1}^{l_s} = \{(f_{S,Tn}, f_{T,t_n}\}_{n=1}^{l_s)} $ 2. BSN: * temporal evaluation: input feature sequence, using 3-layer CNN+3 fiter with sigmoid, to generate start, end, and actioness probability * proposal generation: 1.combine bound with high start/end probability or if probility peak to form proposal; 2. use actioness probability to generate proposal feature for each proposal by sampling the actioness probability during proposal region. * proposal evaluation: using 1 hidden layer perceptron to evaluate confidence score based on proposal features. proposal $\varphi =(t_s,t_e,p_{conf},p_{t_s}^s,p_{t_e}^e) $ $p_{t_e}^e$ is the end probability,and $p_{conf}$ is confidence score https://i.imgur.com/VjJLQDc.png **Training**: * **Learn to generate probility curve**: In order to calculate the accuracy of proposals the loss in the temporal evaluation is calculated as following: $L_{TEM} = \lambda L^{action} + L ^{start} + L^{end}$; $L = \frac{1}{l_w} \sum_{i =1}^{l_w}(\frac{l_w}{l_w-\sum_i g_i} b_i*log(p_i)+\frac{l_w}{\sum_i g_i} (1-b_i)*log(1-p_i))$ $ b_i = sign(g_i-\theta_{IoP})$ Thus, if start region proposal is highly overlapped with ground truth, the start point probability should increase to lower the loss, after training, the information of ground truth region could be leveraged to predict the accurate probability for start. actions and end probability could apply the same rule. * **Learn to choose right proposal**: In order to choose the right proposal based on confidence score, push confidence score to match with IOU of the groud truth and proposal is important. So the loss to do this is described as follow: $L_p = \frac{1}{N_{train}} \sum_{i=1}^{N_{train}} (p_{conf,i}-g_{iou,i})^2$. $N_{train}$ is number of training proposals and among it the ratio of positive to negative proposal is 1:2.$g_{iou,i}$ is the ith proposal's overlap with its corresponding ground truth. During test and prediction, the final confidence is calculated to fetch and suppress proposals using gaussian decaying soft-NMS. and final confidence score for each proposal is $p_f = p_{conf}p_{ts}^sp_{te}^e$ Thus, after training, the confidence score should reveal the iou between the proposal and its corresponding ground truth based on the proposal feature which is generated through actionness probability, whereas final proposal is achieved by ranking final confidence score. **Conclusion**: Different with segment proposal or use RNN to decide where to look next, this paper generate proposals with boundary probability and select them using the confidence score-- the IOU between the proposal and corresponding ground truth. with sufficient data, it can provide right bound probability and confidence score. and the highlight of the paper is it can be very accurate within feature sequence. *However, it only samples part of the video for feature sequence. so it is possible it will jump over the boundary point. if an accurate policy to decide where to sample is used, accuracy should be further boosted. * * **computation complexity**: within this network, computation includes 1. two-stream feature extractor for video samples 2. probility generation module: 3-layers cnn for the generated sequence 3. proposal generation using combination 4. sampler to generate proposal feature 5. 1-hidden layer perceptron to generate confidence score. major computing complexity should attribute to feature extractor(1') and proposal relate module if lots of proposals are generated(3',4') **Performance**: when combined with SCNN-classifier, it reach map@0.5 = 36.9 on THUMOS14 dataset |

Relational Forward Models for Multi-Agent Learning

Andrea Tacchetti and H. Francis Song and Pedro A. M. Mediano and Vinicius Zambaldi and Neil C. Rabinowitz and Thore Graepel and Matthew Botvinick and Peter W. Battaglia

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, cs.MA, stat.ML

**First published:** 2018/09/28 (5 years ago)

**Abstract:** The behavioral dynamics of multi-agent systems have a rich and orderly
structure, which can be leveraged to understand these systems, and to improve
how artificial agents learn to operate in them. Here we introduce Relational
Forward Models (RFM) for multi-agent learning, networks that can learn to make
accurate predictions of agents' future behavior in multi-agent environments.
Because these models operate on the discrete entities and relations present in
the environment, they produce interpretable intermediate representations which
offer insights into what drives agents' behavior, and what events mediate the
intensity and valence of social interactions. Furthermore, we show that
embedding RFM modules inside agents results in faster learning systems compared
to non-augmented baselines. As more and more of the autonomous systems we
develop and interact with become multi-agent in nature, developing richer
analysis tools for characterizing how and why agents make decisions is
increasingly necessary. Moreover, developing artificial agents that quickly and
safely learn to coordinate with one another, and with humans in shared
environments, is crucial.
more
less

Andrea Tacchetti and H. Francis Song and Pedro A. M. Mediano and Vinicius Zambaldi and Neil C. Rabinowitz and Thore Graepel and Matthew Botvinick and Peter W. Battaglia

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, cs.MA, stat.ML

[link]
One of the dominant narratives of the deep learning renaissance has been the value of well-designed inductive bias - structural choices that shape what a model learns. The biggest example of this can be found in convolutional networks, where models achieve a dramatic parameter reduction by having features maps learn local patterns, which can then be re-used across the whole image. This is based on the prior belief that patterns in local images are generally locally contiguous, and so having feature maps that focus only on small (and gradually larger) local areas is a good fit for that prior. This paper operates in a similar spirit, except its input data isn’t in the form of an image, but a graph: the social graph of multiple agents operating within a Multi Agent RL Setting. In some sense, a graph is just a more general form of a pixel image: where a pixel within an image has a fixed number of neighbors, which have fixed discrete relationships to it (up, down, left, right), nodes within graphs have an arbitrary number of nodes, which can have arbitrary numbers and types of attributes attached to that relationship. The authors of this paper use graph networks as a sort of auxiliary information processing system alongside a more typical policy learning framework, on tasks that require group coordination and knowledge sharing to complete successfully. For example, each agent might be rewarded based on the aggregate reward of all agents together, and, in the stag hunt, it might require collaborative effort by multiple agents to successfully “capture” a reward. Because of this, you might imagine that it would be valuable to be able to predict what other agents within the game are going to do under certain circumstances, so that you can shape your strategy accordingly. The graph network used in this model represents both agents and objects in the environment as nodes, which have attributes including their position, whether they’re available or not (for capture-able objects), and what their last action was. As best I can tell, all agents start out with directed connections going both ways to all other agents, and to all objects in the environment, with the only edge attribute being whether the players are on the same team, for competitive environments. Given this setup, the graph network works through a sort of “diffusion” of information, analogous to a message passing algorithm. At each iteration (analogous to a layer), the edge features pull in information from their past value and sender and receiver nodes, as well as from a “global feature”. Then, all of the nodes pull in information from their edges, and their own past value. Finally, this “global attribute” gets updated based on summations over the newly-updated node and edge information. (If you were predicting attributes that were graph-level attributes, this global attribute might be where you’d do that prediction. However, in this case, we’re just interested in predicting agent-level actions). https://i.imgur.com/luFlhfJ.png All of this has the effect of explicitly modeling agents as entities that both have information, and have connections to other entities. One benefit the authors claim of this structure is that it allows them more interpretability: when they “play out” the values of their graph network, which they call a Relational Forward Model or RFM, they observe edge values for two agents go up if those agents are about to collaborate on an action, and observe edge values for an agent and an object go up before that object is captured. Because this information is carefully shaped and structured, it makes it easier for humans to understand, and, in the tests the authors ran, appears to also help agents do better in collaborative games. https://i.imgur.com/BCKSmIb.png While I find graph networks quite interesting, and multi-agent learning quite interesting, I’m a little more uncertain about the inherent “graphiness” of this problem, since there aren’t really meaningful inherent edges between agents. One thing I am curious about here is how methods like these would work in situations of sparser graphs, or, places where the connectivity level between a node’s neighbors, and the average other node in the graph is more distinct. Here, every node is connected to every other node, so the explicit information localization function of graph networks is less pronounced. I might naively think that - to whatever extent the graph is designed in a way that captures information meaningful to the task - explicit graph methods would have an even greater comparative advantage in this setting. |

Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search

Lars Buesing and Theophane Weber and Yori Zwols and Sebastien Racaniere and Arthur Guez and Jean-Baptiste Lespiau and Nicolas Heess

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2018/11/15 (5 years ago)

**Abstract:** Learning policies on data synthesized by models can in principle quench the
thirst of reinforcement learning algorithms for large amounts of real
experience, which is often costly to acquire. However, simulating plausible
experience de novo is a hard problem for many complex environments, often
resulting in biases for model-based policy evaluation and search. Instead of de
novo synthesis of data, here we assume logged, real experience and model
alternative outcomes of this experience under counterfactual actions, actions
that were not actually taken. Based on this, we propose the
Counterfactually-Guided Policy Search (CF-GPS) algorithm for learning policies
in POMDPs from off-policy experience. It leverages structural causal models for
counterfactual evaluation of arbitrary policies on individual off-policy
episodes. CF-GPS can improve on vanilla model-based RL algorithms by making use
of available logged data to de-bias model predictions. In contrast to
off-policy algorithms based on Importance Sampling which re-weight data, CF-GPS
leverages a model to explicitly consider alternative outcomes, allowing the
algorithm to make better use of experience data. We find empirically that these
advantages translate into improved policy evaluation and search results on a
non-trivial grid-world task. Finally, we show that CF-GPS generalizes the
previously proposed Guided Policy Search and that reparameterization-based
algorithms such Stochastic Value Gradient can be interpreted as counterfactual
methods.
more
less

Lars Buesing and Theophane Weber and Yori Zwols and Sebastien Racaniere and Arthur Guez and Jean-Baptiste Lespiau and Nicolas Heess

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, stat.ML

[link]
It is a fact universally acknowledged that a reinforcement learning algorithm not in possession of a model must be in want of more data. Because they generally are. Joking aside, it is broadly understood that model-free RL takes a lot of data to train, and, even when you can design them to use off-policy trajectories, collecting data in the real environment might still be too costly. Under those conditions, we might want to learn a model of the environment and generate synthesized trajectories, and train on those. This has the advantage of not needing us to run the actual environment, but the obvious disadvantage that any model will be a simplification of the true environment, and potentially an inaccurate one. These authors seek to answer the question of: “is there a way to generate trajectories that has higher fidelity to the true environment.” As you might infer from the fact that they published a paper, and that I’m now writing about it, they argue that, yes, there is, and it’s through explicit causal/counterfactual modeling. Causal modeling is one of those areas of statistics that seems straightforward at its highest level of abstraction, but tends to get mathematically messy and unintuitive when you dive into the math. So, rather than starting with equations, I’m going to try to verbally give some intuitions for the way causal modeling is framed here. Imagine you’re trying to understand what would happen if a person had gone to college. There’s some set of information you know about them, and some set of information you don’t, that’s just random true facts about them and about the universe. If, in the real world, they did go to college, and you want to simulate what would have happened if they didn’t, it’s not enough to just know the observed facts about them, you want to actually isolate all of the random other facts (about them, about the world) that weren’t specifically “the choice to go to college”, and condition on those as well. Obviously, in the example given here, it isn’t really practically possible to isolate all the specific unseen factors that influence someone’s outcome. But, conceptually, this quantity, is what we’re going to focus on in this paper. Now, imagine a situation where a RL agent has been dropped into a maze-like puzzle. It has some set of dynamics, not immediately visible to the player, that make it difficult, but ultimately solvable. The best kind of simulated data, the paper argues, would be to keep that state of the world (which is partially unobservable) fixed, and sample different sets of actions the agent might take in that space. Thus, “counterfactual modeling”: for a given configuration of random states in the world, sampling different actions within it. To do this, you first have to infer the random state the agent is experiencing. In the normal model-based case, you’d have some prior over world states, and just sample from it. However, if you use the experience of the agent’s trajectory, you can make a better guess as to what world configuration it was dropped into. If you can do this, which is, technically speaking, sampling from the posterior over unseen context, conditional on an agent’s experience, then the paper suggests you’ll be able to generate data that’s more realistic, because the trajectories will be direct counterfactuals of “real world” scenarios, rather than potentially-unsolvable or unrealistic draws from the prior. This is, essentially, the approach proposed by the paper: during training, they make this “world state” visible to the agent, and let it learn a model predicting what state it started with, given some trajectory of experience. They also learn a model that predicts the outcome and ultimately the value of actions taken, conditioned on this random context (as well as visible context, and the agent’s prior actions). They start out by using this as a tool for policy evaluation, which is a nice problem setup because you can actually check how well you’re doing against some baseline: if you want to know how good your simulated data is at replicating the policy reward on real data, you can just try it out on real data and see. The authors find that they reduce policy reward estimation error pretty substantially by adding steps of experience (in Bayesian terms, bit of evidence moving them from the prior, towards the posterior). https://i.imgur.com/sNAcGjZ.png They also experiment with using this for actual policy search, but, honestly, I didn’t quite follow the intuitions behind Guided Policy Search, so I’m just going to not dive into that for now, since I think a lot of the key contributions of the paper are wrapped up in the idea of “estimate the reward of a policy by simulating data from a counterfactual trajectory” |

Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference

Matthew Riemer and Ignacio Cases and Robert Ajemian and Miao Liu and Irina Rish and Yuhai Tu and Gerald Tesauro

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**First published:** 2018/10/29 (5 years ago)

**Abstract:** Lack of performance when it comes to continual learning over non-stationary
distributions of data remains a major challenge in scaling neural network
learning to more human realistic settings. In this work we propose a new
conceptualization of the continual learning problem in terms of a trade-off
between transfer and interference. We then propose a new algorithm,
Meta-Experience Replay (MER), that directly exploits this view by combining
experience replay with optimization based meta-learning. This method learns
parameters that make interference based on future gradients less likely and
transfer based on future gradients more likely. We conduct experiments across
continual lifelong supervised learning benchmarks and non-stationary
reinforcement learning environments demonstrating that our approach
consistently outperforms recently proposed baselines for continual learning.
Our experiments show that the gap between the performance of MER and baseline
algorithms grows both as the environment gets more non-stationary and as the
fraction of the total experiences stored gets smaller.
more
less

Matthew Riemer and Ignacio Cases and Robert Ajemian and Miao Liu and Irina Rish and Yuhai Tu and Gerald Tesauro

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

[link]
Catastrophic forgetting is the tendency of an neural network to forget previously learned information when learning new information. This paper combats that by keeping a buffer of experience and applying meta-learning to it. They call their new module Meta Experience Replay or MER. How does this work? At each update they compute multiple possible updates to the model weights. One for the new batch of information and some more updates for batches of previous experience. Then they apply meta-learning using the REPTILE algorithm, here the meta-model sees each possible update and has to predict the output which combines them with the least interference. This is done by predicting an update vector that maximizes the dot product between the new and old update vectors, that way it transfers as much learning as possible from the new update without interfering with the old updates. https://i.imgur.com/TG4mZOn.png Does it work? Yes, while it may take longer to train, the results show that it generalizes better and needs a much smaller buffer of experience than the popular approach of using replay buffers. |

An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution

Rosanne Liu and Joel Lehman and Piero Molino and Felipe Petroski Such and Eric Frank and Alex Sergeev and Jason Yosinski

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG, stat.ML

**First published:** 2018/07/09 (6 years ago)

**Abstract:** Few ideas have enjoyed as large an impact on deep learning as convolution.
For any problem involving pixels or spatial representations, common intuition
holds that convolutional neural networks may be appropriate. In this paper we
show a striking counterexample to this intuition via the seemingly trivial
coordinate transform problem, which simply requires learning a mapping between
coordinates in (x,y) Cartesian space and one-hot pixel space. Although
convolutional networks would seem appropriate for this task, we show that they
fail spectacularly. We demonstrate and carefully analyze the failure first on a
toy problem, at which point a simple fix becomes obvious. We call this solution
CoordConv, which works by giving convolution access to its own input
coordinates through the use of extra coordinate channels. Without sacrificing
the computational and parametric efficiency of ordinary convolution, CoordConv
allows networks to learn either perfect translation invariance or varying
degrees of translation dependence, as required by the task. CoordConv solves
the coordinate transform problem with perfect generalization and 150 times
faster with 10--100 times fewer parameters than convolution. This stark
contrast raises the question: to what extent has this inability of convolution
persisted insidiously inside other tasks, subtly hampering performance from
within? A complete answer to this question will require further investigation,
but we show preliminary evidence that swapping convolution for CoordConv can
improve models on a diverse set of tasks. Using CoordConv in a GAN produced
less mode collapse as the transform between high-level spatial latents and
pixels becomes easier to learn. A Faster R-CNN detection model trained on MNIST
detection showed 24% better IOU when using CoordConv, and in the RL domain
agents playing Atari games benefit significantly from the use of CoordConv
layers.
more
less

Rosanne Liu and Joel Lehman and Piero Molino and Felipe Petroski Such and Eric Frank and Alex Sergeev and Jason Yosinski

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG, stat.ML