Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1583 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Gaussian Processes in Machine Learning

Rasmussen, Carl Edward

Springer Advanced Lectures on Machine Learning - 2003 via Local Bibsonomy

Keywords: dblp

Rasmussen, Carl Edward

Springer Advanced Lectures on Machine Learning - 2003 via Local Bibsonomy

Keywords: dblp

[link]
In this tutorial paper, Carl E. Rasmussen gives an introduction to Gaussian Process Regression focusing on the definition, the hyperparameter learning and future research directions. A Gaussian Process is completely defined by its mean function $m(\pmb{x})$ and its covariance function (kernel) $k(\pmb{x},\pmb{x}')$. The mean function $m(\pmb{x})$ corresponds to the mean vector $\pmb{\mu}$ of a Gaussian distribution whereas the covariance function $k(\pmb{x}, \pmb{x}')$ corresponds to the covariance matrix $\pmb{\Sigma}$. Thus, a Gaussian Process $f \sim \mathcal{GP}\left(m(\pmb{x}), k(\pmb{x}, \pmb{x}')\right)$ is a generalization of a Gaussian distribution over vectors to a distribution over functions. A random function vector $\pmb{\mathrm{f}}$ can be generated by a Gaussian Process through the following procedure: 1. Compute the components $\mu_i$ of the mean vector $\pmb{\mu}$ for each input $\pmb{x}_i$ using the mean function $m(\pmb{x})$ 2. Compute the components $\Sigma_{ij}$ of the covariance matrix $\pmb{\Sigma}$ using the covariance function $k(\pmb{x}, \pmb{x}')$ 3. A function vector $\pmb{\mathrm{f}} = [f(\pmb{x}_1), \dots, f(\pmb{x}_n)]^T$ can be drawn from the Gaussian distribution $\pmb{\mathrm{f}} \sim \mathcal{N}\left(\pmb{\mu}, \pmb{\Sigma} \right)$ Applying this procedure to regression, means that the resulting function vector $\pmb{\mathrm{f}}$ shall be drawn in a way that a function vector $\pmb{\mathrm{f}}$ is rejected if it does not comply with the training data $\mathcal{D}$. This is achieved by conditioning the distribution on the training data $\mathcal{D}$ yielding the posterior Gaussian Process $f \rvert \mathcal{D} \sim \mathcal{GP}(m_D(\pmb{x}), k_D(\pmb{x},\pmb{x}'))$ for noise-free observations with the posterior mean function $m_D(\pmb{x}) = m(\pmb{x}) + \pmb{\Sigma}(\pmb{X},\pmb{x})^T \pmb{\Sigma}^{-1}(\pmb{\mathrm{f}} - \pmb{\mathrm{m}})$ and the posterior covariance function $k_D(\pmb{x},\pmb{x}')=k(\pmb{x},\pmb{x}') - \pmb{\Sigma}(\pmb{X}, \pmb{x}')$ with $\pmb{\Sigma}(\pmb{X},\pmb{x})$ being a vector of covariances between every training case of $\pmb{X}$ and $\pmb{x}$. Noisy observations $y(\pmb{x}) = f(\pmb{x}) + \epsilon$ with $\epsilon \sim \mathcal{N}(0,\sigma_n^2)$ can be taken into account with a second Gaussian Process with mean $m$ and covariance function $k$ resulting in $f \sim \mathcal{GP}(m,k)$ and $y \sim \mathcal{GP}(m, k + \sigma_n^2\delta_{ii'})$. The figure illustrates the cases of noisy observations (variance at training points) and of noise-free observationshttps://i.imgur.com/BWvsB7T.png (no variance at training points). In the Machine Learning perspective, the mean and the covariance function are parametrised by hyperparameters and provide thus a way to include prior knowledge e.g. knowing that the mean function is a second order polynomial. To find the optimal hyperparameters $\pmb{\theta}$, 1. determine the log marginal likelihood $L= \mathrm{log}(p(\pmb{y} \rvert \pmb{x}, \pmb{\theta}))$, 2. take the first partial derivatives of $L$ w.r.t. the hyperparameters, and 3. apply an optimization algorithm. It should be noted that a regularization term is not necessary for the log marginal likelihood $L$ because it already contains a complexity penalty term. Also, the tradeoff between data-fit and penalty is performed automatically. Gaussian Processes provide a very flexible way for finding a suitable regression model. However, they require the high computational complexity $\mathcal{O}(n^3)$ due to the inversion of the covariance matrix. In addition, the generalization of Gaussian Processes to non-Gaussian likelihoods remains complicated. |

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Kumar, Aviral and Fu, Justin and Soh, Matthew and Tucker, George and Levine, Sergey

Neural Information Processing Systems Conference - 2019 via Local Bibsonomy

Keywords: dblp

Kumar, Aviral and Fu, Justin and Soh, Matthew and Tucker, George and Levine, Sergey

Neural Information Processing Systems Conference - 2019 via Local Bibsonomy

Keywords: dblp

[link]
Kumar et al. propose an algorithm to learn in batch reinforcement learning (RL), a setting where an agent learns purely form a fixed batch of data, $B$, without any interactions with the environments. The data in the batch is collected according to a batch policy $\pi_b$. Whereas most previous methods (like BCQ) constrain the learned policy to stay close to the behavior policy, Kumar et al. propose bootstrapping error accumulation reduction (BEAR), which constrains the newly learned policy to place some probability mass on every non negligible action. The difference is illustrated in the picture from the BEAR blog post: https://i.imgur.com/zUw7XNt.png The behavior policy is in both images the dotted red line, the left image shows the policy matching where the algorithm is constrained to the purple choices, while the right image shows the support matching. **Theoretical Contribution:** The paper analysis formally how the use of out-of-distribution actions to compute the target in the Bellman equation influences the back-propagated error. Firstly a distribution constrained backup operator is defined as $T^{\Pi}Q(s,a) = \mathbb{E}[R(s,a) + \gamma \max_{\pi \in \Pi} \mathbb{E}_{P(s' \vert s,a)} V(s')]$ and $V(s) = \max_{\pi \in \Pi} \mathbb{E}_{\pi}[Q(s,a)]$ which considers only policies $\pi \in \Pi$. It is possible that the optimal policy $\pi^*$ is not contained in the policy set $\Pi$, thus there is a suboptimallity constant $\alpha (\Pi) = \max_{s,a} \vert \mathcal{T}^{\Pi}Q^{*}(s,a) - \mathcal{T}Q^{*}(s,a) ]\vert $ which captures how far $\pi^{*}$ is from $\Pi$. Letting $P^{\pi_i}$ be the transition-matrix when following policy $\pi_i$, $\rho_0$ the state marginal distribution of the training data in the batch and $\pi_1, \dots, \pi_k \in \Pi $. The error analysis relies upon a concentrability assumption $\rho_0 P^{\pi_1} \dots P^{\pi_k} \leq c(k)\mu(s)$, with $\mu(s)$ the state marginal. Note that $c(k)$ might be infinite if the support of $\Pi$ is not contained in the state marginal of the batch. Using the coefficients $c(k)$ a concentrability coefficient is defined as: $C(\Pi) = (1-\gamma)^2\sum_{k=1}^{\infty}k \gamma^{k-1}c(k).$ The concentrability takes values between 1 und $\infty$, where 1 corresponds to the case that the batch data were collected by $\pi$ and $\Pi = \{\pi\}$ and $\infty$ to cases where $\Pi$ has support outside of $\pi$. Combining this Kumar et a. get a bound of the Bellman error for distribution constrained value iteration with the constrained Bellman operator $T^{\Pi}$: $\lim_{k \rightarrow \infty} \mathbb{E}_{\rho_0}[\vert V^{\pi_k}(s)- V^{*}(s)] \leq \frac{\gamma}{(1-\gamma^2)} [C(\Pi) \mathbb{E}_{\mu}[\max_{\pi \in \Pi}\mathbb{E}_{\pi}[\delta(s,a)] + \frac{1-\gamma}{\gamma}\alpha(\Pi) ] ]$, where $\delta(s,a)$ is the Bellman error. This presents the inherent batch RL trade-off between keeping policies close to the behavior policy of the batch (captured by $C(\Pi)$ and keeping $\Pi$ sufficiently large (captured by $\alpha(\Pi)$). It is finally proposed to use support sets to construct $\Pi$, that is $\Pi_{\epsilon} = \{\pi \vert \pi(a \vert s)=0 \text{ whenever } \beta(a \vert s) < \epsilon \}$. This amounts to the set of all policies that place probability on all non-negligible actions of the behavior policy. For this particular choice of $\Pi = \Pi_{\epsilon}$ the concentrability coefficient can be bounded. **Algorithm**: The algorithm has an actor critic style, where the Q-value to update the policy is taken to be the minimum over the ensemble. The support constraint to place at least some probability mass on every non negligible action from the batch is enforced via sampled MMD. The proposed algorithm is a member of the policy regularized algorithms as the policy is updated to optimize: $\pi_{\Phi} = \max_{\pi} \mathbb{E}_{s \sim B} \mathbb{E}_{a \sim \pi(\cdot \vert s)} [min_{j = 1 \dots, k} Q_j(s,a)] s.t. \mathbb{E}_{s \sim B}[MMD(D(s), \pi(\cdot \vert s))] \leq \epsilon$ The Bellman target to update the Q-functions is computed as the convex combination of minimum and maximum of the ensemble. **Experiments** The experiments use the Mujoco environments Halfcheetah, Walker, Hopper and Ant. Three scenarios of batch collection, always consisting of 1Mio. samples, are considered: - completely random behavior policy - partially trained behavior policy - optimal policy as behavior policy The experiments confirm that BEAR outperforms other off-policy methods like BCQ or KL-control. The ablations show further that the choice of MMD is crucial as it is sometimes on par and sometimes substantially better than choosing KL-divergence. |

Adversarial examples in the physical world

Alexey Kurakin and Ian Goodfellow and Samy Bengio

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CV, cs.CR, cs.LG, stat.ML

**First published:** 2016/07/08 (6 years ago)

**Abstract:** Most existing machine learning classifiers are highly vulnerable to
adversarial examples. An adversarial example is a sample of input data which
has been modified very slightly in a way that is intended to cause a machine
learning classifier to misclassify it. In many cases, these modifications can
be so subtle that a human observer does not even notice the modification at
all, yet the classifier still makes a mistake. Adversarial examples pose
security concerns because they could be used to perform an attack on machine
learning systems, even if the adversary has no access to the underlying model.
Up to now, all previous work have assumed a threat model in which the adversary
can feed data directly into the machine learning classifier. This is not always
the case for systems operating in the physical world, for example those which
are using signals from cameras and other sensors as an input. This paper shows
that even in such physical world scenarios, machine learning systems are
vulnerable to adversarial examples. We demonstrate this by feeding adversarial
images obtained from cell-phone camera to an ImageNet Inception classifier and
measuring the classification accuracy of the system. We find that a large
fraction of adversarial examples are classified incorrectly even when perceived
through the camera.
more
less

Alexey Kurakin and Ian Goodfellow and Samy Bengio

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CV, cs.CR, cs.LG, stat.ML

[link]
Adversarial examples are datapoints that are designed to fool a classifier. For example, we can take an image that is classified correctly using a neural network, then backprop through the model to find which changes we need to make in order for it to be classified as something else. And these changes can be quite small, such that a human would hardly notice a difference. https://i.imgur.com/pkK570X.png Examples of adversarial images. In this paper, they show that much of this property holds even when the images are fed into the classifier from the real world – after being photographed with a cell phone camera. While the accuracy goes from 85.3% to 36.3% when adversarial modifications are applied on the source images, the performance still drops from 79.8% to 36.4% when the images are photographed. They also propose two modifications to the process of generating adversarial images – making it into a more gradual iterative process, and optimising for a specific adversarial class. |

Ensemble Adversarial Training: Attacks and Defenses

Florian Tramèr and Alexey Kurakin and Nicolas Papernot and Ian Goodfellow and Dan Boneh and Patrick McDaniel

arXiv e-Print archive - 2017 via Local arXiv

Keywords: stat.ML, cs.CR, cs.LG

**First published:** 2017/05/19 (6 years ago)

**Abstract:** Adversarial examples are perturbed inputs designed to fool machine learning
models. Adversarial training injects such examples into training data to
increase robustness. To scale this technique to large datasets, perturbations
are crafted using fast single-step methods that maximize a linear approximation
of the model's loss. We show that this form of adversarial training converges
to a degenerate global minimum, wherein small curvature artifacts near the data
points obfuscate a linear approximation of the loss. The model thus learns to
generate weak perturbations, rather than defend against strong ones. As a
result, we find that adversarial training remains vulnerable to black-box
attacks, where we transfer perturbations computed on undefended models, as well
as to a powerful novel single-step attack that escapes the non-smooth vicinity
of the input data via a small random step. We further introduce Ensemble
Adversarial Training, a technique that augments training data with
perturbations transferred from other models. On ImageNet, Ensemble Adversarial
Training yields models with strong robustness to black-box attacks. In
particular, our most robust model won the first round of the NIPS 2017
competition on Defenses against Adversarial Attacks.
more
less

Florian Tramèr and Alexey Kurakin and Nicolas Papernot and Ian Goodfellow and Dan Boneh and Patrick McDaniel

arXiv e-Print archive - 2017 via Local arXiv

Keywords: stat.ML, cs.CR, cs.LG

[link]
Tramèr et al. introduce both a novel adversarial attack as well as a defense mechanism against black-box attacks termed ensemble adversarial training. I first want to highlight that – in addition to the proposed methods – the paper gives a very good discussion of state-of-the-art attacks as well as defenses and how to put them into context. Tramèr et al. consider black-box attacks, focussing on transferrable adversarial examples. Their main observation is as follows: one-shot attacks (i.e. one evaluation of the model's gradient) on adversarially trained models are likely to overfit to the model's training loss. This observation has two aspects that are experimentally validated in the paper. First, the loss of the adversarially trained model increases sharply when considering adversarial examples crafted on a different model; second, the network learns to fool the attacker by, locally, misleading the gradient – this means that perturbations computed on adversarially trained models are specialized to the local loss. These observations are also illustrated in Figure 1, however, I refer to the paper for a detailed discussion. https://i.imgur.com/dIpRz9P.png Figure 1: Illustration of the discussed observations. On the left, the loss function of an adversarially trained model considering a sample $x = x + \epsilon_1 x' + \epsilon_2 x''$ where $x'$ is a perturbation computed on the adversarially trained model and $x''$ is a perturbation computed on a different model. On the right, zoomed in version where it can be seen that the loss rises sharply in the direction of $\epsilon_1$; i.e. the model gives misleading gradients. Based on the above observations, Tramèr et al. First introduce a new one-shot attack exploiting the fact that the adversarially trained model is trained on overfitted perturbations and second introduce a new counter-measure for training more robust networks. Their attack is quite simple; they consider one Fast-Gradient Sign Method (FSGM) step, but apply a random perturbation first to leave the local vicinity of the sample first: $x' = x + \alpha \text{sign}(\mathcal{N}(0, I))$ $x'' = x' + (\epsilon - \alpha)\text{sign}(\nabla_{x'} J(x', y))$ where $J$ is the loss function and $y$ the label corresponding to sample $x$. In experiments, they show that the attack has higher success rates on adversarially trained models. To counter the proposed attack, they propose ensemble adversarial training. The key idea is to train the model utilizing not only adversarial samples crafted on the model itself but also transferred from pre-trained models. On MNIST, for example, they randomly select 64 FGSM samples from 4 different models (including the one in training). Experimentally, they show that ensemble adversarial training improves the defense again all considered attacks, including FGSM, iterative FGSM as well as the proposed attack. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

ZOO: Zeroth Order Optimization based Black-box Attacks to Deep Neural Networks without Training Substitute Models

Pin-Yu Chen and Huan Zhang and Yash Sharma and Jinfeng Yi and Cho-Jui Hsieh

arXiv e-Print archive - 2017 via Local arXiv

Keywords: stat.ML, cs.CR, cs.LG

**First published:** 2017/08/14 (5 years ago)

**Abstract:** Deep neural networks (DNNs) are one of the most prominent technologies of our
time, as they achieve state-of-the-art performance in many machine learning
tasks, including but not limited to image classification, text mining, and
speech processing. However, recent research on DNNs has indicated
ever-increasing concern on the robustness to adversarial examples, especially
for security-critical tasks such as traffic sign identification for autonomous
driving. Studies have unveiled the vulnerability of a well-trained DNN by
demonstrating the ability of generating barely noticeable (to both human and
machines) adversarial images that lead to misclassification. Furthermore,
researchers have shown that these adversarial images are highly transferable by
simply training and attacking a substitute model built upon the target model,
known as a black-box attack to DNNs.
Similar to the setting of training substitute models, in this paper we
propose an effective black-box attack that also only has access to the input
(images) and the output (confidence scores) of a targeted DNN. However,
different from leveraging attack transferability from substitute models, we
propose zeroth order optimization (ZOO) based attacks to directly estimate the
gradients of the targeted DNN for generating adversarial examples. We use
zeroth order stochastic coordinate descent along with dimension reduction,
hierarchical attack and importance sampling techniques to efficiently attack
black-box models. By exploiting zeroth order optimization, improved attacks to
the targeted DNN can be accomplished, sparing the need for training substitute
models and avoiding the loss in attack transferability. Experimental results on
MNIST, CIFAR10 and ImageNet show that the proposed ZOO attack is as effective
as the state-of-the-art white-box attack and significantly outperforms existing
black-box attacks via substitute models.
more
less

Pin-Yu Chen and Huan Zhang and Yash Sharma and Jinfeng Yi and Cho-Jui Hsieh

arXiv e-Print archive - 2017 via Local arXiv

Keywords: stat.ML, cs.CR, cs.LG

[link]
Chen et al. propose a gradient-based black-box attack to compute adversarial examples. Specifically, they follow the general idea of [1] where the following objective is optimized: $\min_x \|x – x_0\|_2 + c \max\{\max_{i\neq t}\{z_i\} – z_t, - \kappa\}$. Here, $x$ is the adversarial example based on training sample $x_0$. The second part expresses that $x$ is supposed to be misclassified, i.e. the logit $z_i$ for some $i \neq t$ distinct form the true label $t$ is supposed to be larger that the logit $z_t$ corresponding to the true label. This is optimized subject to the constraint that $x$ is a valid image. The attack proposed in [1] assumes a white-box setting were we have access to the logits and the gradients (basically requiring access to the full model). Chen et al., in contrast want to design a black-box attacks. Therefore, they make the following changes: - Instead of using logits $z_i$, the probability distribution $f_i$ (i.e. the actual output of the network) is used. - Gradients are approximated by finite differences. Personally, I find that the first point does violate a strict black-box setting. As company, for example, I would prefer not to give away the full probability distribution but just the final decision (or the decision plus a confidence score). Then, however, the proposed method is not applicable anymore. Anyway, the changed objective looks as follows: $\min_x \|x – x_0\|_2 + c \max\{\max_{i\neq t}\{\log f_i\} – \log f_t, - \kappa\}$ where, according to the authors, the logarithm is essential for optimization. One remaining problem is efficient optimization with finite differences. To this end, they propose a randomized/stochastic coordinate descent algorithm. In particular, in each step, a ranodm pixel is chosen and a local update is performed by calculating the gradient on this pixel using finite differences and performing an ADAM step. [1] N. Carlini, D. Wagner. Towards evaluating the robustness of neural networks. IEEE Symposium of Security and Privacy, 2017. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

About