Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1581 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

MagNet: A Two-Pronged Defense against Adversarial Examples

Meng, Dongyu and Chen, Hao

ACM ACM Conference on Computer and Communications Security - 2017 via Local Bibsonomy

Keywords: dblp

Meng, Dongyu and Chen, Hao

ACM ACM Conference on Computer and Communications Security - 2017 via Local Bibsonomy

Keywords: dblp

[link]
Meng and Chen propose MagNet, a combination of adversarial example detection and removal. At test time, given a clean or adversarial test image, the proposed defense works as follows: First, the input is passed through one or multiple detectors. If one of these detectors fires, the input is rejected. To this end, the authors consider detection based on the reconstruction error of an auto-encoder or detection based on the divergence between probability predictions (on adversarial vs. clean example). Second, if not rejected, the input is passed through a reformed. The reformer reconstructs the input, e.g., through an auto-encoder, to remove potentially undetected adversarial noise. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Kumar, Aviral and Fu, Justin and Soh, Matthew and Tucker, George and Levine, Sergey

Neural Information Processing Systems Conference - 2019 via Local Bibsonomy

Keywords: dblp

Kumar, Aviral and Fu, Justin and Soh, Matthew and Tucker, George and Levine, Sergey

Neural Information Processing Systems Conference - 2019 via Local Bibsonomy

Keywords: dblp

[link]
Kumar et al. propose an algorithm to learn in batch reinforcement learning (RL), a setting where an agent learns purely form a fixed batch of data, $B$, without any interactions with the environments. The data in the batch is collected according to a batch policy $\pi_b$. Whereas most previous methods (like BCQ) constrain the learned policy to stay close to the behavior policy, Kumar et al. propose bootstrapping error accumulation reduction (BEAR), which constrains the newly learned policy to place some probability mass on every non negligible action. The difference is illustrated in the picture from the BEAR blog post: https://i.imgur.com/zUw7XNt.png The behavior policy is in both images the dotted red line, the left image shows the policy matching where the algorithm is constrained to the purple choices, while the right image shows the support matching. **Theoretical Contribution:** The paper analysis formally how the use of out-of-distribution actions to compute the target in the Bellman equation influences the back-propagated error. Firstly a distribution constrained backup operator is defined as $T^{\Pi}Q(s,a) = \mathbb{E}[R(s,a) + \gamma \max_{\pi \in \Pi} \mathbb{E}_{P(s' \vert s,a)} V(s')]$ and $V(s) = \max_{\pi \in \Pi} \mathbb{E}_{\pi}[Q(s,a)]$ which considers only policies $\pi \in \Pi$. It is possible that the optimal policy $\pi^*$ is not contained in the policy set $\Pi$, thus there is a suboptimallity constant $\alpha (\Pi) = \max_{s,a} \vert \mathcal{T}^{\Pi}Q^{*}(s,a) - \mathcal{T}Q^{*}(s,a) ]\vert $ which captures how far $\pi^{*}$ is from $\Pi$. Letting $P^{\pi_i}$ be the transition-matrix when following policy $\pi_i$, $\rho_0$ the state marginal distribution of the training data in the batch and $\pi_1, \dots, \pi_k \in \Pi $. The error analysis relies upon a concentrability assumption $\rho_0 P^{\pi_1} \dots P^{\pi_k} \leq c(k)\mu(s)$, with $\mu(s)$ the state marginal. Note that $c(k)$ might be infinite if the support of $\Pi$ is not contained in the state marginal of the batch. Using the coefficients $c(k)$ a concentrability coefficient is defined as: $C(\Pi) = (1-\gamma)^2\sum_{k=1}^{\infty}k \gamma^{k-1}c(k).$ The concentrability takes values between 1 und $\infty$, where 1 corresponds to the case that the batch data were collected by $\pi$ and $\Pi = \{\pi\}$ and $\infty$ to cases where $\Pi$ has support outside of $\pi$. Combining this Kumar et a. get a bound of the Bellman error for distribution constrained value iteration with the constrained Bellman operator $T^{\Pi}$: $\lim_{k \rightarrow \infty} \mathbb{E}_{\rho_0}[\vert V^{\pi_k}(s)- V^{*}(s)] \leq \frac{\gamma}{(1-\gamma^2)} [C(\Pi) \mathbb{E}_{\mu}[\max_{\pi \in \Pi}\mathbb{E}_{\pi}[\delta(s,a)] + \frac{1-\gamma}{\gamma}\alpha(\Pi) ] ]$, where $\delta(s,a)$ is the Bellman error. This presents the inherent batch RL trade-off between keeping policies close to the behavior policy of the batch (captured by $C(\Pi)$ and keeping $\Pi$ sufficiently large (captured by $\alpha(\Pi)$). It is finally proposed to use support sets to construct $\Pi$, that is $\Pi_{\epsilon} = \{\pi \vert \pi(a \vert s)=0 \text{ whenever } \beta(a \vert s) < \epsilon \}$. This amounts to the set of all policies that place probability on all non-negligible actions of the behavior policy. For this particular choice of $\Pi = \Pi_{\epsilon}$ the concentrability coefficient can be bounded. **Algorithm**: The algorithm has an actor critic style, where the Q-value to update the policy is taken to be the minimum over the ensemble. The support constraint to place at least some probability mass on every non negligible action from the batch is enforced via sampled MMD. The proposed algorithm is a member of the policy regularized algorithms as the policy is updated to optimize: $\pi_{\Phi} = \max_{\pi} \mathbb{E}_{s \sim B} \mathbb{E}_{a \sim \pi(\cdot \vert s)} [min_{j = 1 \dots, k} Q_j(s,a)] s.t. \mathbb{E}_{s \sim B}[MMD(D(s), \pi(\cdot \vert s))] \leq \epsilon$ The Bellman target to update the Q-functions is computed as the convex combination of minimum and maximum of the ensemble. **Experiments** The experiments use the Mujoco environments Halfcheetah, Walker, Hopper and Ant. Three scenarios of batch collection, always consisting of 1Mio. samples, are considered: - completely random behavior policy - partially trained behavior policy - optimal policy as behavior policy The experiments confirm that BEAR outperforms other off-policy methods like BCQ or KL-control. The ablations show further that the choice of MMD is crucial as it is sometimes on par and sometimes substantially better than choosing KL-divergence. |

Understanding deep learning requires rethinking generalization

Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

**First published:** 2016/11/10 (6 years ago)

**Abstract:** Despite their massive size, successful deep artificial neural networks can
exhibit a remarkably small difference between training and test performance.
Conventional wisdom attributes small generalization error either to properties
of the model family, or to the regularization techniques used during training.
Through extensive systematic experiments, we show how these traditional
approaches fail to explain why large neural networks generalize well in
practice. Specifically, our experiments establish that state-of-the-art
convolutional networks for image classification trained with stochastic
gradient methods easily fit a random labeling of the training data. This
phenomenon is qualitatively unaffected by explicit regularization, and occurs
even if we replace the true images by completely unstructured random noise. We
corroborate these experimental findings with a theoretical construction showing
that simple depth two neural networks already have perfect finite sample
expressivity as soon as the number of parameters exceeds the number of data
points as it usually does in practice.
We interpret our experimental findings by comparison with traditional models.
more
less

Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

[link]
This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained. When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs. ## Key contributions * Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data. * Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks * The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4. ## What I learned * Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels. * We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought. ## Funny > deep neural nets remain mysterious for many reasons > Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call. ## See also * [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg) |

Beyond Pixel Norm-Balls: Parametric Adversaries using an Analytically Differentiable Renderer

Hsueh-Ti Derek Liu and Michael Tao and Chun-Liang Li and Derek Nowrouzezahrai and Alec Jacobson

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CV, cs.GR, stat.ML

**First published:** 2018/08/08 (4 years ago)

**Abstract:** Many machine learning image classifiers are vulnerable to adversarial
attacks, inputs with perturbations designed to intentionally trigger
misclassification. Current adversarial methods directly alter pixel colors and
evaluate against pixel norm-balls: pixel perturbations smaller than a specified
magnitude, according to a measurement norm. This evaluation, however, has
limited practical utility since perturbations in the pixel space do not
correspond to underlying real-world phenomena of image formation that lead to
them and has no security motivation attached. Pixels in natural images are
measurements of light that has interacted with the geometry of a physical
scene. As such, we propose the direct perturbation of physical parameters that
underly image formation: lighting and geometry. As such, we propose a novel
evaluation measure, parametric norm-balls, by directly perturbing physical
parameters that underly image formation. One enabling contribution we present
is a physically-based differentiable renderer that allows us to propagate pixel
gradients to the parametric space of lighting and geometry. Our approach
enables physically-based adversarial attacks, and our differentiable renderer
leverages models from the interactive rendering literature to balance the
performance and accuracy trade-offs necessary for a memory-efficient and
scalable adversarial data augmentation workflow.
more
less

Hsueh-Ti Derek Liu and Michael Tao and Chun-Liang Li and Derek Nowrouzezahrai and Alec Jacobson

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CV, cs.GR, stat.ML

[link]
Liu et al. propose adversarial attacks on physical parameters of images, which can be manipulated efficiently through differentiable renderer. In particular, they propose adversarial lighting and adversarial geometry; in both cases, an image is assumed to be a function of lighting and geometry, generated by a differentiable renderer. By directly manipulating these latent variables, more realistic looking adversarial examples can be generated for synthetic images as shown in Figure 1. https://i.imgur.com/uh2pj9w.png Figure 1: Comparison of the proposed attack with known attacks applied to large perturbations, $L_\infty \approx 0.82$. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Critic Regularized Regression

Ziyu Wang and Alexander Novikov and Konrad Zolna and Jost Tobias Springenberg and Scott Reed and Bobak Shahriari and Noah Siegel and Josh Merel and Caglar Gulcehre and Nicolas Heess and Nando de Freitas

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**First published:** 2023/03/23 (just now)

**Abstract:** Offline reinforcement learning (RL), also known as batch RL, offers the
prospect of policy optimization from large pre-recorded datasets without online
environment interaction. It addresses challenges with regard to the cost of
data collection and safety, both of which are particularly pertinent to
real-world applications of RL. Unfortunately, most off-policy algorithms
perform poorly when learning from a fixed dataset. In this paper, we propose a
novel offline RL algorithm to learn policies from data using a form of
critic-regularized regression (CRR). We find that CRR performs surprisingly
well and scales to tasks with high-dimensional state and action spaces --
outperforming several state-of-the-art offline RL algorithms by a significant
margin on a wide range of benchmark tasks.
more
less

Ziyu Wang and Alexander Novikov and Konrad Zolna and Jost Tobias Springenberg and Scott Reed and Bobak Shahriari and Noah Siegel and Josh Merel and Caglar Gulcehre and Nicolas Heess and Nando de Freitas

arXiv e-Print archive - 2020 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

[link]
Offline reinforcement learning is potentially high-value thing for the machine learning community learn to do well, because there are many applications where it'd be useful to generate a learnt policy for responding to a dynamic environment, but where it'd be too unsafe or expensive to learn in an on-policy or online way, where we continually evaluate our actions in the environment to test their value. In such settings, we'd like to be able to take a batch of existing data - collected from a human demonstrator, or from some other algorithm - and be able to learn a policy from those pre-collected transitions, without being able to query the environment further by taking arbitrary actions. There are two broad strategies for learning a policy from precollected transitions. One is to simply learn to mimic the action policy used by the demonstrator, predicting the action the demonstrator would take in a given state, without making use of reward data at all. This is Behavioral Cloning, and has the advantage of being somewhat more conservative (in terms of not experimenting with possibly-unsafe-or-low-reward actions the demonstrator never took), but this is also a disadvantage, because it's not possible to get higher reward than the demonstrator themselves got if you're simply copying their behavior. Another approach is to learn a Q function - estimating the value of a given action in a given state - using the reward data from the precollected transitions. This can also have some downsides, mostly in the direction of overconfidence. Q value Temporal Difference learning works by using the current reward added to the max Q value over possible next actions as the target for the current-state Q estimate. This tends to lead to overestimates, because regression to the mean effects mean that the highest value Q estimates are disproportionately likely to be noisy (possibly because they correspond to an action with little data in the demonstrator dataset). In on-policy Q learning, this is less problematic, because the agent can take the action associated with their noisily inaccurate estimate, and as a result get more data for that action, and get an estimate that is less noisy in future. But when we're in a fully offline setting, all our learning is completed before we actually start taking actions with our policy, so taking high-uncertainty actions isn't a valuable source of new information, but just risky. The approach suggested by this DeepMind paper - Critic Regularized Regression, or CRR - is essentially a synthesis of these two possible approaches. The method learns a Q function as normal, using temporal difference methods. The distinction in this method comes from how to get a policy, given a learned Q function. Rather than simply taking the action your Q estimate says is highest-value at a particular point, CRR optimizes a policy according to the formula shown below. The f() function is a stand-in for various potential functions, all of which are monotonic with respect to the Q function, meaning they increase when the Q function does. https://i.imgur.com/jGmhYdd.png This basically amounts to a form of a behavioral cloning loss (with the part that maximizes the probability under your policy of the actions sampled from the demonstrator dataset), but weighted or, as the paper terms it, filtered, by the learned Q function. The higher the estimated q value for a transition, the more weight is placed on that transition from the demo dataset having high probability under your policy. Rather than trying to mimic all of the actions of the demonstrator, the policy preferentially tries to mimic the demonstrator actions that it estimates were particularly high-quality. Different f() functions lead to different kinds of filtration. The `binary`version is an indicator function for the Advantage of an action (the Q value for that action at that state minus some reference value for the state, describing how much better the action is than other alternatives at that state) being greater than zero. Another, `exp`, uses exponential weightings which do a more "soft" upweighting or downweighting of transitions based on advantage, rather than the sharp binary of whether an actions advantage is above 1. The authors demonstrate that, on multiple environments from three different environment suites, CRR outperforms other off-policy baselines - either more pure behavioral cloning, or more pure RL - and in many cases does so quite dramatically. They find that the sharper binary weighting scheme does better on simpler tasks, since the trade-off of fewer but higher-quality samples to learn from works there. However, on more complex tasks, the policy benefits from the exp weighting, which still uses and learns from more samples (albeit at lower weights), which introduces some potential mimicking of lower-quality transitions, but at the trade of a larger effective dataset size to learn from. |

About