[link]
Kumar et al. propose an algorithm to learn in batch reinforcement learning (RL), a setting where an agent learns purely form a fixed batch of data, $B$, without any interactions with the environments. The data in the batch is collected according to a batch policy $\pi_b$. Whereas most previous methods (like BCQ) constrain the learned policy to stay close to the behavior policy, Kumar et al. propose bootstrapping error accumulation reduction (BEAR), which constrains the newly learned policy to place some probability mass on every non negligible action. The difference is illustrated in the picture from the BEAR blog post: https://i.imgur.com/zUw7XNt.png The behavior policy is in both images the dotted red line, the left image shows the policy matching where the algorithm is constrained to the purple choices, while the right image shows the support matching. **Theoretical Contribution:** The paper analysis formally how the use of out-of-distribution actions to compute the target in the Bellman equation influences the back-propagated error. Firstly a distribution constrained backup operator is defined as $T^{\Pi}Q(s,a) = \mathbb{E}[R(s,a) + \gamma \max_{\pi \in \Pi} \mathbb{E}_{P(s' \vert s,a)} V(s')]$ and $V(s) = \max_{\pi \in \Pi} \mathbb{E}_{\pi}[Q(s,a)]$ which considers only policies $\pi \in \Pi$. It is possible that the optimal policy $\pi^*$ is not contained in the policy set $\Pi$, thus there is a suboptimallity constant $\alpha (\Pi) = \max_{s,a} \vert \mathcal{T}^{\Pi}Q^{*}(s,a) - \mathcal{T}Q^{*}(s,a) ]\vert $ which captures how far $\pi^{*}$ is from $\Pi$. Letting $P^{\pi_i}$ be the transition-matrix when following policy $\pi_i$, $\rho_0$ the state marginal distribution of the training data in the batch and $\pi_1, \dots, \pi_k \in \Pi $. The error analysis relies upon a concentrability assumption $\rho_0 P^{\pi_1} \dots P^{\pi_k} \leq c(k)\mu(s)$, with $\mu(s)$ the state marginal. Note that $c(k)$ might be infinite if the support of $\Pi$ is not contained in the state marginal of the batch. Using the coefficients $c(k)$ a concentrability coefficient is defined as: $C(\Pi) = (1-\gamma)^2\sum_{k=1}^{\infty}k \gamma^{k-1}c(k).$ The concentrability takes values between 1 und $\infty$, where 1 corresponds to the case that the batch data were collected by $\pi$ and $\Pi = \{\pi\}$ and $\infty$ to cases where $\Pi$ has support outside of $\pi$. Combining this Kumar et a. get a bound of the Bellman error for distribution constrained value iteration with the constrained Bellman operator $T^{\Pi}$: $\lim_{k \rightarrow \infty} \mathbb{E}_{\rho_0}[\vert V^{\pi_k}(s)- V^{*}(s)] \leq \frac{\gamma}{(1-\gamma^2)} [C(\Pi) \mathbb{E}_{\mu}[\max_{\pi \in \Pi}\mathbb{E}_{\pi}[\delta(s,a)] + \frac{1-\gamma}{\gamma}\alpha(\Pi) ] ]$, where $\delta(s,a)$ is the Bellman error. This presents the inherent batch RL trade-off between keeping policies close to the behavior policy of the batch (captured by $C(\Pi)$ and keeping $\Pi$ sufficiently large (captured by $\alpha(\Pi)$). It is finally proposed to use support sets to construct $\Pi$, that is $\Pi_{\epsilon} = \{\pi \vert \pi(a \vert s)=0 \text{ whenever } \beta(a \vert s) < \epsilon \}$. This amounts to the set of all policies that place probability on all non-negligible actions of the behavior policy. For this particular choice of $\Pi = \Pi_{\epsilon}$ the concentrability coefficient can be bounded. **Algorithm**: The algorithm has an actor critic style, where the Q-value to update the policy is taken to be the minimum over the ensemble. The support constraint to place at least some probability mass on every non negligible action from the batch is enforced via sampled MMD. The proposed algorithm is a member of the policy regularized algorithms as the policy is updated to optimize: $\pi_{\Phi} = \max_{\pi} \mathbb{E}_{s \sim B} \mathbb{E}_{a \sim \pi(\cdot \vert s)} [min_{j = 1 \dots, k} Q_j(s,a)] s.t. \mathbb{E}_{s \sim B}[MMD(D(s), \pi(\cdot \vert s))] \leq \epsilon$ The Bellman target to update the Q-functions is computed as the convex combination of minimum and maximum of the ensemble. **Experiments** The experiments use the Mujoco environments Halfcheetah, Walker, Hopper and Ant. Three scenarios of batch collection, always consisting of 1Mio. samples, are considered: - completely random behavior policy - partially trained behavior policy - optimal policy as behavior policy The experiments confirm that BEAR outperforms other off-policy methods like BCQ or KL-control. The ablations show further that the choice of MMD is crucial as it is sometimes on par and sometimes substantially better than choosing KL-divergence. ![]() |
[link]
Salman et al. combined randomized smoothing with adversarial training based on an attack specifically designed against smoothed classifiers. Specifically, they consider the formulation of randomized smoothing by Cohen et al. [1]; here, Gaussian noise around the input (adversarial or clean) is sampled and the classifier takes a simple majority vote. In [1], Cohen et al. show that this results in good bounds on robustness. In this paper, Salman et al. propose an adaptive attack against randomized smoothing. Essentially, they use a simple PGD attack to attack a smoothed classifier, i.e., maximize the cross entropy loss of the smoothed classifier. To make the objective tractable, Monte Carlo samples are used in each iteration of the PGD optimization. Based on this attack, they do adversarial training, with adversarial examples computed against the smoothed (and adversarially trained) classifier. In experiments, this approach outperforms the certified robustness by Cohen et al. on several datasets. [1] Jeremy M. Cohen, Elan Rosenfeld and J. Zico Kolter. Certified Adversarial Robustness via Randomized Smoothing. ArXiv, 1902.02918, 2019. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
[link]
Coming from the perspective of the rest of machine learning, a somewhat odd thing about reinforcement learning that often goes unnoticed is the fact that, in basically all reinforcement learning, performance of an algorithm is judged by its performance on the same environment it was trained on. In the parlance of ML writ large: training on the test set. In RL, most of the focus has historically been on whether automatic systems would be able to learn a policy from the state distribution of a single environment, already a fairly hard task. But, now that RL has had more success in the single-environment case, there comes the question: how can we train reinforcement algorithms that don't just perform well on a single environment, but over a range of environments. One lens onto this question is that of meta-learning, but this paper takes a different approach, and looks at how straightforward regularization techniques pulled from the land of supervised learning can (or can't straightforwardly) be applied to reinforcement learning. In general, the regularization techniques discussed here are all ways of reducing the capacity of the model, and preventing it from overfitting. Some ways to reduce capacity are: - Apply L2 weight penalization - Apply dropout, which handicaps the model by randomly zeroing out neurons - Use Batch Norm, which uses noisy batch statistics, and increases randomness in a way that, similar to above, deteriorates performance - Use an information bottleneck: similar to a VAE, this approach works by learning some compressed representation of your input, p(z|x), and then predicting your output off of that z, in a way that incentivizes your z to be informative (because you want to be able to predict y well) but also penalizes too much information being put in it (because you penalize differences between your learned p(z|x) distribution and an unconditional prior p(z) ). This pushes your model to use its conditional-on-x capacity wisely, and only learn features if they're quite valuable in predicting y However, the paper points out that there are some complications in straightforwardly applying these techniques to RL. The central one is the fact that in (most) RL, the distribution of transitions you train on comes from prior iterations of your policy. This means that a noisier and less competent policy will also leave you with less data to train on. Additionally, using a noisy policy can increase variance, both by making your trained policy more different than your rollout policy (in an off-policy setting) and by making your estimate of the value function higher-variance, which is problematic because that's what you're using as a target training signal in a temporal difference framework. The paper is a bit disconnected in its connection between justification and theory, and makes two broad, mostly distinct proposals: 1. The most successful (though also the one least directly justified by the earlier-discussed theoretical difficulties of applying regularization in RL) is an information bottleneck ported into a RL setting. It works almost the same as the classification-model one, except that you're trying to increase the value of your actions given compressed-from-state representation z, rather than trying to increase your ability to correctly predict y. The justification given here is that it's good to incentivize RL algorithms in particular to learn simpler, more compressible features, because they often have such poor data and also training signal earlier in training 2. SNI (Selective Noise Injection) works by only applying stochastic aspects of regularization (sampling from z in an information bottleneck, applying different dropout masks, etc) to certain parts of the training procedure. In particular, the rollout used to collect data is non-stochastic, removing the issue of noisiness impacting the data that's collected. They then do an interesting thing where they calculate a weighted mixture of the policy update with a deterministic model, and the update with a stochastic one. The best performing of these that they tested seems to have been a 50/50 split. This is essentially just a knob you can turn on stochasticity, to trade off between the regularizing effect of noise and the variance-increasing-negative effect of it. https://i.imgur.com/fi0dHgf.png https://i.imgur.com/LLbDaRw.png Based on my read of the experiments in the paper, the most impressive thing here is how well their information bottleneck mechanism works as a way to improve generalization, compared to both the baseline and other regularization approaches. It does look like there's some additional benefit to SNI, particularly in the CoinRun setting, but very little in the MultiRoom setting, and in general the difference is less dramatic than the difference from using the information bottleneck. ![]() |
[link]
As per the “holistic” in the paper title, the goal of this work is to take a suite of existing work within semi-supervised learning, and combine many of its ideas into one training pipeline that can (with really impressive empirical success) leverage the advantages of those different ideas. The core premise of supervised learning is that, given true-label training signal from a small number of labels, you can leverage large amounts of unsupervised data to improve your model. A central intuition of many of these methods is that, even if you don’t know the class of a given sample, you know it *has* a class, and you can develop a loss by pushing your model to predict the class for an example and a modified or perturbed version of that example, since, if you have a prior belief that that modification should not change your true class label, then your unlabeled data point should have the same class prediction both times. Entropy minimization is built off similar notions: although we don’t know a point’s class, we know it must have one, and so we’d like our model to make a prediction that puts more of its weight on a single class, rather than be spread out, since we know the “correct model” will be a very confident prediction of one class, though we don’t know which it is. These methods will give context and a frame of mind for understanding the techniques merged together into the MixMatch approach. At its very highest level, MixMatch’s goal is to take in a dataset of both labeled and unlabeled data, and produce a training set of inputs, predictions, and (occasionally constructed or modified labels) to calculate a model update loss from. https://i.imgur.com/6lHQqMD.png - First, for each unlabeled example in the dataset, we produce K different augmented versions of that image (by cropping it, rotating it, flipping it, etc). This is in the spirit of the consistency loss literature, where you want your model to make the same prediction across augmentations - Do the same augmentation for each labeled example, but only once per input, rather than k times - Run all of your augmented examples through your model, and take the average of their predictions. This is based on the idea that the average of the predictions will be a lower variance, more stable pseudo-target to pull each of the individual predictions towards. Also in the spirit of making something more shaped like a real label, they undertake a sharpening step, turning down the temperature of the averaged distribution. This seems like it would have the effect of more confidently pulling the original predictions towards a single “best guess” label - At this point, we have a set of augmented labeled data, with a true label, and also a set of augmented unlabeled data, with a label based off of an averaged and sharpened best guess from the model over different modifications. At this point, the pipeline uses something called “MixUp” (on which there is a previous paper, so I won’t dive into it too much here), which takes pairs of data points, calculates a convex combination of the inputs, runs it through the model, and uses as the loss-function target a convex combination of the outputs. So, in the simple binary case, if you have a positive and negatively labeled image and sample a combination parameter of 0.75, you have an image that is 0.75 positive, 0.25 negative, and the new label that you’re calculating cross entropy loss against is 0.75. - MixMatch generates pairs for its MixUp calculation by mixing (heh) labeled and unlabeled data together, and pairing each labeled and unlabeled pair with one observation from the merged set. At this point, we have combined inputs, and we have combined labels, and we can calculate loss between them With all of these methods combined, this method takes the previous benchmark of 38% error, for a CIFAR dataset with only 250 labels, and drops that to 11%, which is a pretty astonishing improvement in error rate. After performing an ablation study, they find that MixUp itself, temperature sharpening, and calculating K>1 augmentations of unlabeled data rather than K=1 are the strongest value-adds; it doesn’t appear like there’s that much difference that comes from mixing between unlabeled and labeled for the MixUp pairs. ![]() |