[link]
This paper is to mitigate the scene bias in the action recognition task. Scene bias is defined as the model only focusing on scene or object information without paying attention to the actual activity. To mitigate this issue, the author proposed 2 additional types of loss: (1) scene adversarial loss that helps the network to learn features that are suitable for action but invariant to scene type. Hence, reduce the scene bias. (2) human mask confusion loss that prevents a model from predicting the correct action (label) of this video if there is no person in this video. Hence, this can mitigate the scene bias because the model can not predict the correct action based on only the surrounding scene. https://i.imgur.com/BBfWE17.png To mask out the person in the video, they use a human detector to detect and then mask the person out. In the above diagram, there is a gradient reversal layer, which works as follows: In the forward pass, the output is similar to the input. In the backward pass, the output is equal to the input times -1. https://i.imgur.com/hif9ZL9.png This layer comes from Domain Adaptation. In domain adaptation, there is a need to make the distribution of the source and the target domain distinguishable. So, in this work, they want to make the action distribution and the scene distribution distinguishable, which is why they train the action classifier and scene classifier in an adversarial way. https://i.imgur.com/trNJGlm.png And by using the Gradient reversal layer, for the training instances, the action predictor will be trained for predicting the labels of the training instances. The feature extractor will therefore be trained to minimize the classification loss of the action predictor and maximize the classification loss of the scene predictor. As a result, the action will be scene-agnostic. ![]() |
[link]
Kumar et al. propose an algorithm to learn in batch reinforcement learning (RL), a setting where an agent learns purely form a fixed batch of data, $B$, without any interactions with the environments. The data in the batch is collected according to a batch policy $\pi_b$. Whereas most previous methods (like BCQ) constrain the learned policy to stay close to the behavior policy, Kumar et al. propose bootstrapping error accumulation reduction (BEAR), which constrains the newly learned policy to place some probability mass on every non negligible action. The difference is illustrated in the picture from the BEAR blog post: https://i.imgur.com/zUw7XNt.png The behavior policy is in both images the dotted red line, the left image shows the policy matching where the algorithm is constrained to the purple choices, while the right image shows the support matching. **Theoretical Contribution:** The paper analysis formally how the use of out-of-distribution actions to compute the target in the Bellman equation influences the back-propagated error. Firstly a distribution constrained backup operator is defined as $T^{\Pi}Q(s,a) = \mathbb{E}[R(s,a) + \gamma \max_{\pi \in \Pi} \mathbb{E}_{P(s' \vert s,a)} V(s')]$ and $V(s) = \max_{\pi \in \Pi} \mathbb{E}_{\pi}[Q(s,a)]$ which considers only policies $\pi \in \Pi$. It is possible that the optimal policy $\pi^*$ is not contained in the policy set $\Pi$, thus there is a suboptimallity constant $\alpha (\Pi) = \max_{s,a} \vert \mathcal{T}^{\Pi}Q^{*}(s,a) - \mathcal{T}Q^{*}(s,a) ]\vert $ which captures how far $\pi^{*}$ is from $\Pi$. Letting $P^{\pi_i}$ be the transition-matrix when following policy $\pi_i$, $\rho_0$ the state marginal distribution of the training data in the batch and $\pi_1, \dots, \pi_k \in \Pi $. The error analysis relies upon a concentrability assumption $\rho_0 P^{\pi_1} \dots P^{\pi_k} \leq c(k)\mu(s)$, with $\mu(s)$ the state marginal. Note that $c(k)$ might be infinite if the support of $\Pi$ is not contained in the state marginal of the batch. Using the coefficients $c(k)$ a concentrability coefficient is defined as: $C(\Pi) = (1-\gamma)^2\sum_{k=1}^{\infty}k \gamma^{k-1}c(k).$ The concentrability takes values between 1 und $\infty$, where 1 corresponds to the case that the batch data were collected by $\pi$ and $\Pi = \{\pi\}$ and $\infty$ to cases where $\Pi$ has support outside of $\pi$. Combining this Kumar et a. get a bound of the Bellman error for distribution constrained value iteration with the constrained Bellman operator $T^{\Pi}$: $\lim_{k \rightarrow \infty} \mathbb{E}_{\rho_0}[\vert V^{\pi_k}(s)- V^{*}(s)] \leq \frac{\gamma}{(1-\gamma^2)} [C(\Pi) \mathbb{E}_{\mu}[\max_{\pi \in \Pi}\mathbb{E}_{\pi}[\delta(s,a)] + \frac{1-\gamma}{\gamma}\alpha(\Pi) ] ]$, where $\delta(s,a)$ is the Bellman error. This presents the inherent batch RL trade-off between keeping policies close to the behavior policy of the batch (captured by $C(\Pi)$ and keeping $\Pi$ sufficiently large (captured by $\alpha(\Pi)$). It is finally proposed to use support sets to construct $\Pi$, that is $\Pi_{\epsilon} = \{\pi \vert \pi(a \vert s)=0 \text{ whenever } \beta(a \vert s) < \epsilon \}$. This amounts to the set of all policies that place probability on all non-negligible actions of the behavior policy. For this particular choice of $\Pi = \Pi_{\epsilon}$ the concentrability coefficient can be bounded. **Algorithm**: The algorithm has an actor critic style, where the Q-value to update the policy is taken to be the minimum over the ensemble. The support constraint to place at least some probability mass on every non negligible action from the batch is enforced via sampled MMD. The proposed algorithm is a member of the policy regularized algorithms as the policy is updated to optimize: $\pi_{\Phi} = \max_{\pi} \mathbb{E}_{s \sim B} \mathbb{E}_{a \sim \pi(\cdot \vert s)} [min_{j = 1 \dots, k} Q_j(s,a)] s.t. \mathbb{E}_{s \sim B}[MMD(D(s), \pi(\cdot \vert s))] \leq \epsilon$ The Bellman target to update the Q-functions is computed as the convex combination of minimum and maximum of the ensemble. **Experiments** The experiments use the Mujoco environments Halfcheetah, Walker, Hopper and Ant. Three scenarios of batch collection, always consisting of 1Mio. samples, are considered: - completely random behavior policy - partially trained behavior policy - optimal policy as behavior policy The experiments confirm that BEAR outperforms other off-policy methods like BCQ or KL-control. The ablations show further that the choice of MMD is crucial as it is sometimes on par and sometimes substantially better than choosing KL-divergence. ![]() |
[link]
Salman et al. combined randomized smoothing with adversarial training based on an attack specifically designed against smoothed classifiers. Specifically, they consider the formulation of randomized smoothing by Cohen et al. [1]; here, Gaussian noise around the input (adversarial or clean) is sampled and the classifier takes a simple majority vote. In [1], Cohen et al. show that this results in good bounds on robustness. In this paper, Salman et al. propose an adaptive attack against randomized smoothing. Essentially, they use a simple PGD attack to attack a smoothed classifier, i.e., maximize the cross entropy loss of the smoothed classifier. To make the objective tractable, Monte Carlo samples are used in each iteration of the PGD optimization. Based on this attack, they do adversarial training, with adversarial examples computed against the smoothed (and adversarially trained) classifier. In experiments, this approach outperforms the certified robustness by Cohen et al. on several datasets. [1] Jeremy M. Cohen, Elan Rosenfeld and J. Zico Kolter. Certified Adversarial Robustness via Randomized Smoothing. ArXiv, 1902.02918, 2019. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
[link]
In my view, the Lottery Ticket Hypothesis is one of the weirder and more mysterious phenomena of the last few years of Machine Learning. We've known for awhile that we can take trained networks and prune them down to a small fraction of their weights (keeping those weights with the highest magnitudes) and maintain test performance using only those learned weights. That seemed somewhat surprising, in that there were a lot of weights that weren't actually necessary to encoding the learned function, but, the thinking went, possibly having many times more weights than that was helpful for training, even if not necessary once a model is trained. The authors of the original Lottery Ticket paper came to the surprising realization that they could take the weights that were pruned to exist in the final network, re-initialize them (and only them) to the values they had during initial training, and perform almost as well as the final pruned model that had all weights active during training. And, performance using the specific weights and their particular initialization values is much higher than training a comparable topology of weights with random initial values. This paper out of Facebook AI adds another fascinating experiment to the pile of off evidence around lottery tickets: they test whether lottery tickets transfer *between datasets*, and they find that they often do (at least when the dataset on which the lottery ticket is found is more complex (in terms of in size, input complexity, or number of classes) than the dataset the ticket is being transferred to. Even more interestingly, they find that for sufficiently simple datasets, the "ticket" initialization pattern learned on a more complex dataset actually does *better* than ones learned on the simple dataset itself. They also find that tickets by and large transfer between SGD and Adam, so whatever kind of inductive bias or value they provide is general across optimizers in addition to at least partially general across datasets. https://i.imgur.com/H0aPjRN.png I find this result fun to think about through a few frames. The first is to remember that figuring out heuristics for initializing networks (as a function of their topology) was an important step in getting them to train at all, so while this result may at first seem strange and arcane, in that context it feels less surprising that there are still-better initialization heuristics out there, possibly with some kind of interesting theoretical justification to them, that humans simply haven't been clever enough to formalize yet, and have only discovered empirically through methods like this. This result is also interesting in terms of transfer: we've known for awhile that the representations learned on more complex datasets can convey general information back to smaller ones, but it's less easy to think about what information is conveyed by the topology and connectivity of a network. This paper suggests that the information is there, and has prompted me to think more about the slightly mind-bending question of how training models could lead to information compressed in this form, and how this information could be better understood. ![]() |
[link]
VQ-VAE is a Variational AutoEncoder that uses as its information bottleneck a discrete set of codes, rather than a continuous vector. That is: the encoder creates a downsampled spatial representation of the image, where in each grid cell of the downsampled image, the cell is represented by a vector. But, before that vector is passed to the decoder, it's discretized, by (effectively) clustering the vectors the network has historically seen, and substituting each vector with the center of the vector it's closest to. This has the effect of reducing the capacity of your information bottleneck, but without just pushing your encoded representation closer to an uninformed prior. (If you're wondering how the gradient survives this very much not continuous operation, the answer is: we just pretend that operation didn't exist, and imagine that the encoder produced the cluster-center "codebook" vector that the decoder sees). The part of the model that got a (small) upgrade in this paper is the prior distribution model that's learned on top of these latent representations. The goal of this prior is to be able to just sample images, unprompted, from the distribution of latent codes. Once we have a trained decoder, if we give it a grid of such codes, it can produce an image. But these codes aren't one-per-image, but rather a grid of many codes representing features in different part of the image. In order to generate a set of codes corresponding to a reasonable image, we can either generate them all at once, or else (as this paper does) use an autoregressive approach, where some parts of the code grid are generated, and then subsequent ones conditioned on those. In the original version of the paper, the autoregressive model used was a PixelCNN (don't have the space to fully explain that here, but, at a high level: a model that uses convolutions over previously generated regions to generate a new region). In this paper, the authors took inspiration from the huge rise of self-attention in recent years, and swapped that operation in in lieu of the convolutions. Self-attention has the nice benefit that you can easily have a global receptive range (each region being generated can see all other regions) which you'd otherwise need multiple layers of convolutions to accomplish. In addition, the authors add an additional layer of granularity: generating both a 32x32 and 64x64 grid, and using both to generate the decoded reconstruction. They argue that this allows one representation to focus on more global details, and the other on more precise ones. https://i.imgur.com/zD78Pp4.png The final result is the ability to generate quite realistic looking images, that at least are being claimed to be more diverse than those generated by GANs (examples above). I'm always a bit cautious of claims of better performance in the image-generation area, because it's all squinting at pixels and making up somewhat-reasonable but still arbitrary metrics. That said, it seems interesting and useful to be aware of the current relative capabilities of two of the main forms of generative modeling, and so I'd recommend this paper on that front, even if it's hard for me personally to confidently assess the improvements on prior art. ![]() |
[link]
Coming from the perspective of the rest of machine learning, a somewhat odd thing about reinforcement learning that often goes unnoticed is the fact that, in basically all reinforcement learning, performance of an algorithm is judged by its performance on the same environment it was trained on. In the parlance of ML writ large: training on the test set. In RL, most of the focus has historically been on whether automatic systems would be able to learn a policy from the state distribution of a single environment, already a fairly hard task. But, now that RL has had more success in the single-environment case, there comes the question: how can we train reinforcement algorithms that don't just perform well on a single environment, but over a range of environments. One lens onto this question is that of meta-learning, but this paper takes a different approach, and looks at how straightforward regularization techniques pulled from the land of supervised learning can (or can't straightforwardly) be applied to reinforcement learning. In general, the regularization techniques discussed here are all ways of reducing the capacity of the model, and preventing it from overfitting. Some ways to reduce capacity are: - Apply L2 weight penalization - Apply dropout, which handicaps the model by randomly zeroing out neurons - Use Batch Norm, which uses noisy batch statistics, and increases randomness in a way that, similar to above, deteriorates performance - Use an information bottleneck: similar to a VAE, this approach works by learning some compressed representation of your input, p(z|x), and then predicting your output off of that z, in a way that incentivizes your z to be informative (because you want to be able to predict y well) but also penalizes too much information being put in it (because you penalize differences between your learned p(z|x) distribution and an unconditional prior p(z) ). This pushes your model to use its conditional-on-x capacity wisely, and only learn features if they're quite valuable in predicting y However, the paper points out that there are some complications in straightforwardly applying these techniques to RL. The central one is the fact that in (most) RL, the distribution of transitions you train on comes from prior iterations of your policy. This means that a noisier and less competent policy will also leave you with less data to train on. Additionally, using a noisy policy can increase variance, both by making your trained policy more different than your rollout policy (in an off-policy setting) and by making your estimate of the value function higher-variance, which is problematic because that's what you're using as a target training signal in a temporal difference framework. The paper is a bit disconnected in its connection between justification and theory, and makes two broad, mostly distinct proposals: 1. The most successful (though also the one least directly justified by the earlier-discussed theoretical difficulties of applying regularization in RL) is an information bottleneck ported into a RL setting. It works almost the same as the classification-model one, except that you're trying to increase the value of your actions given compressed-from-state representation z, rather than trying to increase your ability to correctly predict y. The justification given here is that it's good to incentivize RL algorithms in particular to learn simpler, more compressible features, because they often have such poor data and also training signal earlier in training 2. SNI (Selective Noise Injection) works by only applying stochastic aspects of regularization (sampling from z in an information bottleneck, applying different dropout masks, etc) to certain parts of the training procedure. In particular, the rollout used to collect data is non-stochastic, removing the issue of noisiness impacting the data that's collected. They then do an interesting thing where they calculate a weighted mixture of the policy update with a deterministic model, and the update with a stochastic one. The best performing of these that they tested seems to have been a 50/50 split. This is essentially just a knob you can turn on stochasticity, to trade off between the regularizing effect of noise and the variance-increasing-negative effect of it. https://i.imgur.com/fi0dHgf.png https://i.imgur.com/LLbDaRw.png Based on my read of the experiments in the paper, the most impressive thing here is how well their information bottleneck mechanism works as a way to improve generalization, compared to both the baseline and other regularization approaches. It does look like there's some additional benefit to SNI, particularly in the CoinRun setting, but very little in the MultiRoom setting, and in general the difference is less dramatic than the difference from using the information bottleneck. ![]() |
[link]
Self-Supervised Learning is a broad category of approaches whose goal is to learn useful representations by asking networks to perform constructed tasks that only use the content of a dataset itself, and not external labels. The idea with these tasks is to design tasks such that solving them requires the network to have learned useful Some examples of this approach include predicting the rotation of rotated images, reconstructing color from greyscale, and, the topic of this paper, maximizing mutual information between different areas of the image. The hope behind this last approach is that if two areas of an image are generated by the same set of underlying factors (in the case of a human face: they're parts of the same person's face), then a representation that correctly captures those factors for one area will give you a lot of information about the representation of the other area. Historically, this conceptual desire for representations that are mutually informative has been captured by mutual information. If we define the representation distribution over the data of area 1 as p(x) and area 2 as q(x), the mutual information is the KL divergence between the joint distribution of these two distributions and the product of their marginals. This is an old statistical intuition: the closer the joint is to the product of marginals, the closer the variables are to independent; the farther away, the closer they are to informationally identical. https://i.imgur.com/2SzD5d5.png This paper argues that the presence of the KL divergence in this mutual information formulation impedes the ability of networks to learn useful representations. This argument is theoretically based on a result from a recent paper (which for the moment I'll just take as foundation, without reading it myself) that empirical lower-bound measurements of mutual information, of the kind used in these settings, are upper bounded by log(n) where n is datapoints. Our hope in maximizing a lower bound to any quantity is that the bound is fairly tight, since that means that optimizing a network to push upward a lower bound actually has the effect of pushing the actual value up as well. If the lower bound we can estimate is constrained to be far below the actual lower bound in the data, then pushing it upward doesn't actually require the value to move upward. The authors identify this as a particular problem in areas where the underlying mutual information of the data is high, such as in videos where one frame is very predictive of the next, since in those cases the constraint imposed by the dataset size will be small relative to the actual possible maximum mutual information you could push your network to achieve. https://i.imgur.com/wm39mQ8.png Taking a leaf out of the GAN literature, the authors suggest keeping replacing the KL divergence component of mutual information and replacing it with the Wasserstein Distance; otherwise known as the "earth-mover distance", the Wasserstein distance measures the cost of the least costly way to move probability mass from one distribution to another, assuming you're moving that mass along some metric space. A nice property of the Wasserstein distance, in both GANs and in this application) is that they don't saturate quite as quickly: the value of a KL divergence can shoot up if the distributions are even somewhat different, making it unable to differentiate between distributions that are somewhat and very far away, whereas a Wasserstein distance continues to have more meaningful signal in that regime. In the context of the swap for mutual information, the authors come up with the "Wasserstein Dependency Measure", which is just the Wasserstein Distance between the joint distributions and the product of the marginals. https://i.imgur.com/3s2QRRz.png In practice, they use the dual formulation of the Wasserstein distance, which amounts to applying a (neural network) function f(x) to values from both distributions, optimizing f(x) so that the values are far apart, and using that distance as your training signal. Crucially, this function has to be relatively smooth in order for the dual formulation to work: in particular it has to have a small Lipschitz value (meaning its derivatives are bounded by some value). Intuitively, this has the effect of restricting the capacity of the network, which is hoped to incentivize it to use its limited capacity to represent true factors of variation, which are assumed to be the most compact way to represent the data. Empirically, the authors found that their proposed Wasserstein Dependency Measure (with a slight variation applied to reduce variance) does have the predicted property of performing better for situations where the native mutual information between two areas is high. I found the theoretical points of this paper interesting, and liked the generalization of the idea of Wasserstein distances from GANs to a new area. That said, I wish I had a better mechanical sense for how it ground out in actual neural network losses: this is partially just my own lack of familiarity with how e.g. mutual information losses are actually formulated as network objectives, but I would have appreciated an appendix that did a bit more of that mapping between mathematical intuition and practical network reality. ![]() |
[link]
The Lottery Ticket Hypothesis is the idea that you can train a deep network, set all but a small percentage of its high-magnitude weights to zero, and retrain the network using the connection topology of the remaining weights, but only if you re-initialize the unpruned weights to the the values they had at the beginning of the first training. This suggests that part of the value of training such big networks is not that we need that many parameters to use their expressive capacity, but that we need many “draws” from the weight and topology distribution to find initial weight patterns that are well-disposed for learning. This paper out of Uber is a refreshingly exploratory experimental work that tries to understand the contours and contingencies of this effect. Their findings included: - The pruning criteria used in the original paper, where weights are kept according to which have highest final magnitude, works well. However, an alternate criteria, where you keep the weights that have increased the most in magnitude, works just as well and sometimes better. This makes a decent amount of sense, since it seems like we’re using magnitude as a signal of “did this weight come to play a meaningful role during training,” and so weights whose influence increased during training fall in that category, regardless of their starting point https://i.imgur.com/wTkNBod.png - The authors’ next question was: other than just re-initialize weights to their initial values, are there other things we can do that can capture all or part of the performance effect? The answer seems to be yes; they found that the most important thing seems to be keeping the sign of the weights aligned with what it was at its starting point. As long as you do that, redrawing initial weights (but giving them the right sign), or re-setting weights to a correctly signed constant value, both work nearly as well as the actual starting values https://i.imgur.com/JeujUr3.png - Turning instead to the weights on the pruning chopping block, the authors find that, instead of just zero-ing out all pruned weights, they can get even better performance if they zero the weights that moved towards zero during training, and re-initialize (but freeze) the weights that moved away from zero during training. The logic of the paper is “if the weight was trying to move to zero, bring it to zero, otherwise reinitialize it”. This performance remains high at even lower levels of training than does the initial zero-masking result - Finally, the authors found that just by performing the masking (i.e. keeping only weights with large final values), bringing those back to their values, and zeroing out the rest, *and not training at all*, they were able to get 40% test accuracy on MNIST, much better than chance. If they masked according to “large weights that kept the same sign during training,” they could get a pretty incredible 80% test accuracy on MNIST. Way below even simple trained models, but, again, this model wasn’t *trained*, and the only information about the data came in the form of a binary weight mask This paper doesn’t really try to come up with explanations that wrap all of these results up neatly with a bow, and I really respect that. I think it’s good for ML research culture for people to feel an affordance to just run a lot of targeted experiments aimed at explanation, and publish the results even if they don’t quite make sense yet. I feel like on this problem (and to some extent in machine learning generally), we’re the blind men each grabbing at one part of an elephant, trying to describe the whole. Hopefully, papers like this can bring us closer to understanding strange quirks of optimization like this one ![]() |
[link]
As per the “holistic” in the paper title, the goal of this work is to take a suite of existing work within semi-supervised learning, and combine many of its ideas into one training pipeline that can (with really impressive empirical success) leverage the advantages of those different ideas. The core premise of supervised learning is that, given true-label training signal from a small number of labels, you can leverage large amounts of unsupervised data to improve your model. A central intuition of many of these methods is that, even if you don’t know the class of a given sample, you know it *has* a class, and you can develop a loss by pushing your model to predict the class for an example and a modified or perturbed version of that example, since, if you have a prior belief that that modification should not change your true class label, then your unlabeled data point should have the same class prediction both times. Entropy minimization is built off similar notions: although we don’t know a point’s class, we know it must have one, and so we’d like our model to make a prediction that puts more of its weight on a single class, rather than be spread out, since we know the “correct model” will be a very confident prediction of one class, though we don’t know which it is. These methods will give context and a frame of mind for understanding the techniques merged together into the MixMatch approach. At its very highest level, MixMatch’s goal is to take in a dataset of both labeled and unlabeled data, and produce a training set of inputs, predictions, and (occasionally constructed or modified labels) to calculate a model update loss from. https://i.imgur.com/6lHQqMD.png - First, for each unlabeled example in the dataset, we produce K different augmented versions of that image (by cropping it, rotating it, flipping it, etc). This is in the spirit of the consistency loss literature, where you want your model to make the same prediction across augmentations - Do the same augmentation for each labeled example, but only once per input, rather than k times - Run all of your augmented examples through your model, and take the average of their predictions. This is based on the idea that the average of the predictions will be a lower variance, more stable pseudo-target to pull each of the individual predictions towards. Also in the spirit of making something more shaped like a real label, they undertake a sharpening step, turning down the temperature of the averaged distribution. This seems like it would have the effect of more confidently pulling the original predictions towards a single “best guess” label - At this point, we have a set of augmented labeled data, with a true label, and also a set of augmented unlabeled data, with a label based off of an averaged and sharpened best guess from the model over different modifications. At this point, the pipeline uses something called “MixUp” (on which there is a previous paper, so I won’t dive into it too much here), which takes pairs of data points, calculates a convex combination of the inputs, runs it through the model, and uses as the loss-function target a convex combination of the outputs. So, in the simple binary case, if you have a positive and negatively labeled image and sample a combination parameter of 0.75, you have an image that is 0.75 positive, 0.25 negative, and the new label that you’re calculating cross entropy loss against is 0.75. - MixMatch generates pairs for its MixUp calculation by mixing (heh) labeled and unlabeled data together, and pairing each labeled and unlabeled pair with one observation from the merged set. At this point, we have combined inputs, and we have combined labels, and we can calculate loss between them With all of these methods combined, this method takes the previous benchmark of 38% error, for a CIFAR dataset with only 250 labels, and drops that to 11%, which is a pretty astonishing improvement in error rate. After performing an ablation study, they find that MixUp itself, temperature sharpening, and calculating K>1 augmentations of unlabeled data rather than K=1 are the strongest value-adds; it doesn’t appear like there’s that much difference that comes from mixing between unlabeled and labeled for the MixUp pairs. ![]() |