Welcome to ShortScience.org! 
[link]
This paper argues that, in semisupervised learning, it's suboptimal to use the same weight for all examples (as happens implicitly, when the unsupervised component of the loss for each example is just added together directly. Instead, it tries to learn weights for each specific data example, through a metalearningesque process. The form of semisupervised learning being discussed here is labelbased consistency loss, where a labeled image is augmented and run through the current version of the model, and the model is optimized to try to induce the same loss for the augmented image as the unaugmented one. The premise of the authors argument for learning perexample weights is that, ideally, you would enforce consistency loss less on examples where a model was unconfident in its label prediction for an unlabeled example. As a way to solve this, the authors suggest learning a vector of parameters  one for each example in the dataset  where element i in the vector is a weight for element i of the dataset, in the summedup unsupervised loss. They do this via a twostep process, where first they optimize the parameters of the network given the example weights, and then the optimize the example weights themselves. To optimize example weights, they calculate a gradient of those weights on the posttraining validation loss, which requires backpropogating through the optimization process (to determine how different weights might have produced a different gradient, which might in turn have produced better validation loss). This requires calculating the inverse Hessian (second derivative matrix of the loss), which is, generally speaking, a quite costly operation for hugeparameter nets. To lessen this cost, they pretend that only the final layer of weights in the network are being optimized, and so only calculate the Hessian with respect to those weights. They also try to minimize cost by only updating the example weights for the examples that were used during the previous update step, since, presumably those were the only ones we have enough information to upweight or downweight. With this model, the authors achieve modest improvements  performance comparable to or withinerrorbounds better than the current state of the art, FixMatch. Overall, I find this paper a little baffling. It's just a crazy amount of effort to throw into something that is a minor improvement. A few issues I have with the approach:  They don't seem to have benchmarked against the simpler baseline of some inverse of using Dropoutestimated uncertainty as the weight on examples, which would, presumably, more directly capture the property of "is my model unsure of its prediction on this unlabeled example"  If the presumed need for this is the lack of certainty of the model, that's a nonstationary problem that's going to change throughout the course of training, and so I'd worry that you're basically taking steps in the direction of a moving target  Despite using techniques rooted in metalearning, it doesn't seem like this models learns anything generalizable  it's learning indexbased weights on specific examples, which doesn't give it anything useful it can do with some new data point it finds that it wasn't specifically trained on Given that, I think I'd need to see a much stronger case for dramatic performance benefits for something like this to seem like it was worth the increase in complexity (not to mention computation, even with the optimized Hessian scheme) 
[link]
Bechtle et al. propose meta learning via learned loss ($ML^3$) and derive and empirically evaluate the framework on classification, regression, modelbased and modelfree reinforcement learning tasks. The problem is formalized as learning parameters $\Phi$ of a meta loss function $M_\phi$ that computes loss values $L_{learned} = M_{\Phi}(y, f_{\theta}(x))$. Following the outerinner loop meta algorithm design the learned loss $L_{learned}$ is used to update the parameters of the learner in the inner loop via gradient descent: $\theta_{new} = \theta  \alpha \nabla_{\theta}L_{learned} $. The key contribution of the paper is the way to construct a differentiable learning signal for the loss parameters $\Phi$. The framework requires to specify a task loss $L_T$ during meta train time, which can be for example the mean squared error for regression tasks. After updating the model parameters to $\theta_{new}$ the task loss is used to measure how much learning progress has been made with loss parameters $\Phi$. The key insight is the decomposition via chainrule of $\nabla_{\Phi} L_T(y, f_{\theta_{new}})$: $\nabla_{\Phi} L_T(y, f_{\theta_{new}}) = \nabla_f L_t \nabla_{\theta_{new}}f_{\theta_{new}} \nabla_{\Phi} \theta_{new} = \nabla_f L_t \nabla_{\theta_{new}}f_{\theta_{new}} [\theta  \alpha \nabla_{\theta} \mathbb{E}[M_{\Phi}(y, f_{\theta}(x))]]$. This allows to update the loss parameters with gradient descent as: $\Phi_{new} = \Phi  \eta \nabla_{\Phi} L_T(y, f_{\theta_{new}})$. This update rules yield the following $ML^3$ algorithm for supervised learning tasks: https://i.imgur.com/tSaTbg8.png For reinforcement learning the task loss is the expected future reward of policies induced by the policy $\pi_{\theta}$, for modelbased rl with respect to the approximate dynamics model and for the model free case a system independent surrogate: $L_T(\pi_{\theta_{new}}) = \mathbb{E}_{\pi_{\theta_{new}}} \left[ R(\tau_{\theta_{new}}) \log \pi_{\theta_{new}}(\tau_{new})\right] $. The allows further to incorporate extra information via an additional loss term $L_{extra}$ and to consider the augmented task loss $\beta L_T + \gamma L_{extra} $, with weights $\beta, \gamma$ at train time. Possible extra loss terms are used to add physics priors, encouragement of exploratory behavior or to incorporate expert demonstrations. The experiments show that this, at test time unavailable information, is retained in the shape of the loss landscape. The paper is packed with insightful experiments and shows that the learned loss function:  yields in regression and classification better accuracies at train and test tasks  generalizes well and speeds up learning in model based rl tasks  yields better generalization and faster learning in model free rl  is agnostic across a bunch of evaluated architectures (2,3,4,5 layers)  with incorporated extra knowledge yields better performance than without and is superior to alternative approaches like iLQR in a MountainCar task. The paper introduces a promising alternative, by learning the loss parameters, to MAML like approaches that learn the model parameters. It would be interesting to see if the learned loss function generalizes better than learned model parameters to a broader distribution of tasks like the metaworld tasks. 
[link]
A central problem in the domain of reinforcement learning is how to incentivize exploration and diversity of experience, since RL agents can typically only learn from states they go to, and it can often be the case that states with high reward don't have an obvious trail of highreward states leading to them, meaning that algorithms that are naively optimizing for reward will be relatively unlikely to discover them. One potential way to promote exploration is to train an ensemble of agents, and have part of your loss that incentivizes diversity in the behavior of those agents. Intuitively, an incentive for diversity will push policies away from one another, which will force them to behave differently, and thus reach a wider range of different states. Current work in diversity tends to penalize, for each agent, the average pairwise distance between it and every other agent. The authors of this paper have two critiques with this approach: 1. Pure pairwise distance doesn't fully capture the notion of diversity we might want, since if you have policies that are clustered together into multiple clusters, average pairwise distance can be increased by moving the clusters farther apart, without having to break up the clusters themselves 2. Having the diversity term be calculated for each policy independently can lead to "cycling" behavior, where policies move in ways that increase distance at each step, but don't do so when each agent's independent steps are taken into account As an alternative, they propose calculating the pairwise Kernel function similarity between all policies (where each policy is represented as the average action probabilities it returns across a sample of states), and calculating the determinate of that matrix. The authors claim that this represents a better measure of full population diversity. I can't fully connect the dots on this intuitively, but what I think they're saying is: the determinant of the kernel matrix tells you something about the effective dimensionality spanned by the different policies. In the same way that, if a matrix is lowrank, it tells you that some vectors within it can be nearly represented by linear combinations of others, a low value of the Kernel determinant means that some policies can be sufficiently represented by combinations of other policies, such that it isn't really adding diversity value to the ensemble. https://i.imgur.com/CmlGsNP.png Another contribution of this paper is to propose an interesting banditbased way of determining when to incentivize diversity vs focus on pure reward. The diversity term in the loss is governed by a lambda parameter, and the paper's model sets up Thompson sampling to determine what the value of the parameter should be at each training iteration. This bandit setup works by starting out uncertain over whether to include diversity in the loss, and building a model of whether reward tends to increase during steps where diversity is used. Over time, if diversity consistently doesn't produce benefits, the sampler will tend more towards excluding it from the loss. This is just a really odd, different idea, that I've never heard of before in the context of hyperparameter scheduling during training. I will say that I'm somewhat confused about the window of time that the bandit uses for calculating rewards. Is it a set of trajectories used in a single training batch, or longer than that? Questions aside, they do so fairly convincingly that the adaptive parameter schedule outperforms over a fixed schedule, though they don't test it against simpler forms of annealing, so I'm less clear on whether it would outperform those. Overall, I still have some confusions about the method proposed by this paper, but it seems to be approaching the question of exploration from an interesting direction, and I'd enjoy trying to understand it further in future.
1 Comments

[link]
This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained. When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs. ## Key contributions * Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply bruteforce memorize the training data. * Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VCdimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks * The authors give a construction for a 2layer network with $p = 2n+d$ parameters  where $n$ is the number of samples and $d$ is the dimension of each sample  which can easily fit any labeling. (Finite sample expressivity). See section 4. ## What I learned * Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels. * We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought. ## Funny > deep neural nets remain mysterious for many reasons > Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call. ## See also * [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg) 
[link]
This is an interestingly pragmatic paper that makes a super simple observation. Often, we may want a usable network with fewer parameters, to make our network more easily usable on small devices. It's been observed (by these same authors, in fact), that pruned networks can achieve comparable weights to their fully trained counterparts if you rewind and retrain from early in the training process, to compensate for the loss of the (not ultimately important) pruned weights. This observation has been dubbed the "Lottery Ticket Hypothesis", after the idea that there's some small effective subnetwork you can find if you sample enough networks. Given these two facts  the usefulness of pruning, and the success of weight rewinding  the authors explore the effectiveness of various ways to train after pruning. Current standard practice is to prune lowmagnitude weights, and then continue training remaining weights from values they had at pruning time, keeping the final learning rate of the network constant. The authors find that: 1. Weight rewinding, where you rewind weights to *near* their starting value, and then retrain using the learning rates of early in training, outperforms fine tuning from the place weights were when you pruned but, also 2. Learning rate rewinding, where you keep weights as they are, but rewind learning rates to what they were early in training, are actually the most effective for a given amount of training time/search cost To me, this feels a little bit like burying the lede: the takeaway seems to be that when you prune, it's beneficial to make your network more "elastic" (in the metaphortoneuroscience sense) so it can more effectively learn to compensate for the removed neurons. So, what was really valuable in weight rewinding was the ability to "heat up" learning on a smaller set of weights, so they could adapt more quickly. And the fact that learning rate rewinding works better than weight rewinding suggests that there is value in the learned weights after all, that value is just outstripped by the benefit of rolling back to old learning rates. All in all, not a super radical conclusion, but a useful and practical one to have so clearly laid out in a paper. 