Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1567 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments

Maruan Al-Shedivat and Trapit Bansal and Yuri Burda and Ilya Sutskever and Igor Mordatch and Pieter Abbeel

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, cs.AI

**First published:** 2017/10/10 (5 years ago)

**Abstract:** Ability to continuously learn and adapt from limited experience in
nonstationary environments is an important milestone on the path towards
general intelligence. In this paper, we cast the problem of continuous
adaptation into the learning-to-learn framework. We develop a simple
gradient-based meta-learning algorithm suitable for adaptation in dynamically
changing and adversarial scenarios. Additionally, we design a new multi-agent
competitive environment, RoboSumo, and define iterated adaptation games for
testing various aspects of continuous adaptation strategies. We demonstrate
that meta-learning enables significantly more efficient adaptation than
reactive baselines in the few-shot regime. Our experiments with a population of
agents that learn and compete suggest that meta-learners are the fittest.
more
less

Maruan Al-Shedivat and Trapit Bansal and Yuri Burda and Ilya Sutskever and Igor Mordatch and Pieter Abbeel

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, cs.AI

[link]
DeepMind’s recently released paper (one of a boatload coming out in the wake of ICLR, which just finished in Vancouver) addresses the problem of building an algorithm that can perform well on tasks that don’t just stay fixed in their definition, but instead evolve and change, without giving the agent a chance to re-train in the middle. An example of this, is one used at various points in the paper: of an agent trying to run East, that finds two of its legs (a different two each time) slowly less functional. The theoretical framework they use to approach this problem is that of meta learning. Meta Learning is typically formulated as: how can I learn to do well on a new task, given only a small number of examples of that task? That’s why it’s called “meta”: it’s an extra, higher-level optimization loop applied around the process of learning. Typical learning learns parameters of some task, meta learning learns longer-scale parameters that make the short-scale, typical learning work better. Here, the task that evolves and changes over time (i.e. a nonstationary task) is seen as a close variant of the the multi-task problem. And, so, the hope is that a model that can quickly adapt to arbitrary new tasks can also be used to learn the ability to adapt to a gradually changing task environment. The meta learning algorithm that got most directly adapted for this paper is MAML: Model Agnostic Meta Learning. This algorithm works by, for a large number of tasks, initializing the model at some parameter set theta, evaluating the loss for a few examples on that task, and moving the gradients from the initialization theta, to a task-specific parameter set phi. Then, it calculating the “test set” performance of the one-step phi parameters, on the task. But then - the crucial thing here - the meta learning model updates its initialization parameters, theta. So, the meta learning model is learning a set of parameters that provides a good jumping off point for any given task within the distribution of tasks the model is trained on. In order to do this well, the theta parameters need to both 1) learn any general information, shared across all tasks, and 2) position the parameters such that an initial update step moves the model in the profitable direction. They adapted this idea, of training a model that could quickly update to multiple tasks, to the environment of a slowly/continuously changing environment, where certain parameters of the task the agent is facing. In this formulation, our set of tasks is no longer random draws from the distribution of possible tasks, but a smooth, Markov-walk gradient over tasks. The main change that the authors made to the original MAML algorithm was to say that each general task would start at theta, but then, as that task gradually evolved, it would perform multiple updates: theta to phi1, phi1 to phi2, and so on. The original theta parameters would then be updated according to a similar principle as the MAML parameters: so as to make the loss, summed over the full non-stationary task (notionally composed of many little sub-tasks) is as low as possible. |

Comparing Rewinding and Fine-tuning in Neural Network Pruning

Renda, Alex and Frankle, Jonathan and Carbin, Michael

International Conference on Learning Representations - 2020 via Local Bibsonomy

Keywords: dblp

Renda, Alex and Frankle, Jonathan and Carbin, Michael

International Conference on Learning Representations - 2020 via Local Bibsonomy

Keywords: dblp

[link]
This is an interestingly pragmatic paper that makes a super simple observation. Often, we may want a usable network with fewer parameters, to make our network more easily usable on small devices. It's been observed (by these same authors, in fact), that pruned networks can achieve comparable weights to their fully trained counterparts if you rewind and retrain from early in the training process, to compensate for the loss of the (not ultimately important) pruned weights. This observation has been dubbed the "Lottery Ticket Hypothesis", after the idea that there's some small effective subnetwork you can find if you sample enough networks. Given these two facts - the usefulness of pruning, and the success of weight rewinding - the authors explore the effectiveness of various ways to train after pruning. Current standard practice is to prune low-magnitude weights, and then continue training remaining weights from values they had at pruning time, keeping the final learning rate of the network constant. The authors find that: 1. Weight rewinding, where you rewind weights to *near* their starting value, and then retrain using the learning rates of early in training, outperforms fine tuning from the place weights were when you pruned but, also 2. Learning rate rewinding, where you keep weights as they are, but rewind learning rates to what they were early in training, are actually the most effective for a given amount of training time/search cost To me, this feels a little bit like burying the lede: the takeaway seems to be that when you prune, it's beneficial to make your network more "elastic" (in the metaphor-to-neuroscience sense) so it can more effectively learn to compensate for the removed neurons. So, what was really valuable in weight rewinding was the ability to "heat up" learning on a smaller set of weights, so they could adapt more quickly. And the fact that learning rate rewinding works better than weight rewinding suggests that there is value in the learned weights after all, that value is just outstripped by the benefit of rolling back to old learning rates. All in all, not a super radical conclusion, but a useful and practical one to have so clearly laid out in a paper. |

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Lasse Espeholt and Hubert Soyer and Remi Munos and Karen Simonyan and Volodymir Mnih and Tom Ward and Yotam Doron and Vlad Firoiu and Tim Harley and Iain Dunning and Shane Legg and Koray Kavukcuoglu

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI

**First published:** 2018/02/05 (4 years ago)

**Abstract:** In this work we aim to solve a large collection of tasks using a single
reinforcement learning agent with a single set of parameters. A key challenge
is to handle the increased amount of data and extended training time. We have
developed a new distributed agent IMPALA (Importance Weighted Actor-Learner
Architecture) that not only uses resources more efficiently in single-machine
training but also scales to thousands of machines without sacrificing data
efficiency or resource utilisation. We achieve stable learning at high
throughput by combining decoupled acting and learning with a novel off-policy
correction method called V-trace. We demonstrate the effectiveness of IMPALA
for multi-task reinforcement learning on DMLab-30 (a set of 30 tasks from the
DeepMind Lab environment (Beattie et al., 2016)) and Atari-57 (all available
Atari games in Arcade Learning Environment (Bellemare et al., 2013a)). Our
results show that IMPALA is able to achieve better performance than previous
agents with less data, and crucially exhibits positive transfer between tasks
as a result of its multi-task approach.
more
less

Lasse Espeholt and Hubert Soyer and Remi Munos and Karen Simonyan and Volodymir Mnih and Tom Ward and Yotam Doron and Vlad Firoiu and Tim Harley and Iain Dunning and Shane Legg and Koray Kavukcuoglu

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI

[link]
This reinforcement learning paper starts with the constraints imposed an engineering problem - the need to scale up learning problems to operate across many GPUs - and ended up, as a result, needing to solve an algorithmic problem along with it. In order to massively scale up their training to be able to train multiple problem domains in a single model, the authors of this paper implemented a system whereby many “worker” nodes execute trajectories (series of actions, states, and reward) and then send those trajectories back to a “learner” node, that calculates gradients and updates a central policy model. However, because these updates are queued up to be incorporated into the central learner, it can frequently happen that the policy that was used to collect the trajectories is a few steps behind from the policy on the central learner to which its gradients will be applied (since other workers have updated the learner since this worker last got a policy download). This results in a need to modify the policy network model design accordingly. IMPALA (Importance Weighted Actor Learner Architectures) uses an “Actor Critic” model design, which means you learn both a policy function and a value function. The policy function’s job is to choose which actions to take at a given state, by making some higher probability than others. The value function’s job is to estimate the reward from a given state onward, if a certain policy p is followed. The value function is used to calculate the “advantage” of each action at a given state, by taking the reward you receive through action a (and reward you expect in the future), and subtracting out the value function for that state, which represents the average future reward you’d get if you just sampled randomly from the policy from that point onward. The policy network is then updated to prioritize actions which are higher-advantage. If you’re on-policy, you can calculate a value function without needing to explicitly calculate the probabilities of each action, because, by definition, if you take actions according to your policy probabilities, then you’re sampling each action with a weight proportional to its probability. However, if your actions are calculated off-policy, you need correct for this, typically by calculating an “importance sampling” ratio, that multiplies all actions by a probability under the desired policy divided by the probability under the policy used for sampling. This cancels out the implicit probability under the sampling policy, and leaves you with your actions scaled in proportion to their probability under the policy you’re actually updating. IMPALA shares the basic structure of this solution, but with a few additional parameters to dynamically trade off between the bias and variance of the model. The first parameter, rho, controls how much bias you allow into your model, where bias here comes from your model not being fully corrected to “pretend” that you were sampling from the policy to which gradients are being applied. The trade-off here is that if your policies are far apart, you might downweight its actions so aggressively that you don’t get a strong enough signal to learn quickly. However, the policy you learn might be statistically biased. Rho does this by weighting each value function update by: https://i.imgur.com/4jKVhCe.png where rho-bar is a hyperparameter. If rho-bar is high, then we allow stronger weighting effects, whereas if it’s low, we put a cap on those weights. The other parameter is c, and instead of weighting each value function update based on policy drift at that state, it weights each timestep based on how likely or unlikely the action taken at that timestep was under the true policy. https://i.imgur.com/8wCcAoE.png Timesteps that much likelier under the true policy are upweighted, and, once again, we use a hyperparameter, c-bar, to put a cap on the amount of allowed upweighting. Where the prior parameter controlled how much bias there was in the policy we learn, this parameter helps control the variance - the higher c-bar, the higher the amount of variance there will be in the updates used to train the model, and the longer it’ll take to converge. |

Improving Network Robustness against Adversarial Attacks with Compact Convolution

Ranjan, Rajeev and Sankaranarayanan, Swami and Castillo, Carlos D. and Chellappa, Rama

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

Ranjan, Rajeev and Sankaranarayanan, Swami and Castillo, Carlos D. and Chellappa, Rama

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

[link]
Ranjan et al. propose to constrain deep features to lie on hyperspheres in order to improve robustness against adversarial examples. For the last fully-connected layer, this is achieved by the L2-softmax, which forces the features to lie on the hypersphere. For intermediate convolutional or fully-connected layer, the same effect is achieved analogously, i.e., by normalizing inputs, scaling them and applying the convolution/weight multiplication. In experiments, the authors argue that this improves robustness against simple attacks such as FGSM and DeepFool. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

Model Transfer for Tagging Low-resource Languages using a Bilingual Dictionary

Fang, Meng and Cohn, Trevor

Association for Computational Linguistics - 2017 via Local Bibsonomy

Keywords: dblp

Fang, Meng and Cohn, Trevor

Association for Computational Linguistics - 2017 via Local Bibsonomy

Keywords: dblp

[link]
They get multilingual alignments from dictionaries, then train a Bilstm pos tagger in source language, then automatically tag many tokens in the target language, then manually annotate 1000 tokens in target language, then train a system with combined loss over distant tagging and gold tagging. They add an additional output layer that is learned for the gold annotations. |

About