Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1567 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Out-distribution training confers robustness to deep neural networks

Abbasi, Mahdieh and Gagné, Christian

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Abbasi, Mahdieh and Gagné, Christian

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

[link]
Abbasi and Gagné propose explicit but natural out-distribution training as defense against adversarial examples. Specifically, as also illustrated on the toy dataset in Figure 1, they argue that networks commonly produce high-confident predictions in regions that are clearly outside of the data manifold (i.e., the training data distribution). As mitigation strategy, the authors propose to explicitly train on out-of-distribution data, allowing the network to additionally classify this data as “dustbin” data. On MNIST, for example, this data comes from NotMNIST, a dataset of letters A-J – on CIFA-10, this data could be CIFAR-100. Experiments show that this out-of-distribution training allow networks to identify adversarial examples as “dustbin” and thus improve robustness. https://i.imgur.com/nUSDZay.png Figure 1: Illustration of a naive model versus an augmented model, i.e., trained on out-of-distribution data, on a toy dataset (left) and on MNIST (right). Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |

RTFM: Generalising to Novel Environment Dynamics via Reading

Zhong, Victor and Rocktäschel, Tim and Grefenstette, Edward

arXiv e-Print archive - 2019 via Local Bibsonomy

Keywords: dblp

Zhong, Victor and Rocktäschel, Tim and Grefenstette, Edward

arXiv e-Print archive - 2019 via Local Bibsonomy

Keywords: dblp

[link]
Reinforcement learning is notoriously sample-inefficient, and one reason why is that agents learn about the world entirely through experience, and it takes lots of experience to learn useful things. One solution you might imagine to this problem is the ones humans by and large use in encountering new environments: instead of learning everything through first-person exploration, acquiring lots of your knowledge by hearing or reading condensed descriptions of the world that can help you take more sensible actions within it. This paper and others like it have the goal of learning RL agents that can take in information about the world in the form of text, and use that information to solve a task. This paper is not the first to propose a solution in this general domain, but it claims to be unique by dint of having both the dynamics of the environment and the goal of the agent change on a per-environment basis, and be described in text. The precise details of the architecture used are very much the result of particular engineering done to solve this problem, and as such, it's a bit hard to abstract away generalizable principles that this paper showed, other than the proof of concept fact that tasks of the form they describe - where an agent has to learn which objects can kill which enemies, and pursue the goal of killing certain ones - can be solved. Arguably the most central design principle of the paper is aggressive and repeated use of different forms of conditioning architectures, to fully mix the information contained in the textual and visual data streams. This was done in two main ways: - Multiple different attention summaries were created, using the document embedding as input, but with queries conditioned on different things (the task, the inventory, a summarized form of the visual features). This is a natural but clever extension of the fact that attention is an easy way to generate conditional aggregated versions of some input https://i.imgur.com/xIsRu2M.png - The architecture uses FiLM (Featurewise Linear Modulation), which is essentially a many-generations-generalized version of conditional batch normalization in which the gamma and lambda used to globally shift and scale a feature vector are learned, taking some other data as input. The canonical version of this would be taking in text input, summarizing it into a vector, and then using that vector as input in a MLP that generates gamma and lambda parameters for all of the convolutional layers in a vision system. The interesting innovation of this paper is essentially to argue that this conditioning operation is quite neutral, and that there's no essential way in which the vision input is the "true" data, and the text simply the auxiliary conditioning data: it's more accurate to say that each form of data should conditioning the process of the other one. And so they use Bidirectional FiLM, which does just that, conditioning vision features on text summaries, but also conditioning text features on vision summaries. https://i.imgur.com/qFaH1k3.png - The model overall is composed of multiple layers that perform both this mixing FiLM operation, and also visually-conditioned attention. The authors did show, not super surprisingly, that these additional forms of conditioning added performance value to the model relative to the cases where they were ablated |

Bayesian dark knowledge

Balan, Anoop Korattikara and Rathod, Vivek and Murphy, Kevin P. and Welling, Max

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

Balan, Anoop Korattikara and Rathod, Vivek and Murphy, Kevin P. and Welling, Max

Neural Information Processing Systems Conference - 2015 via Local Bibsonomy

Keywords: dblp

[link]
This paper combines two ideas. The first is stochastic gradient Langevin dynamics (SGLD), which is an efficient Bayesian learning method for larger datasets, allowing to efficiently sample from the posterior over the parameters of a model (e.g. a deep neural network). In short, SGLD is stochastic (minibatch) gradient descent, but where Gaussian noise is added to the gradients before each update. Each update thus results in a sample from the SGLD sampler. To make a prediction for a new data point, a number of previous parameter values are combined into an ensemble, which effectively corresponds to Monte Carlo estimate of the posterior predictive distribution of the model. The second idea is distillation or dark knowledge, which in short is the idea of training a smaller model (student) in replicating the behavior and performance of a much larger model (teacher), by essentially training the student to match the outputs of the teacher. The observation made in this paper is that the step of creating an ensemble of several models (e.g. deep networks) can be expensive, especially if many samples are used and/or if each model is large. Thus, they propose to approximate the output of that ensemble by training a single network to predict to output of ensemble. Ultimately, this is done by having the student predict the output of a teacher corresponding to the model with the last parameter value sampled by SGLD. Interestingly, this process can be operated in an online fashion, where one alternates between sampling from SGLD (i.e. performing a noisy SGD step on the teacher model) and performing a distillation update (i.e. updating the student model, given the current teacher model). The end result is a student model, whose outputs should be calibrated to the bayesian predictive distribution. |

Reward Augmented Maximum Likelihood for Neural Structured Prediction

Mohammad Norouzi and Samy Bengio and Zhifeng Chen and Navdeep Jaitly and Mike Schuster and Yonghui Wu and Dale Schuurmans

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

**First published:** 2016/09/01 (6 years ago)

**Abstract:** A key problem in structured output prediction is direct optimization of the
task reward function that matters for test evaluation. This paper presents a
simple and computationally efficient approach to incorporate task reward into a
maximum likelihood framework. We establish a connection between the
log-likelihood and regularized expected reward objectives, showing that at a
zero temperature, they are approximately equivalent in the vicinity of the
optimal solution. We show that optimal regularized expected reward is achieved
when the conditional distribution of the outputs given the inputs is
proportional to their exponentiated (temperature adjusted) rewards. Based on
this observation, we optimize conditional log-probability of edited outputs
that are sampled proportionally to their scaled exponentiated reward. We apply
this framework to optimize edit distance in the output label space. Experiments
on speech recognition and machine translation for neural sequence to sequence
models show notable improvements over a maximum likelihood baseline by using
edit distance augmented maximum likelihood.
more
less

Mohammad Norouzi and Samy Bengio and Zhifeng Chen and Navdeep Jaitly and Mike Schuster and Yonghui Wu and Dale Schuurmans

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

[link]
(See also a more thorough summary in [a LaTeX PDF][1].) This paper has some nice clear theory which bridges maximum likelihood (supervised) learning and standard reinforcement learning. It focuses on *structured prediction* tasks, where we want to learn to predict $p_\theta(y \mid x)$ where $y$ is some object with complex internal structure. We can agree on some deficiencies of maximum likelihood learning: - ML training fails to assign **partial credit**. Models are trained to maximize the likelihood of the ground-truth outputs in the dataset, and all other outputs are equally wrong. This is an increasingly important problem as the space of possible solutions grows. - ML training is potentially disconnected from **downstream task reward**. In machine translation, we usually want to optimize relatively complex metrics like BLEU or TER. Since these metrics are non-differentiable, we have to settle for optimizing proxy losses that we hope are related to the metric of interest. Reinforcement learning offers an attractive alternative in theory. RL algorithms are designed to optimize non-differentiable (even stochastic) reward functions, which sounds like just what we want. But RL algorithms have their own problems with this sort of structured output space: - Standard RL algorithms rely on samples from the model we are learning, $p_\theta(y \mid x)$. This becomes intractable when our output space is very complex (e.g. 80-token sequences where each word is drawn from a vocabulary of 80,000 words). - The reward spaces for problems of interest are extremely sparse. Our metrics will assign 0 reward to most of the 80^80K possible outputs in the translation problem in the paper. - Vanilla RL doesn't take into account the ground-truth outputs available to us in structured prediction. This paper designs a solution which combines supervised learning with a reinforcement learning-inspired smoothing method. Concretely, the authors design an **exponentiated payoff distribution** $q(y \mid y^*; \tau)$ which assigns high mass to high-reward outputs $y$ and low mass elsewhere. This distribution is used to effectively smooth the loss function established by the ground-truth outputs in the supervised data. We end up optimizing the following objective: $$\mathcal L_\text{RML} = - \mathbb E_{x, y^* \sim \mathcal D}\left[ \sum_y q(y \mid y^*; \tau) \log p_\theta(y \mid x) \right]$$ This optimization depends on samples from our dataset $\mathcal D$ and, more importantly, the stationary payoff distribution $q$. This contrasts strongly with standard RL training, where the objective depends on samples from the non-stationary model distribution $p_\theta$. To make that clear, we can rewrite the above with another expectation: $$\mathcal L_\text{RML} = - \mathbb E_{x, y^* \sim \mathcal D, y \sim q(y \mid y^*; \tau)}\left[ \log p_\theta(y \mid x) \right]$$ ### Model details If you're interested in the low-level details, I wrote up the gist of the math in [this PDF][1]. ### Analysis #### Relationship to label smoothing This training approach is mathematically equivalent to label smoothing, applied here to structured output problems. In next-word prediction language modeling, a popular trick involves smoothing the target distributions by combining the ground-truth output with some simple base model, e.g. a unigram word frequency distribution. (This just means we take a weighted sum of the one-hot vector from our supervised data and a normalized frequency vector calculated on some corpus.) Mathematically, the cross entropy with label smoothing is $$\mathcal L_\text{ML-smooth} = - \mathbb E_{x, y^* \sim \mathcal D} \left[ \sum_y p_\text{smooth}(y; y^*) \log p_\theta(y \mid x) \right]$$ (The equation above leaves out a constant entropy term.) The gradient of this objective looks exactly the same as the reward-augmented ML gradient from the paper: $$\nabla_\theta \mathcal L_\text{ML-smooth} = \mathbb E_{x, y^* \sim \mathcal D, y \sim p_\text{smooth}} \left[ \log p_\theta(y \mid x) \right]$$ So reward-augmented likelihood is equivalent to label smoothing, where our smoothing distribution is log-proportional to our downstream reward function. #### Relationship to distillation Optimizing the reward-augmented maximum likelihood is equivalent to minimizing the KL divergence $$D_\text{KL}(q(y \mid y^*; \tau) \mid\mid p_\theta(y \mid x))$$ This divergence reaches zero iff $q = p$. We can say, then, that the effect of optimizing on $\mathcal L_\text{RML}$ is to **distill** the reward function (which parameterizes $q$) into the model parameters $\theta$ (which parameterize $p_\theta$). It's exciting to think about other sorts of more complex models that we might be able to distill in this framework. The unfortunate (?) restriction is that the "source" model of the distillation ($q$ in this paper) must admit to efficient sampling. #### Relationship to adversarial training We can also view reward-augmented maximum likelihood training as a data augmentation technique: it synthesizes new "partially correct" examples using the reward function as a guide. We then train on all of the original and synthesized data, again weighting the gradients based on the reward function. Adversarial training is a similar data augmentation technique which generates examples that force the model to be robust to changes in its input space (robust to changes of $x$). Both adversarial training and the RML objective encourage the model to be robust "near" the ground-truth supervised data. A high-level comparison: - Adversarial training can be seen as data augmentation in the input space; RML training performs data augmentation in the output space. - Adversarial training is a **model-based data augmentation**: the samples are generated from a process that depends on the current parameters during training. RML training performs **data-based augmentation**, which could in theory be done independent of the actual training process. --- Thanks to Andrej Karpathy, Alec Radford, and Tim Salimans for interesting discussion which contributed to this summary. [1]: https://drive.google.com/file/d/0B3Rdm_P3VbRDVUQ4SVBRYW82dU0/view |

Curiosity-driven reinforcement learning with homeostatic regulation

Ildefons Magrans de Abril and Ryota Kanai

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.AI

**First published:** 2018/01/23 (4 years ago)

**Abstract:** We propose a curiosity reward based on information theory principles and
consistent with the animal instinct to maintain certain critical parameters
within a bounded range. Our experimental validation shows the added value of
the additional homeostatic drive to enhance the overall information gain of a
reinforcement learning agent interacting with a complex environment using
continuous actions. Our method builds upon two ideas: i) To take advantage of a
new Bellman-like equation of information gain and ii) to simplify the
computation of the local rewards by avoiding the approximation of complex
distributions over continuous states and actions.
more
less

Ildefons Magrans de Abril and Ryota Kanai

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.AI

[link]
Exploring an environment with non-linearities in a continuous action space can be optimized by regulating the agent curiosity with an homeostatic drive. This means that a heterostatic drive to move away from habitual states is blended with a homeostatic motivation to encourage actions that lead to states where the agent is familiar with a state-action pair. This approach improves upon forward models and ICM Pathak et al 17 with an enhanced information gain that basically consists of the following: while the reward in \cite{Pathak17} is formulated as the forward model prediction error, the extended forward model loss in this paper is extended by substracting from the forward model prediction error the error knowing not only $s_t$ and $a_t$, but also $a_{t+1}$. Curiosity-driven reinforcement learning shows that an additional homeostatic drive enhances the information gain of a classical curious/heterostatic agent. Implementation: They take advantage of a new Bellman-like equation of information gain and simplify the computation of the local rewards. It could help by prioritizing the exploration of the state-action space according to how hard is to learn each region. Background: The concept of homeostatic regulation in social robots was first proposed in Breazeal et al. 04. They extend existing approaches by compensating the heterostacity drive encouraged by the curiosity reward with an additional homeostatic drive. 1) The first component implements the heterostatic drive (same as referred to in Pathak et al 17). In other words, this one refers to the tendency to push away our agent from its habitual state; 2) Homeostatic motivation: the second component is our novel contribution. It encourages taking actions $a_t$ that lead to future states $s_{t+1}$ where the corresponding future action $a_{t+1}$ gives us additional information about $s_{t+1}$. This situation happens when the agent is "familiar" with the state-action pair: $\{s_{t+1}, a_{t+1}\}$. The article misses exact comparison with Pathak et al regarding a joint task. In this paper the tasks consists of a 3 room navigation map is used to measure exploration. |

About