ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Meta-Learning via Learned Loss
Sarah Bechtle and Artem Molchanov and Yevgen Chebotar and Edward Grefenstette and Ludovic Righetti and Gaurav Sukhatme and Franziska Meier
arXiv e-Print archive - 2019 via Local arXiv
Keywords: cs.LG, cs.AI, cs.RO, stat.ML
more

[link] Summary by Robert Müller 5 years ago

Bechtle et al. propose meta learning via learned loss ($ML^3$) and derive and empirically evaluate the framework on classification, regression, model-based and model-free reinforcement learning tasks. 

The problem is formalized as learning parameters $\Phi$ of a meta loss function $M_\phi$ that computes loss values $L_{learned} = M_{\Phi}(y, f_{\theta}(x))$. Following the outer-inner loop meta algorithm design the learned loss $L_{learned}$ is used to update the parameters of the learner in the inner loop via gradient descent:
$\theta_{new} = \theta - \alpha \nabla_{\theta}L_{learned} $. The key contribution of the paper is the way to construct a differentiable learning signal for the loss parameters $\Phi$.

The framework requires to specify a task loss $L_T$ during meta train time, which can be for example the mean squared error for regression tasks. After updating the model parameters to $\theta_{new}$ the task loss is used to measure how much learning progress has been made with loss parameters $\Phi$. The key insight is the decomposition via chain-rule of  $\nabla_{\Phi} L_T(y, f_{\theta_{new}})$:

$\nabla_{\Phi} L_T(y, f_{\theta_{new}}) = \nabla_f L_t \nabla_{\theta_{new}}f_{\theta_{new}} \nabla_{\Phi} \theta_{new} = \nabla_f L_t \nabla_{\theta_{new}}f_{\theta_{new}} [\theta - \alpha \nabla_{\theta} \mathbb{E}[M_{\Phi}(y, f_{\theta}(x))]]$.

This allows to update the loss parameters with gradient descent as: $\Phi_{new} = \Phi - \eta \nabla_{\Phi} L_T(y, f_{\theta_{new}})$. 

This update rules yield the following $ML^3$ algorithm for supervised learning tasks:

https://i.imgur.com/tSaTbg8.png

For reinforcement learning the task loss is the expected future reward of policies induced by the policy $\pi_{\theta}$, for model-based rl with respect to the approximate dynamics model and for the model free case a system independent surrogate: $L_T(\pi_{\theta_{new}}) = -\mathbb{E}_{\pi_{\theta_{new}}} \left[ R(\tau_{\theta_{new}})  \log \pi_{\theta_{new}}(\tau_{new})\right] $. 

The allows further to incorporate extra information via an additional loss term $L_{extra}$ and to consider the augmented task loss $\beta L_T + \gamma L_{extra} $, with weights $\beta, \gamma$ at train time. Possible extra loss terms are used to add physics priors, encouragement of exploratory behavior or to incorporate expert demonstrations. The experiments show that this, at test time unavailable information, is retained in the shape of the loss landscape. 

The paper is packed with insightful experiments and shows that the learned loss function:
- yields in regression and classification better accuracies at train and test tasks
- generalizes well and speeds up learning in model based rl tasks
- yields better generalization and faster learning in model free rl
- is agnostic across a bunch of evaluated architectures (2,3,4,5 layers)
- with incorporated extra knowledge yields better performance than without and is superior to alternative approaches like iLQR in a MountainCar task. 

The paper introduces a promising alternative, by learning the loss parameters, to MAML like approaches that learn the model parameters. It would be interesting to see if the learned loss function generalizes better than learned model parameters to a broader distribution of tasks like the meta-world tasks.

arxiv.org
scholar.google.com

Are Sixteen Heads Really Better than One?
Michel, Paul and Levy, Omer and Neubig, Graham
arXiv e-Print archive - 2019 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 5 years ago

In the last two years, the Transformer architecture has taken over the worlds of language modeling and machine translation. The central idea of Transformers is to use self-attention to aggregate information from variable-length sequences, a task for which Recurrent Neural Networks had previously been the most common choice. Beyond that central structural change, one more nuanced change was from having a single attention mechanism on a given layer (with a single set of query, key, and value weights) to having multiple attention heads, each with their own set of weights. The change was framed as being conceptually analogous to the value of having multiple feature dimensions, each of which focuses on a different aspect of input; these multiple heads could now specialize and perform different weighted sums over input based on their specialized function. This paper performs an experimental probe into the value of the various attention heads at test time, and tries a number of different pruning tests across both machine translation and language modeling architectures to see their impact on performance. 

In their first ablation experiment, they test the effect of removing (that is, zero-masking the contribution of) a single head from a single attention layer, and find that in almost all cases (88 out of 96) there's no statistically significant drop in performance. Pushing beyond this, they ask what happens if, in a given layer, they remove all heads but the one that was seen to be most important in the single head tests (the head that, if masked, caused the largest performance drop). This definitely leads to more performance degradation than the removal of single heads, but the degradation is less than might be intuitively expected, and is often also not statistically significant. 

https://i.imgur.com/Qqh9fFG.png

This also shows an interesting distribution over where performance drops: in machine translation, it seems like decoder-decoder attention is the least sensitive to heads being pruned, and encoder-decoder attention is the most sensitive, with a very dramatic performance dropoff observed if particularly the last layer of encoder-decoder attention is stripped to a single head. This is interesting to me insofar as it shows the intuitive roots of attention in these architectures; attention was originally used in encoder-decoder parts of models to solve problems of pulling out information in a source sentence at the time it's needed in the target sentence, and this result suggests that a lot of the value of multiple heads in translation came from making that mechanism more expressive. 

Finally, the authors performed an iterative pruning test, where they ordered all the heads in the network according to their single-head importance, and pruned starting with the least important. Similar to the results above, they find that drops in performance at high rates of pruning happen eventually to all parts of the model, but that encoder-decoder attention suffers more quickly and more dramatically if heads are removed. 

https://i.imgur.com/oS5H1BU.png

Overall, this is a clean and straightforward empirical paper that asks a fairly narrow question and generates some interesting findings through that question. These results seem reminiscent to me of the Lottery Ticket Hypothesis line of work, where it seems that having a network with a lot of weights is useful for training insofar as it gives you more chances at an initialization that allows for learning, but that at test time, only a small percentage of the weights have ultimately become important, and the rest can be pruned. In order to make the comparison more robust, I'd be interested to see work that does more specific testing of the number of heads required for good performance during training and also during testing,  divided out by different areas of the network. (Also, possibly this work exists and I haven't found it!)

doi.acm.org
sci-hub
scholar.google.com

How to use expert advice
Cesa-Bianchi, Nicolò and Freund, Yoav and Helmbold, David P. and Haussler, David and Schapire, Robert E. and Warmuth, Manfred K.
Symposium on the Theory of Computing - 1993 via Local Bibsonomy
Keywords: dblp

[link] Summary by Chris Murray 9 years ago

The authors consider online learning of binary values, where each period each of $N$ experts makes a prediction (in $[0,1]$), and the learner must make predictions such that, in hindsight, the learner didn't do much worse than the best expert. The loss function the authors use is $|\hat{Y}_t - Y_t|$, the the total loss is the sum over all $t$.

The authors first solution to this problem is an algorithm called MM.  It uses a complex recursively-defined function v which takes exponential time to compute, but which (roughly) computes the anticipated future total loss of each expert for the rest of the game and uses that.  The bound they give for MM is in terms of this $v$ function, so it isn't easily interpreted.  This is analyzed in tremendous detail and under various assumptions.

The authors then give a more familiar (and implementable) multiplicative weight update algorithm, where a certain non-negative weight is placed on each expert and our prediction at any time is the weighted average of the experts' predictions.  After every period, each expert's weight is multiplied by some function of the loss it incurred that period and a learning rate. They show how, when the weight updates are done using the right function, the algorithm has a nice Bayesian interpretation. This paper is dense (at 59 pages) and so filled with proofs it feels like reading an appendix.

arxiv.org
scholar.google.com

PGQ: Combining policy gradient and Q-learning
O'Donoghue, Brendan and Munos, Rémi and Kavukcuoglu, Koray and Mnih, Volodymyr
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by abhishm 8 years ago

This paper proposes an approach to incorporate off-line samples in Policy Gradient. The authors were able to do this by drawing a parallel between Policy Gradient and Q-Learning. This is the second paper in ICLR 2017 that tries to use off-line samples to accelerate the learning in Policy Gradient. The summary of the first paper can be found [here](http://www.shortscience.org/paper?bibtexKey=journals/corr/WangBHMMKF16]). 

This is the first paper that describes the relationship between policy gradient and Q-learning. Lets consider a simple case of a grid world problem. In this problem, at every state, you can take one of the four actions: left, right, up, or down. Lets assume that, you are following a policy $\pi_\theta(\cdot|s_t)$. Let $\beta^{\pi_\theta}(s_t)$ denotes the stationary probability distribution of states under policy $\pi_\theta$ and $r(s, a)$ be the reward that we get after taking action $a$ at state $s$. Then our goal is to maximize
$$
\begin{array}{ccc}
&\arg\max_{\theta} \mathbb{E}_{s \sim \beta^{\pi_\theta}} \sum_{a}\pi_\theta(a|s) r(s, a) + \alpha H(\pi_\theta(\cdot| s))&\\
& \sum_{a} \pi_\theta(a|s) = 1 \;\;\forall\;\;  s
\end{array}
$$ 

Note that the above equation is a regularized policy optimization approach where we added a entropy bonus term, $H(\pi_{\theta}(\cdot | s))$ to enhance exploration. Mathematically,
$$
H(\pi_{\theta}(\cdot | s)) = - \sum_{a} \pi_\theta(a|s)\log\pi_\theta(a|s)
$$
and 
$$
\begin{array}{ccl}
\nabla_\theta H(\pi_{\theta}(\cdot | s))& = &-  \sum_{a} \left(1 + \log\pi_\theta(a|s)\right) \nabla_\theta \pi_\theta(a|s) \\
& = &- \mathbb{E}_{a \sim \pi_\theta(\cdot|s)}\left(1 + \log\pi_\theta(a|s)\right) \nabla_\theta \log\pi_\theta(a|s) \;\; \text{(likelihood trick)}
\end{array}
$$
Lagrange multiplier tells us that at the critical point $\theta^*$ of the above optimization problem, the gradient of optimization function should be parallel to gradient of constraint. Using the Policy Gradient Theorem, the gradient of the objective at optimal policy $\theta^*$ is 
$$
\mathbb{E}_{s \sim \beta^{\pi_{\theta^*}}, a \sim \pi_{\theta^*}(\cdot|s)} \nabla_{\theta} \log\pi_{\theta^*}(a|s) \left(Q^{\pi_{\theta^*}}(s, a) - \alpha  - \alpha \log\pi_{\theta^*}(a|s)\right)
$$ 

The gradient of the constraint at the optimal point $\theta^*$ is 
$$
\mathbb{E}_{s \sim \beta^{\pi_{\theta^*}}, a \sim \pi_{\theta^*}(\cdot|s)} \lambda_s  \nabla_{\theta^*}\log\pi_{\theta^*}(a|s)
$$

Using the theory of Lagrange multiplication
$$
\begin{array}{lll}
&&\mathbb{E}_{s \sim \beta^{\pi_{\theta^*}}, a \sim \pi_{\theta^*}(\cdot|s)} \nabla_{\theta}\log\pi_{\theta^*}(a|s) \left(Q^{\pi_{\theta^*}}(s, a) - \alpha  - \alpha \log\pi_{\theta^*}(a|s)\right) = \\
&&\mathbb{E}_{s \sim \beta^{\pi_{\theta^*}}, a \sim \pi_{\theta^*}(\cdot|s)} \lambda_s  \nabla_{\theta}\log\pi_{\theta^*}(a|s)
\end{array}
$$


If $\beta^{\pi_{\theta^*}}(s) > 0\;\; \forall\;\; s $ and $0 < \pi_{\theta^*}(a | s) < 1\;\; \forall\;\; s, a$, then for the tabular case of grid world, we get 
$$
Q^{\pi_{\theta^*}}(s, a) -  \alpha \log\pi_{\theta^*}(a|s) = \lambda_{s}\;\; \forall \;\; a 
$$
By multiplying both sides in above equation with $\pi_{\theta^*}(a|s)$ and summing over $a$, we get 
$$
\lambda_s = V^{\pi_{\theta^*}}(s) + \alpha H(\pi_\theta(\cdot | s))
$$
Using the value of Lagrange Multiplier, the action-value function of the optimal policy $\pi_{{\theta^*}}$ is
$$
{Q}^{\pi_{\theta^*}}(s, a) = V^{\pi_{\theta^*}}(s) + \alpha \left(H(\pi_{\theta^*}(\cdot | s)) + \log\pi_{\theta^*}(a|s)\right)
$$
and the optimal policy $\pi_{\theta^*}$ is a softmax policy 
$$
\pi_{\theta^*}(a|s) = \exp\left( \frac{Q^{\pi_{\theta^*}}(s, a) - V^{\pi_{\theta^*}}(s)}{\alpha} - H(\pi_{\theta^*}(\cdot|s))\right)
$$

The above relationship suggests that the optimal policy for the tabular case is a softmax policy of action-value function. Mainly;
$$
\pi_{\theta^*}(a|s) = \frac{\exp\left(\frac{Q^{\pi_{\theta^*}}(s,a)}{\alpha}\right)}{\sum_b \exp\left(\frac{Q^{\pi_{\theta^*}}(s,b)}{\alpha}\right)}
$$
Note that as the $\alpha \rightarrow 0$, the above policy becomes a greedy policy. Since we know that $Q$ learning reaches to an optimal policy, we can say that the above softmax policy will also converge to the optimal policy as the $\alpha \rightarrow 0$.
  
The authors suggest that even when the policy $\pi_{\theta}$ is not an optimal policy, we can still use the 
$\tilde{Q}^{\pi_\theta}(s, a)$ as an estimate for action-value of policy $\pi_\theta$ where  
$$
\tilde{Q}^{\pi_{\theta}}(s, a) = V^{\pi_{\theta}}(s) + \alpha \left(H(\pi_{\theta}(\cdot | s)) + \log\pi_{\theta}(a|s)\right)
$$
In the light of above definition of $\tilde{Q}^{\pi_\theta}(s, a)$, the update rule for the regularized policy gradient can be written as
$$
\mathbb{E}_{s \sim \beta^{\pi_\theta}, a \sim \pi_\theta(\cdot|s)} \nabla_{\theta} \log\pi_\theta(a|s) \left(Q^{\pi_\theta}(s, a) - {\tilde{Q}}^{\pi_\theta}(s, a) \right)
$$
Note that all the term in the above equation that depend only on the state $s$ can be incorporated using the baseline trick.   

**Author shows a strong similarity between Duelling DQN and Actor-critice method** 

In a duelling architecture, action-values are represented as summation of Advantage and Value function. Mathematically,
$$
Q(s, a) =  A^w(s, a) - \sum_a \mu(a|s) A^w(s, a) + V^\phi(s)
$$

The goal of the middle term in the above equation to obtain advantage, $A^w(s, a)$, and value function, $V(s)$, uniquely given $Q(s, a) \forall a$. In the Duelling architecture, we will minimize the following error 
$$
(r+ \max_b Q(s', b) - Q(s, a))^2
$$
Consequently, we will update the $w$ and $\phi$ parameters as following:
$$
\begin{array}{ccc}
\Delta W &\sim& (r+ \max_b Q(s', b) - Q(s, a)) \nabla \left( A^w(s, a) - \sum_a \mu(a|s) A^w(s, a) \right) \\
\Delta \phi &\sim& (r+ \max_b Q(s', b) - Q(s, a)) \nabla \left( V^\phi(s) \right)
\end{array}
$$

Now assume an actor-critic approach, where policy is parametrized by $A^w(s, a)$ and there is a critic $V^\phi(s)$ of value-function. The policy $\pi_w(s, a)$ is 
$$
\pi_w(s, a) = \frac{e^{A^w(s, a)/\alpha}}{\sum_a e^{A^w(s, a)/\alpha}}
$$

Note that 
$$
\nabla_w \log\pi_w(s, a) = \nabla_w \frac{1}{\alpha}\left(A^w(s, a) - \sum_a \pi_w(s, a) A^w(s, a)\right)
$$

To estimate the $Q$-values of policy $\pi_w$, we use the estimator obtained from optimal policy:
$$
\begin{array}{ccc}
\hat{Q}(s, a) &=& \alpha \left(-\sum_a \pi_w(a | s)\log\pi_w(a | s) + \log\pi_w(a|s)\right) + V^\phi(s) \\
&=& A^w(s, a) - \sum_a \pi_w(a | s) A^w(s, a) + V^\phi(s)
\end{array}
$$

Note that the $Q-$estimate in the actor critic and duelling architecture are different in using $\pi$ instead of $\mu$. The actor update rule in the regularized policy gradient will be 
$$
\Delta W \sim (r+ \max_b Q(s', b) - Q(s, a)) \nabla \left( A^w(s, a) - \sum_a \pi_w(a|s) A^w(s, a) \right) 
$$
and the critic update rule is 
$$
\Delta \phi \sim (r+ \max_b Q(s', b) - Q(s, a)) \nabla V^\phi(s)
$$

Note that the rules to update $V^{\phi}$ is same in both DQN and actor-critic. The rule to update the critic varies in the probability distribution that is used to normalize the advantage function to make it unique.

** PGQ Algorithm**

Given this information PGQ Algorithm is simple and consist of the following steps:
1. Lets $\pi_\theta(a|s)$ represent our policy network and $V^w(s)$ represent our value network.
2. do $N$ times
   1. We collect samples $\{s_1, a_1, r_1, s_2, a_2, r_2, \cdots, s_T\}$ using policy $\pi_\theta$.
   2. We compute $Q^{\pi_\theta}(s_t, a_t) = \alpha(\log \pi(a_t| s_t) + H(\pi(\cdot|s_t))) + V(s_t)\;\; \forall t$. 
   3. We update $\theta$ using the regularized policy gradient approach:
   $$
   \begin{array}{ccc}
   \Delta \theta & = & \nabla_\theta \log \pi_\theta(a|s)(r_t + \max_b \tilde{Q}^{\pi_\theta}(s_{t+1}, b) - \tilde{Q}^{\pi_\theta}(s_{t}, a_{t} )\\
   \Delta \theta & = &  \nabla_\theta (W_\theta(s, a) - \sum_{a} \pi_\theta(a|s) W_\theta(s, a))(r_t + \max_b \tilde{Q}^{\pi_\theta}(s_{t+1}, b) - \tilde{Q}^{\pi_\theta}(s_{t}, a_{t} )
   \end{array}
   $$
   where $\pi_{\theta}(s, a) = \frac{e^{W(s, a)}}{\sum_b e^{W(s, b)}}$
  We update the critic by minimizing the mean square error:
   $$
   ((r_t + \max_b \tilde{Q}^{\pi_\theta}(s_{t+1}, b) - \tilde{Q}^{\pi_\theta}(s_{t}, a_{t} ))^2      
   $$

dx.doi.org
sci-hub
scholar.google.com

An Experimental Evaluation of the Generalizing Capabilities of Process Discovery Techniques and Black-Box Sequence Models
Niek Tax and Sebastiaan J. van Zelst and Irene Teinemaa
Lecture Notes in Business Information Processing - 2018 via Local CrossRef
Keywords:

[link] Summary by Niek Tax 7 years ago

# Contributions
The contribution of this paper is three-fold:
1. We present a method to use *process models* as interpretable sequence models that have a stronger notion of interpretability than what is generally used in the machine learning field (see Section *process models* below),
2. We show that this approach enables the comparison of traditional sequence models (RNNs, LSTMs, Markov Models) with techniques from the research field of *automated process discovery*,
3. We show on a collection of three real-life datasets that a better fit of sequence data can be obtained with LSTMs than with techniques from the *automated process discovery* field

# Process Models
Process models are visually interpretable models that model sequence data in such a way that the generated model is represented in a notation that has *formal semantics*, i.e., it is well-defined which sequences are and which aren't allowed by the model. Below you see an example of a Petri net (a type of model with formal semantics) which allows for the sequences <A,B,C>, <A,C,B>, <D,B,C>, and <D,C,B>.
https://i.imgur.com/SbVYMvX.png
For an overview of automated process discovery algorithms to mine a process model from sequnce data, we refer to [this recent survey and benchmark paper](https://ieeexplore.ieee.org/abstract/document/8368306/).