Learning with Opponent-Learning Awareness

Jakob N. Foerster and Richard Y. Chen and Maruan Al-Shedivat and Shimon Whiteson and Pieter Abbeel and Igor Mordatch

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.AI, cs.GT

**First published:** 2017/09/13 (3 years ago)

**Abstract:** Multi-agent settings are quickly gathering importance in machine learning.
This includes a plethora of recent work on deep multi-agent reinforcement
learning, but also can be extended to hierarchical RL, generative adversarial
networks and decentralised optimisation. In all these settings the presence of
multiple learning agents renders the training problem non-stationary and often
leads to unstable training or undesired final results. We present Learning with
Opponent-Learning Awareness (LOLA), a method in which each agent shapes the
anticipated learning of the other agents in the environment. The LOLA learning
rule includes an additional term that accounts for the impact of one agent's
policy on the anticipated parameter update of the other agents. Preliminary
results show that the encounter of two LOLA agents leads to the emergence of
tit-for-tat and therefore cooperation in the iterated prisoners' dilemma, while
independent learning does not. In this domain, LOLA also receives higher
payouts compared to a naive learner, and is robust against exploitation by
higher order gradient-based methods. Applied to repeated matching pennies, LOLA
agents converge to the Nash equilibrium. In a round robin tournament we show
that LOLA agents can successfully shape the learning of a range of multi-agent
learning algorithms from literature, resulting in the highest average returns
on the IPD. We also show that the LOLA update rule can be efficiently
calculated using an extension of the policy gradient estimator, making the
method suitable for model-free RL. This method thus scales to large parameter
and input spaces and nonlinear function approximators. We also apply LOLA to a
grid world task with an embedded social dilemma using deep recurrent policies
and opponent modelling. Again, by explicitly considering the learning of the
other agent, LOLA agents learn to cooperate out of self-interest.
more
less

Jakob N. Foerster and Richard Y. Chen and Maruan Al-Shedivat and Shimon Whiteson and Pieter Abbeel and Igor Mordatch

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.AI, cs.GT

[link]
Normal RL agents in multi-agent scenarios treat their opponents as a static part of the environment, not taking into account the fact that other agents are learning as well. This paper proposes LOLA, a learning rule that should take the agency and learning of opponents into account by optimizing "return under one step look-ahead of opponent learning" So instead of optimizing under the current parameters of agent 1 and 2 $$V^1(\theta_i^1, \theta_i^2)$$ LOLA proposes to optimize taking into account one step of opponent (agent 2) learning $$V^1(\theta_i^1, \theta_i^2 + \Delta \theta^2_i)$$ where we assume the opponent's naive learning update $\Delta \theta^2_i = \nabla_{\theta^2} V^2(\theta^1, \theta^2) \cdot \eta$ and we add a second-order correction term on top of this, the authors propose - a learning rule with policy gradients in the case that the agent does not have access to exact gradients - a way to estimate the parameters of the opponent, $\theta^2$, from its trajectories using maximum likelihood in the case you can't access them directly $$\hat \theta^2 = \text{argmax}_{\theta^2} \sum_t \log \pi_{\theta^2}(u_t^2|s_t)$$ LOLA is tested on iterated prisoner's dilemma and converges to a tit-for-tat strategy more frequently than the naive RL learning algorithm, and outperforms it. LOLA is tested on iterated matching pennies (similar to prisoner's dilemma) and stably converges to the Nash equilibrium whereas the naive learners do not. In testing on coin game (a higher dimensional version of prisoner's dilemma) they find that naive learners generally choose the defect option whereas LOLA agents have a mostly-cooperative strategy. As well, the authors show that LOLA is a dominant learning rule in IPD, where both agents always do better if either is using LOLA (and even better if both are using LOLA). Finally, the authors also propose second order LOLA, which instead of assuming the opponent is a naive learner, assumes the opponent uses a LOLA learning rule. They show that second order LOLA does not lead to improved performance so there is no need to have a $n$th order LOLA arms race. |

Deep contextualized word representations

Matthew E. Peters and Mark Neumann and Mohit Iyyer and Matt Gardner and Christopher Clark and Kenton Lee and Luke Zettlemoyer

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CL

**First published:** 2018/02/15 (3 years ago)

**Abstract:** We introduce a new type of deep contextualized word representation that
models both (1) complex characteristics of word use (e.g., syntax and
semantics), and (2) how these uses vary across linguistic contexts (i.e., to
model polysemy). Our word vectors are learned functions of the internal states
of a deep bidirectional language model (biLM), which is pre-trained on a large
text corpus. We show that these representations can be easily added to existing
models and significantly improve the state of the art across six challenging
NLP problems, including question answering, textual entailment and sentiment
analysis. We also present an analysis showing that exposing the deep internals
of the pre-trained network is crucial, allowing downstream models to mix
different types of semi-supervision signals.
more
less

Matthew E. Peters and Mark Neumann and Mohit Iyyer and Matt Gardner and Christopher Clark and Kenton Lee and Luke Zettlemoyer

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CL

[link]
This paper introduces a deep universal word embedding based on using a bidirectional LM (in this case, biLSTM). First words are embedded with a CNN-based, character-level, context-free, token embedding into $x_k^{LM}$ and then each sentence is parsed using a biLSTM, maximizing the log-likelihood of a word given it's forward and backward context (much like a normal language model). The innovation is in taking the output of each layer of the LSTM ($h_{k,j}^{LM}$ being the output at layer $j$) $$ \begin{align} R_k &= \{x_k^{LM}, \overrightarrow{h}_{k,j}^{LM}, \overleftarrow{h}_{k,j}^{LM} | j = 1 \ldots L \} \\ &= \{h_{k,j}^{LM} | j = 0 \ldots L \} \end{align} $$ and allowing the user to learn a their own task-specific weighted sum of these hidden states as the embedding: $$ ELMo_k^{task} = \gamma^{task} \sum_{j=0}^L s_j^{task} h_{k,j}^{LM} $$ The authors show that this weighted sum is better than taking only the top LSTM output (as in their previous work or in CoVe) because it allows capturing syntactic information in the lower layer of the LSTM and semantic information in the higher level. Table below shows that the second layer is more useful for the semantic task of word sense disambiguation, and the first layer is more useful for the syntactic task of POS tagging. https://i.imgur.com/dKnyvAa.png On other benchmarks, they show it is also better than taking the average of the layers (which could be done by setting $\gamma = 1$) https://i.imgur.com/f78gmKu.png To add the embeddings to your supervised model, ELMo is concatenated with your context-free embeddings, $\[ x_k; ELMo_k^{task} \]$. It can also be concatenated with the output of your RNN model $\[ h_k; ELMo_k^{task} \]$ which can show improvements on the same benchmarks https://i.imgur.com/eBqLe8G.png Finally, they show that adding ELMo to a competitive but simple baseline gets SOTA (at the time) on very many NLP benchmarks https://i.imgur.com/PFUlgh3.png It's all open-source and there's a tutorial [here](https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md) |

About