Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search
Lars Buesing
and
Theophane Weber
and
Yori Zwols
and
Sebastien Racaniere
and
Arthur Guez
and
Jean-Baptiste Lespiau
and
Nicolas Heess
arXiv e-Print archive - 2018 via Local arXiv
Keywords:
cs.LG, stat.ML
First published: 2018/11/15 (6 years ago) Abstract: Learning policies on data synthesized by models can in principle quench the
thirst of reinforcement learning algorithms for large amounts of real
experience, which is often costly to acquire. However, simulating plausible
experience de novo is a hard problem for many complex environments, often
resulting in biases for model-based policy evaluation and search. Instead of de
novo synthesis of data, here we assume logged, real experience and model
alternative outcomes of this experience under counterfactual actions, actions
that were not actually taken. Based on this, we propose the
Counterfactually-Guided Policy Search (CF-GPS) algorithm for learning policies
in POMDPs from off-policy experience. It leverages structural causal models for
counterfactual evaluation of arbitrary policies on individual off-policy
episodes. CF-GPS can improve on vanilla model-based RL algorithms by making use
of available logged data to de-bias model predictions. In contrast to
off-policy algorithms based on Importance Sampling which re-weight data, CF-GPS
leverages a model to explicitly consider alternative outcomes, allowing the
algorithm to make better use of experience data. We find empirically that these
advantages translate into improved policy evaluation and search results on a
non-trivial grid-world task. Finally, we show that CF-GPS generalizes the
previously proposed Guided Policy Search and that reparameterization-based
algorithms such Stochastic Value Gradient can be interpreted as counterfactual
methods.