Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations
Daniel S. Brown
and
Wonjoon Goo
and
Prabhat Nagarajan
and
Scott Niekum
arXiv e-Print archive - 2019 via Local arXiv
Keywords:
cs.LG, stat.ML
First published: 2019/04/12 (5 years ago) Abstract: A critical flaw of existing inverse reinforcement learning (IRL) methods is
their inability to significantly outperform the demonstrator. This is because
IRL typically seeks a reward function that makes the demonstrator appear
near-optimal, rather than inferring the underlying intentions of the
demonstrator that may have been poorly executed in practice. In this paper, we
introduce a novel reward-learning-from-observation algorithm, Trajectory-ranked
Reward EXtrapolation (T-REX), that extrapolates beyond a set of (approximately)
ranked demonstrations in order to infer high-quality reward functions from a
set of potentially poor demonstrations. When combined with deep reinforcement
learning, T-REX outperforms state-of-the-art imitation learning and IRL methods
on multiple Atari and MuJoCo benchmark tasks and achieves performance that is
often more than twice the performance of the best demonstration. We also
demonstrate that T-REX is robust to ranking noise and can accurately
extrapolate intention by simply watching a learner noisily improve at a task
over time.