Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations
Daniel S. Brown
and
Wonjoon Goo
and
Prabhat Nagarajan
and
Scott Niekum
arXiv e-Print archive - 2019 via Local arXiv
Keywords:
cs.LG, stat.ML
First published: 2019/04/12 (5 years ago) Abstract: A critical flaw of existing inverse reinforcement learning (IRL) methods is
their inability to significantly outperform the demonstrator. This is because
IRL typically seeks a reward function that makes the demonstrator appear
near-optimal, rather than inferring the underlying intentions of the
demonstrator that may have been poorly executed in practice. In this paper, we
introduce a novel reward-learning-from-observation algorithm, Trajectory-ranked
Reward EXtrapolation (T-REX), that extrapolates beyond a set of (approximately)
ranked demonstrations in order to infer high-quality reward functions from a
set of potentially poor demonstrations. When combined with deep reinforcement
learning, T-REX outperforms state-of-the-art imitation learning and IRL methods
on multiple Atari and MuJoCo benchmark tasks and achieves performance that is
often more than twice the performance of the best demonstration. We also
demonstrate that T-REX is robust to ranking noise and can accurately
extrapolate intention by simply watching a learner noisily improve at a task
over time.
## General Framework
Only access to a finite set of **ranked demonstrations**. The demonstrations only contains **observations** and **do not need to be optimal** but must be (approximately) ranked from worst to best.
The **reward learning part is off-line** but not the policy learning part (requires interactions with the environment).
In a nutshell: learns a reward models that looks at observations. The reward model is trained to predict if a demonstration's ranking is greater than another one's. Then, once the reward model is learned, one simply uses RL to learn a policy. This latter outperform the demonstrations' performance.
## Motivations
Current IRL methods cannot outperform the demonstrations because they seek a reward function that makes the demonstrator optimal and thus do not infer the underlying intentions of the demonstrator that may have been poorly executed.
In practice, high quality demonstrations may be difficult to provide and it is often easier to provide demonstrations with a ranking of their relative performance (desirableness).
## Trajectory-ranked Reward EXtrapolation (T-REX)
![](https://i.imgur.com/cuL8ZFJ.png =400x)
Uses ranked demonstrations to extrapolate a user's underlying intent beyond the best demonstrations by learning a reward that assigns greater return to higher-ranked trajectories. While standard IRL seeks a reward that **justifies** the demonstrations, T-REX tries learns a reward that **explains** the ranking over demonstrations.
![](https://i.imgur.com/4IQ13TC.png =500x)
Having rankings over demonstrations may remove the reward ambiguity problem (always 0 reward cannot explain the ranking) as well as provide some "data-augmentation" since from a few ranked demonstrations you can define many pair-wise comparisons. Additionally, suboptimal demonstrations may provide more diverse data by exploring a larger area of the state space (but may miss the parts relevant to solving the task...)
## Tricks and Tips
Authors used ground truth reward to rank trajectories, but they also show that approximate ranking does not hurt the performance much.
To avoid overfitting they used an ensemble of 5 Neural Networks to predict the reward.
For episodic tasks, they compare subtrajectories that correspond to similar timestep (better trajectory is a bit later in the episode than the one it is compared against so that reward increases as the episode progresses).
At RL training time, the learned reward goes through a sigmoid to avoid large changes in the reward scale across time-steps.
## Results
![](https://i.imgur.com/7ysYZKd.png)
![](https://i.imgur.com/CHO9aVT.png)
![](https://i.imgur.com/OzVD9sf.png =600x)
Results are quite positive and performance can be good even when the learned reward is not really correlated with the ground truth (cf. HalfCheetah).
They also show that T-REX is robust to different ranking-noises: random-swapping of pair-wise ranking, ranking by humans that only have access to a description of the task and not the ground truth reward. **They also automatically rank the demonstrations using the number of learning steps of a learning expert: therefore T-REX could be used as an intrinsic reward alongside the ground-truth to accelerate training.**
![](https://i.imgur.com/IfOeLY6.png =500x)
**Limitations**
They do not show than T-REX can match an optimal expert, maybe ranking demonstrations hurt when all the demos are close to optimality?