[link]
## General Framework Only access to a finite set of **ranked demonstrations**. The demonstrations only contains **observations** and **do not need to be optimal** but must be (approximately) ranked from worst to best. The **reward learning part is off-line** but not the policy learning part (requires interactions with the environment). In a nutshell: learns a reward models that looks at observations. The reward model is trained to predict if a demonstration's ranking is greater than another one's. Then, once the reward model is learned, one simply uses RL to learn a policy. This latter outperform the demonstrations' performance. ## Motivations Current IRL methods cannot outperform the demonstrations because they seek a reward function that makes the demonstrator optimal and thus do not infer the underlying intentions of the demonstrator that may have been poorly executed. In practice, high quality demonstrations may be difficult to provide and it is often easier to provide demonstrations with a ranking of their relative performance (desirableness). ## Trajectory-ranked Reward EXtrapolation (T-REX)  Uses ranked demonstrations to extrapolate a user's underlying intent beyond the best demonstrations by learning a reward that assigns greater return to higher-ranked trajectories. While standard IRL seeks a reward that **justifies** the demonstrations, T-REX tries learns a reward that **explains** the ranking over demonstrations.  Having rankings over demonstrations may remove the reward ambiguity problem (always 0 reward cannot explain the ranking) as well as provide some "data-augmentation" since from a few ranked demonstrations you can define many pair-wise comparisons. Additionally, suboptimal demonstrations may provide more diverse data by exploring a larger area of the state space (but may miss the parts relevant to solving the task...) ## Tricks and Tips Authors used ground truth reward to rank trajectories, but they also show that approximate ranking does not hurt the performance much. To avoid overfitting they used an ensemble of 5 Neural Networks to predict the reward. For episodic tasks, they compare subtrajectories that correspond to similar timestep (better trajectory is a bit later in the episode than the one it is compared against so that reward increases as the episode progresses). At RL training time, the learned reward goes through a sigmoid to avoid large changes in the reward scale across time-steps. ## Results    Results are quite positive and performance can be good even when the learned reward is not really correlated with the ground truth (cf. HalfCheetah). They also show that T-REX is robust to different ranking-noises: random-swapping of pair-wise ranking, ranking by humans that only have access to a description of the task and not the ground truth reward. **They also automatically rank the demonstrations using the number of learning steps of a learning expert: therefore T-REX could be used as an intrinsic reward alongside the ground-truth to accelerate training.**  **Limitations** They do not show than T-REX can match an optimal expert, maybe ranking demonstrations hurt when all the demos are close to optimality? ![]()
Your comment:
|