Inverse Reward Design
Dylan Hadfield-Menell
and
Smitha Milli
and
Pieter Abbeel
and
Stuart Russell
and
Anca Dragan
arXiv e-Print archive - 2017 via Local arXiv
Keywords:
cs.AI, cs.LG
First published: 2017/11/08 (6 years ago) Abstract: Autonomous agents optimize the reward function we give them. What they don't
know is how hard it is for us to design a reward function that actually
captures what we want. When designing the reward, we might think of some
specific training scenarios, and make sure that the reward will lead to the
right behavior in those scenarios. Inevitably, agents encounter new scenarios
(e.g., new types of terrain) where optimizing that same reward may lead to
undesired behavior. Our insight is that reward functions are merely
observations about what the designer actually wants, and that they should be
interpreted in the context in which they were designed. We introduce inverse
reward design (IRD) as the problem of inferring the true objective based on the
designed reward and the training MDP. We introduce approximate methods for
solving IRD problems, and use their solution to plan risk-averse behavior in
test MDPs. Empirical results suggest that this approach can help alleviate
negative side effects of misspecified reward functions and mitigate reward
hacking.
The method they use basically tells the robot to reason as follows:
1. The human gave me a reward function $\tilde{r}$, selected in order to get me to behave the way they wanted.
2. So I should favor reward functions which produce that kind of behavior.
This amounts to doing RL (step 1) followed by IRL on the learned policy (step 2); see the final paragraph of section 4.