Inverse Reward Design on ShortScience.org

arxiv.org
scholar.google.com

Inverse Reward Design
Dylan Hadfield-Menell and Smitha Milli and Pieter Abbeel and Stuart Russell and Anca Dragan
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.AI, cs.LG
more

Summaries/Notes 2

[link] Summary by CodyWild 7 years ago

It’s a commonly understood problem in Reinforcement Learning: that it is difficult to fully specify your exact reward function for an agent you’re training, especially when that agent will need to operate in conditions potentially different than those it was trained in. The canonical example of this, used throughout the Inverse Rewards Design paper, is that of an agent trained on an environment of grass and dirt, that now encounters an environment with lava. In a typical problem setup, the agent would be indifferent to passing or not passing over the lava, because it was never disincentivized from doing so during training. 

The fundamental approach this paper takes is to explicitly assume that there exists a program designer who gave the agent some proxy reward, and that that proxy reward is a good approximation of the true reward on training data, but might not be so on testing. This framing, of the reward as a noisy signal, allows the model to formalize its uncertainty about scenarios where the proxy reward might be a poor mapping to the real one. 

The way the paper tests this is through a pretty simplified model. In the example, the agent is given a reward function expressed by a weighting of different squares it could navigate into: it has a strong positive weight on dirt, and a strong negative one on grass. The agent then enters an environment where there is lava, which, implicitly, it has a 0 penalty for in its rewards function. However, it’s the case that, if you integrate over all possible weight values for “lava”, none of them would have produced different behavior over the training trajectories. Thus, if you assume high uncertainty, and adopt a risk-averse policy where under cases of uncertainty you assume bad outcomes, this leads to avoiding values of the environment feature vector that you didn’t have data weighting against during training. 

Overall, the intuition of this paper makes sense to me, but it’s unclear to me if the formulation it uses generalizes outside of a very trivial setting, where your reward function is an explicit and given function of your feature vectors, rather than (as is typical) a scalar score not explicitly parametrized by the states of game prior to the very last one. It’s certainly possible that it might, but, I don’t feel like I quite have the confidence to say at this point.

Your comment:

[link] Summary by capybaralet 6 years ago

The method they use basically tells the robot to reason as follows:
1. The human gave me a reward function $\tilde{r}$, selected in order to get me to behave the way they wanted.
2. So I should favor reward functions which produce that kind of behavior. 

This amounts to doing RL (step 1) followed by IRL on the learned policy (step 2); see the final paragraph of section 4.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private