First published: 2018/11/15 (3 years ago) Abstract: To solve complex real-world problems with reinforcement learning, we cannot
rely on manually specified reward functions. Instead, we can have humans
communicate an objective to the agent directly. In this work, we combine two
approaches to learning from human feedback: expert demonstrations and
trajectory preferences. We train a deep neural network to model the reward
function and use its predicted reward to train an DQN-based deep reinforcement
learning agent on 9 Atari games. Our approach beats the imitation learning
baseline in 7 games and achieves strictly superhuman performance on 2 games
without using game rewards. Additionally, we investigate the goodness of fit of
the reward model, present some reward hacking problems, and study the effects
of noise in the human labels.
How can humans help an agent perform at a task that has no clear reward? Imitation, demonstration, and preferences. This paper asks which combinations of imitation, demonstration, and preferences will best guide an agent in Atari games.
For example an agent that is playing Pong on the Atari, but can't access the score. You might help it by demonstrating your play style for a few hours. To help the agent further you are shown two short clips of it playing and you are asked to indicate which one, if any, you prefer.
To avoid spending many hours rating videos the authors sometimes used an automated approach where the game's score decides which clip is preferred, but they also compared this approach to human preferences. It turns out that human preferences are often worse because of reward traps. These happen, for example, when the human tries to encourage the agent to explore ladders, resulting in the agent obsessing about ladders instead of continuing the game.
They also observed that the agent often misunderstood the preferences it was given, causing unexpected behavior called reward hacking. The only solution they mention was to have someone keep an eye on it and continue giving it preferences, but this isn't always feasible. This is the alignment problem which is a hard problem in AGI research.
Results: adding merely a few thousand preferences can help in most games, unless they have sparse rewards. Demonstrations, on the other hand, tend to help those games with sparse rewards but only if the demonstrator is good at the game.