[link]
This paper combines imitation learning algorithm GAIL with recent advances in goal-conditioned reinforcement learning, to create a combined approach that can make efficient use of demonstrations, but can also learn information about a reward that can allow the agent to outperform the demonstrator. Goal-conditioned learning is a form of reward-driven reinforcement learning where the reward is a defined to be 1 when an agent reaches a particular state, and 0 otherwise. This can be a particularly useful form of learning for navigation tasks, where, instead of only training your agent to reach a single hardcoded goal (as you would with a reward function) you teach it to reach arbitrary goals when information about the goal is passed in as input. A typical difficulty with this kind of learning is that its reward is sparse: for any given goal, if an agent never reaches it, it won't ever get reward signal it can use to learn to find it again. A clever solution to this, proposed by earlier method HER (Hindsight Experience Replay), is to perform rollouts of the agent trajectory, and then train your model to reach all the states it actually reached along that trajectory. Said another way, even if your agent did a random, useless thing with respect to one goal, if you retroactively decided that the goal was where it ended up, then it'd be able to receive reward signal after all. In a learning scenario with a fixed reward, this trick wouldn't make any sense, since you don't want to train your model to only go wherever it happened to initially end up. But because the policy here is goal-conditioned, we're not giving our policy wrong information about how to go to the place we want, we're incentivizing it to remember ways it got to where it ended up, in the hopes that it can learn generalizable things about how to reach new places. The other technique being combined in this paper is imitation learning, or learning from demonstrations. Demonstrations can be highly useful for showing the agent how to get to regions of state space it might not find on its own. The authors of this paper advocate creating a goal-conditioned version of one particular imitation learning algorithm (Generative Adversarial Imitation Learning, or GAIL), and combining that with an off-policy version of Hindsight Experience Replay. In their model, a discriminator tries to tell the behavior of the demonstrator from that of the agent, given some input goal, and uses that as loss, combined with the loss of a more normal Q learning loss with a reward set to 1 when a goal is achieved. Importantly, they amplify both of these methods using the relabeling trick mentioned before: for both the demonstrators and the actual agent trajectories, they take tuples of (state, next state, goal) and replace the intended goal with another state reached later in the trajectory. For the Q learner, this performs its normal role as a way to get reward in otherwise sparse settings, and for the imitation learner, it is a form of data amplification, where a single trajectory + goal can be turned into multiple trajectories "successfully" reaching all of the intermediate points along the observed trajectory. The authors show that their method learns more quickly (as a result of the demonstrations), but also is able to outperform demonstrators, which it wouldn't generally be able to do without an independent, non-demonstrator reward signal
Your comment:
|