[link]
Summary by CodyWild 6 years ago
This paper proposes a new curiosity-based intrinsic reward technique that seeks to address one of the failure modes of previous curiosity methods. The basic idea of curiosity is that, often, exploring novel areas of an environment can be correlated with gaining reward within that environment, and that we can find ways to incentivize the former that don’t require a hand-designed reward function. This is appealing because many useful-to-learn environments either lack inherent reward altogether, or have reward that is very sparse (i.e. no signal until you reach the end, at which point you get a reward of 1). In both of these cases, supplementing with some kind of intrinsic incentive towards exploration might improve performance. The existing baseline curiosity technique is called ICM, and works based on “surprisal”: asking the agent to predict the next state as a function of its current state, and incentivizing exploration of areas where the gap between these two quantities is high, to promote exploration of harder-to-predict (and presumably more poorly sampled) locations. However, one failure mode of this approach is something called the “noisy TV” problem, whereby if the environment contains something analogous to a television where one can press a button and go to a random channel, that is highly unpredictable, and thus a source of easy rewards, and thus liable to distract the agent from any other actions.
As an alternative, the authors here suggest a different way of defining novelty: rather than something that is unpredictable, novelty should be seen as something far away from what I as an agent have seen before. This is more direct than the prior approach, which takes ‘hard to predict’ as a proxy for ‘somewhere I haven’t explored’, which may not necessary be a reasonable assumption.
https://i.imgur.com/EfcAOoI.png
They implement this idea by keeping a memory of past (embedded) observations that the agent has seen during this episode, and, at each step, check whether the current observation is predicted to be more than K steps away than any of the observations in memory (more on that in a moment). If so, a bonus reward is added, and this observation is added to the aforementioned memory. (Which, waving hands vigorously, kind of ends up functioning as a spanning set of prior experience).
https://i.imgur.com/gmHE11s.png
The question of “how many steps is observation A from observation B” is answered by a separate Comparator network which is trained in pretty straightforward fashion: a random-samplling policy is used to collect trajectories, which are then turned into pairs of observations as input, and a 1 if they occurred > k + p steps apart, and a 0 if they occurred < k steps apart. Then, these paired states are passed into a shared-weight convolutional network, which creates an embedding, and, from that embedding, a prediction is made as to whether they’re closer than the thresholds or farther away. This network is pre-trained before the actual RL training starts. (Minor sidenote: at RL-training time, the network is chopped into two, and the embedding read out and stored, and then input as a pair with each current observation to make the prediction).
https://i.imgur.com/1oUWKyb.png
Overall, the authors find that their method works better than both ICM and no-intrinsic-reward for VizDoom (a maze + shooting game), and the advantage is stronger in situations more sparse settings of the external reward.
https://i.imgur.com/4AURZbX.png
On DeepMind Lab tasks, they saw no advantage on tasks with already-dense extrinsic rewards, and little advantage on the “normally sparse”, which they suggest may be due to it actually being easier than expected. They added doors to a maze navigation task, to ensure the agent couldn’t find the target right away, and this situation brought better performance of their method. They also tried a fully no-extrinsic-reward situation, and their method strongly performed both the ICM baseline and (obviously) the only-extrinsic-reward baseline, which was basically an untrained random policy in this setting. Regarding the poor performance of the ICM baseline in this environment, “we hypothesise that the agent can most significantly change its current view when it is close to the wall — thus increasing one-step prediction error — so it tends to get stucknear “interesting” diverse textures on the walls.”.
more
less