Model-Based Active Exploration on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Model-Based Active Exploration
Pranav Shyam and Wojciech Jaśkowski and Faustino Gomez
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.LG, cs.AI, cs.IT, cs.NE, math.IT, stat.ML
more

Summaries/Notes 1

[link] Summary by CodyWild 5 years ago

This paper continues in the tradition of curiosity-based models, which try to reward models for exploring novel parts of their environment, in the hopes this can intrinsically motivate learning. However, this paper argues that it’s insufficient to just treat novelty as an occasional bonus on top of a normal reward function, and that instead you should figure out a process that’s more specifically designed to increase novelty. Specifically: you should design a policy whose goal is to experience transitions and world-states that are high novelty.

In this setup, like in other curiosity-based papers, “high novelty” is defined in terms of a state being unpredictable given a prior state, history, and action. However, where other papers saw novelty reward as something only applied when the agent arrived at somewhere novel, here, the authors build a model (technically, an ensemble of models) to predict the state at various future points. The ensemble is important here because it’s (quasi) bootstrapped, and thus gives us a measure of uncertainty. States where the predictions of the ensemble diverge represent places of uncertainty, and thus of high value to explore. I don’t 100% follow the analytic specification of this idea (even though the heuristic/algorithmic description makes sense). The authors frame the Utility function of a state and action as being equivalent to the Jenson Shannon Divergence (~distance between probability distributions) shown below.

https://i.imgur.com/YIuomuP.png

Here, P(S | S, a, T) is the probability of a state given prior state and action under a given model of the environment (Transition Model), and P(gamma) is the distribution over the space of possible transition models one might learn. A “model” here is one network out of the ensemble of networks that makes up our bootstrapped (trained on different sets) distribution over models. Conceptually, I think this calculation is measuring “how different is each sampled model/state distribution from all the other models in the distribution over possible models”. If the models within the distribution diverge from one another, that indicates a location of higher uncertainty.

What’s important about this is that, by building a full transition model, the authors can calculate the expected novelty or “utility” of future transitions it might take, because it can make a best guess based on this transition model (which, while called a “prior”, is really something trained on all data up to this current iteration). My understanding is that these kinds of models function similarly to a Q(s,a) or V(s) in a pure-reward case: they estimate the “utility reward” of different states and actions, and then the policy is updated to increase that expected reward.

I’ve recently read papers on ICM, and I was a little disappointed that this paper didn’t appear to benchmark against that, but against Bootstrapped DQN and Exploration Bonus DQN, which I know less well and can less speak to the conceptual differences from this approach. Another difficulty in actually getting a good sense of results was that the task being tested on is fairly specific, and different from RL results coming out of the world of e.g. Atari and Deep Mind Labs. All of that said, this is a cautiously interesting idea, if the results generate to beat more baselines on more environments.

Hi, I am one of the authors of the paper. This is a very nice summary! To clarify one point: ICM is a special case of the Exploration Bonus DQN baseline we used. Unlike ICM, we measure the state visitation frequency directly with an oracle instead of relying on prediction errors as a proxy (details in appendix). Hence, the numbers for ICM will be very similar to Exploration Bonus DQN baseline and the performance we report can be considered as an upper bound of ICM’s potential performance. We agree that the value of the proposed method is dependant on its scalability and we are currently working on it :)

Great, thanks for the clarification! In general, this wasn't any particular judgment that ICM is a better or more correct baseline, it's just that I wasn't familiar with Exploration Bonus DQN, and so couldn't comment on it.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private