Model-Based Reinforcement Learning for Atari
Lukasz Kaiser
and
Mohammad Babaeizadeh
and
Piotr Milos
and
Blazej Osinski
and
Roy H Campbell
and
Konrad Czechowski
and
Dumitru Erhan
and
Chelsea Finn
and
Piotr Kozakowski
and
Sergey Levine
and
Ryan Sepassi
and
George Tucker
and
Henryk Michalewski
arXiv e-Print archive - 2019 via Local arXiv
Keywords:
cs.LG, stat.ML
First published: 2019/03/01 (5 years ago) Abstract: Model-free reinforcement learning (RL) can be used to learn effective
policies for complex tasks, such as Atari games, even from image observations.
However, this typically requires very large amounts of interaction --
substantially more, in fact, than a human would need to learn the same games.
How can people learn so quickly? Part of the answer may be that people can
learn how the game works and predict which actions will lead to desirable
outcomes. In this paper, we explore how video prediction models can similarly
enable agents to solve Atari games with orders of magnitude fewer interactions
than model-free methods. We describe Simulated Policy Learning (SimPLe), a
complete model-based deep RL algorithm based on video prediction models and
present a comparison of several model architectures, including a novel
architecture that yields the best results in our setting. Our experiments
evaluate SimPLe on a range of Atari games and achieve competitive results with
only 100K interactions between the agent and the environment (400K frames),
which corresponds to about two hours of real-time play.
This paper shows exciting results on using Model-based RL for Atari.
Model-based RL has shown impressive improvements in sample efficiency on Mujoco tasks ([Chua et. al, 2018](https://arxiv.org/abs/1805.12114)), so its nice to see that the sample efficiency improvements carry over to Pixel-based envs like Atari too.
Specifically, the authors show that their model-based method can do well on several Atari games after training on only 100K env steps (400K frames with FrameSkip 4) which roughly corresponds to 2 hours of game play. They compare to SOTA model-free variants (Rainbow, PPO) after similar number of frames and show that the model-based version achieves much better scores.
The overall training procedure has a very Dyna like flavor. The algorithm, termed SimPLe follows an iterative scheme of:
* Collect experience from the real environment using a policy (initialized to random).
* Use this experience to train the world model (a next-step frame prediction model, and a reward prediction model). This amounts to supervised learning on `{(s, a) -> s’}` and `{(s, a) -> r}` pairs.
* Generate rollouts using the world model, and learn a policy with these rollouts using PPO.
https://i.imgur.com/SZLmdME.png
**Countering distributional shift:**
A key issue when training models is compounding errors when doing multi-step rollouts. This is similar to the problem of making predictions with RNNs trained via teacher-forcing, and hence it's natural to leverage existing techniques from that literature.
This paper uses one such technique: scheduled sampling, that is during training randomly replace some frames of the input by the prediction from the previous step. This seems like a natural way to make the model robust to slight distributional changes.
**Commentary / possible future work:**
* The paper evaluated only on 26 out of 60 Atari games in ALE. I would have really liked if the authors showed performance numbers on all the games even if they weren’t good.
* Related: I suspect the method would not work well when the initial diversity of frames given by the random policy is not sufficient (ex. Sparse reward games like Montezuma’s revenge/Pitfall). Using sample efficient exploration algorithms to augment model learning would be really interesting.
* The trained world-model is able to rollout only for 50 time-steps (compounding errors don't allow for longer rollouts), it might be worthwhile to explore models that can do long-horizon predictions [(TD-VAE?)](https://openreview.net/forum?id=S1x4ghC9tQ).
* Apart from sample-efficiency gains, one reason I am excited about models is their potential ability to generalize to different tasks in the same environment. Benchmarking their generalization capability should thus be an exciting next step.
Finally, props to authors for open-sourcing the code: [tensor2tensor/tensor2tensor/rl at master · tensorflow/tensor2tensor · GitHub](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/rl) and providing detailed instructions to run.