[link]
Summary by Ankesh Anand 5 years ago
This paper shows exciting results on using Model-based RL for Atari.
Model-based RL has shown impressive improvements in sample efficiency on Mujoco tasks ([Chua et. al, 2018](https://arxiv.org/abs/1805.12114)), so its nice to see that the sample efficiency improvements carry over to Pixel-based envs like Atari too.
Specifically, the authors show that their model-based method can do well on several Atari games after training on only 100K env steps (400K frames with FrameSkip 4) which roughly corresponds to 2 hours of game play. They compare to SOTA model-free variants (Rainbow, PPO) after similar number of frames and show that the model-based version achieves much better scores.
The overall training procedure has a very Dyna like flavor. The algorithm, termed SimPLe follows an iterative scheme of:
* Collect experience from the real environment using a policy (initialized to random).
* Use this experience to train the world model (a next-step frame prediction model, and a reward prediction model). This amounts to supervised learning on `{(s, a) -> s’}` and `{(s, a) -> r}` pairs.
* Generate rollouts using the world model, and learn a policy with these rollouts using PPO.
https://i.imgur.com/SZLmdME.png
**Countering distributional shift:**
A key issue when training models is compounding errors when doing multi-step rollouts. This is similar to the problem of making predictions with RNNs trained via teacher-forcing, and hence it's natural to leverage existing techniques from that literature.
This paper uses one such technique: scheduled sampling, that is during training randomly replace some frames of the input by the prediction from the previous step. This seems like a natural way to make the model robust to slight distributional changes.
**Commentary / possible future work:**
* The paper evaluated only on 26 out of 60 Atari games in ALE. I would have really liked if the authors showed performance numbers on all the games even if they weren’t good.
* Related: I suspect the method would not work well when the initial diversity of frames given by the random policy is not sufficient (ex. Sparse reward games like Montezuma’s revenge/Pitfall). Using sample efficient exploration algorithms to augment model learning would be really interesting.
* The trained world-model is able to rollout only for 50 time-steps (compounding errors don't allow for longer rollouts), it might be worthwhile to explore models that can do long-horizon predictions [(TD-VAE?)](https://openreview.net/forum?id=S1x4ghC9tQ).
* Apart from sample-efficiency gains, one reason I am excited about models is their potential ability to generalize to different tasks in the same environment. Benchmarking their generalization capability should thus be an exciting next step.
Finally, props to authors for open-sourcing the code: [tensor2tensor/tensor2tensor/rl at master · tensorflow/tensor2tensor · GitHub](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/rl) and providing detailed instructions to run.
more
less