First published: 2018/09/04 (6 years ago) Abstract: A generative recurrent neural network is quickly trained in an unsupervised
manner to model popular reinforcement learning environments through compressed
spatio-temporal representations. The world model's extracted features are fed
into compact and simple policies trained by evolution, achieving state of the
art results in various environments. We also train our agent entirely inside of
an environment generated by its own internal world model, and transfer this
policy back into the actual environment. Interactive version of paper at
https://worldmodels.github.io
## General Framework
The take-home message is that the challenge of Reinforcement Learning for environments with high-dimensional and partial observations is learning a good representation of the environment. This means learning a sensory features extractor V to deal with the highly dimensional observation (pixels for example). But also learning a temporal representation M of the environment dynamics to deal with the partial observability. If provided with such representations, learning a controller so as to maximize a reward is really easy (single linear layer evolved with CMA-ES).
Authors call these representations a *World model* since they can use the learned environment's dynamics to simulate roll-outs. They show that policies trained inside the world model transfer well back to the real environment provided that measures are taken to prevent the policy from exploiting the world model's inaccuracies.
## Method
**Learning the World Model**
![](https://i.imgur.com/tgV17k4.png =600x)
In this work they propose to learn these representations off-line in an unsupervized manner in order to be more efficient.
They use a VAE for V that they train exclusively with the reconstruction loss, that way the learned representations are independent of the reward and can be used alongside any reward. They then train M as Mixture-Density-Network-RNN to predict the next sensory features (as extracted by the VAE) --and possibly the done condition and the reward-- and thus learn the dynamics of the environment in the VAE's latent space (which is likely simpler there than in the pixel space).
Note that the VAE's latent space is a single Gaussian (adding stochasticity makes it more robust to the "next state" outputs of M), whereas M outputs next states in a mixture of Gaussians. Indeed, an image is likely to have one visual encoding, yet it can have multiple and different future scenarii which are captured by the multimodal output of M.
**Training the policy**
![](https://i.imgur.com/H5vpb2H.png)
* In the real env:
The agent is provided with the visual features and M's hidden state (temporal features).
* In the world model:
To avoid that the agent exploits this imperfect simulator they increase its dynamics' stochasticity by playing with $\tau$ the sampling temperature of $z_{t+1}$ in M.
## Limitations
If exploration is important in the environment the initial random policy might fail to collect data in all the relevant part of the environment and an iterative version of Algorithm 1 might be required (see https://worldmodels.github.io/ for a discussion on the different iterative methods) for the data collection.
By training V independently of M it might fail to encode all the information relevant to the task. Another option would be to train V and M concurrently so that the reward and $z_{t+1}$'s prediction loss (or next state reconstruction loss) of M flows through V (that would also be trained with its own reconstruction loss). The trade-off is that now V is tuned to a particular reward and cannot be reused.
The authors argue that since $h_t$ is such that it can predict $z_{t+1}$, it contains enough insight about the future for the agent not needing to *plan ahead* and just doing reflexive actions based on $h_t$. This is interesting but the considered tasks (driving, dodging fireball) are still very reflexive and do not require much planning.
## Results
When trained on the true env, a simple controller with the V and M representations achieve SOTA on car-racing. V + M is better than V alone.
When trained inside the world model, its dynamics' stochasticity must be tuned in order for the policy to transfer well and perform well on the real env: too little stochasticity and the agent overfits to the world model flaws and does not transfer to the real env, too much and the agent becomes risk-averse and robust but suboptimal.
![](https://i.imgur.com/e8ETSjQ.png)
## Additional ressources
Thorough interactive blog post with additional experiments and discussions: https://worldmodels.github.io/