Benchmarking Model-Based Reinforcement Learning

Wang, Tingwu and Bao, Xuchan and Clavera, Ignasi and Hoang, Jerrick and Wen, Yeming and Langlois, Eric and Zhang, Shunshi and Zhang, Guodong and Abbeel, Pieter and Ba, Jimmy

arXiv e-Print archive - 2019 via Local Bibsonomy

Keywords: dblp

Wang, Tingwu and Bao, Xuchan and Clavera, Ignasi and Hoang, Jerrick and Wen, Yeming and Langlois, Eric and Zhang, Shunshi and Zhang, Guodong and Abbeel, Pieter and Ba, Jimmy

arXiv e-Print archive - 2019 via Local Bibsonomy

Keywords: dblp

[link]
This is not a detailed summary, just general notes: Authors make a excellent and extensive comparison of Model Free, Model based methods in 18 environments. In general, the authors compare 3 classes of Model Based Reinforcement Learning (MBRL) algorithms using as metric for comparison the total return in the environment after 200K steps (reporting the mean and std by taking windows of 5000 steps throughout the whole training - and averaging across 4 seeds for each algorithm). They compare MBRL classes: - **Dyna style:** using a policy to gather data, training a transition function model on this data(i.e. dynamics function / "world model"), and using data predicted by the model (i.e. "imaginary" data) to train the policy) - **Policy Search with Backpropagation through Time (BPTT):** starting at some state $s_0$ the policy rolls out an episode using the model. Then given the trajectory and its sum of rewards (or any other objective function to maximize) one can differentiate this expression with respect to the policies parameters $\theta$ to obtain the gradient. The training process iterates between collecting data using the current policy and improving the policy via computing the BPTT gradient ... Some version include dynamic programming approaches where the ground -truth dynamics need to be known - **Model Predictive Control (MPC) / Shooting methods:** There is in general no explicit policy to choose actions, rather the actions sequence is chosen by: starting with a set of candidates of actions sequences $a_{t:t+\tau}$ , propagating this actions sequences in the dynamics model, and then choosing the action sequence which achieved the highest return through out the propagated episode. Then, the agent only applies the first action from the optimal sequence and re-plans at every time-step. They also compare this to Model Free (MF) methods such as SAC and TD3. **Brief conclusions which I noticed from MB and MF comparisons:** (note the $>$ indicates better than ) - **MF:** SAC & TD3 $>$ PPO & TRPO - **Performance:** MPC (shooting, robust performance except for complex env.) $>$ Dyna (bad for long H) $>$ BPTT (SVG very good for complex env.) - **State and Action Noise:** MPC (shooting, re-planning compensates for noise) $>$ Dyna (lots of Model predictive errors … although meta learning actually benefits from noisy do to lack of exploration) - **MB dynamics error accumulation:** MB performance plateaus, more data $\neq$ better performance $\rightarrow$ 1. prediction error accumulates through time 2. As we the policy and model improvement are closely link, we can (early) fall into local minima - **Early Termination (ET):** including ET always negatively affects MB methods. Different ways of incorporating ET into planning horizon (see appendix G for details) work better for some environment but worst for more complex envs. - so, tbh theres no conclusion to be made about early termination schemes (same as for there entire paper :D, but it's not the authors fault, is just the (sad) direction in which most RL / DL research is moving in) **Some stuff which seems counterintuitive:** - Why can’t we see a significant sample efficiency of MB w.r.t to MF ? - Why does PILCO suck at almost everything ? original authors/ implementation seems to excel at several tasks - When using ensembles (e.g. PETS), why is predicting the next state as the Expectation over the ensemble (PE-E: $\boldsymbol{s}_{t+1}=\mathbb{E}\left[\widetilde{f}_{\boldsymbol{\theta}}\left(\boldsymbol{s}_{t}, \boldsymbol{a}_{t}\right)\right]$) better or at least highly comparable to propagating a particle by leveraging the ensemble of models (PE-TS: given $P$ initial states $\boldsymbol{s}_{t=0}^{p}=\boldsymbol{s}_{0} \forall \boldsymbol{p}$, propagate each state / particle $p$ using *one* model $b(p)$ from the entire ensemble ($B$), such that $\boldsymbol{s}_{t+1}^{p} \sim \tilde{f}_{\boldsymbol{\theta}_{b(p)}}\left(\boldsymbol{s}_{t}^{\boldsymbol{p}}, \boldsymbol{a}_{t}\right)$ ), which should in theory better capture uncertainty / multimodality of the State space ?? |

About