[link]
Summary by Robert Müller 1 year ago
The typical model based reinforcement learning (RL) loop consists of collecting data, training a model of the environment, using the model to do model predictive control (MPC). If however the model is wrong, for example for state-action pairs that have been barely visited, the dynamics model might be very wrong and the MPC fails as the imagined model and the reality align to longer. Boney et a. propose to tackle this with a denoising autoencoder for trajectory regularization according to the familiarity of a trajectory.
MPC uses at each time t the learned model $s_{t+1} = f_{\theta}(s_t, a_t)$ to select a plan of actions, that is maximizing the sum of expected future reward:
$
G(a_t, \dots, a_{t+h}) = \mathbb{E}[\sum_{k=t}^{t+H}r(o_t, a_t)] ,$
where $r(o_t, a_t)$ is the observation and action dependent reward. The plan obtained by trajectory optimization is subsequently unrolled for H steps.
Boney et al. propose to regularize trajectories by the familiarity of the visited states leading to the regularized objective:
$G_{reg} = G + \alpha \log p(o_k, a_k, \dots, o_{t+H}, a_{t+H})
$
Instead of regularizing over the whole trajectory they propose to regularize over marginal probabilities of windows of length w:
$G_{reg} = G + \alpha \sum_{k = t}^{t+H-w} \log p(x_k), \text{ where } x_k = (o_k, a_k, \dots, o_{t+w}, a_{t+w}).$
Instead of explicitly learning a generative model of the familiarity $p(x_k)$ a denoising auto-encoder is used that approximates instead the derivative of the log probability density $\frac{\delta}{\delta x} \log p(x)$. This allows the following back-propagation rule:
$ \frac{\delta G_{reg}}{\delta_i} = \frac{\delta G}{\delta a_i} + \alpha \sum_{k = i}^{i+w} \frac{\delta x_k}{\delta a_i} \frac{\delta}{\delta x} \log p(x).$
The experiments show that the proposed method has competitive sample-efficiency. For example on Halfcheetah the asymptotic performance of PETS is not matched.
This is due to the biggest limitation of this approach, the hindering of exploration. Penalizing the unfamiliarity of states is in contrast to approaches like optimism in the face of uncertainty, which is a core principle of exploration. Aiming to avoid states of high unfamiliarity, the proposed method is the precise opposite of curiosity driven exploration. The appendix proposes preliminary experiments to account for exploration. I would expect, that the pure penalization of unfamiliarity works best in a batch RL setting, which would be an interesting extension of this work.

more
less