Learning Continuous Control Policies by Stochastic Value Gradients on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Learning Continuous Control Policies by Stochastic Value Gradients
Nicolas Heess and Greg Wayne and David Silver and Timothy Lillicrap and Yuval Tassa and Tom Erez
arXiv e-Print archive - 2015 via Local arXiv
Keywords: cs.LG, cs.NE
more

Summaries/Notes 1

[link] Summary by tom89 7 years ago

This paper shows how a family of reinforcement learning algorithms known as value gradient methods can be generalised to learn stochastic policies and deal with stochastic environment models.

Value gradients are a type of policy gradient algorithm which represent a value function either by:
* A learned Q-function (a critic)
* Linking together a policy, an environment model and reward function to define a recursive function to simulate the trajectory and the total return from a given state.

By backpropagating though these functions, value gradient methods can calculate a policy gradient. This backpropagation sets them apart from other policy gradient methods (like REINFORCE for example) which are model-free and sample returns from the real environment.

Applying value gradients to stochastic problems requires differentiating the stochastic bellman equation:
\begin{equation} 
V ^t (s) = \int \left[ r^t + γ \int  V^{t+1} (s) p(s' | s, a) ds'  \right] p(a|s; θ) da
\end{equation}

To do that, the authors use a trick called re-parameterisation to express the stochastic bellman equation as a deterministic function which takes a noise variable as an input. To differentiate a re-parameterised function, one simply samples the noise variable then computes the derivative as if the function were deterministic. This can then be repeated $ M $ times and averaged to arrive at a Monte Carlo estimate for the derivative of the stochastic function.

The re-parameterised bellman equation is: 

$ V (s) = \mathbb{E}_{ \rho(\eta) } \left[ r(s, \pi(s, \eta; \theta)) + \gamma \mathbb{E}_{\rho(\xi) }  \left[ V' (f(s, \pi(s, \eta; \theta), \xi)) \right]  \right] $

It's derivative with respect to the current state and the policy parameters is:

$ V_s = \mathbb{E}_{\rho(\eta)} \[ r_\textbf{s} + r_\textbf{a} \pi_\textbf{s} + \gamma \mathbb{E}_{\rho(\xi)} V'_{s'} (\textbf{f}_\textbf{s} + \textbf{f}_\textbf{a} \pi_\textbf{s}) \] $

$ V_\theta = \mathbb{E}_{\rho(\eta)} \[ r_\textbf{a} \pi_\theta + \gamma \mathbb{E}_{\rho(\xi)} \[ V'_{\textbf{s'}} \textbf{f}_\textbf{a} \pi_\textbf{s} + V'_\theta\] \] $

Based on these relationships the authors define two algorithms; SVG(∞), SVG(1) 

* SVG(∞) takes the trajectory from an entire episode and starting at the terminal state accumulates a gradients $V_{\textbf{s}} $ and $ V_{\theta} $ using the expressions above to arrive at a policy gradient. SVG(∞) is on-policy and only works with finite-horizon environments

* SVG(1) trains a value function then uses its gradient as an estimate for $ V_{\textbf{s}} $ above. SVG(1) also uses importance weighting so as to be off-policy and can work with infinite-horizon environments.

Both algorithms use an environment model which is trained using an experience replay database. The paper also introduces SVG(0) which is a similar to SVG(1), but is model-free.

SVG was analysed using several MuJoCo environments and it was found that:
* SVG(∞) outperformed a BBPT planner on a control problem with a stochastic model, indicating that gradient evaluation using real trajectories is more effective than planning for stochastic environments
* SVG(1) is more robust to inaccurate environment models and value functions than SVG(∞)
* SVG(1) was able to solve several complex environments

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private