[link]
# Introduction This research examines the way that the brain's motor system carries out movement tasks by looking at the trajectory of human subjects' arms during (mostly) reaching and pistol-aiming movements. The authors' observation was that even though subjects were able to reliably to carry out these movements, the way in which they did so (i.e the trajectory of their arm) varied considerably between trials. Previous models for motor coordination suggest that the brain strictly separates motor planning and motor execution. That is to say, the brain decides in advance, how to move its limbs in order to carry out a task and then follows that sequence to complete a movement. However this model doesn't make sense given that the motor system is still able to complete movements even in the presence of unforeseen perturbations. Instead, the authors propose a theory for motor coordination based on stochastic optimal feedback control. They suggest that motor coordination is implemented as a feedback control loop where both the motor signals and the sensory feedback are subject to noise and transmission delay. To complete a movement, the motor system comes up with an optimal feedback control law which iteratively calculates the motor outputs throughout a given task based on the instantaneous state of the system. The system defines this 'optimal control law' as being that which maximises task performance while minimising the total effort expended in carrying out the movement. # Method The authors ran simulations for simple movement tasks. They used optimal control laws to drive the simulations and then compared the results with measurements taken from human subjects doing the same movement tasks. It was found that the simulations showed the same variability in their joint trajectories along task-irrelevant dimensions as the human subjects in the practical experiments. # Results This research concluded that the control algorithm implemented by the motor system can be explained by the principle of minimal intervention defined in the optimal control framework. The principle of minimal intervention dictates that control effort should only expended on state dimensions that are relevant for completing the task at hand. This minimises the total control effort and avoids the possibility of degrading task performance by attempting to correct irrelevant errors with noisy control signals. |
[link]
This paper shows how a family of reinforcement learning algorithms known as value gradient methods can be generalised to learn stochastic policies and deal with stochastic environment models. Value gradients are a type of policy gradient algorithm which represent a value function either by: * A learned Q-function (a critic) * Linking together a policy, an environment model and reward function to define a recursive function to simulate the trajectory and the total return from a given state. By backpropagating though these functions, value gradient methods can calculate a policy gradient. This backpropagation sets them apart from other policy gradient methods (like REINFORCE for example) which are model-free and sample returns from the real environment. Applying value gradients to stochastic problems requires differentiating the stochastic bellman equation: \begin{equation} V ^t (s) = \int \left[ r^t + γ \int V^{t+1} (s) p(s' | s, a) ds' \right] p(a|s; θ) da \end{equation} To do that, the authors use a trick called re-parameterisation to express the stochastic bellman equation as a deterministic function which takes a noise variable as an input. To differentiate a re-parameterised function, one simply samples the noise variable then computes the derivative as if the function were deterministic. This can then be repeated $ M $ times and averaged to arrive at a Monte Carlo estimate for the derivative of the stochastic function. The re-parameterised bellman equation is: $ V (s) = \mathbb{E}_{ \rho(\eta) } \left[ r(s, \pi(s, \eta; \theta)) + \gamma \mathbb{E}_{\rho(\xi) } \left[ V' (f(s, \pi(s, \eta; \theta), \xi)) \right] \right] $ It's derivative with respect to the current state and the policy parameters is: $ V_s = \mathbb{E}_{\rho(\eta)} \[ r_\textbf{s} + r_\textbf{a} \pi_\textbf{s} + \gamma \mathbb{E}_{\rho(\xi)} V'_{s'} (\textbf{f}_\textbf{s} + \textbf{f}_\textbf{a} \pi_\textbf{s}) \] $ $ V_\theta = \mathbb{E}_{\rho(\eta)} \[ r_\textbf{a} \pi_\theta + \gamma \mathbb{E}_{\rho(\xi)} \[ V'_{\textbf{s'}} \textbf{f}_\textbf{a} \pi_\textbf{s} + V'_\theta\] \] $ Based on these relationships the authors define two algorithms; SVG(∞), SVG(1) * SVG(∞) takes the trajectory from an entire episode and starting at the terminal state accumulates a gradients $V_{\textbf{s}} $ and $ V_{\theta} $ using the expressions above to arrive at a policy gradient. SVG(∞) is on-policy and only works with finite-horizon environments * SVG(1) trains a value function then uses its gradient as an estimate for $ V_{\textbf{s}} $ above. SVG(1) also uses importance weighting so as to be off-policy and can work with infinite-horizon environments. Both algorithms use an environment model which is trained using an experience replay database. The paper also introduces SVG(0) which is a similar to SVG(1), but is model-free. SVG was analysed using several MuJoCo environments and it was found that: * SVG(∞) outperformed a BBPT planner on a control problem with a stochastic model, indicating that gradient evaluation using real trajectories is more effective than planning for stochastic environments * SVG(1) is more robust to inaccurate environment models and value functions than SVG(∞) * SVG(1) was able to solve several complex environments |