First published: 2018/02/26 (3 years ago) Abstract: In value-based reinforcement learning methods such as deep Q-learning,
function approximation errors are known to lead to overestimated value
estimates and suboptimal policies. We show that this problem persists in an
actor-critic setting and propose novel mechanisms to minimize its effects on
both the actor and critic. Our algorithm takes the minimum value between a pair
of critics to restrict overestimation and delays policy updates to reduce
per-update error. We evaluate our method on the suite of OpenAI gym tasks,
outperforming the state of the art in every environment tested.
As in Q-learning, modern actor-critic methods suffer from value estimation errors due to high bias and variance. While there are many attempts to address this in Q-learning (such as Double DQN), not much was done in actor-critic methods.
Authors of the paper propose three modifications to DDPG and empirically show that they help address both bias and variance issues:
* 1.) Clipped Double Q-Learning:
Add a second pair of critics $Q_{\theta}$ and $Q_{\theta_\text{target}}$ (so four critics total) and use them to upper-bound the value estimate target update: $y = r + \gamma \min\limits_{i=1,2} Q_{\theta_{target,i}}(s', \pi_{\phi_1}(s'))$
* 2.) Reduce number of policy and target networks updates, and magnitude of target networks updates: $\theta_{target} \leftarrow \tau\theta + (1-\tau)\theta_{target}$
* 3.) Inject (clipped) random noise to the target policy: $\hat{a} \leftarrow \pi_{\phi_{target}}(s) + \text{clip}(N(0,\sigma), -c, c)$
Implementing these results, authors show significant improvements on seven continuous control tasks, beating not only reference DDPG algorithm, but also PPO, TRPO and ACKTR.
Full algorithm from the paper:
https://i.imgur.com/rRjwDyT.png
Source code: https://github.com/sfujim/TD3