First published: 2018/02/26 (4 years ago) Abstract: In value-based reinforcement learning methods such as deep Q-learning,
function approximation errors are known to lead to overestimated value
estimates and suboptimal policies. We show that this problem persists in an
actor-critic setting and propose novel mechanisms to minimize its effects on
both the actor and critic. Our algorithm takes the minimum value between a pair
of critics to restrict overestimation and delays policy updates to reduce
per-update error. We evaluate our method on the suite of OpenAI gym tasks,
outperforming the state of the art in every environment tested.