RUDDER: Return Decomposition for Delayed Rewards
Jose A. Arjona-Medina
and
Michael Gillhofer
and
Michael Widrich
and
Thomas Unterthiner
and
Sepp Hochreiter
arXiv e-Print archive - 2018 via Local arXiv
Keywords:
cs.LG, cs.AI, math.OC, stat.ML
First published: 2018/06/20 (6 years ago) Abstract: We propose a novel reinforcement learning approach for finite Markov decision
processes (MDPs) with delayed rewards. In this work, biases of temporal
difference (TD) estimates are proved to be corrected only exponentially slowly
in the number of delay steps. Furthermore, variances of Monte Carlo (MC)
estimates are proved to increase the variance of other estimates, the number of
which can exponentially grow in the number of delay steps. We introduce RUDDER,
a return decomposition method, which creates a new MDP with same optimal
policies as the original MDP but with redistributed rewards that have largely
reduced delays. If the return decomposition is optimal, then the new MDP does
not have delayed rewards and TD estimates are unbiased. In this case, the
rewards track Q-values so that the future expected reward is always zero. We
experimentally confirm our theoretical results on bias and variance of TD and
MC estimates. On artificial tasks with different lengths of reward delays, we
show that RUDDER is exponentially faster than TD, MC, and MC Tree Search
(MCTS). RUDDER outperforms rainbow, A3C, DDQN, Distributional DQN, Dueling
DDQN, Noisy DQN, and Prioritized DDQN on the delayed reward Atari game Venture
in only a fraction of the learning time. RUDDER considerably improves the
state-of-the-art on the delayed reward Atari game Bowling in much less learning
time. Source code is available at https://github.com/ml-jku/baselines-rudder,
with demonstration videos at https://goo.gl/EQerZV.