First published: 2018/06/20 (2 years ago) Abstract: We propose a novel reinforcement learning approach for finite Markov decision
processes (MDPs) with delayed rewards. In this work, biases of temporal
difference (TD) estimates are proved to be corrected only exponentially slowly
in the number of delay steps. Furthermore, variances of Monte Carlo (MC)
estimates are proved to increase the variance of other estimates, the number of
which can exponentially grow in the number of delay steps. We introduce RUDDER,
a return decomposition method, which creates a new MDP with same optimal
policies as the original MDP but with redistributed rewards that have largely
reduced delays. If the return decomposition is optimal, then the new MDP does
not have delayed rewards and TD estimates are unbiased. In this case, the
rewards track Q-values so that the future expected reward is always zero. We
experimentally confirm our theoretical results on bias and variance of TD and
MC estimates. On artificial tasks with different lengths of reward delays, we
show that RUDDER is exponentially faster than TD, MC, and MC Tree Search
(MCTS). RUDDER outperforms rainbow, A3C, DDQN, Distributional DQN, Dueling
DDQN, Noisy DQN, and Prioritized DDQN on the delayed reward Atari game Venture
in only a fraction of the learning time. RUDDER considerably improves the
state-of-the-art on the delayed reward Atari game Bowling in much less learning
time. Source code is available at https://github.com/ml-jku/baselines-rudder,
with demonstration videos at https://goo.gl/EQerZV.
[Summary by author /u/SirJAM_armedi](https://www.reddit.com/r/MachineLearning/comments/8sq0jy/rudder_reinforcement_learning_algorithm_that_is/e11swv8/).
Math aside, the "big idea" of RUDDER is the following: We use an LSTM to predict the return of an episode. To do this, the LSTM will have to recognize what actually causes the reward (e.g. "shooting the gun in the right direction causes the reward, even if we get the reward only once the bullet hits the enemy after travelling along the screen"). We then use a salience method (e.g. LRP or integrated gradients) to get that information out of the LSTM, and redistribute the reward accordingly (i.e., we then give reward already once the gun is shot in the right direction). Once the reward is redistributed this way, solving/learning the actual Reinforcement Learning problem is much, much easier and as we prove in the paper, the optimal policy does not change with this redistribution.