[link]
The authors propose a unified setting to evaluate the performance of batch reinforcement learning algorithms. The proposed benchmark is discrete and based on the popular Atari Domain. The authors review and benchmark several current batch RL algorithms against a newly introduced version of BCQ (Batch Constrained Deep Q Learning) for discrete environments. https://i.imgur.com/zrCZ173.png Note in line 5 that the policy chooses actions with a restricted argmax operation, eliminating actions that have not enough support in the batch. One of the key difficulties in batch-RL is the divergence of value estimates. In this paper the authors use Double DQN, which means actions are selected with a value net $Q_{\theta}$ and the policy evaluation is done with a target network $Q_{\theta'}$ (line 6). **How is the batch created?** A partially trained DQN-agent (trained online for 10mio steps, aka 40mio frames) is used as behavioral policy to collect a batch $B$ containing 10mio transitions. The DQN agent uses either with probability 0.8 an $\epsilon=0.2$ and with probability 0.2 an $\epsilon = 0.001$. The batch RL agents are trained on this batch for 10mio steps and evaluated every 50k time steps for 10 episodes. This process of batch creation differs from the settings used in other papers in i) having only a single behavioral policy, ii) the batch size and iii) the proficiency level of the batch policy. The experiments, performed on the arcade learning environment include DQN, REM, QR-DQN, KL-Control, BCQ, OnlineDQN and Behavioral Cloning and show that: - for conventional RL algorithms distributional algorithms (QR-DQN) outperform the plain algorithms (DQN) - batch RL algorithms perform better than conventional algorithms with BCQ outperforming every other algorithm in every tested game In addition to the return the authors plot the value estimates for the Q-networks. A drop in performance corresponds in all cases to a divergence (up or down) in value estimates. The paper is an important contribution to the debate about what is the right setting to evaluate batch RL algorithms. It remains however to be seen if the proposed choice of i) a single behavior policy, ii) the batch size and iii) quality level of the behavior policy will be accepted as standard. Further work is in any case required to decide upon a benchmark for continuous domains.
Your comment:
|