Discovering Reinforcement Learning Algorithms on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com
Discovering Reinforcement Learning Algorithms
Junhyuk Oh and Matteo Hessel and Wojciech M. Czarnecki and Zhongwen Xu and Hado van Hasselt and Satinder Singh and David Silver
arXiv e-Print archive - 2020 via Local arXiv
Keywords: cs.LG, cs.AI
more
Summaries/Notes 1
[link] Summary by CodyWild 4 years ago
This work attempts to use meta-learning to learn an update rule for a reinforcement learning agent. In this context, "learning an update rule" means learning the parameters of an LSTM module that takes in information about the agent's recent reward and current model and outputs two values - a scalar and a vector - that are used to update the agent's model. I'm not going to go too deep into meta-learning here, but, at a high level, meta learning methods optimize parameters governing an agent's learning, and, over the course of many training processes over many environments, optimize those parameters such that the reward over the full lifetime of training is higher. 

To be more concrete, the agent in a given environment learns two things: 

- A policy, that is, a distribution over predicted action given a state.
- A "prediction vector". This fits in the conceptual slot where most RL algorithms would learn some kind of value or Q function, to predict how much future reward can be expected from a given state. However, in this context, this vector is *very explicitly* not a value function, but is just a vector that the agent-model generates and updates. The notion here is that maybe our human-designed construction of a value function isn't actually the best quantity for an agent to be predicting, and, if we meta-learn, we might find something more optimal. I'm a little bit confused about the structure of this vector, but I think it's *intended* to be a categorical 1-of-m prediction

At each step, after acting in the environment, the agent passes to an LSTM: 

- The reward at the step
- A binary of whether the trajectory is done
- The discount factor
- The probability of the action that was taken from state t
- The prediction vector evaluated at state t
- The prediction vector evaluated at state t+1

Given that as input (and given access to its past history from earlier in the training process), the LSTM predicts two things: 

- A scalar, pi-hat
- A prediction vector, y-hat

These two quantities are used to update the existing policy and prediction model according to the rule below.

https://i.imgur.com/xx1W9SU.png

 Conceptually, the scalar governs whether to increase or decrease probability assigned to the taken action under the policy, and y-hat serves as a target for the prediction vector to be pulled towards.  An important thing to note about the LSTM structure is that none of the quantities it takes as input are dependent on the action or observation space of the environment, so, once it is learned it can (hopefully) generalize to new environments. 

Given this, the basic meta learning objective falls out fairly easily - optimize the parameters of the LSTM to maximize lifetime reward, taken in expectation over training runs.  However, things don't turn out to be quite that easy. The simplest version of this meta-learning objective is wildly unstable and difficult to optimize, and the authors had to add a number of training hacks in order to get something that would work. (It really is dramatic, by the way, how absolutely essential these are to training something that actually learns a prediction vector). These include: 

- A entropy bonus, pushing the meta learned parameters to learn policies and prediction vectors that have higher entropy (which is to say: are less deterministic)
- An L2 penalty on both pi-hat and y-hat
- A removal of the softmax that had originally been originally taken over the k-dimensional prediction vector categorical, and switching that target from a KL divergence to a straight mean squared error loss. As far as I can tell, this makes the prediction vector no longer actually a 1-of-k categorical, but instead just a continuous vector, with each value between 0 and 1, which makes it make more sense to think of k separate binaries? This I was definitely confused about in the paper overall

https://i.imgur.com/EL8R1yd.png

With the help of all of these regularizers, the authors were able to get something that trained, and that appeared to be able to perform comparably to or better than A2C - the human-designed baseline - across the simple grid-worlds it was being trained in. However, the two most interesting aspects of the evaluation were: 

1. The authors showed that, given the values of the prediction vector, you could predict the true value of a state quite well, suggesting that the vector captured most of the information about what states were high value. However, beyond that, they found that the meta-learned vector was able to be used to predict the value calculated with discount rates different that than one used in the meta-learned training, which the hand-engineered alternative, TD-lambda, wasn't able to do (it could only well-predict values at the same discount rate used to calculate it). This suggests that the network really is learning some more robust notion of value that isn't tied to a specific discount rate. 

2. They also found that they were able to deploy the LSTM update rule learned on grid worlds to Atari games, and have it perform reasonably well - beating A2C in a few cases, though certainly not all. This is fairly impressive, since it's an example of a rule learned on a different, much simpler set of environments generalizing to more complex ones, and suggests that there's something intrinsic to Reinforcement Learning that it's capturing
Your comment:
Write your summary here (You can use $\LaTeX$ and markdown syntax):
Anon Private