[link]
Summary by CodyWild 4 years ago
The thing I think is happening here:
It proposes a self-supervised learning scheme (which...seems fairly basic, but okay) to generate encodings. It then trains a Latent World Model, which takes in the current state encoding, the action, and the belief state (I think just the prior RNN state?) and predicts a next state. The intrinsic reward is the difference between this and the actual encoding of the next step. (This is dependent on a particular action and resulting next obs, it seems). I don't really know what the belief state is doing here. Is it a... scalar rather than a RNN state? It being said to start at 0 makes it seem that way.
Summary: For years, an active area of research has been the question about how to incentivize reinforcement learning agents to more effectively explore their environment. In many environments, the state space is large, and it's quite difficult to find reward just by randomly traversing it, and, in the absence of reward signal, most reinforcement learning algorithms won't learn. To remedy this, a common approach has been to attempt to define a measure of novelty, such that you can reward policies for exploring novel states. One approach for this is to tabulate counts of how often you've seen given past states, and explore in inverse proportion to those counts. However, this is complicated to scale to image-based and continuous state spaces.
Another tactic has been to use uncertainty in an agent's model of the world as an indication of that part of state space being insufficiently explored. In this way of framing the problem, exploring a part of space gives you more samples from it, and if you use those samples to train your forward predictive model - a model predicting the next state - it will increase in accuracy for that state and states like this. So, in this setting, the "novelty reward" for your agent comes from the prediction error; it's incentivized to explore states where its model is more incorrect. However, a problem with this, if you do simple pixel-based prediction, is that there are some inherent sources of uncertainty in an environment, that don't get reduced by you drawing more samples from those parts of space. The canonical example of this is static on a tv - it's just fundamentally noisy and unpredictable, and no amount of gathering data will reduce that fundamental noise. A lot of ways of naively incentivizing uncertainty draw you into those parts of state space again and again, even though they aren't really serving the purpose of getting you to explore interesting, uninvestigated parts of the space.
This paper argues for a similar uncertainty-based metric, but based on prediction of a particular kind of representation of the state, which they argue the pathological property described earlier, of getting stuck in regions of high inherent uncertainty. They do this by first learning a self-supervised representation that seems *kind of* like contrastive predictive coding, but slightly different. Their approach simply pushes the Euclidean distance between the representations of nearby timesteps to be smaller, without any explicit negative set to contrast again. To avoid the network learning the degenerative solution of "always predict a constant, so everything is close to everything else", the authors propose a "whitening" (or rescaling, or normalizing" operation before the mean squared error. This involves subtracting out the mean representation, and dividing by the covariance, before doing a mean squared error. This means that, even if you've pushed your representations together in absolute space, after the whitening operation, they will be "pulled out" again to be in a spherical unit Gaussian, which stops the network from being able to get a benefit from falling into the collapsing solution.
https://i.imgur.com/Psjlf4t.png
Given this pretrained encoder, the method works by:
- Constructing a recurrent Latent World Model (LWM) that takes in the encoded observation, the current action, and the prior belief state of the recurrent model
- Encoding the current observation with the pretrained encoder
- Passing that representation into the LWM to get a predicted next representation out (prediction happens in representation space, not pixel space)
- Using the error between the actual encoding of the next observation, and the predicted next representation, as the novelty signal
- Training a DQN on top of the encoding
Something I'm a little confused by is whether the encoding network is exclusively trained via MSE loss, or whether it also gets gradients from the actual RL DQN task.
Overall, this method makes sense, but I'm not quite sure why the proposed representation structure would be notably better than other, more canonical self-supervised losses, like CPC. They show Action-Conditioned CPC as a baseline when demonstrating the quality of the representations, but not as a slot-in replacement for the MSE representations in their overall architecture. It does seem to get strong performance on exploration-heavy tasks, but I'll admit I'm not familiar with the quality of the baselines they chose, and so don't have a great sense of whether the - admittedly quite strong! - performance shown in the table below is in fact comparing against the current state of the art.
https://i.imgur.com/cIr2Y4w.png
more
less