Weakly-Supervised Reinforcement Learning for Controllable Behavior
arXiv e-Print archive - 2020 via Local Bibsonomy
I tried my best, but I'm really confused by the central methodology of this paper. Here are the things I do understand:
1. The goal of the method is to learn disentangled representations, and, specifically, to learn representations that correspond to factors of variation in the environment that are selected by humans. That means, we ask humans whether a given image is higher or lower on a particular relevant axis, and aggregate those rankings into a vector, where a particular index of the vector corresponds to a particular factor. Given a small amount of supervision, the hope is to learn an encoder that takes in an image, and produces a Z code that encodes where the image is on that particular axis
2. With those disentangled representations, the authors hope they can learn goal-conditioned policies, where the distance between the current image's representation and the goal image's representation can serve as a reward. In particular, they're trying to show that their weakly supervised disentangled representation performs better as a metric space to do goal-conditioning distance calculations in, relative to other learned spaces
3. The approach uses a GAN-based design, where a generator generates the images that correspond with a given z1 and z2, and the discriminator tries to tell the difference between the two real images, paired with their supervision vector, and two generated images, with their fake supervision vector
[Here is the relevant equation, along with some notation-explaining text]
The thing I'm confused by is the actual mechanism for why (3) gets you disentangled representations. To my understanding, the thing the generator should be trying to do is generate images whose relationship to one another is governed by the relationship between z1 and z2; if z is really capturing your factors of variation, the two images should differ in places and in ways governed by where those z values are different. Based on this, I'd expect the fake supervision vector here to be some kind of binarized element-wise difference between the two (randomly sampled) vectors, z1 and z2. But the authors claim that the fake supervision vector that the generator is trying to replicate is just the zero vector. That seems like it would just result in the generator trying to generate images that don't differ on any axes, with two different z vectors as input.