Understanding Visual Concepts with Continuation Learning
Whitney, William F.
and
Chang, Michael
and
Kulkarni, Tejas D.
and
Tenenbaum, Joshua B.
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords:
dblp
In recent years, many generative models have been proposed to learn distributed representations automatically from data. One criticism of these models are that they produce representations that are "entangled": no single component of the representation vector has meaning on its own. This paper proposes a novel neural architecture and associated learning algorithm for learning disentangled representations. The paper demonstrates the network learning visual concepts on pairs of frames from Atari Games and rendered faces.
The proposed architecture uses a gating mechanism to select an index to hidden elements that store the "unpredictable" parts of the frame into a single component. The architecture bears some similarity to other "gated" architectures, e.g. relational autoencoders, three-way RBMs, etc. in that it models input-output pairs and encodes transformations. However, these other architectures do not use an explicit mechanism to make the network model "differences". This is novel. The paper claims that the objective function is novel: "given the previous frame $x_{t-1}$ of a video and the current frame x_t, reconstruct the current frame $x_t$. This is essentially the same objective as relational autoencoders (Memisevic) and similar to gated and conditional RBMs which have been used to model pairs of frames. Therefore I would recommend de-emphasizingthe novelty of the objective.
Significance - This paper opens up many possibilities for explicit mechanisms of "relative" encodings to produce symbolic representations. There isn't much detail in the results (it's an extended abstract!) but I think the work is exciting and I'm looking forward to reading a follow up paper.
Pros
- Attacks a major problem of current generative models (entanglement)
- Proposes a simple yet novel solution
- Results show visually that the technique seems to work on two non-trivial datasets
Cons
- Experiments are really preliminary - no quantitative results
- Doesn't mention mechanisms like dropout which attempt to prevent co-adaptation of features