[link]
Summary by Abhishek Das 7 years ago
This paper describes using Relation Networks (RN) for reasoning about relations between objects/entities.
RN is a plug-and-play module and although expects object representations as input,
the semantics of what an object is need not be specified, so object representations
can be convolutional layer feature vectors or entity embeddings from text, or something else.
And the feedforward network is free to discover relations between objects (as opposed to being
hand-assigned specific relations).
- At its core, RN has two parts:
- a feedforward network `g` that operates on pairs of object representations,
for all possible pairs, all pairwise computations pooled via element-wise addition
- a feedforward network `f` that operates on pooled features for downstream
task, everything being trained end-to-end
- When dealing with pixels (as in CLEVR experiment), individual object representations are
spatially distinct convolutional layer features (196 512-d object representations for VGG conv5 say).
The other experiment on CLEVR uses explicit factored object state representations with 3D coordinates,
shape, material, color, size.
- For bAbI, object representations are LSTM encodings of supporting sentences.
- For VQA tasks, `g` conditions its processing on question encoding as well, as relations
that are relevant for figuring out the answer would be question-dependent.
## Strengths
- Very simple idea, clearly explained, performs well. Somewhat shocked that it
hasn't been tried before.
## Weaknesses / Notes
Fairly simple idea — let a feedforward network
operate on all pairs of object representations and figure out relations
necessary for downstream task with end-to-end training. And it is fairly general in its design,
relations aren't hand-designed and neither are object representations — for
RGB images, these are spatially distinct convolutional layer features, for text,
these are LSTM encodings of supporting facts, and so on. This module can be dropped
in and combined with more sophisticated networks to improve performance at VQA.
RNs also offer an alternative design choice to prior works on CLEVR, that have
this explicit notion of programs or modules with specialized roles (that need to be pre-defined),
as opposed to letting these relations emerge, reducing dependency on hand-designing
modules and adding in inductive biases from an architectural point-of-view for
the network to reason about relations (earlier end-to-end VQA models didn't have
the capacity to figure out relations).
more
less