This paper proposes a framework where an agent learns to navigate a 2D maze-like
environment (XWORLD) from (templated) natural language commands, in the process
simultaneously learning visual representations, syntax and semantics of language and
performing navigation actions. The task is essentially VQA + navigation; at every step
the agent either gets a question about the environment or navigation command,
and the output is either a navigation action or answer. Key contributions:
- Grounding and recognition are tied together to be two versions of the same problem.
In grounding, given an image feature map and label (word), the problem is to find
regions of the image corresponding to word semantics (attention map); and in
recognition, given an image feature map and attention, the problem is to assign
a word label. And thus word embeddings (for grounding) and softmax layer weights
(for recognition) are tied together. This enables transferring concepts
learnt during recognition to navigation.
- Further, recognition is modulated by question intent. For e.g. given an
attention map that highlights an agent's west, should it be recognized as
'west', 'apple' or 'red' (location, object or attribute)? It depends on what
the question asks. Thus, GRU encoding of question produces an embedding mask
that modulates recognition. The equivalent when grounding is that word embeddings
are passed through fully-connected layers.
- Compositionality in language is exploited by performing grounding and
recognition by sequentially (softly) attending to parts of a sentence and
grounding in image. The resulting attention map is selectively combined
with attention from previous timesteps for final decision.
## Weaknesses / Notes
Although the environment is super simple, it's a neat framework and it is useful
that the target is specified in natural language (unlike prior/concurrent work
e.g. Zhu et al., ICRA17). The model gets to see a top-down centred view of the
entire environment at all times, which is a little weird.