This paper presents an approach to visual question answering by dynamically composing networks of independent neural modules based on the semantic parsing of the question. Main contributions:
- Independent neural modules that can be combined together and jointly trained.
- Attention: Convolutional layer, with different filters for different instances. For example, attend[dog], attend[cat], etc.
- Re-attention: FC-ReLU-FC-ReLU, weights are different for different instances. For example, re-attend[above], re-attend[not], etc.
- Combination: Stacks two attention maps, followed by conv-ReLU to map to a single attention map. For example, combine[and], combine[except], etc.
- Classification: Combines attention map and image, followed by FC-Softmax to map to answer. For example, classify[colors].
- Measurement: FC-ReLU-FC-Softmax, takes attention map as input. For example, measure[exists].
- Structured representations are extracted from questions and these are then mapped to network layouts, including the connections between them.
- All leaves become attend modules, all internal nodes become re-attend or combine modules dependent on their arity, and root nodes become measure modules for yes/no questions and classify modules for all other question types.
- Networks with the same structure but different instantiations can be processed in the same batch. For example, classify[color]\(attend[cat]\), classify[where]\(attend[truck]\).
- Predictions from the module network are combined with LSTM representations to get the final answer.
- Syntactic regularities: 'what is flying?' and 'what are flying?' get mapped to the same module network.
- Semantic regularities: 'green' is an implausible answer for 'what color is the bear?'.
- Experiments are performed on the synthetic SHAPES dataset and VQA dataset.
- Performance on the SHAPES dataset is better as it is designed to benefit from compositionality.
- This model takes advantage of the inherently compositional property of language, which makes a lot of sense. VQA is an extremely complex task and breaking it up into separate functions/modules is an excellent approach.
## Weaknesses / Notes
- Mapping from syntactic structure to module network is hand-designed. Ideally, the model should learn this too to generalize.
- Due to its compositional nature, this kind of model can possibly be used in the zero-shot learning setting, i.e. generalize to novel question types that the network hasn't seen before.