A modular neural architecture for visual question answering. A seq2seq component predicts the sequence of neural modules (eg find() and compare()) based on the textual question, which are then dynamically combined and trained end-to-end. Achieves good results on three separate benchmarks that focus on reasoning about the image.
https://i.imgur.com/iOkSh8y.png