Dynamic Memory Networks for Visual and Textual Question Answering
Caiming Xiong
and
Stephen Merity
and
Richard Socher
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.NE, cs.CL, cs.CV
First published: 2016/03/04 (8 years ago) Abstract: Neural network architectures with memory and attention mechanisms exhibit
certain reasoning capabilities required for question answering. One such
architecture, the dynamic memory network (DMN), obtained high accuracy on a
variety of language tasks. However, it was not shown whether the architecture
achieves strong results for question answering when supporting facts are not
marked during training or whether it could be applied to other modalities such
as images. Based on an analysis of the DMN, we propose several improvements to
its memory and input modules. Together with these changes we introduce a novel
input module for images in order to be able to answer visual questions. Our new
DMN+ model improves the state of the art on both the Visual Question Answering
dataset and the \babi-10k text question-answering dataset without supporting
fact supervision.
Dynamic Memory Network has:
1. **Input module**: This module processes the input data about which a question is being asked into a set of vectors termed facts. This module consists of GRU over input words.
2. **Question Module**: Representation of question as a vector. (final hidden state of the GRU over the words in the question)
3. **Episodic Memory Module**: Retrieves the information required to answer the question from the input facts (input module). Consists of two parts
1. attention mechanism
2. memory update mechanism
To get it more intuitive: When we see a question, we only have the question in our memory(i.e. the initial memory vector == question vector), then based on our question and previous memory we pass over the input facts and generate a contextual vector (this is the work of attention mechanism), then memory is updated again based upon the contextual vector and the previous memory, this is repeated again and again.
4. **Answer Module**: The answer module uses the question vector and the most updated memory from 3rd module to generate answer. (a linear layer with softmax activation for single word answers, RNNs for complicated answers)
**Improved DMN+**
The input module used single GRU to process the data. Two shortcomings:
1. The GRU only allows sentences to have context from sentences before them, but not after them. This prevents information propagation from future sentences. Therfore bi-directional GRUs were used in DMN+.
2. The supporting sentences may be too far away from each other on a word level to allow for these distant sentences to interact through the word level GRU. In DMN+ they used sentence embeddings rather than word embeddings. And then used the GRUs to interact between the sentence embeddings(input fusion layer).
**For Visual Question Answering**
Split the image into parts, consider them parallel to sentences in input module for text. Linear layer with tanh activation to project the regional vectors(from images) to textual feature space (for text based question answering they used positional encoding for embedding sentences). Again use bi-directional GRUs to form the facts. Now use the same process as mentioned for text based question answering.