Dynamic Memory Networks for Visual and Textual Question Answering on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Dynamic Memory Networks for Visual and Textual Question Answering
Caiming Xiong and Stephen Merity and Richard Socher
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.NE, cs.CL, cs.CV
more

Summaries/Notes 1

[link] Summary by nishnik 7 years ago

Dynamic Memory Network has:
1. **Input module**: This module processes the input data about which a question is being asked into a set of vectors termed facts. This module consists of GRU over input words.
2. **Question Module**: Representation of question as a vector. (final hidden state of the GRU over the words in the question)
3. **Episodic Memory Module**: Retrieves the information required to answer the question from the input facts (input module). Consists of two parts
1. attention mechanism
2. memory update mechanism

To get it more intuitive: When we see a question, we only have the question in our memory(i.e. the initial memory vector == question vector), then based on our question and previous memory we pass over the input facts and generate a contextual vector (this is the work of attention mechanism), then memory is updated again based upon the contextual vector and the previous memory, this is repeated again and again.
4. **Answer Module**: The answer module uses the question vector and the most updated memory from 3rd module to generate answer. (a linear layer with softmax activation for single word answers, RNNs for complicated answers)

**Improved DMN+**

The input module used single GRU to process the data. Two shortcomings:
1. The GRU only allows sentences to have context from sentences before them, but not after them. This prevents information propagation from future sentences. Therfore bi-directional GRUs were used in DMN+.
2. The supporting sentences may be too far away from each other on a word level to allow for these distant sentences to interact through the word level GRU. In DMN+ they used sentence embeddings rather than word embeddings. And then used the GRUs to interact between the sentence embeddings(input fusion layer).

**For Visual Question Answering**

Split the image into parts, consider them parallel to sentences in input module for text. Linear layer with tanh activation to project the regional vectors(from images) to textual feature space (for text based question answering they used positional encoding for embedding sentences). Again use bi-directional GRUs to form the facts. Now use the same process as mentioned for text based question answering.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private