[link]
Summary by Shagun Sodhani 8 years ago
## Introduction
* Neural Network with a recurrent attention model over a large external memory.
* Continous form of Memory-Network but with end-to-end training so can be applied to more domains.
* Extension of RNNSearch and can perform multiple hops (computational steps) over the memory per symbol.
* [Link to the paper](http://arxiv.org/pdf/1503.08895v5.pdf).
* [Link to the implementation](https://github.com/facebook/MemNN).
## Approach
* Model takes as input $x_1,...,x_n$ (to store in memory), query $q$ and outputs answer $a$.
### Single Layer
* Input set ($x_i$) embedded in D-dimensional space, using embedding using matrix $A$ to obtain memory vectors ($m_i$).
* Query is also embedded using matrix $B$ to obtain internal state $u$.
* Compute match between each memory $m_i$ and $u$ in the embedding space followed by softmax operation to obtain probability vector $p$ over the inputs.
* Each $x_i$ maps to an output vector $c_i$ (using embedding matrix $C$).
* Output $o$ = weighted sum of transformed input $c_i$, weighted by $p_i$.
* Sum of output vector, $o$ and embedding vector, $u$, is passed through weight matrix $W$ followed by softmax to produce output.
* $A$, $B$, $C$ and $W$ are learnt by minimizing cross entropy loss.
### Multiple Layers
* For layers above the first layer, input $u^{k+1} = u^k + o^k$.
* Each layer has its own $A^k$ and $C^k$ - with constraints.
* At final layer, output $o = \text{softmax}(W(o^K, u^K))$
### Constraints On Embedding Vectors
* Adjacent
* Output embedding for one layer is input embedding for another ie $A^k+1 = C^k$
* $W = C^k$
* $B = A^1$
* Layer-wise (RNN-like)
* Same input and output embeddings across layes ie $A^1 = A^2 ... = A^K$ and $C^1 = C^2 ... = C^K$.
* A linear mapping $H$ is added to update of $u$ between hops.
* $u^{k+1} = Hu^k + o^k$.
* $H$ is also learnt.
* Think of this as a traditional RNN with 2 outputs
* Internal output - used for memory consideration
* External output - the predicted result
* $u$ becomes the hidden state.
* $p$ is an internal output which, combined with $C$ is used to update the hidden state.
## Related Architectures
* RNN - Memory stored as the state of the network and unusable in long temporal contexts.
* LSTM - Locks network state using local memory cells. Fails over longer temporal contexts.
* Memory Networks - Uses global memory.
* Bidirectional RNN - Uses a small neural network with sophisticated gated architecture (attention model) to find useful hidden states but unlike MemNN, perform only a single pass over the memory.
## Sentence Representation for Question Answering Task
* Bag-of-words representation
* Input sentences and questions are embedded as a bag of words.
* Can not capture the order of the words.
* Position Encoding
* Takes into account the order of words.
* Temporal Encoding
* Temporal information encoded by matrix $T_A$ and memory vectors are modified as
$m_i = \text{sum}(Ax_{ij}) + T_A(i)$
* Random Noise
* Dummy Memories (empty memories) are added at training time to regularize $T_A$.
* Linear Start (LS) training
* Removes softmax layers when starting training and insert them when validation loss stops decreasing.
## Observations
* Best MemN2N models are close to supervised models in performance.
* Position Encoding improves over bag-of-words approach.
* Linear Start helps to avoid local minima.
* Random Noise gives a small yet consistent boost in performance.
* More computational hops leads to improved performance.
* For Language Modelling Task, some hops concentrate on recent words while other hops have more broad attention span over all memory locations.
more
less