End-To-End Memory Networks on ShortScience.org

3

[link] Summary by Denny Britz 8 years ago

TLDR; The authors propose a recurrent memory-based model that can reason over multiple hops and be trained end to end with standard gradient descent. The authors evaluate the model on QA and Language Modeling Tasks. In the case of QA, the network inputs are a list of sentences, a query and (during training) an answer. The network then attends to the sentences at each time step, considering the next piece information relevant to the question. The network outperforms baseline approaches, but does not come close to a strongly supervised (relevant sentences are pre-selected) approach.


#### Key Takeaways

- Sentence Representation: 1. Word embeddings are averaged (BoW) 2. Positional Encoding (PE)
- Synthetic dataset with vocabulary size of ~180. Version one has 1k training example, version 2 has 10k training examples.
- The model is similar to Bahdanau seq2seq attention model, only that it operates on sentences and does not output at every step and used a simpler scoring function.


#### Questions / Notes

- The positional encoding formula is not explained neither is it intutiive.
- There are so many hyperparameters and model variations (jittering, linear start) that it's easy to lose track of the essential.
- No intuitive explanation of what the model does. The easiest way for me to understand this model was to look at it as a variation of Bahdanau's attention model, which is very intuitive. I don't understand the intuition behind the proposed weight constraints.
- The LM results are not convincing. The model beats the baselines by a little bit, but probably only due to very time-intensive hyperparameter optimization.
- What is the training complexity and training time?

Your comment:

3

[link] Summary by NIPS Conference Reviews 8 years ago

This paper presents an end-to-end version of memory networks (Weston et al., 2015) such that the model doesn't train on the intermediate 'supporting facts' strong supervision of which input sentences are the best memory accesses, making it much more realistic. They also have multiple hops (computational steps) per output symbol. The tasks are Q&A and language modeling, and achieves strong results.

The paper is a useful extension of memNN because it removes the strong, unrealistic supervision requirement and still performs pretty competitively. The architecture is defined pretty cleanly and simply. The related work section is quite well-written, detailing the various similarities and differences with multiple streams of related work. The discussion about the model's connection to RNNs is also useful.

Your comment:

3

[link] Summary by Shagun Sodhani 8 years ago

## Introduction

* Neural Network with a recurrent attention model over a large external memory.
* Continous form of Memory-Network but with end-to-end training so can be applied to more domains.
* Extension of RNNSearch and can perform multiple hops (computational steps) over the memory per symbol.
* [Link to the paper](http://arxiv.org/pdf/1503.08895v5.pdf).
* [Link to the implementation](https://github.com/facebook/MemNN).

## Approach

* Model takes as input $x_1,...,x_n$ (to store in memory), query $q$ and outputs answer $a$.

### Single Layer

* Input set ($x_i$) embedded in D-dimensional space, using embedding using matrix $A$ to obtain memory vectors ($m_i$).
* Query is also embedded using matrix $B$ to obtain internal state $u$.
* Compute match between each memory $m_i$ and $u$ in the embedding space followed by softmax operation to obtain probability vector $p$ over the inputs.
* Each $x_i$ maps to an output vector $c_i$ (using embedding matrix $C$).
* Output $o$ = weighted sum of transformed input $c_i$, weighted by $p_i$.
* Sum of output vector, $o$ and embedding vector, $u$, is passed through weight matrix $W$ followed by softmax to produce output.
* $A$, $B$, $C$ and $W$ are learnt by minimizing cross entropy loss.

### Multiple Layers

* For layers above the first layer, input $u^{k+1} = u^k + o^k$.
* Each layer has its own $A^k$ and $C^k$ - with constraints.
* At final layer, output $o = \text{softmax}(W(o^K, u^K))$ 

### Constraints On Embedding Vectors

* Adjacent
    * Output embedding for one layer is input embedding for another ie $A^k+1 = C^k$
    * $W = C^k$
    * $B = A^1$

* Layer-wise (RNN-like)
    * Same input and output embeddings across layes ie $A^1 = A^2 ... = A^K$ and $C^1 = C^2 ... = C^K$.
    * A linear mapping $H$ is added to update of $u$ between hops. 
    * $u^{k+1} = Hu^k + o^k$.
    * $H$ is also learnt.
    * Think of this as a traditional RNN with 2 outputs
        * Internal output - used for memory consideration
        * External output - the predicted result
        * $u$ becomes the hidden state.
        * $p$ is an internal output which, combined with $C$ is used to update the hidden state.

## Related Architectures

* RNN - Memory stored as the state of the network and unusable in long temporal contexts.
* LSTM - Locks network state using local memory cells. Fails over longer temporal contexts.
* Memory Networks - Uses global memory.
* Bidirectional RNN - Uses a small neural network with sophisticated gated architecture (attention model) to find useful hidden states but unlike MemNN, perform only a single pass over the memory.

## Sentence Representation for Question Answering Task

* Bag-of-words representation
    * Input sentences and questions are embedded as a bag of words.
    * Can not capture the order of the words.

* Position Encoding
    * Takes into account the order of words.

* Temporal Encoding
    * Temporal information encoded by matrix $T_A$ and memory vectors are modified as 

    $m_i = \text{sum}(Ax_{ij}) + T_A(i)$

* Random Noise
    * Dummy Memories (empty memories) are added at training time to regularize $T_A$.

* Linear Start (LS) training
    * Removes softmax layers when starting training and insert them when validation loss stops decreasing.

## Observations

* Best MemN2N models are close to supervised models in performance.
* Position Encoding improves over bag-of-words approach.
* Linear Start helps to avoid local minima.
* Random Noise gives a small yet consistent boost in performance.
* More computational hops leads to improved performance.
* For Language Modelling Task, some hops concentrate on recent words while other hops have more broad attention span over all memory locations.

Your comment: