TLDR; The authors propose Neural Turing Machines (NTMs). A NTM consists of a memory bank and a controller network. The controller network (LSTM or MLP in this paper) controls read/write heads by focusing their attention softly, using a distribution over all memory addresses. It can learn the parameters for two addressing mechanisms: Content-based addressing ("find similar items") and location-based addressing. NTMs can be trained end-to-end using gradient descent. The authors evaluate NTMs on program generations tasks and compare their performance against that of LSTMs. Tasks include copying, recall, prediction, and sorting binary vectors. While both LSTMs and NTMs seems to perform well on training data, only NTMs are able to generalize to longer sequences.
#### Key Observations
- Controller network tried with LSTM or MLP. Which one works better is task-dependent, but LSTM "cache" can be a bottleneck.
- Controller size, number of read/write heads, and memory size are hyperparameters.
- Monitoring the memory addressing shows that the NTM actually learns meaningful programs.
- Number LSTM parameters grow quadratically with hidden unit size due to recurrent connection, not so for NTMs, leading to models with fewer parameters.
- Example problems are very small, typically using sequences 8 bit vectors.
#### Notes/Questions
- At what length to NTMs stop to work? Would've liked to see where results get significantly worse.
- Can we automatically transform fuzzy NTM programs into deterministic ones?
Neural Turing Machine (NTM) consists of a neural network controller interacting with a working memory bank in a learnable manner. This is analogous to computers — controllers = CPU (hidden activations as registers) and memory matrix = RAM. Key ideas:
- Controller (modified RNN) interacts with external world via input and output vectors, and with memory via read and write "heads"
- "Read" vector is a convex combination of row-vectors of $M_t$ (memory matrix at time $t$) — $r\_t = \sum w\_t(i) M\_t(i)$ where w_t is a vector of weightings over N memory locations
- "Writing" is decomposed into 1) erasing and 2) adding
- The write head produces the erase vector e_t and the add vector a_t along with the vector of weightings over memory locations w_t
- $M\_t(i) = M\_{t-1}(i)[1 - w_t(i) e_t] + w\_t(i) a\_t$
- Erase and add vectors control which components of memory are updated, while weightings w_t control which locations are updated
- Weight vectors are produced by an addressing mechanism
- Content-based addressing
- Each head produces length M key k_t that is compared to each vector M_t(i) by cosine similarity and a temperature parameter. The weightings are normalized (softmax).
- Location-based addressing
- Interpolation: Each head produces interpolation gate g_t that is used to blend between weighting at previous time step and the content weighting of current tilmestep $w^{g}\_t = g\_t w^{c}\_t + (1-g\_t)w\_{t-1}$
- Shift: Circular convolution (modulo N) with a shift weighting distribution, for example softmax over integer shift positions (say 3 locations)
- Sharpening: Each head emits \gamma_t to sharpen the final weighting
- Experiments on copy, repeat-copy, associative memory, N-gram emulator and priority sort
## Links
- [Attention and Augmented RNNs](http://distill.pub/2016/augmented-rnns/)
- [NTM-Lasagne](https://medium.com/snips-ai/ntm-lasagne-a-library-for-neural-turing-machines-in-lasagne-2cdce6837315)