Hybrid computing using a neural network with dynamic external memory
Alex Graves
and
Greg Wayne
and
Malcolm Reynolds
and
Tim Harley
and
Ivo Danihelka
and
Agnieszka Grabska-Barwińska
and
Sergio Gómez Colmenarejo
and
Edward Grefenstette
and
Tiago Ramalho
and
John Agapiou
and
Adrià Puigdomènech Badia
and
Karl Moritz Hermann
and
Yori Zwols
and
Georg Ostrovski
and
Adam Cain
and
Helen King
and
Christopher Summerfield
and
Phil Blunsom
and
Koray Kavukcuoglu
and
Demis Hassabis
Nature - 2016 via Local CrossRef
Keywords:
The paper introduces an approach to model external memory in a differentiable way. Memory is modeled as $N \times W$ matrix. The memory has $N$ independent storage location each being able to store a datum of length $W$.
### DNC Architecture
A controller network is trained to utilize the memory. Memory access is modeled in cycles. At each time-step $t$ the network emits read and write command as part of its output. The commands are than processed, read data is given to the network at time-step $t+1$ as part of its input. Any deep learning network architecture can be used as Controller network (e.g. standard feed-forward CNN). The paper utilizes a deep LSTM architecture is used as controller network.
https://storage.googleapis.com/deepmind-live-cms/images/dnc_figure1.width-1500.png
### Memory Interaction
The control commands allow the network to interact with the data in three different ways:
1. Content Lookup: access is controlled by how closely a given location matches a predefined key.
2. Sequential read access: For each read vector $v$ the network receives as input at time $t$ it can access the data which was written directly after $v$.
3. Usage based write access: A "usage" vector $u \in [0,1]^N$ models the importance of each location. The network can choose to write data based on the lowest usage level. $u$ can be decreased ("erased memory") at each time-step.
#### Differentiable Operations
All memory operations are modeled in a differentiable way, so that the entire model can be trained end-to-end using gradient descend. To do this all read commands are mapped to a soft read vector $w^r \in [0,1]^N$, such that $\sum^N_{i=1} w^r_i = 1$. The read vector $r$ is then defined as weighted sum over the rows of the memory: $r = \sum^N_{i=1} M[i, \cdot] w^r_i = 1$.
### Experiments
The network was trained to perform a variety of different, memory intensive tasks. The results show, that the network is able to learn to take advantage of the external memory.
https://www.youtube.com/watch?v=B9U8sI7TcMY