First published: 2016/08/18 (4 years ago) Abstract: Training directed neural networks typically requires forward-propagating data
through a computation graph, followed by backpropagating error signal, to
produce weight updates. All layers, or more generally, modules, of the network
are therefore locked, in the sense that they must wait for the remainder of the
network to execute forwards and propagate error backwards before they can be
updated. In this work we break this constraint by decoupling modules by
introducing a model of the future computation of the network graph. These
models predict what the result of the modelled subgraph will produce using only
local information. In particular we focus on modelling error gradients: by
using the modelled synthetic gradient in place of true backpropagated error
gradients we decouple subgraphs, and can update them independently and
asynchronously i.e. we realise decoupled neural interfaces. We show results for
feed-forward models, where every layer is trained asynchronously, recurrent
neural networks (RNNs) where predicting one's future gradient extends the time
over which the RNN can effectively model, and also a hierarchical RNN system
with ticking at different timescales. Finally, we demonstrate that in addition
to predicting gradients, the same framework can be used to predict inputs,
resulting in models which are decoupled in both the forward and backwards pass
-- amounting to independent networks which co-learn such that they can be
composed into a single functioning corporation.