First published: 2018/06/19 (4 years ago) Abstract: We introduce a new family of deep neural network models. Instead of
specifying a discrete sequence of hidden layers, we parameterize the derivative
of the hidden state using a neural network. The output of the network is
computed using a black-box differential equation solver. These continuous-depth
models have constant memory cost, adapt their evaluation strategy to each
input, and can explicitly trade numerical precision for speed. We demonstrate
these properties in continuous-depth residual networks and continuous-time
latent variable models. We also construct continuous normalizing flows, a
generative model that can train by maximum likelihood, without partitioning or
ordering the data dimensions. For training, we show how to scalably
backpropagate through any ODE solver, without access to its internal
operations. This allows end-to-end training of ODEs within larger models.
Summary by senior author [duvenaud on hackernews](https://news.ycombinator.com/item?id=18678078).
A few years ago, everyone switched their deep nets to "residual nets". Instead of building deep models like this:
h1 = f1(x)
h2 = f2(h1)
h3 = f3(h2)
h4 = f3(h3)
y = f5(h4)
They now build them like this:
h1 = f1(x) + x
h2 = f2(h1) + h1
h3 = f3(h2) + h2
h4 = f4(h3) + h3
y = f5(h4) + h4
Where f1, f2, etc are neural net layers. The idea is that it's easier to model a small change to an almost-correct answer than to output the whole improved answer at once.
In the last couple of years a few different groups noticed that this looks like a primitive ODE solver (Euler's method) that solves the trajectory of a system by just taking small steps in the direction of the system dynamics and adding them up. They used this connection to propose things like better training methods.
We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system? So instead of updating the hidden units layer by layer, we define their derivative with respect to depth instead. We call this an ODE net.
Now, we can use off-the-shelf adaptive ODE solvers to compute the final state of these dynamics, and call that the output of the neural network. This has drawbacks (it's slower to train) but lots of advantages too: We can loosen the numerical tolerance of the solver to make our nets faster at test time. We can also handle continuous-time models a lot more naturally. It turns out that there is also a simpler version of the change of variables formula (for density modeling) when you move to continuous time.