[link]
Summary by Peter O'Connor 6 years ago
# Very Short
The authors propose a deep, recurrent, convolutional architecture called PredNet, inspired by the idea of predictive coding from neuroscience. In PredNet, first layer attempts to predict the input frame, based on past frames and input from higher layers. The next layer then attempts to predict the *prediction error* of the first layer, and so on. The authors show that such an architecture can predict future frames of video, and predict the parameters synthetically-generated video, better than a conventional recurrent autoencoder.
# Short
## The Model
PredNet has the following architecture:
https://i.imgur.com/7vOcGwI.png
Where the R blocks are Recurrent Neural Networks, and the A blocks are Convolutional Layers. $E_l$ indictes the prediction error at layer $l$. The network is trained from snippets of video, and the loss is given as:
$L_{train} = \sum_{t=1}^T \sum_l \frac{\lambda_l}{n_l} \sum_i^{2n_l} [E_l^t]_i$
Where $t$ indexes the time step, $l$ indexes the layer, $n_l$ is the number of units in the layer, $E_l^t = [ReLU(A_l^t-\hat A_l^t) ; ReLU(\hat A_l^t - A_l^t) ]$ is the concatenation of the negative and positive components of the error, $\lambda_l$ is a hyperparameter determining the effect that layer $l$ error should have on the loss.
In the experiments, they use two settings for the $\lambda_l$ hyperparameters. In the "$L_0$" setting, they set $\lambda_0=1, \lambda_{>0}=0$, which ends up being optimal when trying to optimize next-frame L1 error. In the "$L_{all}$" setting, they use $\lambda_0=1, \lambda_{>0}=0.1$, which, in the synthetic-images experiment, seems to be better at predicting the parameters of the synthetic-image generator.
## Results
They apply the model on two tasks:
1) Predicting the future frames of a synthetic video generated by a graphics engine.
Here they predict both the next frame (in which their $L_0$ model does best), and the parameters (face characteristics, rotation, angle) of the program that generates the synthetic faces, (on which their $L_{all}$ model does best). They predict face generating parameters by first training the model, and then freezing weights and regressing from the learned representations at a given layer to the parameters. They show that both the $L_0$ and $L_{all}$ models outperform a more conventional recurrent autoencoder.
https://i.imgur.com/S8PpJnf.png
**Next-frame predictions on a sequence of faces (note: here, predictions are *not* fed back into the model to generate the next frame)**
2) Predicting future frames of video from dashboard cameras.
https://i.imgur.com/Zus34Vm.png
**Next-frame predictions of dashboard-camera images**
The authors conclude that allowing higher layers to model *prediction errors*, instead of *abstract representations* can lead to better modeling of video.
more
less