Delving Deeper into Convolutional Networks for Learning Video Representations
Nicolas Ballas
and
Li Yao
and
Chris Pal
and
Aaron Courville
arXiv e-Print archive - 2015 via Local arXiv
Keywords:
cs.CV, cs.LG, cs.NE
First published: 2015/11/19 (9 years ago) Abstract: We propose an approach to learn spatio-temporal features in videos from
intermediate visual representations we call "percepts" using
Gated-Recurrent-Unit Recurrent Networks (GRUs).Our method relies on percepts
that are extracted from all level of a deep convolutional network trained on
the large ImageNet dataset. While high-level percepts contain highly
discriminative information, they tend to have a low-spatial resolution.
Low-level percepts, on the other hand, preserve a higher spatial resolution
from which we can model finer motion patterns. Using low-level percepts can
leads to high-dimensionality video representations. To mitigate this effect and
control the model number of parameters, we introduce a variant of the GRU model
that leverages the convolution operations to enforce sparse connectivity of the
model units and share parameters across the input spatial locations.
We empirically validate our approach on both Human Action Recognition and
Video Captioning tasks. In particular, we achieve results equivalent to
state-of-art on the YouTube2Text dataset using a simpler text-decoder model and
without extra 3D CNN features.
This paper presents a neat method for learning spatio-temporal representations from videos. Convolutional features from intermediate layers of a CNN are extracted, to preserve spatial resolution, and fed into a modified GRU that can (in theory) learn infinite temporal dependencies. Main contributions:
- Their variant of GRU (called GRU-RCN) uses convolution operations instead of fully-connected units.
- This exploits the local correlation in image frames across spatial locations.
- Features from pool2, pool3, pool4, pool5 are extracted and fed into independent GRU-RCNs. Hidden states at last time step are now feature volumes, which are average pooled to reduce to 1x1 spatially, and fed into a linear + softmax classifier. Outputs from each of these classifiers is averaged to get the final prediction.
- Other variants that they experiment with are bidirectional GRU-RCNs and stacked GRU-RCNs i.e. GRU-RCNs with connections between them (with maxpool operations for dimensionality reduction).
- Bidirectional GRU-RCNs perform the best.
- Stacked GRU-RCNs perform worse than the other variants, probably because of limited data.
- They evaluate their method on action recognition and video captioning, and show significant improvements on a CNN+RNN baseline, comparing favorably with other state-of-the-art methods (like C3D).
## Strengths
- The idea is simple and elegant. Earlier methods for learning video representations typically used 3D convolutions (k x k x T filters), which suffered from finite temporal capacity, or RNNs sitting on top of last-layer CNN features, which is unable to capture finer spatial resolution. In theory, this formulation solves both.
- Changing fully-connected operations to convolutions has the additional advantage of requiring lesser parameters (n\_input x n\_output x input\_width x input\_height v/s n\_input x n\_output x k\_width x k\_height).