Delving Deeper into Convolutional Networks for Learning Video Representations
Nicolas Ballas
and
Li Yao
and
Chris Pal
and
Aaron Courville
arXiv e-Print archive - 2015 via Local arXiv
Keywords:
cs.CV, cs.LG, cs.NE
First published: 2015/11/19 (9 years ago) Abstract: We propose an approach to learn spatio-temporal features in videos from
intermediate visual representations we call "percepts" using
Gated-Recurrent-Unit Recurrent Networks (GRUs).Our method relies on percepts
that are extracted from all level of a deep convolutional network trained on
the large ImageNet dataset. While high-level percepts contain highly
discriminative information, they tend to have a low-spatial resolution.
Low-level percepts, on the other hand, preserve a higher spatial resolution
from which we can model finer motion patterns. Using low-level percepts can
leads to high-dimensionality video representations. To mitigate this effect and
control the model number of parameters, we introduce a variant of the GRU model
that leverages the convolution operations to enforce sparse connectivity of the
model units and share parameters across the input spatial locations.
We empirically validate our approach on both Human Action Recognition and
Video Captioning tasks. In particular, we achieve results equivalent to
state-of-art on the YouTube2Text dataset using a simpler text-decoder model and
without extra 3D CNN features.