End-to-end Learning of Action Detection from Frame Glimpses in Videos
Serena Yeung
and
Olga Russakovsky
and
Greg Mori
and
Li Fei-Fei
arXiv e-Print archive - 2015 via Local arXiv
Keywords:
cs.CV, cs.LG
First published: 2015/11/22 (8 years ago) Abstract: In this work we introduce a fully end-to-end approach for action detection in
videos that learns to directly predict the temporal bounds of actions. Our
intuition is that the process of detecting actions is naturally one of
observation and refinement: observing moments in video, and refining hypotheses
about when an action is occurring. Based on this insight, we formulate our
model as a recurrent neural network-based agent that interacts with a video
over time. The agent observes video frames and decides both where to look next
and when to emit a prediction. Since backpropagation is not adequate in this
non-differentiable setting, we use REINFORCE to learn the agent's decision
policy. Our model achieves state-of-the-art results on the THUMOS'14 and
ActivityNet datasets while observing only a fraction (2% or less) of the video
frames.