Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs
Zheng Shou
and
Dongang Wang
and
Shih-Fu Chang
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.CV
First published: 2016/01/09 (8 years ago) Abstract: We address temporal action localization in untrimmed long videos. This is
important because videos in real applications are usually unconstrained and
contain multiple action instances plus video content of background scenes or
other activities. To address this challenging issue, we exploit the
effectiveness of deep networks in temporal action localization via three
segment-based 3D ConvNets: (1) a proposal network identifies candidate segments
in a long video that may contain actions; (2) a classification network learns
one-vs-all action classification model to serve as initialization for the
localization network; and (3) a localization network fine-tunes on the learned
classification network to localize each action instance. We propose a novel
loss function for the localization network to explicitly consider temporal
overlap and therefore achieve high temporal localization accuracy. Only the
proposal network and the localization network are used during prediction. On
two large-scale benchmarks, our approach achieves significantly superior
performances compared with other state-of-the-art systems: mAP increases from
1.7% to 7.4% on MEXaction2 and increases from 15.0% to 19.0% on THUMOS 2014,
when the overlap threshold for evaluation is set to 0.5.
## Segmented SNN
**Summary**: this paper use 3-stage 3D CNN to identify candidate proposals, recognize actions and localize temporal boundaries.
**Models**:
this network can be mainly divided into 3 parts: generate proposals, select proposal and refine temporal boundaries, and using NMS to remove redundant proposals.
1. generate multiscale(16,32,64,128,256.512) segment using sliding window with 75% overlap. high computing complexity!
2. network: Each stage of the three-stage network is using 3D convNets concatenating with 3 FC layers.
* the proposal network is basically a classifier which will judge if each proposal contains action or not.
* the classification network is used to classify each proposal which the proposal network think is valid into background and K action categories
* the localization network functioned as a scoring system which raises scores of proposals that have high overlap with corresponding ground truth while decreasing the others.
.