Temporal Action Detection with Structured Segment Networks
Yue Zhao
and
Yuanjun Xiong
and
Limin Wang
and
Zhirong Wu
and
Xiaoou Tang
and
Dahua Lin
arXiv e-Print archive - 2017 via Local arXiv
Keywords:
cs.CV
First published: 2017/04/20 (7 years ago) Abstract: Detecting actions in untrimmed videos is an important yet challenging task.
In this paper, we present the structured segment network (SSN), a novel
framework which models the temporal structure of each action instance via a
structured temporal pyramid. On top of the pyramid, we further introduce a
decomposed discriminative model comprising two classifiers, respectively for
classifying actions and determining completeness. This allows the framework to
effectively distinguish positive proposals from background or incomplete ones,
thus leading to both accurate recognition and localization. These components
are integrated into a unified network that can be efficiently trained in an
end-to-end fashion. Additionally, a simple yet effective temporal action
proposal scheme, dubbed temporal actionness grouping (TAG) is devised to
generate high quality action proposals. On two challenging benchmarks, THUMOS14
and ActivityNet, our method remarkably outperforms previous state-of-the-art
methods, demonstrating superior accuracy and strong adaptivity in handling
actions with various temporal structures.
## Structured segmented network
### **key word**: action detection in video; computing complexity reduction; structurize proposal
**Abstract**: using a temporal action grouping scheme (TAG) to generate accurate proposals, using a structured pyramid to model the temporal structure of each action instance to tackle the issue that detected actions are not complete, using two classifiers to determine class and completeness and using a regressor for each category to further modify the temporal bound. In this paper, Yue Zhao et al mainly tackle the problem of high computing complexity by sampling video frame and remove redundant proposals in video detection and the lack of action stage modeling.
**Model**:
1. generate proposals: find continuous temporal regions with mostly high actioness. $P = \{ p_i = [s_i,e_i]\}_{i = 1}^N$
2. splitting proposals into 3 stages: start, course, and end: first augment the proposal by 2 times symmetrical to center, and course part is the original proposal, while start and end is the left part and right part of the difference between the transformed proposal and original one.
3. build temporal pyramid representation for each stage: first L samples are sampled from the augmented proposal, then two-stream feature extractor is used on each one of them and pooling features for each stage
4. build global representation for each proposal by concatenating stage-level representations
5. a global representation for each proposal is used as input for classifiers
* input = ${S_t}_{t = 1} ^{T}$a sequence of T snippet representing the video. each snippet = the frames + an optical flow stack
* network: two linear classifiers; L two-steam feature extractor and several pooling layer
* output: category and completeness and modification for each proposals.
https://i.imgur.com/thM9oWz.png
**Training**:
* joint loss for classifiers: $L_{cls} = -log(P(c_i|p_i)* P(b_i,c_i,p_i)) $
* loss for location regression: $\lambda * 1(c_i>=1, b_i = 1) L(u_i,\varphi _i;p_i)$
**Summary**:
This paper has three highlights:
1. Parallel: it uses a paralleled network structure where proposals can be processed in paralleled which will shorten the processing time based on GPU
2. temporal structure modeling and regression: give each proposal certain structure so that completeness of proposals can be achieved
3. reduce computing complexity: use two tricks: remove video redundancy by sampling frame; remove proposal redundance