[link]
## Structured segmented network ### **key word**: action detection in video; computing complexity reduction; structurize proposal **Abstract**: using a temporal action grouping scheme (TAG) to generate accurate proposals, using a structured pyramid to model the temporal structure of each action instance to tackle the issue that detected actions are not complete, using two classifiers to determine class and completeness and using a regressor for each category to further modify the temporal bound. In this paper, Yue Zhao et al mainly tackle the problem of high computing complexity by sampling video frame and remove redundant proposals in video detection and the lack of action stage modeling. **Model**: 1. generate proposals: find continuous temporal regions with mostly high actioness. $P = \{ p_i = [s_i,e_i]\}_{i = 1}^N$ 2. splitting proposals into 3 stages: start, course, and end: first augment the proposal by 2 times symmetrical to center, and course part is the original proposal, while start and end is the left part and right part of the difference between the transformed proposal and original one. 3. build temporal pyramid representation for each stage: first L samples are sampled from the augmented proposal, then two-stream feature extractor is used on each one of them and pooling features for each stage 4. build global representation for each proposal by concatenating stage-level representations 5. a global representation for each proposal is used as input for classifiers * input = ${S_t}_{t = 1} ^{T}$a sequence of T snippet representing the video. each snippet = the frames + an optical flow stack * network: two linear classifiers; L two-steam feature extractor and several pooling layer * output: category and completeness and modification for each proposals. https://i.imgur.com/thM9oWz.png **Training**: * joint loss for classifiers: $L_{cls} = -log(P(c_i|p_i)* P(b_i,c_i,p_i)) $ * loss for location regression: $\lambda * 1(c_i>=1, b_i = 1) L(u_i,\varphi _i;p_i)$ **Summary**: This paper has three highlights: 1. Parallel: it uses a paralleled network structure where proposals can be processed in paralleled which will shorten the processing time based on GPU 2. temporal structure modeling and regression: give each proposal certain structure so that completeness of proposals can be achieved 3. reduce computing complexity: use two tricks: remove video redundancy by sampling frame; remove proposal redundance
Your comment:
|