[link]
## Segmented SNN **Summary**: this paper use 3-stage 3D CNN to identify candidate proposals, recognize actions and localize temporal boundaries. **Models**: this network can be mainly divided into 3 parts: generate proposals, select proposal and refine temporal boundaries, and using NMS to remove redundant proposals. 1. generate multiscale(16,32,64,128,256.512) segment using sliding window with 75% overlap. high computing complexity! 2. network: Each stage of the three-stage network is using 3D convNets concatenating with 3 FC layers. * the proposal network is basically a classifier which will judge if each proposal contains action or not. * the classification network is used to classify each proposal which the proposal network think is valid into background and K action categories * the localization network functioned as a scoring system which raises scores of proposals that have high overlap with corresponding ground truth while decreasing the others. . |
[link]
## Temporal unit regression network keyword: temporal action proposal; computing efficiency **Summary**: In this paper, Jiyang et al designed a proposal generation and refinement network with high computation efficiency by reusing unit feature on coordinated regression and classification network. Especially, a new metric against temporal proposal called AR-F is raised to meet 2 metric criteria: 1. evaluate different method on the same dataset efficiently. 2. capable to evaluate same method's performance across several datasets(generalization capability) **Model**: * decompose video and extract feature to form clip pyramid: 1. A video is first decomposed into short units where each unit has 16/32 frames. 2. extract each unit's feature using C3D/Two-stream CNN model. 3. several units' features are average pooled to compose clip level feature. In order to provide context and adaptive for different length action, clip level feature also concatenate surround feature and scaled to different length by concatenating more or fewer clips. Feature for a slip is $f_c = P(\{u_j\}_{s_u-n_{ctx}}^{s_u})||P(\{u_j\}_{s_u}^{e_u})||P(\{u_j\}_{e_u}^{e_u+n_{ctx}}) $ 4. for each proposal pyramid, a classifier is used to judge if the proposal contains an action and a regressor is used to provide an offset for each proposal to refine proposal's temporal boundary. 5. finally, during prediction, NMS is used to remove redundant proposal thus provide high accuracy without changing the recall rate. https://i.imgur.com/zqvHOxj.png **Training**: There are two output need to be optimized, the classification result and the regression offset. Intuitively, the distance between the proposal and corresponding ground truth should be measured. In this paper, the authors used the L-1 metric for regressor targeted the positive proposals. total loss is measured as follow: $L = \lambda L_{reg}+L_{cls}$ $L_{reg} = \frac{1}{N_{pos}}\sum_{i = 1}^N*l_s^*|o_{s,i} - o_{s,i}^*+o_{e,i} - o_{e,i}^*|$ During training, the ratio between positive samples and negative samples is set to 1:10. And for each positive proposal, its ground truth is the one with which it has the highest IOU or which it has IOU more than 0.5. **result**: 1. Computation complexity: 880 fps using the C3D feature on TITAN X GPU, while 260 FPS using flow CNN feature on the same machine. 2. Accuracy: mAP@0.5 = 25.6% on THUMOS14 **Conclusion**: Within this paper, it generates proposals by generate candidate at each unit with different scale and then using regression to refine the boundary. *However, there are a lot of redundant proposals for each unit which is an unnecessary waste of computing source; Also, proposals are generated with the pre-defined length which restricted its adaptivity to different length action; Finally the proposals are generated on the unit level which will suffer granularity problem* |
[link]
## Boundary sensitive network ### **keyword**: action detection in video; accurate proposal **Summary**: In order to generate precise temporal boundaries and improve recall with lesses proposals, Tianwei Lin et al use BSN which first combine temporal boundaries with high probability to form proposals and then select proposals by evaluating whether a proposal contains an action(confidence score+ boundary probability). **Model**: 1. video feature encoding: use the two-stream extractor to form the input of BSN. $F = \{f_{tn}\}_{n=1}^{l_s} = \{(f_{S,Tn}, f_{T,t_n}\}_{n=1}^{l_s)} $ 2. BSN: * temporal evaluation: input feature sequence, using 3-layer CNN+3 fiter with sigmoid, to generate start, end, and actioness probability * proposal generation: 1.combine bound with high start/end probability or if probility peak to form proposal; 2. use actioness probability to generate proposal feature for each proposal by sampling the actioness probability during proposal region. * proposal evaluation: using 1 hidden layer perceptron to evaluate confidence score based on proposal features. proposal $\varphi =(t_s,t_e,p_{conf},p_{t_s}^s,p_{t_e}^e) $ $p_{t_e}^e$ is the end probability,and $p_{conf}$ is confidence score https://i.imgur.com/VjJLQDc.png **Training**: * **Learn to generate probility curve**: In order to calculate the accuracy of proposals the loss in the temporal evaluation is calculated as following: $L_{TEM} = \lambda L^{action} + L ^{start} + L^{end}$; $L = \frac{1}{l_w} \sum_{i =1}^{l_w}(\frac{l_w}{l_w-\sum_i g_i} b_i*log(p_i)+\frac{l_w}{\sum_i g_i} (1-b_i)*log(1-p_i))$ $ b_i = sign(g_i-\theta_{IoP})$ Thus, if start region proposal is highly overlapped with ground truth, the start point probability should increase to lower the loss, after training, the information of ground truth region could be leveraged to predict the accurate probability for start. actions and end probability could apply the same rule. * **Learn to choose right proposal**: In order to choose the right proposal based on confidence score, push confidence score to match with IOU of the groud truth and proposal is important. So the loss to do this is described as follow: $L_p = \frac{1}{N_{train}} \sum_{i=1}^{N_{train}} (p_{conf,i}-g_{iou,i})^2$. $N_{train}$ is number of training proposals and among it the ratio of positive to negative proposal is 1:2.$g_{iou,i}$ is the ith proposal's overlap with its corresponding ground truth. During test and prediction, the final confidence is calculated to fetch and suppress proposals using gaussian decaying soft-NMS. and final confidence score for each proposal is $p_f = p_{conf}p_{ts}^sp_{te}^e$ Thus, after training, the confidence score should reveal the iou between the proposal and its corresponding ground truth based on the proposal feature which is generated through actionness probability, whereas final proposal is achieved by ranking final confidence score. **Conclusion**: Different with segment proposal or use RNN to decide where to look next, this paper generate proposals with boundary probability and select them using the confidence score-- the IOU between the proposal and corresponding ground truth. with sufficient data, it can provide right bound probability and confidence score. and the highlight of the paper is it can be very accurate within feature sequence. *However, it only samples part of the video for feature sequence. so it is possible it will jump over the boundary point. if an accurate policy to decide where to sample is used, accuracy should be further boosted. * * **computation complexity**: within this network, computation includes 1. two-stream feature extractor for video samples 2. probility generation module: 3-layers cnn for the generated sequence 3. proposal generation using combination 4. sampler to generate proposal feature 5. 1-hidden layer perceptron to generate confidence score. major computing complexity should attribute to feature extractor(1') and proposal relate module if lots of proposals are generated(3',4') **Performance**: when combined with SCNN-classifier, it reach map@0.5 = 36.9 on THUMOS14 dataset |
[link]
## Structured segmented network ### **key word**: action detection in video; computing complexity reduction; structurize proposal **Abstract**: using a temporal action grouping scheme (TAG) to generate accurate proposals, using a structured pyramid to model the temporal structure of each action instance to tackle the issue that detected actions are not complete, using two classifiers to determine class and completeness and using a regressor for each category to further modify the temporal bound. In this paper, Yue Zhao et al mainly tackle the problem of high computing complexity by sampling video frame and remove redundant proposals in video detection and the lack of action stage modeling. **Model**: 1. generate proposals: find continuous temporal regions with mostly high actioness. $P = \{ p_i = [s_i,e_i]\}_{i = 1}^N$ 2. splitting proposals into 3 stages: start, course, and end: first augment the proposal by 2 times symmetrical to center, and course part is the original proposal, while start and end is the left part and right part of the difference between the transformed proposal and original one. 3. build temporal pyramid representation for each stage: first L samples are sampled from the augmented proposal, then two-stream feature extractor is used on each one of them and pooling features for each stage 4. build global representation for each proposal by concatenating stage-level representations 5. a global representation for each proposal is used as input for classifiers * input = ${S_t}_{t = 1} ^{T}$a sequence of T snippet representing the video. each snippet = the frames + an optical flow stack * network: two linear classifiers; L two-steam feature extractor and several pooling layer * output: category and completeness and modification for each proposals. https://i.imgur.com/thM9oWz.png **Training**: * joint loss for classifiers: $L_{cls} = -log(P(c_i|p_i)* P(b_i,c_i,p_i)) $ * loss for location regression: $\lambda * 1(c_i>=1, b_i = 1) L(u_i,\varphi _i;p_i)$ **Summary**: This paper has three highlights: 1. Parallel: it uses a paralleled network structure where proposals can be processed in paralleled which will shorten the processing time based on GPU 2. temporal structure modeling and regression: give each proposal certain structure so that completeness of proposals can be achieved 3. reduce computing complexity: use two tricks: remove video redundancy by sampling frame; remove proposal redundance |
[link]
### **Keyword**: RNN, serialized model; non-differentiable backpropogarion; action detection in video **Abstract**: This paper uses an end-to-end model which is a recurrent neural network trained by REINFORCE to directly predict the temporal bounds of actions. The intuition is that people will observe moments in video and decide where to look to predict when an action is occurring. After training, Serena et al manage to achieve the state-of-art result by only observing 2% of the video frames. **Model**: In order to take a long video and output all the instances of given action, they use two parts including an observation network and recurrent network. * observation network: encode the visual representation of video frames. * input: $ln$ -- the normalized location of the frame + frame $v_{ln}$ * network: fc7 feature of finetuned VGG16 network * output: $on$ of 1024 dimension indicate time and frame feature * recurrent network: sequentially process the visual representations and decide where to watch next and whether to emit detection. ##### for each timestep: * input: $on$ -- the representation of the frame + previous state $h_{n-1}$ * network: $d_n = f_d(h_n; \theta_d)$. $pn = fp(h_n,\theta_p)$,$fd$ is fc. $fp$ is fc+sigmoid * output: $d_n = (s_n,e_n,c_n )$as the candidate detection, where $s_n$,$e_n$ is the start and end of the detection, $c_n$ is confidence level; $p_n$ whether $d_n$ is a valid detection. $l_{n+1}$ where to observe next. all the parameter falls in [0,1] https://i.imgur.com/SeFianV.png **Training**: in order to learn the supervision annotation in long videos and handle the non-differentiable components, authors use BP to train $d_n$ while use REINFORCE to train $p_n$ and $l_{n+1}$ * for $d_n$: $L(D) = \sum_n L_{cls}(d_n) + \sum_n \sum_m 1[y_{mn} = 1] L_{loc}(d_n,g_m)$ * for $p_n,l_{n+1}$: reward $J(\theta) = \sum_{a\in A} p(a) r(a)$ p(a) is the distribution of action and r(a) is the reward for the action. so training needs to maximize this. **Summary**: This paper uses a serialized model which first extract the feature from each frame, then use the frame feature and previous state info to generate the next observation time, detection and detection indicator. Specifically, in order to use previous information, they use RNN to store information and use REINFORCE to train $p_n$ and $l_n$, where the goal is to maximize reward for an action sequence and use Monte-Carlo sampling to numerically calculate the gradient for high dimension function. **questions**: 1. why $p_n$ and $l_n$ are non-differentiable components? 2. if $p_n$ and $l_n$ are non-differentiable components indeed, how do we come up with REINFORCE to compute the gradient? 3. why don't we get $p_n$ from $p_n = f_p(h_n, \theta_p)$ directly but rather use fp as the parameter in bernoulli distribution, similar question can be applied to calculation for $l_{n+1}$ in trainning time. |