[link]
## Temporal unit regression network keyword: temporal action proposal; computing efficiency **Summary**: In this paper, Jiyang et al designed a proposal generation and refinement network with high computation efficiency by reusing unit feature on coordinated regression and classification network. Especially, a new metric against temporal proposal called AR-F is raised to meet 2 metric criteria: 1. evaluate different method on the same dataset efficiently. 2. capable to evaluate same method's performance across several datasets(generalization capability) **Model**: * decompose video and extract feature to form clip pyramid: 1. A video is first decomposed into short units where each unit has 16/32 frames. 2. extract each unit's feature using C3D/Two-stream CNN model. 3. several units' features are average pooled to compose clip level feature. In order to provide context and adaptive for different length action, clip level feature also concatenate surround feature and scaled to different length by concatenating more or fewer clips. Feature for a slip is $f_c = P(\{u_j\}_{s_u-n_{ctx}}^{s_u})||P(\{u_j\}_{s_u}^{e_u})||P(\{u_j\}_{e_u}^{e_u+n_{ctx}}) $ 4. for each proposal pyramid, a classifier is used to judge if the proposal contains an action and a regressor is used to provide an offset for each proposal to refine proposal's temporal boundary. 5. finally, during prediction, NMS is used to remove redundant proposal thus provide high accuracy without changing the recall rate. https://i.imgur.com/zqvHOxj.png **Training**: There are two output need to be optimized, the classification result and the regression offset. Intuitively, the distance between the proposal and corresponding ground truth should be measured. In this paper, the authors used the L-1 metric for regressor targeted the positive proposals. total loss is measured as follow: $L = \lambda L_{reg}+L_{cls}$ $L_{reg} = \frac{1}{N_{pos}}\sum_{i = 1}^N*l_s^*|o_{s,i} - o_{s,i}^*+o_{e,i} - o_{e,i}^*|$ During training, the ratio between positive samples and negative samples is set to 1:10. And for each positive proposal, its ground truth is the one with which it has the highest IOU or which it has IOU more than 0.5. **result**: 1. Computation complexity: 880 fps using the C3D feature on TITAN X GPU, while 260 FPS using flow CNN feature on the same machine. 2. Accuracy: mAP@0.5 = 25.6% on THUMOS14 **Conclusion**: Within this paper, it generates proposals by generate candidate at each unit with different scale and then using regression to refine the boundary. *However, there are a lot of redundant proposals for each unit which is an unnecessary waste of computing source; Also, proposals are generated with the pre-defined length which restricted its adaptivity to different length action; Finally the proposals are generated on the unit level which will suffer granularity problem*
Your comment:
|