[link]
### **Keyword**: RNN, serialized model; non-differentiable backpropogarion; action detection in video **Abstract**: This paper uses an end-to-end model which is a recurrent neural network trained by REINFORCE to directly predict the temporal bounds of actions. The intuition is that people will observe moments in video and decide where to look to predict when an action is occurring. After training, Serena et al manage to achieve the state-of-art result by only observing 2% of the video frames. **Model**: In order to take a long video and output all the instances of given action, they use two parts including an observation network and recurrent network. * observation network: encode the visual representation of video frames. * input: $ln$ -- the normalized location of the frame + frame $v_{ln}$ * network: fc7 feature of finetuned VGG16 network * output: $on$ of 1024 dimension indicate time and frame feature * recurrent network: sequentially process the visual representations and decide where to watch next and whether to emit detection. ##### for each timestep: * input: $on$ -- the representation of the frame + previous state $h_{n-1}$ * network: $d_n = f_d(h_n; \theta_d)$. $pn = fp(h_n,\theta_p)$,$fd$ is fc. $fp$ is fc+sigmoid * output: $d_n = (s_n,e_n,c_n )$as the candidate detection, where $s_n$,$e_n$ is the start and end of the detection, $c_n$ is confidence level; $p_n$ whether $d_n$ is a valid detection. $l_{n+1}$ where to observe next. all the parameter falls in [0,1] https://i.imgur.com/SeFianV.png **Training**: in order to learn the supervision annotation in long videos and handle the non-differentiable components, authors use BP to train $d_n$ while use REINFORCE to train $p_n$ and $l_{n+1}$ * for $d_n$: $L(D) = \sum_n L_{cls}(d_n) + \sum_n \sum_m 1[y_{mn} = 1] L_{loc}(d_n,g_m)$ * for $p_n,l_{n+1}$: reward $J(\theta) = \sum_{a\in A} p(a) r(a)$ p(a) is the distribution of action and r(a) is the reward for the action. so training needs to maximize this. **Summary**: This paper uses a serialized model which first extract the feature from each frame, then use the frame feature and previous state info to generate the next observation time, detection and detection indicator. Specifically, in order to use previous information, they use RNN to store information and use REINFORCE to train $p_n$ and $l_n$, where the goal is to maximize reward for an action sequence and use Monte-Carlo sampling to numerically calculate the gradient for high dimension function. **questions**: 1. why $p_n$ and $l_n$ are non-differentiable components? 2. if $p_n$ and $l_n$ are non-differentiable components indeed, how do we come up with REINFORCE to compute the gradient? 3. why don't we get $p_n$ from $p_n = f_p(h_n, \theta_p)$ directly but rather use fp as the parameter in bernoulli distribution, similar question can be applied to calculation for $l_{n+1}$ in trainning time.
Your comment:
|