[link]
In this race for getting that extra few % improvement for a '*brand-new'* paper, this paper brings a fresh air by posing some very pertinent questions supported by rigorous experimental analysis. Its an ICCV 2017 paper. The paper talks about understanding activities in videos – both from activity classification and detection perspective. In doing so, the authors examined several datasets, evaluation metrics, algorithms, and pointed to possible future directions worthy of exploring. The default choice in terms of the dataset is Charades. Other than this, multiTHUMOS, THUMOS and ActivityNet are used as and when required. The activity classification/detection algorithms analyzed, are two-stream, improved dense trajectories (IDT), LSTM on VGG, actionVLAD and temporalfields. The paper starts with the very definition of action. To quote *"When we talk about activities, we are referring to anything a person is doing, regardless of whether the person is intentionally and actively altering the environment, or simply sitting still".* This is a complementary perspective than what the community has perceived as action so far - *"Intentional bodily motion of biological agents"* [1]. The paper generalizes this notion and advocates that bodily motion is not indispensable to define actionness (*e.g.*, 'watching the tv', 'Lying on a couch' hardly consist of a bodily motion). Analysis of motion’s role in understanding activity has played a major role later in the paper. Let’s see some of the major questions that the authors explored in this paper. 1. "Only verbs" can make actions ambiguous. To quote, - "Verbs such as 'drinking' and 'running' are unique on their own, but verbs such as 'take' and 'put' are ambiguous unless nouns and even prepositions are included: 'take medication', 'take shoes', take off shoes'". The experiments involving both human (sec 3.1) and activity algorithms (sec 4.1) shows that given the verb less confusion arises when the object is mentioned ('holding a cup' vs 'holding a broom'), but given the object, confusion is more among different verbs ('holding a cup' vs 'drinking from a cup'). All the current algorithms are shown to have significant confusion among similar action categories, both in terms of verbs and objects. In fact, for a given category, the more categories share the object or verb, the worse is the accuracy. 2. The next study, to me, is the most important one. It’s about the long-standing concern of whether activities have clear and universal boundaries. The human study shows that, in fact, it is ambiguous. Average human agreement with ground truth is only 72.5% IOU for Charades and 58.7% IOU for MultiTHUMOS. In a natural course of action, the authors wanted to see if this ambiguity is affecting the evaluation performance of the algorithms. For this purpose, they relaxed the ground truth boundary to be more flexible (sec 3.2) and then evaluated the performance of the algorithms. The surprising fact is that this relaxation did not improve the performance much. The authors opined that despite boundary ambiguity current datasets allow current algorithms to understand and learn from the temporal extent of activities. I must say, I did not expect that ambiguity in temporal boundary will have this insignificant effect on the localization performances. In addition to the conclusion as drawn by the authors, this can be caused by another issue. The (bad) effect of other things are so large that the correction due to boundary ambiguity can't change the performance much. What I mean is - it may not be that the datasets are sufficient but the algorithms are suffering from other flaws much more than they are suffering from the boundary ambiguity. 3. Another important question that the authors dealt with is – how does the amount of labeled training data affect the performance. The broad finding goes with the common knowledge of - "more data means better performance". However, there are a plethora of finer equally important insights that the authors pointed out. The amount of data does not affect all categories equally, especially for a dataset with long-tailed distribution of classes. Smaller categories are more affected. In addition, activities with more similar categories (that share the same object/verb) also get affected much more than their counter parts. The authors end the subsection (sec 4.2) with an observation that improvement can be made by designing algorithms that are better able to make use of the wealth of data in small categories than in large ones. 4. The authors did a thorough analysis of the role of temporal reasoning (motion, continuity, and temporal context) for activity understanding. The very first finding is that current methods are doing better for longer activities than shorter ones. Another common notion (naive temporal smoothing of the predictions helps improve localization and classification) is also verified. 5. An action is almost invariably related to persons. So, the authors tried to see if person based reasoning helps. For that, they experimented with removing the person from the scene, keeping nothing but the person etc. They also examined how diverse are the datasets in terms of human pose and if injecting human pose information helps the current approaches. The conclusion was that person based reasoning helps and the nature of the videos require the activity understanding approaches to harness pose information for improved performance. 6. Finally, the authors try to see what aspects help most if that aspect is solved perfectly with an oracle. The oracles include perfect object detection, perfect verb identification and so on. It varies for datasets to some extent but, in general, the finding was that all the oracles help, some more some less. I think this is a much-needed work that would help the community to ponder over different avenues of activity understanding in videos to design better systems. [1]. Wei Chen, Caiming Xiong, Ran Xu, Jason J. Corso, Actionness Ranking with Lattice Conditional Ordinal Random Fields, CVPR 2014.
Your comment:
|