Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition on ShortScience.org

papers.nips.cc
scholar.google.com

Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition
Choi, Jinwoo and Gao, Chen and Messou, Joseph C. E. and Huang, Jia-Bin
Neural Information Processing Systems Conference - 2019 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 1

[link] Summary by ngthanhtinqn 2 years ago

This paper is to mitigate the scene bias in the action recognition task.

Scene bias is defined as the model only focusing on scene or object information without paying attention to the actual activity.

To mitigate this issue, the author proposed 2 additional types of loss:
(1) scene adversarial loss that helps the network to learn features that are suitable for action but invariant to scene type. Hence, reduce the scene bias.
(2) human mask confusion loss that prevents a model from predicting the correct action (label) of this video if there is no person in this video. Hence, this can mitigate the scene bias because the model can not predict the correct action based on only the surrounding scene.

https://i.imgur.com/BBfWE17.png

To mask out the person in the video, they use a human detector to detect and then mask the person out.

In the above diagram, there is a gradient reversal layer, which works as follows:

In the forward pass, the output is similar to the input.

In the backward pass, the output is equal to the input times -1.

https://i.imgur.com/hif9ZL9.png

This layer comes from Domain Adaptation. In domain adaptation, there is a need to make the distribution of the source and the target domain distinguishable. So, in this work, they want to make the action distribution and the scene distribution distinguishable, which is why they train the action classifier and scene classifier in an adversarial way.

https://i.imgur.com/trNJGlm.png

And by using the Gradient reversal layer, for the training instances, the action predictor will be trained for predicting the labels of the training instances. The feature extractor will therefore be trained to minimize the classification loss of the action predictor and maximize the classification loss of the scene predictor.

As a result, the action will be scene-agnostic.

Your comment: