This paper is to mitigate the scene bias in the action recognition task.
Scene bias is defined as the model only focusing on scene or object information without paying attention to the actual activity.
To mitigate this issue, the author proposed 2 additional types of loss:
(1) scene adversarial loss that helps the network to learn features that are suitable for action but invariant to scene type. Hence, reduce the scene bias.
(2) human mask confusion loss that prevents a model from predicting the correct action (label) of this video if there is no person in this video. Hence, this can mitigate the scene bias because the model can not predict the correct action based on only the surrounding scene.
To mask out the person in the video, they use a human detector to detect and then mask the person out.
In the above diagram, there is a gradient reversal layer, which works as follows:
In the forward pass, the output is similar to the input.
In the backward pass, the output is equal to the input times -1.
This layer comes from Domain Adaptation. In domain adaptation, there is a need to make the distribution of the source and the target domain distinguishable. So, in this work, they want to make the action distribution and the scene distribution distinguishable, which is why they train the action classifier and scene classifier in an adversarial way.
And by using the Gradient reversal layer, for the training instances, the action predictor will be trained for predicting the labels of the training instances. The feature extractor will therefore be trained to minimize the classification loss of the action predictor and maximize the classification loss of the scene predictor.
As a result, the action will be scene-agnostic.