[link]
Summary by ngthanhtinqn 1 year ago
This paper is to reduce gender bias in the captioning model. Concretely, traditional captioning models tend to rely on contextual cues, so they usually predict incorrect captions for an image that contains people.
To reduce gender bias, they introduced a new $Equalizer$ model that contains two losses:
(1) Appearance Confusion Loss: When it is hard to tell if there is a man or a woman in the image, the model should provide a fair probability of predicting a man or a woman.
To define that loss, first, they define a confusion function, which indicates how likely a next predicted word belongs to a set of woman words or a set of man words.
https://i.imgur.com/oI6xswy.png
Where, $w~_{t}$ is the next predicted word, $G_{w}$ is the set of woman words, $G_{m}$ is the set of man words.
And the Loss is defined as the normal cross-entropy loss multiplied by the Confusion function.
https://i.imgur.com/kLpROse.png
(2) Confident Loss: When it is easy to recognize a man or a woman in an image, this loss encourages the model to predict gender words correctly.
In this loss, they also defined in-confidence functions, there are two in-confidence functions, the first one is the in-confidence function for man words, and the second one is for woman words. These two functions are the same.
https://i.imgur.com/4stFjac.png
This function tells that if the model is confident when predicting a gender (ex. woman), then the value of the in-confidence function for woman words should be low.
Then, the confidence loss function is as follows:
https://i.imgur.com/1pRgDir.png
more
less