[link]
Summary by Abir Das 6 years ago
Concern about the issue of fairness (or the lack of it) in machine learning models is gaining widespread visibility among general public, the governments as well as the researchers. This is especially alarming as AI enabled systems are becoming more and more pervasive in our society as decisions are being taken by AI agents in healthcare to autonomous driving to criminal justice and so on. Bias in any dataset is, in some way or other, a reflection of the general attitude of humankind towards different activities which are typified by certain gender, race or ethnicity. As these datasets are the sources of knowledge for these AI models (especially the multimodal end-to-end models which depend only on the human annotated training datasets for literally everything), their decision making ability also gets shadowed by the bias in the dataset. This paper makes an important observation about the image captioning models that these models not only explore the bias in the dataset but tend to exaggerate them during inference. This is definitely a shortcoming of the current supervised models which are marked by their over-reliance on image context. The related works section of the paper (Section 2 first part: “Unwanted Dataset Bias”) gives an extensive review of the types of bias in the dataset and of the few recent works trying to address them. Gender bias (Presence of woman in kitchen makes most of us to guess a woman in a kitchen scene in case the person is not clearly apprehensible in the scene or a male is supposed to snowboard more often than a woman) and reporting biases (over reporting less common co-occurrences, such as “male nurse” or “green banana”) are two of the many present in machine learning datasets.
The paper addresses the problem of fair caption generation that would not presume a specific gender without appropriate evidence for that gender. This is done by introducing an ‘Equalizer Model’. This includes two complementary losses in addition to the normal cross entropy loss for the image captioning systems. The Appearance Confusion Loss (ACL) encourages the model to generate gender neutral words (for example ‘person’) when an image does not contain enough evidence of gender. During training, images of persons are masked out and the loss term encourages the gender words (“man” and “woman”) to have equal probability i.e., the model is encouraged to get confused when it should get confused instead of hallucinating from the context. The loss expression is pretty much intuitive (eqn (2) and (3)). However, it is not a good idea to make a model confused only. Thus the other loss (the Confident Loss (Conf)) is introduced. This loss encourages the model to predict gender words and predict them correctly when there is enough evidence of gender in the image. The loss function (eqns. (4) and (5)) has an intelligent use of the quotient between predicted probabilities of male and female gender words. If I have to give a single take away line from the paper then it will be the following which summarizes the working principle behind the two losses very succinctly.
> > “These complementary losses allow the Equalizer model to encourage models to be cautious in the absence of gender information and discriminative in its presence.”
The experiments are also well thought out. For experimentations, 3 different versions of the MSCOCO dataset is created - MSCOCO-Bias, MSCOCO-Confident and MSCOCO-Balanced. The bias in the gender gradually decreases in these 3 datasets. Three different metrics are also used to evaluate the model - Error rate (fraction of man/woman misclassifications), gender ratio (how close the gender ratio in the predicted captions of the test set is to the ground truth gender ratio), right for right reasons (whether the visual evidence used by the model for the prediction of the gender words coincide with the person images). There are a few baseline models and ablation studies. The baselines considered a naive image captioning model (‘Show and Tell’ approach), an approach where images corresponding to less common gender are sampled more while training and another baseline where the gender words are given higher weights in the cross-entropy loss. The ablation models considered the two losses (ACL and Conf) separately. For all the datasets, the proposed equalizer model consistently performed well according to all the 3 metrics. The experiments also show that, as the evaluation datasets become more and more balanced (i.e., the gender distribution departs more and more from the biased gender distribution in the training dataset), the performance of all the models falls away. However, the proposed model performs the best with the least inconsistency of performance among the the datasets. The qualitative examples with grad-cam and sliding window saliency maps for the gender words are also a positive point of the paper.
Things I would have liked the paper to contain:
* There are a few confusions in the expression of the conf loss in eqn. (4). Specifically, I am not sure what is the difference between $w_t$ and $\tilde{w}_t$. It seems the first one is the ground truth word and the later is the predicted word. It would have been good to have a clarification.
Overall, the paper is very new in defining the problem and in solving it. The solution strategy is very intuitive and easy to grasp. The paper is well written too. We can, sincerely, hope that this type of works addressing problems at the intersection of machine learning and societal issues would come more frequently and the discussed paper is a very significant first step towards it.
more
less