Abir Das's profile - ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Women also Snowboard: Overcoming Bias in Captioning Models
Kaylee Burns and Lisa Anne Hendricks and Kate Saenko and Trevor Darrell and Anna Rohrbach
arXiv e-Print archive - 2018 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Abir Das 7 years ago

Concern about the issue of fairness (or the lack of it) in machine learning models is gaining widespread visibility among general public, the governments as well as the researchers. This is especially alarming as AI enabled systems are becoming more and more pervasive in our society as decisions are being taken by AI agents in healthcare to autonomous driving to criminal justice and so on. Bias in any dataset is, in some way or other, a reflection of the general attitude of humankind towards different activities which are typified by certain gender, race or ethnicity. As these datasets are the sources of knowledge  for these AI models (especially the multimodal end-to-end models which depend only on the human annotated training datasets for literally everything), their decision making ability also gets shadowed by the bias in the dataset. This paper makes an important observation about the image captioning models that these models not only explore the bias in the dataset but tend to exaggerate them during inference. This is definitely a shortcoming of the current supervised models which are marked by their over-reliance on image context. The related works section of the paper (Section 2 first part: “Unwanted Dataset Bias”) gives an extensive review of the types of bias in the dataset and of the few recent works trying to address them. Gender bias (Presence of woman in kitchen makes most of us to guess a woman in a kitchen scene in case the person is not clearly apprehensible in the scene or a male is supposed to snowboard more often than a woman) and reporting biases (over reporting less common co-occurrences, such as “male nurse” or “green banana”) are two of the many present in machine learning datasets.

The paper addresses the problem of fair caption generation that would not presume a specific gender without appropriate evidence for that gender. This is done by introducing an ‘Equalizer Model’. This includes two complementary losses in addition to the normal cross entropy loss for the image captioning systems. The Appearance Confusion Loss (ACL) encourages the model to generate gender neutral words (for example ‘person’) when an image does not contain enough evidence of gender. During training, images of persons are masked out and the loss term encourages the gender words (“man” and “woman”) to have equal probability i.e., the model is encouraged to get confused when it should get confused instead of hallucinating from the context. The loss expression is pretty much intuitive (eqn (2) and (3)). However, it is not a good idea to make a model confused only. Thus the other loss (the Confident Loss (Conf)) is introduced. This loss encourages the model to predict gender words and predict them correctly when there is enough evidence of gender in the image. The loss function (eqns. (4) and (5)) has an intelligent use of the quotient between predicted probabilities of male and female gender words. If I have to give a single take away line from the paper then it will be the following which summarizes the working principle behind the two losses very succinctly.
> > “These complementary losses allow the Equalizer model to encourage models to be cautious in the absence of gender information and discriminative in its presence.”

The experiments are also well thought out. For experimentations, 3 different versions of the MSCOCO dataset is created - MSCOCO-Bias, MSCOCO-Confident and MSCOCO-Balanced. The bias in the gender gradually decreases in these 3 datasets. Three different metrics are also used to evaluate the model - Error rate (fraction of man/woman misclassifications), gender ratio (how close the gender ratio in the predicted captions of the test set is to the ground truth gender ratio), right for right reasons (whether the visual evidence used by the model for the prediction of the gender words coincide with the person images). There are a few baseline models and ablation studies. The baselines considered a naive image captioning model (‘Show and Tell’ approach), an approach where images corresponding to less common gender are sampled more while training and another baseline where the gender words are given higher weights in the cross-entropy loss. The ablation models considered the two losses (ACL and Conf) separately. For all the datasets, the proposed equalizer model consistently performed well according to all the 3 metrics. The experiments also show that, as the evaluation datasets become more and more balanced (i.e., the gender distribution departs more and more from the biased gender distribution in the training dataset), the performance of all the models falls away. However, the proposed model performs the best with the least inconsistency of performance among the the datasets. The qualitative examples with grad-cam and sliding window saliency maps for the gender words are also a positive point of the paper.

Things I would have liked the paper to contain:
* There are a few confusions in the expression of the conf loss in eqn. (4). Specifically, I am not sure what is the difference between $w_t$ and $\tilde{w}_t$. It seems the first one is the ground truth word and the later is the predicted word. It would have been good to have a clarification.

Overall, the paper is very new in defining the problem and in solving it. The solution strategy is very intuitive and easy to grasp. The paper is well written too. We can, sincerely, hope that this type of works addressing problems at the intersection of machine learning and societal issues would come more frequently and the discussed paper is a very significant first step towards it.

dx.doi.org
sci-hub
scholar.google.com

Generating Visual Explanations
Hendricks, Lisa Anne and Akata, Zeynep and Rohrbach, Marcus and Donahue, Jeff and Schiele, Bernt and Darrell, Trevor
European Conference on Computer Vision - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abir Das 7 years ago

This paper deals with an important problem where a deep classification system is made explainable. After the (continuing) success of Deep Networks, researchers are trying to open the blackbox and this work is one of the foremosts. The authors explored the strength of a deep learning method (vision-language model) to explain the performance of another deep learning model (image classification). The approach jointly predicts a class label and explains why it predicted so in natural language.

The paper starts with a very important differentiation between two basic schools of *explnation* systems - the *introspection* explanation system and the *justification* explanation system. The introspection system looks into the model to get an explanation (e.g., "This is a Western Grebe because filter 2 has a high activation..."). On the other hand, a justification system justifies the decision by producing sentence details on how visual evidence is compatible with the system output (e.g., "This is a Western Grebe because it has red eyes..."). The paper focuses on *justification* explanation system and proposes a novel one.

The authors argue that unlike a description of an image or a sentence defining a class (not necessarily in presence of an image), visual explanation, conditioned on an input image, provides much more of an explanatory text on why the image is classified as a certain category mentioning only image relevant features. The broad outline of the approach is given in Fig (2) of the paper.
https://i.imgur.com/tta2qDp.png
The first stage consists of a deep convolutional network for classification which generates a softmax distribution over the classes. As the task handles fine-grained bird species classification, it uses a compact bilinear feature representation known to work well for the fine-grained classification tasks. The second stage is a stacked LSTM which generates natural language sentences or explanations justifying the decision of the first stage. The first LSTM of the stack receives the previously generated word. The second LSTM receives the output of the first LSTM along with image features and predicted label distribution from the classification network. This LSTM produces the sequence of output words until an "end-of-sentence" token is generated. The intuition behind using predicted label distribution for explanation is that it would inform the explanation generation model which words and attributes are more likely to occur in the description.

Two kinds of losses are used for the second stage *i.e.*, the language model. The first one is termed as the *Relevance Loss* which is the typical sentence generation loss that is seen in literature. This is the sum of cross-entropy losses of the generated words with respect to the ground truth words. Its role is to optimize the alignment between generated and ground truth sentences. However, this loss is not very effective in producing sentences which include class discriminative information. class specificity is a global sentence property. This is illustrated with the following example - *whereas a sentence "This is an all black bird with a bright red eye" is class specific to a "Bronzed Cowbird", words and phrases in the sentence, such as "black" or "red eye" are less class discriminative on their own.* As a result, cross entropy loss on individual words turns out to be less effective in capturing the global sentence property of which class specifity is an example. The authors address this issue by proposing an addiitonal loss, termed as the *Discriminative Loss* which is based on a reinforcement learning paradigm. Before computing the loss, a sentence is sampled. The sentence is passed through a LSTM-based classification network whose task is to produce the ground truth category $C$ given only the sampled sentence. The reward for this operation is simply the probability of the ground truth category $C$ given only the sentence. The intuition is - for the model to produce an output with a large reward, the generated sentence must include enough information to classify the original image properly. The *Discriminative Loss* is the expectation of the negative of this reward and a wieghted linear combination of the two losses is optimized during training.

My experience in reinforcement learning is limited. However, I must say I did not quite get why is sampling of the sentences required (which called for the special algorithm for backpropagation). If the idea is to see whether a generated sentence can be used to get at the ground truth category, could the last internal state of one of the stacked LSTM not be used? It would have been better to get some more intution behind the sampling operation. Another thing which (is fairly obvious but still I felt) is missing is not mentioning the loss used in the fine grained classification network.

The experimentation is rigorous. The proposed method is compared with four different baseline and ablation models - description, definition, explanation-label, explanation-discriminative with different permutation and combinations of the presence of two types losses, class precition informations etc. Also the evaluation metrics measure different qualities of the generated exlanations, specifically image and class relevances. To measure image relevance METEOR/CIDEr scores of the generated sentences with the ground truth (image based) explanations are computed. On the other hand, to measure the class relevance, CIDEr scores with class definition (not necessarily based on the images from the dataset) sentences are computed. The proposed approach has continuously shown better performance than any of the baseline or ablation methods. I'd specifically mention about one experiment where the effect of class conditioning is studies (end of Sec 5.2). The finding is quite interesting as it shows that providing or not providing correct class information has drastic effect at the generated explanations. It is seen that giving incorrect class information makes the explanation model hallucinate colors or attributes which are not present in the image but are specific to the class. This raises the question whether it is worth giving the class information when the classifier is poor on the first hand? But, I think the answer lies in the observation that row 5 (with class prediction information) in table 1 is always better than row 4 (no class prediction information). Since, row 5 is better than row 4, this means the classifier is also reasonable and this in turn implies that end-to-end training can improve all the stages of a pipeline which ultimately improves the overall performance of the system too!

In summary, the paper is a very good first step to explain intelligent systems and should encourage a lot more effort in this direction.

arxiv.org
arxiv-vanity.com
scholar.google.com

What Actions are Needed for Understanding Human Actions in Videos?
Gunnar A. Sigurdsson and Olga Russakovsky and Abhinav Gupta
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Abir Das 7 years ago

In this race for getting that extra few % improvement for a '*brand-new'* paper, this paper brings a fresh air by posing some very pertinent questions supported by rigorous experimental analysis. Its an ICCV 2017 paper. The paper talks about understanding activities in videos – both from activity classification and detection perspective. In doing so, the authors examined several datasets, evaluation metrics, algorithms, and pointed to possible future directions worthy of exploring. The default choice in terms of the dataset is Charades. Other than this, multiTHUMOS, THUMOS and ActivityNet are used as and when required. The activity classification/detection algorithms analyzed, are two-stream, improved dense trajectories (IDT), LSTM on VGG, actionVLAD and temporalfields.

The paper starts with the very definition of action. To quote *"When we talk about activities, we are referring to anything a person is doing, regardless of whether the person is intentionally and actively altering the environment, or simply sitting still".* This is a complementary perspective than what the community has perceived as action so far - *"Intentional bodily motion of biological agents"* [1]. The paper generalizes this notion and advocates that bodily motion is not indispensable to define actionness (*e.g.*, 'watching the tv', 'Lying on a couch' hardly consist of a bodily motion). Analysis of motion’s role in understanding activity has played a major role later in the paper. Let’s see some of the major questions that the authors explored in this paper.

1. "Only verbs" can make actions ambiguous. To quote, - "Verbs such as 'drinking' and 'running' are unique on their own, but verbs such as 'take' and 'put' are ambiguous unless nouns and even prepositions are included: 'take medication', 'take shoes', take off shoes'". The experiments involving both human (sec 3.1) and activity algorithms (sec 4.1) shows that given the verb less confusion arises when the object is mentioned ('holding a cup' vs 'holding a broom'), but given the object, confusion is more among different verbs ('holding a cup' vs 'drinking from a cup'). All the current algorithms are shown to have significant confusion among similar action categories, both in terms of verbs and objects. In fact, for a given category, the more categories share the object or verb, the worse is the accuracy.
2. The next study, to me, is the most important one. It’s about the long-standing concern of whether activities have clear and universal boundaries. The human study shows that, in fact, it is ambiguous. Average human agreement with ground truth is only 72.5% IOU for Charades and 58.7% IOU for MultiTHUMOS. In a natural course of action, the authors wanted to see if this ambiguity is affecting the evaluation performance of the algorithms. For this purpose, they relaxed the ground truth boundary to be more flexible (sec 3.2) and then evaluated the performance of the algorithms. The surprising fact is that this relaxation did not improve the performance much. The authors opined that despite boundary ambiguity current datasets allow current algorithms to understand and learn from the temporal extent of activities. I must say, I did not expect that ambiguity in temporal boundary will have this insignificant effect on the localization performances. In addition to the conclusion as drawn by the authors, this can be caused by another issue. The (bad) effect of other things are so large that the correction due to boundary ambiguity can't change the performance much. What I mean is - it may not be that the datasets are sufficient but the algorithms are suffering from other flaws much more than they are suffering from the boundary ambiguity.
3. Another important question that the authors dealt with is – how does the amount of labeled training data affect the performance. The broad finding goes with the common knowledge of - "more data means better performance". However, there are a plethora of finer equally important insights that the authors pointed out. The amount of data does not affect all categories equally, especially for a dataset with long-tailed distribution of classes. Smaller categories are more affected. In addition, activities with more similar categories (that share the same object/verb) also get affected much more than their counter parts. The authors end the subsection (sec 4.2) with an observation that improvement can be made by designing algorithms that are better able to make use of the wealth of data in small categories than in large ones.
4. The authors did a thorough analysis of the role of temporal reasoning (motion, continuity, and temporal context) for activity understanding. The very first finding is that current methods are doing better for longer activities than shorter ones. Another common notion (naive temporal smoothing of the predictions helps improve localization and classification) is also verified.
5. An action is almost invariably related to persons. So, the authors tried to see if person based reasoning helps. For that, they experimented with removing the person from the scene, keeping nothing but the person etc. They also examined how diverse are the datasets in terms of human pose and if injecting human pose information helps the current approaches. The conclusion was that person based reasoning helps and the nature of the videos require the activity understanding approaches to harness pose information for improved performance.
6. Finally, the authors try to see what aspects help most if that aspect is solved perfectly with an oracle. The oracles include perfect object detection, perfect verb identification and so on. It varies for datasets to some extent but, in general, the finding was that all the oracles help, some more some less.

I think this is a much-needed work that would help the community to ponder over different avenues of activity understanding in videos to design better systems.

[1]. Wei Chen, Caiming Xiong, Ran Xu, Jason J. Corso, Actionness Ranking with Lattice Conditional Ordinal Random Fields, CVPR 2014.

Abir Das

sciscore: 2.667