![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
what is the paper doing? This paper proposed a way to explain the model decision by human-readable concepts. For example, if the model thinks the following image is a black-throated sparrow, then a human can understand this decision via input descriptors. https://i.imgur.com/xVleDhp.png The descriptors were obtained from GPT-3, they got 500 descriptors for each class and then remove the class name in each descriptor. Then, for each class, they chose $k$ concepts to make sure that every class has an equal amount of concepts. After that, they put these concepts into a concept selection module to select a more fine-grained subset of concepts for each class. Then, they put these concepts and the image into CLIP to learn the score for each concept. Finally, they put a Class-concept weight matrix on top of CLIP to fine-tune these scores and output the predicted class name. Note that, this weight matrix was initialized with language priors. https://i.imgur.com/r9Op5Lm.png ![]() |
[link]
The paper proposed a new object detection method to detect novel classes by using Conditional Matching. This detector can be conditioned on either image or text, which means a user can use an image or text to let the model detect the corresponding bounding boxes in the picture. This model has 2 changes compared to other open-vocabulary detectors: 1) Other detectors rely on Region Proposal Network (RPN) which can not cover all the objects in a picture, so it will worsen the performance of detecting novel objects. So in this work, they use CLIP to detect novel objects, it is better than RPN because it uses queries as a reader to read the whole picture, then these queries can cover many objects in the picture. https://i.imgur.com/GqvvSVs.png 2) Other detectors rely on Bipartite Matching to match between class label names and detected bounding boxes. But the downside of Bipartite Matching is that it can not match the novel objects with any label names because the novel objects do not have the labels. So, in this work, they proposed to use Conditional Matching which turns the matching problem into a binary matching problem. By using Conditional Matching, an object can be assigned to a "matched" or "not matched" label. https://i.imgur.com/FjI2iub.png ![]() |
[link]
This paper proposed a way to do classification using primitive concepts such as color, shape, texture, etc. The framework is simple, they have two sub-models: (1) the first one is a trained VL model such as CLIP, ViLT, and ALBEF. The input of this step is the primitive concepts or let's say, attribute concepts and an image, then the output will be the scores for each concept. (2) the second one is a linear model that uses the concepts and their scores to do classification. This model is trained in a supervised manner. https://i.imgur.com/7WMmGyv.png ![]() |
[link]
This paper has a way to leverage pre-trained Vision Language encoders to do VL tasks such as VQA, and Image Captioning. To have a good VL model, the modality gap must be reduced. In this paper, they proposed a Q-Former which is a Transformer module that is trained first with a frozen image encoder, then trained with this frozen image encoder and a frozen text encoder (from a Large Language Model). https://i.imgur.com/rQ3V3oQ.png The reason why the Q-Former needs to train in two stages is: (1) Trained with frozen image encoder to learn the most informative visual features. https://i.imgur.com/gshAy1p.png (2) Trained with frozen text encoder to learn the visual features related to the textual feature the most. https://i.imgur.com/gPz40GC.png ![]() |
[link]
This paper is to mitigate the scene bias in the action recognition task. Scene bias is defined as the model only focusing on scene or object information without paying attention to the actual activity. To mitigate this issue, the author proposed 2 additional types of loss: (1) scene adversarial loss that helps the network to learn features that are suitable for action but invariant to scene type. Hence, reduce the scene bias. (2) human mask confusion loss that prevents a model from predicting the correct action (label) of this video if there is no person in this video. Hence, this can mitigate the scene bias because the model can not predict the correct action based on only the surrounding scene. https://i.imgur.com/BBfWE17.png To mask out the person in the video, they use a human detector to detect and then mask the person out. In the above diagram, there is a gradient reversal layer, which works as follows: In the forward pass, the output is similar to the input. In the backward pass, the output is equal to the input times -1. https://i.imgur.com/hif9ZL9.png This layer comes from Domain Adaptation. In domain adaptation, there is a need to make the distribution of the source and the target domain distinguishable. So, in this work, they want to make the action distribution and the scene distribution distinguishable, which is why they train the action classifier and scene classifier in an adversarial way. https://i.imgur.com/trNJGlm.png And by using the Gradient reversal layer, for the training instances, the action predictor will be trained for predicting the labels of the training instances. The feature extractor will therefore be trained to minimize the classification loss of the action predictor and maximize the classification loss of the scene predictor. As a result, the action will be scene-agnostic. ![]() |