![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
The Slot Attention module maps from a set of N input feature vectors to a set of K output vectors that we refer to as slots. Each vector in this output set can, for example, describe an object or an entity in the input. https://i.imgur.com/81nh508.png ![]() |
[link]
This paper aims to do zero-shot action recognition which uses cluster-based representation. Concretely, it uses REINFORCE algorithm which is a Reinforcement Learning algorithm to optimize the centroids and the reward signal is the classification scores. https://i.imgur.com/gWyJLX0.png ![]() |
[link]
This paper is about Multimodal Large Language Model (MLLM). In this paper, they proposed an MLLM model called KOSMOS-1 that can do instruction following, VQA, IQ-testing, visual dialog, etc. https://i.imgur.com/9P3Vuse.png https://i.imgur.com/HcYtbdD.png The input of this model is image-caption pairs and interleaved data of images and texts. https://i.imgur.com/LL4HiM3.png The input data will be fed into an embedding module to encode the data into vectors, then the vectors will be fed into a Transformer Decoder. Then the decoder will predict the next token based on the previous context. I think this large model's downside is that it can only predict phrases, not images. ![]() |
[link]
This paper is to reduce gender bias in the captioning model. Concretely, traditional captioning models tend to rely on contextual cues, so they usually predict incorrect captions for an image that contains people. To reduce gender bias, they introduced a new $Equalizer$ model that contains two losses: (1) Appearance Confusion Loss: When it is hard to tell if there is a man or a woman in the image, the model should provide a fair probability of predicting a man or a woman. To define that loss, first, they define a confusion function, which indicates how likely a next predicted word belongs to a set of woman words or a set of man words. https://i.imgur.com/oI6xswy.png Where, $w~_{t}$ is the next predicted word, $G_{w}$ is the set of woman words, $G_{m}$ is the set of man words. And the Loss is defined as the normal cross-entropy loss multiplied by the Confusion function. https://i.imgur.com/kLpROse.png (2) Confident Loss: When it is easy to recognize a man or a woman in an image, this loss encourages the model to predict gender words correctly. In this loss, they also defined in-confidence functions, there are two in-confidence functions, the first one is the in-confidence function for man words, and the second one is for woman words. These two functions are the same. https://i.imgur.com/4stFjac.png This function tells that if the model is confident when predicting a gender (ex. woman), then the value of the in-confidence function for woman words should be low. Then, the confidence loss function is as follows: https://i.imgur.com/1pRgDir.png ![]() |
[link]
This paper proposed a method to locate an object based on an image and a sentence describing objects in the image. Then, predicting a new visual concept embedding based on two graphs (1) a graph that describes the relationship between objects in a supplemental sentence describing several objects, and (2) a graph that describes the relationship between the detected object in the image and example images related to objects in the supplementary sentence. This embedding can be used for many downstream tasks such as Visual entailment, Visual Reasoning. For example, in the domain 1 image, the new visual concept is red, and the model can locate where is the red cub in the image (1a). Then, in (1b) the model can interpret the supplemental sentences that relate the novel concept with other concepts. https://i.imgur.com/yBIteYT.png To locate the box that describes the object, this work utilized MaskRCNN to first detect the object in the scene, then used a Neuro-Symbolic Program to match the object mentioned in the input sentence with the detected objects by MaskRCNN. https://i.imgur.com/2cG9IUX.png To learn the concept embedding for that object, this work needs a supplemental sentence that describes several objects, they are known concepts except one is a novel concept. Then, building two graphs $GNN_{concept}$, and $GNN_{example}$. In $GNN_{concept}$, this is a graph representing the relationship between known concepts and the novel concept. For example in this graph, "White-eyed Vireo" is a new concept. https://i.imgur.com/1LjirJz.png In $GNN_{example}$, this is a graph representing the relationship between the detected object that represents the novel object and example images of the novel object. https://i.imgur.com/SdR74Vu.png Then learn the concept embedding for this novel concept. https://i.imgur.com/YGYEPvc.png https://i.imgur.com/VXOBn6n.png ![]() |