[link]
The Slot Attention module maps from a set of N input feature vectors to a set of K output vectors that we refer to as slots. Each vector in this output set can, for example, describe an object or an entity in the input. https://i.imgur.com/81nh508.png |
[link]
This paper aims to do zero-shot action recognition which uses cluster-based representation. Concretely, it uses REINFORCE algorithm which is a Reinforcement Learning algorithm to optimize the centroids and the reward signal is the classification scores. https://i.imgur.com/gWyJLX0.png |
[link]
This paper is about Multimodal Large Language Model (MLLM). In this paper, they proposed an MLLM model called KOSMOS-1 that can do instruction following, VQA, IQ-testing, visual dialog, etc. https://i.imgur.com/9P3Vuse.png https://i.imgur.com/HcYtbdD.png The input of this model is image-caption pairs and interleaved data of images and texts. https://i.imgur.com/LL4HiM3.png The input data will be fed into an embedding module to encode the data into vectors, then the vectors will be fed into a Transformer Decoder. Then the decoder will predict the next token based on the previous context. I think this large model's downside is that it can only predict phrases, not images. |
[link]
This paper is to reduce gender bias in the captioning model. Concretely, traditional captioning models tend to rely on contextual cues, so they usually predict incorrect captions for an image that contains people. To reduce gender bias, they introduced a new $Equalizer$ model that contains two losses: (1) Appearance Confusion Loss: When it is hard to tell if there is a man or a woman in the image, the model should provide a fair probability of predicting a man or a woman. To define that loss, first, they define a confusion function, which indicates how likely a next predicted word belongs to a set of woman words or a set of man words. https://i.imgur.com/oI6xswy.png Where, $w~_{t}$ is the next predicted word, $G_{w}$ is the set of woman words, $G_{m}$ is the set of man words. And the Loss is defined as the normal cross-entropy loss multiplied by the Confusion function. https://i.imgur.com/kLpROse.png (2) Confident Loss: When it is easy to recognize a man or a woman in an image, this loss encourages the model to predict gender words correctly. In this loss, they also defined in-confidence functions, there are two in-confidence functions, the first one is the in-confidence function for man words, and the second one is for woman words. These two functions are the same. https://i.imgur.com/4stFjac.png This function tells that if the model is confident when predicting a gender (ex. woman), then the value of the in-confidence function for woman words should be low. Then, the confidence loss function is as follows: https://i.imgur.com/1pRgDir.png |
[link]
This paper proposed a method to locate an object based on an image and a sentence describing objects in the image. Then, predicting a new visual concept embedding based on two graphs (1) a graph that describes the relationship between objects in a supplemental sentence describing several objects, and (2) a graph that describes the relationship between the detected object in the image and example images related to objects in the supplementary sentence. This embedding can be used for many downstream tasks such as Visual entailment, Visual Reasoning. For example, in the domain 1 image, the new visual concept is red, and the model can locate where is the red cub in the image (1a). Then, in (1b) the model can interpret the supplemental sentences that relate the novel concept with other concepts. https://i.imgur.com/yBIteYT.png To locate the box that describes the object, this work utilized MaskRCNN to first detect the object in the scene, then used a Neuro-Symbolic Program to match the object mentioned in the input sentence with the detected objects by MaskRCNN. https://i.imgur.com/2cG9IUX.png To learn the concept embedding for that object, this work needs a supplemental sentence that describes several objects, they are known concepts except one is a novel concept. Then, building two graphs $GNN_{concept}$, and $GNN_{example}$. In $GNN_{concept}$, this is a graph representing the relationship between known concepts and the novel concept. For example in this graph, "White-eyed Vireo" is a new concept. https://i.imgur.com/1LjirJz.png In $GNN_{example}$, this is a graph representing the relationship between the detected object that represents the novel object and example images of the novel object. https://i.imgur.com/SdR74Vu.png Then learn the concept embedding for this novel concept. https://i.imgur.com/YGYEPvc.png https://i.imgur.com/VXOBn6n.png |
[link]
what is the paper doing? This paper proposed a way to explain the model decision by human-readable concepts. For example, if the model thinks the following image is a black-throated sparrow, then a human can understand this decision via input descriptors. https://i.imgur.com/xVleDhp.png The descriptors were obtained from GPT-3, they got 500 descriptors for each class and then remove the class name in each descriptor. Then, for each class, they chose $k$ concepts to make sure that every class has an equal amount of concepts. After that, they put these concepts into a concept selection module to select a more fine-grained subset of concepts for each class. Then, they put these concepts and the image into CLIP to learn the score for each concept. Finally, they put a Class-concept weight matrix on top of CLIP to fine-tune these scores and output the predicted class name. Note that, this weight matrix was initialized with language priors. https://i.imgur.com/r9Op5Lm.png |
[link]
The paper proposed a new object detection method to detect novel classes by using Conditional Matching. This detector can be conditioned on either image or text, which means a user can use an image or text to let the model detect the corresponding bounding boxes in the picture. This model has 2 changes compared to other open-vocabulary detectors: 1) Other detectors rely on Region Proposal Network (RPN) which can not cover all the objects in a picture, so it will worsen the performance of detecting novel objects. So in this work, they use CLIP to detect novel objects, it is better than RPN because it uses queries as a reader to read the whole picture, then these queries can cover many objects in the picture. https://i.imgur.com/GqvvSVs.png 2) Other detectors rely on Bipartite Matching to match between class label names and detected bounding boxes. But the downside of Bipartite Matching is that it can not match the novel objects with any label names because the novel objects do not have the labels. So, in this work, they proposed to use Conditional Matching which turns the matching problem into a binary matching problem. By using Conditional Matching, an object can be assigned to a "matched" or "not matched" label. https://i.imgur.com/FjI2iub.png |
[link]
This paper proposed a way to do classification using primitive concepts such as color, shape, texture, etc. The framework is simple, they have two sub-models: (1) the first one is a trained VL model such as CLIP, ViLT, and ALBEF. The input of this step is the primitive concepts or let's say, attribute concepts and an image, then the output will be the scores for each concept. (2) the second one is a linear model that uses the concepts and their scores to do classification. This model is trained in a supervised manner. https://i.imgur.com/7WMmGyv.png |
[link]
This paper has a way to leverage pre-trained Vision Language encoders to do VL tasks such as VQA, and Image Captioning. To have a good VL model, the modality gap must be reduced. In this paper, they proposed a Q-Former which is a Transformer module that is trained first with a frozen image encoder, then trained with this frozen image encoder and a frozen text encoder (from a Large Language Model). https://i.imgur.com/rQ3V3oQ.png The reason why the Q-Former needs to train in two stages is: (1) Trained with frozen image encoder to learn the most informative visual features. https://i.imgur.com/gshAy1p.png (2) Trained with frozen text encoder to learn the visual features related to the textual feature the most. https://i.imgur.com/gPz40GC.png |
[link]
This paper is to mitigate the scene bias in the action recognition task. Scene bias is defined as the model only focusing on scene or object information without paying attention to the actual activity. To mitigate this issue, the author proposed 2 additional types of loss: (1) scene adversarial loss that helps the network to learn features that are suitable for action but invariant to scene type. Hence, reduce the scene bias. (2) human mask confusion loss that prevents a model from predicting the correct action (label) of this video if there is no person in this video. Hence, this can mitigate the scene bias because the model can not predict the correct action based on only the surrounding scene. https://i.imgur.com/BBfWE17.png To mask out the person in the video, they use a human detector to detect and then mask the person out. In the above diagram, there is a gradient reversal layer, which works as follows: In the forward pass, the output is similar to the input. In the backward pass, the output is equal to the input times -1. https://i.imgur.com/hif9ZL9.png This layer comes from Domain Adaptation. In domain adaptation, there is a need to make the distribution of the source and the target domain distinguishable. So, in this work, they want to make the action distribution and the scene distribution distinguishable, which is why they train the action classifier and scene classifier in an adversarial way. https://i.imgur.com/trNJGlm.png And by using the Gradient reversal layer, for the training instances, the action predictor will be trained for predicting the labels of the training instances. The feature extractor will therefore be trained to minimize the classification loss of the action predictor and maximize the classification loss of the scene predictor. As a result, the action will be scene-agnostic. |
[link]
Open-vocabulary semantic segmentation is a method to generate semantic segment regions based on text descriptions. Due to the text descriptions, this model can detect unseen objects that have not been seen in the training phase. Some works create two-stage methods to first create class-agnostic segments and then use CLIP to assign each segment to a phrase. https://i.imgur.com/eyME6i1.png To compute the prediction for an image, they ensemble two types of prediction scores. (1) If we want to classify a mask into $K$ classes, firstly, we encode $K$ class names into $K$ phrase embedding, each phrase embedding is denoted as $t_{k}$, and also encode the mask into a visual embedding, it is denoted as $v_{i}$, then calculate the score $p_{k}$ between $K$ phrase embedding and the visual embedding. $p_{k} = e(sigmoid(v_{i}, t_{k})/temperature)/\sum(e(sigmoid(v_{i}, t_{k})/temperature))$ (2) Another way to classify a mask into $K$ classes is to feed the mask into the CLIP vision encoder and reduce the size to $K$ embedding vector, to get the score $p^{'}_{k}$. Then, the final prediction will be the ensemble between these two scores, $p = p_{k}^{1-lambda}*p^{' lambda}_{k}$ where $lambda \in [0,1]$ But CLIP does not work well on masked images (segments), because CLIP was trained on the full image resolution. A critical problem with masked images is that it contains blank areas, so when these areas are fed into CLIP, they will become zero tokens, and according to the paper, these tokens not only bring no information but also bring domain distribution shift to the model. In this work, they made CLIP work well on masked images by converting these zero tokens into learnable tokens, and this is called mask prompt. https://i.imgur.com/muhdGxP.png |
[link]
Visual Question Answering can not do the counting objects problem properly. So in this paper, they figured out the reason is due to the Soft Attention module, and they also proposed a module that can produce reliable counting from object proposals. There are two challenges in VQA Counting tasks: (1) There is no ground truth label for the objects to be counted. (2) The additional module should not affect performance on non-counting problems. Why Soft Attention is not good for the counting task: One case to explain why Soft Attention limits counting ability: Consider the task of counting cats for two images: an image of a cat and an image that contains two images side-by-side that are copies of the first image. For image 1: after the normalization of the softmax function in the attention, the cat in this image will receive a normalized weight of 1. For image 2: each cat receives a weight of 0.5. Then, the attention module will do the weighted sum to produce an attention feature vector. Because the weighted sum process will average the two cats in the second image back to a single cat, so 2 attention feature vectors of the two images are the same. As a result, the information about possible counts is lost by using the attention map. Counting Component: This component will be in charge of counting objects for an image. This has two things to do: 1) A differentiable mechanism for counting from attention weights. 2) Handling overlapping object proposals to reduce object double-counting. The Counting Component is as follows: https://i.imgur.com/xVGcaov.png Note that, intra-objects are objects that point to the same object and the same class, while inter-objects are objects that point to the different object and the same class. They have three main components: (1) object proposals (4 vertices), the black ones are relevant objects while the white ones are irrelevant objects. Then (2) intra-object edges between duplicate proposals, and (3) blue edges mark the inter-object duplicate edges. Finally, there will be one edge and 2 vertices (2 relevant objects). To illustrate the component in more detail, there are 4 main steps: (1) Input: The component needs n attention weights $a = [a_{1}, a_{2},...,a_{n}]^{T}$ and their corresponding boxes $b = [b_{1}, ..., b_{n}]^{T}$ (2) Deduplication: The goal of this step is to make a graph $A=aa^{T}$ (attention matrix) where each vertex is a bounding box proposal if the $ith$ bounding box is a relevant box, then $a_{i} = 1$ otherwise, $a_{i} = 0$. And the Counting Component will modify this graph to delete those edges until the graph becomes a fully directed graph with self-loops. For example, [a1, a2, a3, a4, a5]=[1,0,1,0,1], the subgraph containing a1, a3, or a5 is a fully directed graph, as follows: https://i.imgur.com/cCKIQ0K.png The illustration for this graph is as follows: https://i.imgur.com/x93gk8c.png Then we will eliminate duplicate edges: (1) intra-object edges and (2) inter-object edges. 1. Intra-object edges First, we eliminate intra-object edges. To achieve this, we need to calculate the distance matrix $D$ where $D_{ij} = 1- IoU(b_{i}, b_{j})$, if $D_{ij}=1$ which means two bounding boxes are quite overlapped, and then should be eliminated. To remove them, multiply the attention matrix $A$, which is calculated before, with the matrix $D$, to remove the connection between duplicate proposals of a single object. https://i.imgur.com/TQAvAnW.png 2. Inter-object edges Second, we eliminate inter-object edges. The main idea is to combine the proposals of the duplicate objects into 1. To do this, scale down the weight of its associated edges (vertices connected to that vertex). For example, if an object has two proposals, the edges involving those proposals should be scaled by 0.5. Essentially, this is averaging the proposal within each base object, since we only use the sum of edge weights to compute the final count. https://i.imgur.com/4An0BAj.png |
[link]
Transformer is proposed to capture long-range information with the self-attention mechanism, but it comes with quadratic computation cost and lacks multi-resolution information. Then, Swin Transformer introduces local-window-self-attention to reduce the cost to linear w.r.t image size, shifted-window-attention to capture cross-window information and finally exploits multi-resolution information with hierarchical architecture. But shifted-window-attention struggles to capture long-range information due to the small coverage area of shifted-window-attention and lacks inductive-bias like ViT. Finally, Global Context ViT is proposed to address the limitations of the Swin Transformer. Improvements: (1) Unlike Swin Transformer this paper uses global context self-attention, with local self-attention, rather than shifted window self-attention, to model both long and short-range dependencies. (2) Even though global-window-attention is a window-attention but it takes leverage of global query which contains global information and hence captures long-range information. (3) In addition, this paper compensates for the lack of the inductive bias that exists in both ViTs and Swin Transformers by utilizing a CNN-based module. Key components: Stem/PatchEmbed: A stem/patchify layer processes images at the network’s beginning. For this network, it creates patches/tokens and converts them into embeddings. Level: It is the repetitive building block that extracts features using different blocks. Global Token Gen./FeatExtract: It generates global tokens/patches with Depthwise-CNN, SE (Squeeze-Excitation), CNN and MaxPooling. So basically it's a Feature Extractor. Block: It is the repetitive module that applies attention to the features and projects them to a certain dimension. Local-MSA: Local Multi head Self Attention. Global-MSA: Global Multi head Self Attention. MLP: Linear layer that projects a vector to another dimension. Downsample/ReduceSize: It is very similar to Global Token Gen. module except it uses CNN instead of MaxPooling to downsample with additional Layer Normalization modules. Head: It is the module responsible for the classification task. Pooling: It converts N×2D features to N×1D features. Classifier: It processes N×1D features to make a decision about class. I annotated like this to make it easier to digest: https://i.imgur.com/bTqIUH2.png |
[link]
This work enforced vision-language pretraining models to comprehend events and associated argument (participant) roles. https://i.imgur.com/TH7cOfZ.png To achieve this, they created a framework including 3 steps: https://i.imgur.com/8fpOA1r.png (1) Event structural knowledge extraction including (a) text extraction: using SOTA text information extraction system to extract events (ex: agent, entity, instrument), (b) image extraction: using Faster RCNN trained on Open Images to detect objects. (c) Primary event detection: the primary event is the event that is closer to the root of dependency parsing tree, and has larger number of arguments, higher event type frequency, and higher similarity between trigger word and the image using CLIP. (2) Event structure driven negative sampling: the negatives and positives can help the text and vision encoders learn robust features (encoders can learn why they are wrong, and why they are correct). To do that, they have 3 types of negatives: (a) negative event sampling: compute the confusion matrix for the event types and select the top one as the predicted event type, then event types whose visual features are ambiguous with the primary event type will be the negative events. (b) Negative Argument Sampling: if there are multiple roles, they will perform a right-rotation of the argument role sequence to get the negative argument samples. If there are only one argument for the event, compute the confusion matrix of the text argument extraction system (c) Description Generation: To encode positive and negative event structures, they have multiple prompt functions such as, single template-based prompt, composed template-based prompt, continuos prompt, caption editing, then use 5 manual event description examples as the input of the GPT-3, the output will be a fine-grained event description. https://i.imgur.com/fPo0UpH.png https://i.imgur.com/vIWv4lc.png (3) Event Graph Alignment via Optimal Transport Each event and its arguments can be organized as a graph. Encoding event graph structures enables the model to capture the interactions between events and arguments. For example, the injured man should be aligned with the ENTITY being transported, rather than the AGENT. https://i.imgur.com/NiWfNe4.png There are 3 types of alignments: (a) Image-level Alignment: computes consine similarity $s(t,i)$ and distance $d(t,i)$ between the text t and image i (2) Entity-level Alignment: computes the cosine similarity between text entity $t_{e}$ and image object $i_{o}$, where $t_{e}$ is the text mention of entity e, and $t_{e}$ is its embedding contextualized on the sentence, this contextualized embedding is encoded using Text Transformer, and apply average pooling over the tokens in the entity mention $t_{e}$. Similarly, $i_{o}$ is the bounding box of object o and $i_{o}$ is its embedding contextualized on the image, based on the average pooling over the vision transformer representations of the patches covered in the bounding box (3) Event-level Alignment: to obtain a global alignment score based on the structures of two graphs, we use the OT to get the minimal distance $d(G_{t}, G_{i})$ between text event graph $G_{t}$ and image event graph $G_{i}$. Finally, train the whole framework using Contrastive Learning. |
[link]
This paper is to design a generalized multimodal architecture that can solve all Vision language tasks. Concretely, they will pre-train their model on 4 main tasks (MLM, ITM, WRA, MRM) and will evaluate various downstream tasks (VQA, VCR, NLVR). https://i.imgur.com/IG7suDj.png As shown in Fig 1, UNITER first encodes image regions (visual features and bounding box features) and textual words (tokens and positions) into a common embedding space with Image Embedder and Text Embedder. Then, a Transformer module is applied to learn generalizable contextualized embeddings for each region and each word. The contribution is two-fold: (1) Masked language/region modeling is conditioned on full observation of image/text, rather than applying joint random masking to both modalities (2) Introducing a novel WRA pre-training task via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions. Intuitively, OT-based learning aims to optimize distribution matching by minimizing the cost of transporting one distribution to another. In our context, we aim to minimize the cost of transporting the embeddings from image regions to words in a sentence (and vice versa), thus optimizing toward better cross-modal alignment. |
[link]
This paper aims to learn a sparse semantic representation of texts and images instead of a dense representation trained by CLIP or ALIGN. The sparse embeddings are achieved by: (1) For an input (image or text), extract it to a feature (using Transformer) $h$ where $h_{j}$ corresponds to the $jth$ word in the input. (2) Each $j$ word embedding will be transformed to $p(h_{j})$ in vocabulary space $V$ by using a mapping function (in this paper, this is BERT Masked Language Model MLM). So each $p(h_{j})$ is a token in a vocabulary space $V$. (3) A max pooling layer will be applied to $p(h_{j})$ to get a value denoted for that token. So in the end, we will have a sparse vector living in V-dimensional space. https://i.imgur.com/BTvndLR.png Training: To achieve two goals (1) aligning text and images in the sparse embedding and (2) grounding the sparse vector with the human-understandable word in the vocabulary, they proposed 3-stage training: Stage 1: Training image embedding with masked tokens. In the first stage, they co-trained both the image and text encoders and apply a binary mask on the text embedding. By matching with the masked text embedding, the image encoder is learned to ground its image embedding on the tokens from the pairing text. Therefore, after the stage 1 training, the image embedding is living in the vocabulary’s interpretable space. Stage 2: Training with frozen image encoder. In this stage, they focus on grounding the text embedding to the same interpretable space where the image embedding is trained to reside in from stage 1. The key idea is to let the image encoder teach the text encoder as a teacher model. After stage 2 training, both image and text embeddings are in the same human-interpretable embedding space constructed by the vocabulary. Stage 3: Fine-tuning both encoders, they boosted the image-text matching performance by finetuning both encoders jointly. https://i.imgur.com/PWrEbkk.png To further encourage the sparsity, they proposed to use FLOPs regularization loss such that only a small number of token embeddings in V are non-zeros. |