[link]
This work enforced vision-language pretraining models to comprehend events and associated argument (participant) roles. https://i.imgur.com/TH7cOfZ.png To achieve this, they created a framework including 3 steps: https://i.imgur.com/8fpOA1r.png (1) Event structural knowledge extraction including (a) text extraction: using SOTA text information extraction system to extract events (ex: agent, entity, instrument), (b) image extraction: using Faster RCNN trained on Open Images to detect objects. (c) Primary event detection: the primary event is the event that is closer to the root of dependency parsing tree, and has larger number of arguments, higher event type frequency, and higher similarity between trigger word and the image using CLIP. (2) Event structure driven negative sampling: the negatives and positives can help the text and vision encoders learn robust features (encoders can learn why they are wrong, and why they are correct). To do that, they have 3 types of negatives: (a) negative event sampling: compute the confusion matrix for the event types and select the top one as the predicted event type, then event types whose visual features are ambiguous with the primary event type will be the negative events. (b) Negative Argument Sampling: if there are multiple roles, they will perform a right-rotation of the argument role sequence to get the negative argument samples. If there are only one argument for the event, compute the confusion matrix of the text argument extraction system (c) Description Generation: To encode positive and negative event structures, they have multiple prompt functions such as, single template-based prompt, composed template-based prompt, continuos prompt, caption editing, then use 5 manual event description examples as the input of the GPT-3, the output will be a fine-grained event description. https://i.imgur.com/fPo0UpH.png https://i.imgur.com/vIWv4lc.png (3) Event Graph Alignment via Optimal Transport Each event and its arguments can be organized as a graph. Encoding event graph structures enables the model to capture the interactions between events and arguments. For example, the injured man should be aligned with the ENTITY being transported, rather than the AGENT. https://i.imgur.com/NiWfNe4.png There are 3 types of alignments: (a) Image-level Alignment: computes consine similarity $s(t,i)$ and distance $d(t,i)$ between the text t and image i (2) Entity-level Alignment: computes the cosine similarity between text entity $t_{e}$ and image object $i_{o}$, where $t_{e}$ is the text mention of entity e, and $t_{e}$ is its embedding contextualized on the sentence, this contextualized embedding is encoded using Text Transformer, and apply average pooling over the tokens in the entity mention $t_{e}$. Similarly, $i_{o}$ is the bounding box of object o and $i_{o}$ is its embedding contextualized on the image, based on the average pooling over the vision transformer representations of the patches covered in the bounding box (3) Event-level Alignment: to obtain a global alignment score based on the structures of two graphs, we use the OT to get the minimal distance $d(G_{t}, G_{i})$ between text event graph $G_{t}$ and image event graph $G_{i}$. Finally, train the whole framework using Contrastive Learning.
Your comment:
|