Welcome to ShortScience.org! |
[link]
This work enforced vision-language pretraining models to comprehend events and associated argument (participant) roles. https://i.imgur.com/TH7cOfZ.png To achieve this, they created a framework including 3 steps: https://i.imgur.com/8fpOA1r.png (1) Event structural knowledge extraction including (a) text extraction: using SOTA text information extraction system to extract events (ex: agent, entity, instrument), (b) image extraction: using Faster RCNN trained on Open Images to detect objects. (c) Primary event detection: the primary event is the event that is closer to the root of dependency parsing tree, and has larger number of arguments, higher event type frequency, and higher similarity between trigger word and the image using CLIP. (2) Event structure driven negative sampling: the negatives and positives can help the text and vision encoders learn robust features (encoders can learn why they are wrong, and why they are correct). To do that, they have 3 types of negatives: (a) negative event sampling: compute the confusion matrix for the event types and select the top one as the predicted event type, then event types whose visual features are ambiguous with the primary event type will be the negative events. (b) Negative Argument Sampling: if there are multiple roles, they will perform a right-rotation of the argument role sequence to get the negative argument samples. If there are only one argument for the event, compute the confusion matrix of the text argument extraction system (c) Description Generation: To encode positive and negative event structures, they have multiple prompt functions such as, single template-based prompt, composed template-based prompt, continuos prompt, caption editing, then use 5 manual event description examples as the input of the GPT-3, the output will be a fine-grained event description. https://i.imgur.com/fPo0UpH.png https://i.imgur.com/vIWv4lc.png (3) Event Graph Alignment via Optimal Transport Each event and its arguments can be organized as a graph. Encoding event graph structures enables the model to capture the interactions between events and arguments. For example, the injured man should be aligned with the ENTITY being transported, rather than the AGENT. https://i.imgur.com/NiWfNe4.png There are 3 types of alignments: (a) Image-level Alignment: computes consine similarity $s(t,i)$ and distance $d(t,i)$ between the text t and image i (2) Entity-level Alignment: computes the cosine similarity between text entity $t_{e}$ and image object $i_{o}$, where $t_{e}$ is the text mention of entity e, and $t_{e}$ is its embedding contextualized on the sentence, this contextualized embedding is encoded using Text Transformer, and apply average pooling over the tokens in the entity mention $t_{e}$. Similarly, $i_{o}$ is the bounding box of object o and $i_{o}$ is its embedding contextualized on the image, based on the average pooling over the vision transformer representations of the patches covered in the bounding box (3) Event-level Alignment: to obtain a global alignment score based on the structures of two graphs, we use the OT to get the minimal distance $d(G_{t}, G_{i})$ between text event graph $G_{t}$ and image event graph $G_{i}$. Finally, train the whole framework using Contrastive Learning. |
[link]
This paper is to design a generalized multimodal architecture that can solve all Vision language tasks. Concretely, they will pre-train their model on 4 main tasks (MLM, ITM, WRA, MRM) and will evaluate various downstream tasks (VQA, VCR, NLVR). https://i.imgur.com/IG7suDj.png As shown in Fig 1, UNITER first encodes image regions (visual features and bounding box features) and textual words (tokens and positions) into a common embedding space with Image Embedder and Text Embedder. Then, a Transformer module is applied to learn generalizable contextualized embeddings for each region and each word. The contribution is two-fold: (1) Masked language/region modeling is conditioned on full observation of image/text, rather than applying joint random masking to both modalities (2) Introducing a novel WRA pre-training task via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions. Intuitively, OT-based learning aims to optimize distribution matching by minimizing the cost of transporting one distribution to another. In our context, we aim to minimize the cost of transporting the embeddings from image regions to words in a sentence (and vice versa), thus optimizing toward better cross-modal alignment. |
[link]
This is a mildly silly paper to summarize, since there isn't really a new mechanism to understand, but rather a number of straightforward (and interesting!) empirical results that are also quite well-explained in the paper itself. That said, for the sake of a tiny bit more brevity than the paper itself provides, I'll try to pull out some of the conclusions I found the most interesting here. The general goal of this paper is to better understand the contours of when self-supervised representation learning is valuable for vision (and specifically when it can compete with supervised learning), and when it doesn't. In general, the results are all using ResNet backbones, with SimCLR SSL, on image classification datasets. Some bullet-point takeaways: - The SSL models being tested here seem to roughly saturate at unsupervised dataset sizes of around 500K; the comparative jump from dataset sizes of 500K to 1M is fairly small. - Once you have a supervised dataset of around 50K or more, the benefit of SSL pretraining starts to diminish, and it converges to being more similar to just supervised learning on that numbrer of labeled images. On the flip side, it's only possible to get close to "good" fully supervised performance by using 100K images or more on top of a SSL baseline. - Even within image classification datasets, it's much better to do SSL representation on the same dataset as the one you'll use for downstream training; trying to transfer representations to different datasets leads to meaningfully worse results. Interestingly, this is even true when you add out-of-domain (i.e. other-dataset) data to an existing in-domain dataset: a dataset of 250K in-dataset images does better than a 500K dataset of images from mixed datasets, and does notably better than a 1M dataset of mixed images. In this case, adding more out-of-domain images seems to have just degraded performance - SSL seems to perform more closely to SL on a course label set; when the label set gets more granular, the task gets harder overall, but, more specifically, the gap between SSL and SL grows - When the authors tried different forms of dataset corruption, SSL was much more robust to adding salt-and-pepper noise than it was to removing high-frequency information in the form of reducing the images to a lower resolution. |
[link]
This paper is an interesting extension of earlier work, in the TransformerXL paper, that sought to give Transformers access to a "memory" beyond the scope of the subsequence where full self-attention was being performed. This was done by caching the activations from prior subsequences, and making them available to the subsequence currently being calculated in a "read-only" way, with gradients not propagated backwards. This had the effect of (1) reducing the maximum memory size compared to simply doubling the subsequence length, and (2) reducing the extent to which gradients had to propagate backward through time. The authors of the Compressive Transformers paper want to build on that set of ideas to construct an even longer accessible memory. So, they take the baseline non-backpropogated memory design of TransformerXL, but instead of having tokens roll out of memory after the end of the previous (cached) subsequence, they create an extra compressed memory. Each token in this compressed memory is a function of C inputs in the normal memory. So, if C=3, you would input 3 memory vectors into your compression function to get one instance of a compressed memory vector. Depending on the scale of your C, you can turn up the temporal distance into the past that your compressed memory had to. https://i.imgur.com/7BaCzoU.png While the gradients from the main loss function didn't, as far as I could tell, pass back into the compression function, they did apply a compression loss to incentivize the compression to be coherent. They considered an autoencoder loss to reconstruct the input tokens from the compressed memory, but decided against that on the principle that memory inherently has to be compressed and lossy to be effective, and an autoencoder loss would promote infeasibly lossless compression. Instead, they take the interesting approach of incentivizing the compressed representations to be able to reconstruct the attention calculation performed on the pre-compressed representations. Basically, any information pulled out of the pre-compressed memories by content-based lookup also needs to be able to be pulled out of the compressed memories. This incentives the network to preferentially keep the information that was being actively used by the attention mechanisms in prior steps, and discard less useful information. One framing from this paper that I enjoyed was them drawing a comparison between the approach of Transformers (of keeping all lower-level activations in memory, and recombining them "in real time," for each downstream use of that information), and the approach of RNNs (of keeping a running compressed representation of everything seen up to this point). In this frame, their method is somewhere in between, with a tunable compression rate C (by contrast, a RNN would have an effectively unlimited compression rate, since all prior tokens would be compressed into a single state representation). |
[link]
The idea of the Switch Transformer is to have more parameters available for a network to use, but to only use a small subset of those parameters for each example that's run through the network. This is achieved through a routing scheme, whereby a weighting layer is applied to each token and produces a set of logits/softmax weights over the set of possible experts. The token is then sent to the expert that was given the highest weight. The network is implemented such that different experts can actually live on different devices. https://i.imgur.com/HEB7cJw.png This architecture is inspired by previous Mixture of Experts work, which applied a similar scheme, but sent each token through a set of k experts rather than just a single one. This had the ostensible effect of increasing stability and performance, but the authors of this paper argue that using a single expert per token is actually preferable on both of these fronts. There are a lot of experiments in this paper, and I'd recommend taking a look at them in detail if you're interested, but, at a high level, they found evidence that, compared to models with a comparable amount of parameters they were indeed able to get comparable or better performance with a lower number of FLOPS. It also meant they were able to build up to a trillion-parameter model, without having unreasonable computation requirements. Some interesting considerations relevant to this approach: - To keep training speed up, you need to strike the right balance of the number of tokens sent to each expert; in this case, the authors added a loss term to incentivize the division between experts to be roughly uniform - There was some numerical instability around the expert training procedure if you used float16 data types, so they switched to using float32, but only within the experts themselves, rather than in the rest of the network. - To regularize this huge of a network, the authors decided to apply dropout, but only within the experts |