[link]
This paper aims to do zero-shot action recognition which uses cluster-based representation. Concretely, it uses REINFORCE algorithm which is a Reinforcement Learning algorithm to optimize the centroids and the reward signal is the classification scores. https://i.imgur.com/gWyJLX0.png ![]() |
[link]
This is a mildly silly paper to summarize, since there isn't really a new mechanism to understand, but rather a number of straightforward (and interesting!) empirical results that are also quite well-explained in the paper itself. That said, for the sake of a tiny bit more brevity than the paper itself provides, I'll try to pull out some of the conclusions I found the most interesting here. The general goal of this paper is to better understand the contours of when self-supervised representation learning is valuable for vision (and specifically when it can compete with supervised learning), and when it doesn't. In general, the results are all using ResNet backbones, with SimCLR SSL, on image classification datasets. Some bullet-point takeaways: - The SSL models being tested here seem to roughly saturate at unsupervised dataset sizes of around 500K; the comparative jump from dataset sizes of 500K to 1M is fairly small. - Once you have a supervised dataset of around 50K or more, the benefit of SSL pretraining starts to diminish, and it converges to being more similar to just supervised learning on that numbrer of labeled images. On the flip side, it's only possible to get close to "good" fully supervised performance by using 100K images or more on top of a SSL baseline. - Even within image classification datasets, it's much better to do SSL representation on the same dataset as the one you'll use for downstream training; trying to transfer representations to different datasets leads to meaningfully worse results. Interestingly, this is even true when you add out-of-domain (i.e. other-dataset) data to an existing in-domain dataset: a dataset of 250K in-dataset images does better than a 500K dataset of images from mixed datasets, and does notably better than a 1M dataset of mixed images. In this case, adding more out-of-domain images seems to have just degraded performance - SSL seems to perform more closely to SL on a course label set; when the label set gets more granular, the task gets harder overall, but, more specifically, the gap between SSL and SL grows - When the authors tried different forms of dataset corruption, SSL was much more robust to adding salt-and-pepper noise than it was to removing high-frequency information in the form of reducing the images to a lower resolution. ![]() |
[link]
The idea of the Switch Transformer is to have more parameters available for a network to use, but to only use a small subset of those parameters for each example that's run through the network. This is achieved through a routing scheme, whereby a weighting layer is applied to each token and produces a set of logits/softmax weights over the set of possible experts. The token is then sent to the expert that was given the highest weight. The network is implemented such that different experts can actually live on different devices. https://i.imgur.com/HEB7cJw.png This architecture is inspired by previous Mixture of Experts work, which applied a similar scheme, but sent each token through a set of k experts rather than just a single one. This had the ostensible effect of increasing stability and performance, but the authors of this paper argue that using a single expert per token is actually preferable on both of these fronts. There are a lot of experiments in this paper, and I'd recommend taking a look at them in detail if you're interested, but, at a high level, they found evidence that, compared to models with a comparable amount of parameters they were indeed able to get comparable or better performance with a lower number of FLOPS. It also meant they were able to build up to a trillion-parameter model, without having unreasonable computation requirements. Some interesting considerations relevant to this approach: - To keep training speed up, you need to strike the right balance of the number of tokens sent to each expert; in this case, the authors added a loss term to incentivize the division between experts to be roughly uniform - There was some numerical instability around the expert training procedure if you used float16 data types, so they switched to using float32, but only within the experts themselves, rather than in the rest of the network. - To regularize this huge of a network, the authors decided to apply dropout, but only within the experts ![]() |
[link]
In certain classes of multi-agent cooperation games, it's useful for agents to be able to coordinate on future actions, which is an obvious use case for having a communication channel between the two players. However, prior work in multi-agent RL has shown that it's surprisingly hard to train agents that (1) consistently learn to use a communication channel in a way that is informative rather than random, and (2) if they do use communication, can come to a common grounding on the meaning of symbols, to use them in an effective way. This paper suggests the straightforward and clever approach of, instead of just having agents communicate using arbitrary vectors produced as part of a policy, having those communication vectors be directly linked to the content of an agent's observations. Specifically, this is done by taking the encoding of the image that is used for making policy decisions, and passing that encoding through an autoencoder, using the bottleneck at the middle of the autoencoder as the communication vector sent to other agents. This structure incentivizes the agent to generate communication vectors that are intrinsically grounded in the observation, enforcing a certain level of consistency that the authors hope makes it easier for the other agent to follow and interpret the communication. https://i.imgur.com/u9OAZm8.png Empirically, there seem to be fairly compelling evidence that this autoencoder-based form of grounding is more stable and thus more mutually learnable than learning from RL alone. The authors even found that adding RL training to the autoencoder-based training deteriorated performance. ![]() |
[link]
This strikes me as a really straightforward, clever, and exciting paper that uses the supervision intrinsic in the visual, audio, and text streams of a video to train a shared multimodal model. The basic premise is: - Tokenize all three modalities into a sequence of embedding tokens. For video, split into patches, and linearly project the voxels of these patches to get a per-token representation. For audio, a similar strategy but with waveform patches. For text, the normal per-token embedding is done. Combine this tokenization with a modality-specific positional encoding. - Run all of these embeddings through a Transformer with shared weights for all three modalities - Take the final projected CLS representation for each the video patches, and perform contrastive learning against both an aligned audio patch, and an aligned text region. This contrastive loss is calculated by, for each pair, projecting into a shared space (video and audio each project into a shared audio-video space, video and text each project into a shared video-text space, with specific projection weights), and then doing a normal contrastive setup where positive pairs come either from a direct alignment of audio and video, or from a soft "nearest neighbors" alignment of text with video, to account for not all video snippets containing text One technique that was fun in its simplicity was the author's DropToken strategy, which basically just said "hey, we have a high-resolution input, what if we just randomly dropped tokens within our sequence to reduce the S^2 sequence length cost. This obviously leads to some performance cost, but they found it not very dramatic. Experimental results were all-around impressive, achieving SOTA on a number of modality-specific tasks (action prediction in video, audio prediction) with their cross-modality model. ![]() |
[link]
This new architecture out of Deepmind applies combines information extraction and bottlenecks to a traditional Transformer base to get a model that can theoretically apply self-attention to meaningfully larger input sizes than earlier architectures allowed. Currently, self-attention models are quite powerful and capable, but because attention is quadratic-in-sequence-length in both time, and, often more saliently, memory, it's infeasible to use on long sequences without some modification. This paper propose what they call "cross-attention," where some smaller-dimensional latent vector attends to the input (the latent generates the queries, the input the keys and values). This lets the network pull information out of the larger-dimensional input into a smaller and fixed-by-hyperparameter, size of latent. From there, multiple self-attention layers are applied to generate a new latent, which can be fed back into the beginning of the process to query new information from the input, accounting for the "iterative" in the title of this work. The authors argue this approach lets them take larger inputs, and create deeper models, because the cost of each self-attention layer (going from latent-dim to latent-dim) is small and controlled. Like many other Transformer-based architectures, they use positional encodings, theirs based on Fourier features at different frequencies. https://i.imgur.com/Wc8rzII.png My overall take from the results presented is that it is competitive on many of the audio and vision tasks tested, with none of the convolutional priors that even something like Vision Transformer (which does course convolution-style preprocessing before going into Transformer layers) require, though it didn't dramatically outperform the state-of-the-art on any of the tested tasks. One thing that was strange to me was that they didn't (at least in the main paper, haven't read the appendix) seem to evaluate on text, which would seem like an obvious benchmark if you're proposing a Transformer-alternate architecture. ![]() |
[link]
Model combination\ensembling: Average ensembling is practical - but naive. Combine considering each network's strengths, much better! Moreover, let's make the networks diverse so they will have different strengths. Wenjuan Han & Hwee Tou Ng (no twitters?) #enough2skim #NLProc The basic idea is quite simple: Given some models, why would we want the average? We want to rely on each one(or group) when it is more likely to be the correct one. This was actually introduced in our previous work (as admitted by the authors) in aclanthology.org/W19-4414.pdf The paper's addition: 1. Given a set of black-box models we may train at least one of them to be different from the rest with RL. 2. we can use more sophisticated NNs to combine the outputs 3. we can ignore domain knowledge for the combination (I am not sure this is a bonus) Results are very strong. Especially nice is that they show that the diversity training indeed helps My criticism: The comparisons are always to SoTA, this is meaningless. The authors propose different parts (the diversity, the combination and the combined models). It is unclear whether ensembling after the diversity would be preferable over their's or not. Similarly, they compare to Kantor et al., but Kantor provided a combination method, why not compare on the same models, or combine with Kantor's method the models after the diversity training? To conclude, I really like the direction, and ensembling is a very practical tool that for some reason was not improved in a long time. ![]() |
[link]
Huge 𝙘𝙤𝙢𝙢𝙞𝙩 𝙨𝙪𝙢𝙢𝙖𝙧𝙞𝙯𝙖𝙩𝙞𝙤𝙣 dataset The dataset cleans tons of open source projects to have only ones with high quality committing habits (e.g. large active projects with commits that are of significant length etc.) We present some ways to evaluate that the meaning was kept while summarizing, so you can go beyond ROUGE We provide a strict split that keeps some (thousand+-) repositories totally out of the training, so you can check in domain and out of domain or just be sure results are clean. If you ever want an even larger dataset, follow the same procedure and use more repositories (we took only ones active in 2020, pick ones that are active no longer or wasn't active until now) Dataset in https://figshare.com/articles/dataset/CumSum_data_set/14711370 Code is found in https://github.com/evidencebp/comsum Paper in https://arxiv.org/pdf/2108.10763.pdf ![]() |
[link]
**Background:** The goal of this work is to indicate image features which are relevant to the prediction of a neural network and convey that information to the user by displaying a counterfactual image animation. **The Latent Shift Method:** This method works on any pretrained encoder/decoder and classifier which is differentiable. No special considerations are needed during model training. With this approach they want the exact opposite of an adversarial attack but it is using the same idea. They want to perturb the input image so that the classifier reduces its prediction. If they just compute $\frac{\partial f}{\partial x}$ and move the pixels directly then they will get an imperceivable difference like an adversarial attack. Using a decoder they can regularize the transformation so it will only yield value images. The encoder takes the input image and encodes it into a latent representation $z$. Then the decoder reconstructs the image and feeds this image into the classifier. The gradient is computed from the output of the classifier with respect to $z$. Subtracting the gradient from z and reconstructing the image generates a counterfactual. https://i.imgur.com/iuZGUTH.gif They found that if they change the prediction by -30% the images come out pretty good. So an iterative search along the vector defined by the gradient in the latent space until the prediction is reduced by 30%. From this sequence a 2D image can be reconstructed which is similar to a traditional attribution map by taking the maximum pixel wise difference between every image and the unperturbed reconstruction. https://i.imgur.com/V3PCgXZ.png The results look great! https://i.imgur.com/DBki84c.gif https://i.imgur.com/kFfQNKD.gif In order to validate if this approach can help spot false positive predictions, two radiologists to evaluate how confident they were in a models predictions. For each image, radiologists viewed the prediction in two ways, using traditional methods or the Latent Shift images. Traditional methods includes the image gradient, guided backprop, and integrated gradients. The Latent Shift Counterfactual includes the animation as well as the 2D version. https://i.imgur.com/TlUBhzL.png What they would like to see, that for true positives, the results are all 5 and for false positives they are all 1. What they observe however, is that many false positives still cause high confidence in the model predictions but not as much as the true positives. Between these two methods they find for true positives that the latent shift counterfactuals show a significant increase in confidence which is good. > 0.15±0.95 confidence increase using the Latent Shift method (p=0.01). For false positives they find an increase in confidence but it is not significant. > 0.04±1.06 increase which is not significant (p=0.57) **Conclusions:** - Latent Shift's ability to generate counterfactuals is pretty good! - Vanilla autoencoders are sufficient for some pathologies. - StyleGAN and higher quality models should improve performance. - IoU analysis may not be the best fit. - Explainable AI methods can have an impact on the user confidence in the model. (Disclaimer: I am the author of this work) Project Website: https://mlmed.org/gifsplanation/ ![]() |