This strikes me as a really straightforward, clever, and exciting paper that uses the supervision intrinsic in the visual, audio, and text streams of a video to train a shared multimodal model. The basic premise is: - Tokenize all three modalities into a sequence of embedding tokens. For video, split into patches, and linearly project the voxels of these patches to get a per-token representation. For audio, a similar strategy but with waveform patches. For text, the normal per-token embedding is done. Combine this tokenization with a modality-specific positional encoding. - Run all of these embeddings through a Transformer with shared weights for all three modalities - Take the final projected CLS representation for each the video patches, and perform contrastive learning against both an aligned audio patch, and an aligned text region. This contrastive loss is calculated by, for each pair, projecting into a shared space (video and audio each project into a shared audio-video space, video and text each project into a shared video-text space, with specific projection weights), and then doing a normal contrastive setup where positive pairs come either from a direct alignment of audio and video, or from a soft "nearest neighbors" alignment of text with video, to account for not all video snippets containing text One technique that was fun in its simplicity was the author's DropToken strategy, which basically just said "hey, we have a high-resolution input, what if we just randomly dropped tokens within our sequence to reduce the S^2 sequence length cost. This obviously leads to some performance cost, but they found it not very dramatic. Experimental results were all-around impressive, achieving SOTA on a number of modality-specific tasks (action prediction in video, audio prediction) with their cross-modality model.