This new architecture out of Deepmind applies combines information extraction and bottlenecks to a traditional Transformer base to get a model that can theoretically apply self-attention to meaningfully larger input sizes than earlier architectures allowed. Currently, self-attention models are quite powerful and capable, but because attention is quadratic-in-sequence-length in both time, and, often more saliently, memory, it's infeasible to use on long sequences without some modification. This paper propose what they call "cross-attention," where some smaller-dimensional latent vector attends to the input (the latent generates the queries, the input the keys and values). This lets the network pull information out of the larger-dimensional input into a smaller and fixed-by-hyperparameter, size of latent. From there, multiple self-attention layers are applied to generate a new latent, which can be fed back into the beginning of the process to query new information from the input, accounting for the "iterative" in the title of this work. The authors argue this approach lets them take larger inputs, and create deeper models, because the cost of each self-attention layer (going from latent-dim to latent-dim) is small and controlled. Like many other Transformer-based architectures, they use positional encodings, theirs based on Fourier features at different frequencies. https://i.imgur.com/Wc8rzII.png My overall take from the results presented is that it is competitive on many of the audio and vision tasks tested, with none of the convolutional priors that even something like Vision Transformer (which does course convolution-style preprocessing before going into Transformer layers) require, though it didn't dramatically outperform the state-of-the-art on any of the tested tasks. One thing that was strange to me was that they didn't (at least in the main paper, haven't read the appendix) seem to evaluate on text, which would seem like an obvious benchmark if you're proposing a Transformer-alternate architecture.