First published: 2024/12/03 (just now) Abstract: We propose global context vision transformer (GC ViT), a novel architecture
that enhances parameter and compute utilization for computer vision tasks. The
core of the novel model are global context self-attention modules, joint with
standard local self-attention, to effectively yet efficiently model both long
and short-range spatial interactions, as an alternative to complex operations
such as an attention masks or local windows shifting. While the local
self-attention modules are responsible for modeling short-range information,
the global query tokens are shared across all global self-attention modules to
interact with local key and values. In addition, we address the lack of
inductive bias in ViTs and improve the modeling of inter-channel dependencies
by proposing a novel downsampler which leverages a parameter-efficient fused
inverted residual block. The proposed GC ViT achieves new state-of-the-art
performance across image classification, object detection and semantic
segmentation tasks. On ImageNet-1K dataset for classification, GC ViT models
with 51M, 90M and 201M parameters achieve 84.3%, 84.9% and 85.6% Top-1
accuracy, respectively, surpassing comparably-sized prior art such as CNN-based
ConvNeXt and ViT-based Swin Transformer. Pre-trained GC ViT backbones in
downstream tasks of object detection, instance segmentation, and semantic
segmentation on MS COCO and ADE20K datasets outperform prior work consistently,
sometimes by large margins.
Transformer is proposed to capture long-range information with the self-attention mechanism, but it comes with quadratic computation cost and lacks multi-resolution information. Then, Swin Transformer introduces local-window-self-attention to reduce the cost to linear w.r.t image size, shifted-window-attention to capture cross-window information and finally exploits multi-resolution information with hierarchical architecture. But shifted-window-attention struggles to capture long-range information due to the small coverage area of shifted-window-attention and lacks inductive-bias like ViT. Finally, Global Context ViT is proposed to address the limitations of the Swin Transformer.
Improvements:
(1) Unlike Swin Transformer this paper uses global context self-attention, with local self-attention, rather than shifted window self-attention, to model both long and short-range dependencies.
(2) Even though global-window-attention is a window-attention but it takes leverage of global query which contains global information and hence captures long-range information.
(3) In addition, this paper compensates for the lack of the inductive bias that exists in both ViTs and Swin Transformers by utilizing a CNN-based module.
Key components:
Stem/PatchEmbed: A stem/patchify layer processes images at the network’s beginning. For this network, it creates patches/tokens and converts them into embeddings.
Level: It is the repetitive building block that extracts features using different blocks.
Global Token Gen./FeatExtract: It generates global tokens/patches with Depthwise-CNN, SE (Squeeze-Excitation), CNN and MaxPooling. So basically it's a Feature Extractor.
Block: It is the repetitive module that applies attention to the features and projects them to a certain dimension.
Local-MSA: Local Multi head Self Attention.
Global-MSA: Global Multi head Self Attention.
MLP: Linear layer that projects a vector to another dimension.
Downsample/ReduceSize: It is very similar to Global Token Gen. module except it uses CNN instead of MaxPooling to downsample with additional Layer Normalization modules.
Head: It is the module responsible for the classification task.
Pooling: It converts N×2D features to N×1D features.
Classifier: It processes N×1D features to make a decision about class.
I annotated like this to make it easier to digest:
https://i.imgur.com/bTqIUH2.png