[link]
Transformer is proposed to capture long-range information with the self-attention mechanism, but it comes with quadratic computation cost and lacks multi-resolution information. Then, Swin Transformer introduces local-window-self-attention to reduce the cost to linear w.r.t image size, shifted-window-attention to capture cross-window information and finally exploits multi-resolution information with hierarchical architecture. But shifted-window-attention struggles to capture long-range information due to the small coverage area of shifted-window-attention and lacks inductive-bias like ViT. Finally, Global Context ViT is proposed to address the limitations of the Swin Transformer. Improvements: (1) Unlike Swin Transformer this paper uses global context self-attention, with local self-attention, rather than shifted window self-attention, to model both long and short-range dependencies. (2) Even though global-window-attention is a window-attention but it takes leverage of global query which contains global information and hence captures long-range information. (3) In addition, this paper compensates for the lack of the inductive bias that exists in both ViTs and Swin Transformers by utilizing a CNN-based module. Key components: Stem/PatchEmbed: A stem/patchify layer processes images at the network’s beginning. For this network, it creates patches/tokens and converts them into embeddings. Level: It is the repetitive building block that extracts features using different blocks. Global Token Gen./FeatExtract: It generates global tokens/patches with Depthwise-CNN, SE (Squeeze-Excitation), CNN and MaxPooling. So basically it's a Feature Extractor. Block: It is the repetitive module that applies attention to the features and projects them to a certain dimension. Local-MSA: Local Multi head Self Attention. Global-MSA: Global Multi head Self Attention. MLP: Linear layer that projects a vector to another dimension. Downsample/ReduceSize: It is very similar to Global Token Gen. module except it uses CNN instead of MaxPooling to downsample with additional Layer Normalization modules. Head: It is the module responsible for the classification task. Pooling: It converts N×2D features to N×1D features. Classifier: It processes N×1D features to make a decision about class. I annotated like this to make it easier to digest: https://i.imgur.com/bTqIUH2.png
Your comment:
|