Global Context Vision Transformers on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Global Context Vision Transformers
Ali Hatamizadeh and Hongxu Yin and Jan Kautz and Pavlo Molchanov
arXiv e-Print archive - 2022 via Local arXiv
Keywords: cs.CV, cs.AI, cs.LG
more

Summaries/Notes 1

[link] Summary by ngthanhtinqn 2 years ago

Transformer is proposed to capture long-range information with the self-attention mechanism, but it comes with quadratic computation cost and lacks multi-resolution information. Then, Swin Transformer introduces local-window-self-attention to reduce the cost to linear w.r.t image size, shifted-window-attention to capture cross-window information and finally exploits multi-resolution information with hierarchical architecture. But shifted-window-attention struggles to capture long-range information due to the small coverage area of shifted-window-attention and lacks inductive-bias like ViT. Finally, Global Context ViT is proposed to address the limitations of the Swin Transformer.

Improvements:
(1) Unlike Swin Transformer this paper uses global context self-attention, with local self-attention, rather than shifted window self-attention, to model both long and short-range dependencies.

(2) Even though global-window-attention is a window-attention but it takes leverage of global query which contains global information and hence captures long-range information.

(3) In addition, this paper compensates for the lack of the inductive bias that exists in both ViTs and Swin Transformers by utilizing a CNN-based module.

Key components:

Stem/PatchEmbed: A stem/patchify layer processes images at the network’s beginning. For this network, it creates patches/tokens and converts them into embeddings.

Level: It is the repetitive building block that extracts features using different blocks.

Global Token Gen./FeatExtract: It generates global tokens/patches with Depthwise-CNN, SE (Squeeze-Excitation), CNN and MaxPooling. So basically it's a Feature Extractor.

Block: It is the repetitive module that applies attention to the features and projects them to a certain dimension.

Local-MSA: Local Multi head Self Attention.

Global-MSA: Global Multi head Self Attention.

MLP: Linear layer that projects a vector to another dimension.

Downsample/ReduceSize: It is very similar to Global Token Gen. module except it uses CNN instead of MaxPooling to downsample with additional Layer Normalization modules.

Head: It is the module responsible for the classification task.

Pooling: It converts N×2D features to N×1D features.

Classifier: It processes N×1D features to make a decision about class.

I annotated like this to make it easier to digest:
https://i.imgur.com/bTqIUH2.png

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private