First published: 2017/09/05 (7 years ago) Abstract: Convolutional neural networks are built upon the convolution operation, which
extracts informative features by fusing spatial and channel-wise information
together within local receptive fields. In order to boost the representational
power of a network, much existing work has shown the benefits of enhancing
spatial encoding. In this work, we focus on channels and propose a novel
architectural unit, which we term the "Squeeze-and-Excitation" (SE) block, that
adaptively recalibrates channel-wise feature responses by explicitly modelling
interdependencies between channels. We demonstrate that by stacking these
blocks together, we can construct SENet architectures that generalise extremely
well across challenging datasets. Crucially, we find that SE blocks produce
significant performance improvements for existing state-of-the-art deep
architectures at slight computational cost. SENets formed the foundation of our
ILSVRC 2017 classification submission which won first place and significantly
reduced the top-5 error to 2.251%, achieving a 25% relative improvement over
the winning entry of 2016.