[link]
Summary by Hadrien Bertrand 5 years ago
Natural images can be decomposed in frequencies, higher frequencies contain small changes and details, while lower frequencies contain the global structure. We can see an example in this image:
![image](https://user-images.githubusercontent.com/8659132/58988729-4e599b80-87b0-11e9-88e2-0ecde2cce369.png)
Each filter of a convolutional layer focuses on different frequencies of the image. This paper proposes a way to group them explicitly into high and low frequency filters.
To do that, the low frequency group is reduced spatially by 2 in all dimensions (which they define as an octave), before applying the convolution. The spatial reduction, which is a pooling operation, makes sense as it is a low pass filter, small details are discarded but the global structure is kept.
More concretely, the layer takes as input two groups of feature maps, one with a higher resolution than the other. The output is also two groups of feature maps, separated as high/low frequencies. Information is exchanged between the two groups by pooling or upsampling as needed, and as is shown on this image:
![image](https://user-images.githubusercontent.com/8659132/58990790-c7f38880-87b4-11e9-8bca-6a23c63963ad.png)
The proportion of high and low frequency feature maps is controlled through a single parameter, and through testing the authors found that having around 25% of low frequency features gives the best performance.
One important fact about this layer is that it can simply be used as replacement for a standard convolutional layer, and thus does not require other changes to the architecture. They test on various ResNets, DenseNets and MobileNets.
In terms of tasks, they get performance near state-of-the-art on [ImageNet top-1](https://paperswithcode.com/sota/image-classification-on-imagenet) and top-5. So why use this octave convolution? Because it reduces the amount of memory and computation required by the network.
# Comments
- I would have liked to see more groups of varying frequencies. Since an octave is a spatial reduction of 2^n, the authors could do the same with n > 1. I expect this will be addressed in future work.
- While the results are not quite SOTA, octave convolutions seem compatible with EfficientNet, and I expect this would improve the performance of both.
- Since each octave convolution layer outputs a multi-scale representation of the input, doesn't that mean that pooling becomes less necessary in a network? If so, octave convolutions would give better performances on a new architecture optimized for them.
Code: [Official](https://github.com/facebookresearch/OctConv), [all implementations](https://paperswithcode.com/paper/drop-an-octave-reducing-spatial-redundancy-in)
more
less