[link]
Brendel and Bethge show empirically that state-of-the-art deep neural networks on ImageNet rely to a large extent on local features, without any notion of interaction between them. To this end, they propose a bag-of-local-features model by applying a ResNet-like architecture on small patches of ImageNet images. The predictions of these local features are then averaged and a linear classifier is trained on top. Due to the locality, this model allows to inspect which areas in an image contribute to the model’s decision, as shown in Figure 1. Furthermore, these local features are sufficient for good performance on ImageNet. Finally, they show, on scrambled ImageNet images, that regular deep neural networks also rely heavily on local features, without any notion of spatial interaction between them. https://i.imgur.com/8NO1w0d.png Figure 1: Illustration of the heap maps obtained using BagNets, the bag-of-local-features model proposed in the paper. Here, different sizes for the local patches are used. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).
Your comment:
|
[link]
When talking about modern machine learning, particularly on images, it can feel like deep neural networks are a world unto themselves when it comes to complexity. On one hand, there are straightforward things like hand-designed features and linear classifiers, and then on the other, there are these deep, heavily-interacting networks that dazzle us with their performance but seem almost unavoidably difficult to hold in our heads or interpret. This paper, from ICLR 2019 earlier this year, investigates another point along this trade-off curve of complexity: a model that uses deep layers of convolutions, but limits the receptive field of those convolutions so that each feature is calculated using only a small spatial area of the image. https://i.imgur.com/NR0vFbN.png This approach, termed BagNet, essentially predicts class logits off of a small area of the image, without using information from anywhere else. Then, to aggregate the local predictions, a few simple and linear steps are performed: the predictions from each spatial area are averaged together into one vector containing the "aggregate information" for each class, and then that class information vector is passed into a linear (non-interacting!) model to predict final class probabilities. This is quite nice for interpretability, because you can directly identify the areas of the image that contributed evidence to the prediction, and you can know that the impact of those areas wasn't in fact amplified by feature values elsewhere, because there are no interaction effects outside of these small regions Now, it's fairly obvious that you're not going to get any state-of-the-art results off of this: the entire point is to handicap a network in ways believed to make it more interpretable. So the interesting question is instead what degree of performance loss comes from such a (fairly drastic) limitation of model capacity and receptive field? And the answer of the paper is: less than you might think. (Or, at least, less than *they* think you think). If you only use features calculated from 33x33 pixel chunks of image net, and aggregate their evidence together in a purely linear way, you can get to 87.6% top-5 image accuracy on ImageNet, which is about where we were with AlexNet in 2012. The authors also do some comparisons of their network to more common neural networks, to try to argue that even fully nonlinear neural nets don't use spatial information very much in their predictions. One way they did this was by masking different areas of the image, and comparing the effect of masking each individually to the effect of masking all areas together. In a purely linear model like BagNet, where the effects of different areas are just aggregated together, these would sum together perfectly, and the performance loss of all areas at once would be equal to the sum of each individually. To measure "effective spatial linearity" of each network, they measured the correlation between the sum of the individual effects and the joint effect. For VGG, they found a correlation of 0.75 here (compared to 1.0 for BagNet), which they use to argue that VGG doesn't use very much spatial information. I found this result hard to really get a grounding on, since I don't have a good intuitive grasp for what differences in this correlation value would mean. Is a difference of 0.25 a small difference, or a dramatic one? https://i.imgur.com/hA58AKM.png That aside, I found this paper interesting, and I'm quite pleased it was written. On one hand, you can say: well, obviously, we've done a lot of work in 7 years to build ResNet and DenseNet and whatnot, so of course if you apply those more advanced architectures, even on a small region of image space, you'll get good performance. That said, I still think this is an interesting finding, because it helps us understand how much of the added value in recent research requires a high (and uninterpretable) interaction complexity, and what proportion of the overall performance can be achieved with a simpler-to-understand model. Machine learning is used in a lot of settings, and it always practically exists on a trade-off curve, where performance is important, but it's often worth trading off performance to do better on other considerations, and this paper does a good job of illustrating that trade-off curve more fully. |