First published: 2014/12/21 (6 years ago) Abstract: Most modern convolutional neural networks (CNNs) used for object recognition
are built using the same principles: Alternating convolution and max-pooling
layers followed by a small number of fully connected layers. We re-evaluate the
state of the art for object recognition from small images with convolutional
networks, questioning the necessity of different components in the pipeline. We
find that max-pooling can simply be replaced by a convolutional layer with
increased stride without loss in accuracy on several image recognition
benchmarks. Following this finding -- and building on other recent work for
finding simple network structures -- we propose a new architecture that
consists solely of convolutional layers and yields competitive or state of the
art performance on several object recognition datasets (CIFAR-10, CIFAR-100,
ImageNet). To analyze the network we introduce a new variant of the
"deconvolution approach" for visualizing features learned by CNNs, which can be
applied to a broader range of network structures than existing approaches.
This paper simplifies the convolutional network proposed
by Alex Krizhevsky by replacing max-pooling with strided
convolutions (under the assumption that max-pooling is
required only for dimensionality reduction). They also
propose a novel technique for visualizing representations
learnt by intermediate layers that produces nicer visualizations
in input pixel space than DeconvNet (Zeiler et al) and Saliency
map (Simonyan at al) approaches.
- Their model performs at par or better than the original AlexNet formulation.
- Max-pooling replaced by convolution with stride 2
- Fully-connected layers replaced by 1x1 convolutions and global averaging + softmax
- Smaller filter size (same intuition as VGGNet paper)
- Combining the DeconvNet (Zeiler et al.) and backpropagation (Simonyan et al.) approaches
at the ReLU operator (which is the only point of difference) by masking out values where at
least one of input activation or output reconstruction is negative (guided backprop) is neat
and leads to nice visualizations.
## Weaknesses / Notes
- Saliency maps generated from guided backpropagation definitely look much better
as compared to DeconvNet visualizations and saliency maps from Simonyan et al's paper.
It works better probably because the negative saliency values only arise from the very
first convolution, since negative error signals are never propagated back through the