First published: 2016/06/08 (8 years ago) Abstract: Despite the success of CNNs, selecting the optimal architecture for a given
task remains an open problem. Instead of aiming to select a single optimal
architecture, we propose a "fabric" that embeds an exponentially large number
of architectures. The fabric consists of a 3D trellis that connects response
maps at different layers, scales, and channels with a sparse homogeneous local
connectivity pattern. The only hyper-parameters of a fabric are the number of
channels and layers. While individual architectures can be recovered as paths,
the fabric can in addition ensemble all embedded architectures together,
sharing their weights where their paths overlap. Parameters can be learned
using standard methods based on back-propagation, at a cost that scales
linearly in the fabric size. We present benchmark results competitive with the
state of the art for image classification on MNIST and CIFAR10, and for
semantic segmentation on the Part Labels dataset.
Convolutional Neural Fabrics (CNFs) are a construction algorithm for CNN architectures.
> Instead of aiming to select a single optimal architecture, we propose a “fabric” that embeds an exponentially large number of architectures. The fabric consists of a 3D trellis that connects response maps at different layers, scales, and channels with a sparse homogeneous local connectivity pattern.
![Image](http://i.imgur.com/wlISXgo.png)
* **Pooling**: CNFs don't use pooling. However, this might not be necessary as they use strided convolution.
* **Filter size**: All convolutions use kernel size 3.
* **Output layer**: Scale $1 \times 1$, channels = nr of classes
* **Activation function**: Rectified linear units (ReLUs) are used at all nodes.
## Evaluation
* Part Labels dataset (face images from the LFW dataset): a super-pixel accuracy of 95.6%
* MNIST: 0.33% error (see [SotA](https://martin-thoma.com/sota/#image-classification); 0.21 %)
* CIFAR10: 7.43% error (see [SotA](https://martin-thoma.com/sota/#image-classification); 2.72 %)
## What I didn't understand
* "Activations are thus a linear function over multi-dimensional neighborhoods, i.e. a four dimensional
3×3×3×3 neighborhood when processing 2D images"
* "within the first layer, channel c at scale s receives input from channels c + {−1, 0, 1} from scale s − 1": Why does the scale change? Why doesn't the first layer receive input from the same scale?