Resnet in Resnet: Generalizing Residual Architectures
Targ, Sasha
and
Almeida, Diogo
and
Lyman, Kevin
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords:
dblp
* They describe an architecture that merges classical convolutional networks and residual networks.
* The architecture can (theoretically) learn anything that a classical convolutional network or a residual network can learn, as it contains both of them.
* The architecture can (theoretically) learn how many convolutional layers it should use per residual block (up to the amount of convolutional layers in the whole network).
### How
* Just like residual networks, they have "blocks". Each block contains convolutional layers.
* Each block contains residual units and non-residual units.
* They have two "streams" of data in their network (just matrices generated by each block):
* Residual stream: The residual blocks write to this stream (i.e. it's their output).
* Transient stream: The non-residual blocks write to this stream.
* Residual and non-residual layers receive *both* streams as input, but only write to *their* stream as output.
* Their architecture visualized:

* Because of this architecture, their model can learn the number of layers per residual block (though BN and ReLU might cause problems here?):

* The easiest way to implement this should be along the lines of the following (some of the visualized convolutions can be merged):
* Input of size CxHxW (both streams, each C/2 planes)
* Concat
* Residual block: Apply C/2 convolutions to the C input planes, with shortcut addition afterwards.
* Transient block: Apply C/2 convolutions to the C input planes.
* Apply BN
* Apply ReLU
* Output of size CxHxW.
* The whole operation can also be implemented with just a single convolutional layer, but then one has to make sure that some weights stay at zero.
### Results
* They test on CIFAR-10 and CIFAR-100.
* They search for optimal hyperparameters (learning rate, optimizer, L2 penalty, initialization method, type of shortcut connection in residual blocks) using a grid search.
* Their model improves upon a wide ResNet and an equivalent non-residual CNN by a good margin (CIFAR-10: 0.5-1%, CIFAR-100: 1-2%).