The authors propose a new way to initialize the weights of a deep feedfoward network based on inspiration from residual networks, then apply it for initialization of layers in a residual network with improved results on CIFAR-10/100.
The abstract is inaccurate with respect to the experiments actually performed in the paper. An architecture with the ability to 'forget' is only mentioned without detail towards the end of the paper with a single experiment.
The authors propose an initialization scheme based on some comparisons to the ResNet architecture. They also replace CONV blocks with the proposed ResNetInit CONV blocks to obtained a Resnet in Resnet (RiR). These experiments are needed, the connections made between the models in the paper are interesting.
* They describe an architecture that merges classical convolutional networks and residual networks.
* The architecture can (theoretically) learn anything that a classical convolutional network or a residual network can learn, as it contains both of them.
* The architecture can (theoretically) learn how many convolutional layers it should use per residual block (up to the amount of convolutional layers in the whole network).
* Just like residual networks, they have "blocks". Each block contains convolutional layers.
* Each block contains residual units and non-residual units.
* They have two "streams" of data in their network (just matrices generated by each block):
* Residual stream: The residual blocks write to this stream (i.e. it's their output).
* Transient stream: The non-residual blocks write to this stream.
* Residual and non-residual layers receive *both* streams as input, but only write to *their* stream as output.
* Their architecture visualized:
* Because of this architecture, their model can learn the number of layers per residual block (though BN and ReLU might cause problems here?):
![Learning layercount](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/Resnet_in_Resnet__learning_layercount.png?raw=true "Learning layercount")
* The easiest way to implement this should be along the lines of the following (some of the visualized convolutions can be merged):
* Input of size CxHxW (both streams, each C/2 planes)
* Residual block: Apply C/2 convolutions to the C input planes, with shortcut addition afterwards.
* Transient block: Apply C/2 convolutions to the C input planes.
* Apply BN
* Apply ReLU
* Output of size CxHxW.
* The whole operation can also be implemented with just a single convolutional layer, but then one has to make sure that some weights stay at zero.
* They test on CIFAR-10 and CIFAR-100.
* They search for optimal hyperparameters (learning rate, optimizer, L2 penalty, initialization method, type of shortcut connection in residual blocks) using a grid search.
* Their model improves upon a wide ResNet and an equivalent non-residual CNN by a good margin (CIFAR-10: 0.5-1%, CIFAR-100: 1-2%).