First published: 2016/05/20 (8 years ago) Abstract: We describe Swapout, a new stochastic training method, that outperforms
ResNets of identical network structure yielding impressive results on CIFAR-10
and CIFAR-100. Swapout samples from a rich set of architectures including
dropout, stochastic depth and residual architectures as special cases. When
viewed as a regularization method swapout not only inhibits co-adaptation of
units in a layer, similar to dropout, but also across network layers. We
conjecture that swapout achieves strong regularization by implicitly tying the
parameters across layers. When viewed as an ensemble training method, it
samples a much richer set of architectures than existing methods such as
dropout or stochastic depth. We propose a parameterization that reveals
connections to exiting architectures and suggests a much richer set of
architectures to be explored. We show that our formulation suggests an
efficient training method and validate our conclusions on CIFAR-10 and
CIFAR-100 matching state of the art accuracy. Remarkably, our 32 layer wider
model performs similar to a 1001 layer ResNet model.
This paper presents Swapout, a simple dropout method applied to Residual Networks (ResNets). In a ResNet, a layer $Y$ is computed from the previous layer $X$ as
$Y = X + F(X)$
where $F(X)$ is essentially the composition of a few convolutional layers. Swapout simply applies dropout separately on both terms of a layer's equation:
$Y = \Theta_1 \odot X + \Theta_2 \odot F(X)$
where $\Theta_1$ and $\Theta_2$ are independent dropout masks for each term.
The paper shows that this form of dropout is at least as good or superior as other forms of dropout, including the recently proposed [stochastic depth dropout][1]. Much like in the stochastic depth paper, better performance is achieved by linearly increasing the dropout rate (from 0 to 0.5) from the first hidden layer to the last.
In addition to this observation, I also note the following empirical observations:
1. At test time, averaging the output layers of multiple dropout mask samples (referenced to as stochastic inference) is better than replacing the masks by their expectation (deterministic inference), the latter being the usual standard.
2. Comparable performance is achieved by making the ResNet wider (e.g. 4 times) and with fewer layers (e.g. 32) than the orignal ResNet work with thin but very deep (more than 1000 layers) ResNets. This would confirm a similar observation from [this paper][2].
Overall, these are useful observations to be aware of for anyone wanting to use ResNets in practice.
[1]: http://arxiv.org/abs/1603.09382v1
[2]: https://arxiv.org/abs/1605.07146