Deep Networks with Stochastic Depth on ShortScience.org

arxiv.org
scholar.google.com

Deep Networks with Stochastic Depth
Huang, Gao and Sun, Yu and Liu, Zhuang and Sedra, Daniel and Weinberger, Kilian
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: deeplearning, acreuser

Summaries/Notes 4

[link] Summary by Denny Britz 9 years ago

TLDR; The authors randomly drop out complete layers during training using a modified ResNet architecture. The dropout probability hyperparameter decreases linearly (higher layers have a higher chance to be dropped) ending at 0.5 at the final layer in the experiments. This mechanisms helps vanishing gradients, diminishing feature reuse, and long training time. The model achieves new records on the CIFAR-10, CIFAR-100 and SVHN dataset.


#### Key Points:

- Can easily modify ResNet architecture to dropout out whole layer by only keeping the identity skip connection
- Lower layers get lower probability of being dropped since they intuitively contain more "stable" features. Authors use linear decay with final value 0.5.
- Training time reduces by 25% - 50% depending on dropout probability hyperparameter
- Authors find that vanishing gradients are indeed reduces by plotting the gradient magnitudes vs. number of epochs
- Can be interpreted as an ensemble of networks with varying depth
- All layers are used during test time and need to scale activations appropriately
- Authors successfully train network with 1000+ layers and achieve further error reduction

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private