Deep Residual Learning for Image Recognition on ShortScience.org

arxiv.org
scholar.google.com

Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 8

[link] Summary by Eddie Smolansky 7 years ago

Sources:
- https://arxiv.org/pdf/1512.03385.pdf
- http://image-net.org/challenges/talks/ilsvrc2015_deep_residual_learning_kaiminghe.pdf

Summary:
- Took the first place in Imagenet 5 main tracks
- Revolution of depth: GoogLeNet was 22 layers with 6.7 top-5 error, 
Resnet is 152 layers with 3.57 top-5 error
- Light on complexity: the 34 layer baseline is 18% of the FLOPs(multiply-adds) of VGG.
    - Resnet 152 has lower time complexity than VGG-16/19
- Extends well to detection and segmentation tasks
- Just stacking more layers gives worse performance. Why? In theory:
    > A deeper model should not have
    higher training error
    • A solution by construction:
    • original layers: copied from a
    learned shallower model
    • extra layers: set as identity
    • at least the same training error
    • Optimization difficulties: solvers
    cannot find the solution when going
    deeper…
- Why do the residual connections help? it's easier to learn a residual mapping w.r.t. identity. 
    - If identity were optimal, easy to set weights as 0
    - >If the optimal function is closer to an identity
mapping than to a zero mapping, it should be easier for the
solver to find the perturbations with reference to an identity
mapping, than to learn the function as a new one. We show
by experiments (Fig. 7) that the learned residual functions in
general have small responses, suggesting that identity mappings
provide reasonable preconditioning.
- Basic design (VGG-style)
    - all 3x3 conv (almost)
    - spatial size /2 => # filters x2
    - Simple design; just deep!
    - Other remarks:
        - no max pooling (almost)
        - no hidden fc
        - no dropout
- Training
    - All plain/residual nets are trained from scratch
    - All plain/residual nets use Batch Normalization
    - Standard hyper-parameters & augmentation
- The learned features are well transferable to other tasks
    - Works well with Faster RCNN
    - Works well with semantic instance segmentation
 
Also skimmed:
- Deep Residual Networks with 1K Layers
https://github.com/KaimingHe/resnet-1k-layers 
https://arxiv.org/pdf/1603.05027.pdf

Your comment: