Deep Residual Learning for Image Recognition
He, Kaiming
and
Zhang, Xiangyu
and
Ren, Shaoqing
and
Sun, Jian
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords:
dblp
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**.
Advantages:
* Learning the identity becomes learning 0 which is simpler
* Loss in information flow in the forward pass is not a problem anymore
* No vanishing / exploding gradient
* Identities don't have parameters to be learned
## Evaluation
The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128.
* ImageNet ILSVRC 2015: 3.57% (ensemble)
* CIFAR-10: 6.43%
* MS COCO: 59.0% mAp@0.5 (ensemble)
* PASCAL VOC 2007: 85.6% mAp@0.5
* PASCAL VOC 2012: 83.8% mAp@0.5
## See also
* [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993)