[link]
Sources: - https://arxiv.org/pdf/1512.03385.pdf - http://image-net.org/challenges/talks/ilsvrc2015_deep_residual_learning_kaiminghe.pdf Summary: - Took the first place in Imagenet 5 main tracks - Revolution of depth: GoogLeNet was 22 layers with 6.7 top-5 error, Resnet is 152 layers with 3.57 top-5 error - Light on complexity: the 34 layer baseline is 18% of the FLOPs(multiply-adds) of VGG. - Resnet 152 has lower time complexity than VGG-16/19 - Extends well to detection and segmentation tasks - Just stacking more layers gives worse performance. Why? In theory: > A deeper model should not have higher training error • A solution by construction: • original layers: copied from a learned shallower model • extra layers: set as identity • at least the same training error • Optimization difficulties: solvers cannot find the solution when going deeper… - Why do the residual connections help? it's easier to learn a residual mapping w.r.t. identity. - If identity were optimal, easy to set weights as 0 - >If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one. We show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning. - Basic design (VGG-style) - all 3x3 conv (almost) - spatial size /2 => # filters x2 - Simple design; just deep! - Other remarks: - no max pooling (almost) - no hidden fc - no dropout - Training - All plain/residual nets are trained from scratch - All plain/residual nets use Batch Normalization - Standard hyper-parameters & augmentation - The learned features are well transferable to other tasks - Works well with Faster RCNN - Works well with semantic instance segmentation Also skimmed: - Deep Residual Networks with 1K Layers https://github.com/KaimingHe/resnet-1k-layers https://arxiv.org/pdf/1603.05027.pdf |
[link]
- Implementations: - https://hub.docker.com/r/mklinov/caffe-flownet2/ - https://github.com/lmb-freiburg/flownet2-docker - https://github.com/lmb-freiburg/flownet2 - Explanations: - A Brief Review of FlowNet - not a clear explanation https://medium.com/towards-data-science/a-brief-review-of-flownet-dca6bd574de0 - https://www.youtube.com/watch?v=JSzUdVBmQP4 Supplementary material: http://openaccess.thecvf.com/content_cvpr_2017/supplemental/Ilg_FlowNet_2.0_Evolution_2017_CVPR_supplemental.pdf |
[link]
It's like mask rcnn but for salient instances. code will be available at https://github.com/RuochenFan/S4Net. They invented a layer "mask pooling" that they claim is better than ROI pooling and ROI align. >As can be seen, our proposed binary RoIMasking and ternary RoIMasking both outperform RoIPool and RoIAlign in mAP0.7 . Specifically, our ternary RoIMasking result improves the RoIAlign result by around 2.5 points. This reflects that considering more context information outside the proposals does help for salient instance segmentation Important benchmark attached: https://i.imgur.com/wOF2Ovz.png |
[link]
# Metadata * **Title**: The Do’s and Don’ts for CNN-based Face Verification * **Authors**: Ankan Bansal Carlos Castillo Rajeev Ranjan Rama Chellappa UMIACS - University of Maryland, College Park * **Link**: https://arxiv.org/abs/1705.07426 # Abstract >Convolutional neural networks (CNN) have become the most sought after tools for addressing object recognition problems. Specifically, they have produced state-of-the art results for unconstrained face recognition and verification tasks. While the research community appears to have developed a consensus on the methods of acquiring annotated data, design and training of CNNs, many questions still remain to be answered. In this paper, we explore the following questions that are critical to face recognition research: (i) Can we train on still images and expect the systems to work on videos? (ii) Are deeper datasets better than wider datasets? (iii) Does adding label noise lead to improvement in performance of deep networks? (iv) Is alignment needed for face recognition? We address these questions by training CNNs using CASIA-WebFace, UMDFaces, and a new video dataset and testing on YouTubeFaces, IJBA and a disjoint portion of UMDFaces datasets. Our new data set, which will be made publicly available, has 22,075 videos and 3,735,476 human annotated frames extracted from them. # Introduction >We make the following main contributions in this paper: • We introduce a large dataset of videos of over 3,000 subjects along with 3,735,476 human annotated bounding boxes in frames extracted from these videos. • We conduct a large scale systematic study about the effects of making certain apparently routine decisions about the training procedure. Our experiments clearly show that data variety, number of individuals in the dataset, quality of the dataset, and good alignment are keys to obtaining good performance. • We suggest the best practices that could lead to an improvement in the performance of deep face recognition networks. These practices will also guide future data collection efforts. # How they made the dataset - collect youtube videos - automated filtering with yolo and landmark detection projects - crowd source final filtering (AMT - give 50 face images to turks and ask which don't belong) - quality control through sentinels: give turks the same test but with 5 known correct answers, and rank the turks according to how they perform on this ground truth test. If they're good, trust their answers on the real tests. - result: > we have 3,735,476 annotated frames in 22,075 videos. We will publicly release this massive dataset # Questions and experiments ## Do deep recognition networks trained on stills perform well on videos? > We study the effects of this difference between still images and frames extracted from videos in section 3.1 using our new dataset. We found that mixing both still images and the large number of video frames during training performs better than using just still images or video frames for testing on any of the test datasets ## What is better: deeper or wider datasets? >In section 3.2 we investigate the impact of using a deep dataset against using a wider dataset. For two datasets with the same number of images, we call one deeper than the other if on average it has more images per subject (and hence fewer subjects) than the other. We show that it is important to have a wider dataset than a deeper dataset with the same number of images. ## Does some amount of label noise help improve the performance of deep recognition networks? >When training any supervised face classification system, each image is first associated with a label. Label noise is the phenomenon of assigning an incorrect label to some images. Label noise is an inherent part of the data collection process. Some authors intentionally leave in some label noise [25, 6, 7] in the dataset in hopes of making the deep networks more robust. In section 3.3 we examine the effect of this label noise on the performance of deep networks for verification trained on these datasets and demonstrate that clean datasets almost always lead to significantly better performance than noisy datasets. ## Does thumbnail creation method affect performance? >... This leads to generation of different types of bounding boxes for faces. Verification accuracy can be affected by the type of bounding box used. In addition, most recent face recognition and verification methods [35, 31, 33, 5, 9, 34] use some kind of 2D or 3D alignment procedure [41, 14, 28, 8]. ... In section 3.4 we study the consequences of using different thumbnail generation methods on verification performance of deep networks. We show that using a good keypoint detection method and aligning faces both during training and testing leads to the best performance. |