Eddie Smolansky's profile - ShortScience.org

arxiv.org
scholar.google.com

Deep Residual Learning for Image Recognition
He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian
arXiv e-Print archive - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Eddie Smolansky 7 years ago

Sources:
- https://arxiv.org/pdf/1512.03385.pdf
- http://image-net.org/challenges/talks/ilsvrc2015_deep_residual_learning_kaiminghe.pdf

Summary:
- Took the first place in Imagenet 5 main tracks
- Revolution of depth: GoogLeNet was 22 layers with 6.7 top-5 error, 
Resnet is 152 layers with 3.57 top-5 error
- Light on complexity: the 34 layer baseline is 18% of the FLOPs(multiply-adds) of VGG.
    - Resnet 152 has lower time complexity than VGG-16/19
- Extends well to detection and segmentation tasks
- Just stacking more layers gives worse performance. Why? In theory:
    > A deeper model should not have
    higher training error
    • A solution by construction:
    • original layers: copied from a
    learned shallower model
    • extra layers: set as identity
    • at least the same training error
    • Optimization difficulties: solvers
    cannot find the solution when going
    deeper…
- Why do the residual connections help? it's easier to learn a residual mapping w.r.t. identity. 
    - If identity were optimal, easy to set weights as 0
    - >If the optimal function is closer to an identity
mapping than to a zero mapping, it should be easier for the
solver to find the perturbations with reference to an identity
mapping, than to learn the function as a new one. We show
by experiments (Fig. 7) that the learned residual functions in
general have small responses, suggesting that identity mappings
provide reasonable preconditioning.
- Basic design (VGG-style)
    - all 3x3 conv (almost)
    - spatial size /2 => # filters x2
    - Simple design; just deep!
    - Other remarks:
        - no max pooling (almost)
        - no hidden fc
        - no dropout
- Training
    - All plain/residual nets are trained from scratch
    - All plain/residual nets use Batch Normalization
    - Standard hyper-parameters & augmentation
- The learned features are well transferable to other tasks
    - Works well with Faster RCNN
    - Works well with semantic instance segmentation
 
Also skimmed:
- Deep Residual Networks with 1K Layers
https://github.com/KaimingHe/resnet-1k-layers 
https://arxiv.org/pdf/1603.05027.pdf

arxiv.org
arxiv-vanity.com
scholar.google.com

FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks
Eddy Ilg and Nikolaus Mayer and Tonmoy Saikia and Margret Keuper and Alexey Dosovitskiy and Thomas Brox
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Eddie Smolansky 7 years ago

- Implementations:    
    - https://hub.docker.com/r/mklinov/caffe-flownet2/
    - https://github.com/lmb-freiburg/flownet2-docker
    - https://github.com/lmb-freiburg/flownet2
- Explanations:
    - A Brief Review of FlowNet - not a clear explanation
    https://medium.com/towards-data-science/a-brief-review-of-flownet-dca6bd574de0
    - https://www.youtube.com/watch?v=JSzUdVBmQP4
Supplementary material: 
http://openaccess.thecvf.com/content_cvpr_2017/supplemental/Ilg_FlowNet_2.0_Evolution_2017_CVPR_supplemental.pdf

arxiv.org
arxiv-vanity.com
scholar.google.com

$S^4$Net: Single Stage Salient-Instance Segmentation
Ruochen Fan and Qibin Hou and Ming-Ming Cheng and Tai-Jiang Mu and Shi-Min Hu
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Eddie Smolansky 7 years ago

It's like mask rcnn but for salient instances. 
code will be available at https://github.com/RuochenFan/S4Net.

They invented a layer "mask pooling" that they claim is better than ROI pooling and ROI align.

>As can be seen, our proposed
binary RoIMasking and ternary RoIMasking both outperform
RoIPool and RoIAlign in mAP0.7
. Specifically, our
ternary RoIMasking result improves the RoIAlign result by
around 2.5 points. This reflects that considering more context
information outside the proposals does help for salient
instance segmentation


Important benchmark attached: 
https://i.imgur.com/wOF2Ovz.png

arxiv.org
arxiv-vanity.com
scholar.google.com

The Do's and Don'ts for CNN-based Face Verification
Ankan Bansal and Carlos Castillo and Rajeev Ranjan and Rama Chellappa
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.CV
more

[link] Summary by Eddie Smolansky 7 years ago

# Metadata
* **Title**: The Do’s and Don’ts for CNN-based Face Verification
* **Authors**: Ankan Bansal Carlos Castillo Rajeev Ranjan Rama Chellappa
UMIACS - 
University of Maryland, College Park
* **Link**: https://arxiv.org/abs/1705.07426

# Abstract
>Convolutional neural networks (CNN) have become the most sought after tools for addressing object recognition problems. Specifically, they have produced state-of-the art results for unconstrained face recognition and verification tasks. While the research community appears to have developed a consensus on the methods of acquiring annotated data, design and training of CNNs, many questions still remain to be answered. In this paper, we explore the following questions that are critical to face recognition research: (i) Can we train on still images and expect the systems to work on videos? (ii) Are deeper datasets better than wider datasets? (iii) Does adding label noise lead to improvement in performance of deep networks? (iv) Is alignment needed for face recognition? We address these questions by training CNNs using CASIA-WebFace, UMDFaces, and a new video dataset and testing on YouTubeFaces, IJBA and a disjoint portion of UMDFaces datasets. Our new data set, which will be made publicly available, has 22,075 videos and 3,735,476 human annotated frames extracted from them.

# Introduction
>We make the following main contributions in this paper:
• We introduce a large dataset of videos of over
3,000 subjects along with 3,735,476 human annotated
bounding boxes in frames extracted from these videos.
• We conduct a large scale systematic study about the
effects of making certain apparently routine decisions
about the training procedure. Our experiments clearly
show that data variety, number of individuals in the
dataset, quality of the dataset, and good alignment are
keys to obtaining good performance.
• We suggest the best practices that could lead to an improvement
in the performance of deep face recognition
networks. These practices will also guide future data
collection efforts.

# How they made the dataset
- collect youtube videos
- automated filtering with yolo and landmark detection projects
- crowd source final filtering (AMT - give 50 face images to turks and ask which don't belong)
- quality control through sentinels: give turks the same test but with 5 known correct answers, 
and rank the turks according to how they perform on this ground truth test. 
If they're good, trust their answers on the real tests.
- result: 
    > we have 3,735,476 annotated frames in 22,075 videos. We will
    publicly release this massive dataset

# Questions and experiments
## Do deep recognition networks trained on stills perform well on videos?
> We study the effects of this difference between
still images and frames extracted from videos in section
3.1 using our new dataset. We found that mixing both
still images and the large number of video frames during
training performs better than using just still images or video
frames for testing on any of the test datasets

## What is better: deeper or wider datasets?
>In section 3.2 we investigate the impact of using a deep
dataset against using a wider dataset. For two datasets with
the same number of images, we call one deeper than the
other if on average it has more images per subject (and
hence fewer subjects) than the other. We show that it is
important to have a wider dataset than a deeper dataset with
the same number of images.

## Does some amount of label noise help improve the performance of deep recognition networks?
>When training any supervised face classification system,
each image is first associated with a label. Label noise is the
phenomenon of assigning an incorrect label to some images.
Label noise is an inherent part of the data collection process.
Some authors intentionally leave in some label noise [25, 6,
7] in the dataset in hopes of making the deep networks more
robust. In section 3.3 we examine the effect of this label
noise on the performance of deep networks for verification
trained on these datasets and demonstrate that clean datasets
almost always lead to significantly better performance than
noisy datasets.

## Does thumbnail creation method affect performance?
>... This leads to generation of different types
of bounding boxes for faces. Verification accuracy can
be affected by the type of bounding box used. In addition,
most recent face recognition and verification methods
[35, 31, 33, 5, 9, 34] use some kind of 2D or 3D alignment
procedure [41, 14, 28, 8]. ... In section 3.4 we study the consequences
of using different thumbnail generation methods
on verification performance of deep networks. We show
that using a good keypoint detection method and aligning
faces both during training and testing leads to the best performance.

Eddie Smolansky

sciscore: 1.6