Mask R-CNN on ShortScience.org

arxiv.org
scholar.google.com

Mask R-CNN
He, Kaiming and Gkioxari, Georgia and Dollár, Piotr and Girshick, Ross B.
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 5

[link] Summary by nandini 7 years ago

####  Mask R-CNN framework for instance segmentation
### Goal:
* classify individual objects
*  localize each using a bounding box, 
* semantic segmentation
https://i.imgur.com/XfBRa5O.png
*  classify each pixel into a fixed set of categories without differentiating object instances.
* extends Faster R-CNN  by adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression.
*  FCN applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner
1. RoIAlign:
 * Used to fix the misalignment that faithfully preserves exact spatial locations
 * improves mask accuracy by relative 10% to 50%, fast speed
2. Decouple mask and class prediction:
 * predict a binary mask for each class independently, without competition among classes

 History:
* RCNN: The Region-based CNN (R-CNN) approach to bounding-box object detection
* Fast RCNN:  Speeding up and Simplifying R-CNN
 *  RoI (Region of Interest) Pooling
 *  jointly train the CNN, classifier, and bounding box regressor in a single model
*  Faster R-CNN - Speeding Up Region Proposal
 * reuse the same CNN results for region proposals instead of running a separate selective search algorithm it can be done by Region Proposal Network
 * only one CNN needs to be trained 

 Related Work
*  Instance Segmentation: “fully convolutional instance segmentation” (FCIS)
* Faster R-CNN: * Region Proposal Network (RPN), proposes candidate object bounding boxes
* Fast R-CNN [12], extracts features using RoIPool from each candidate box and performs classification and bounding-box regression
*  Mask R-CNN: Mask R-CNN adopts the same two-stage of Faster RCNN And has third stage i.e binary mask for each RoI
* Mask Representation: pixel to pixel representation of image done by RoIAlign layer (7X7)
#### Network Architecture
*  convolutional backbone architecture used for feature extraction over an entire image (ResNet-50-C4, FPN)
*  network head for bounding-box recognition (classification and regression) and mask prediction
https://i.imgur.com/pUvKdmx.png
#### Training:
* Images resized:800 pixel                                     
* mini-batch : 2 images per GPU
* N : 64
* train: on 8 GPUs for 160k iterations
* learning : 0.02
*  train images: 80K
* val images: 35K
* minival:5K
https://i.imgur.com/6ZLpewi.png
https://i.imgur.com/5o3um0Y.png

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private