[link]
SSD aims to solve the major problem with most of the current state of the art object detectors namely Faster RCNN and like. All the object detection algortihms have same methodology - Train 2 different nets - Region Proposal Net (RPN) and advanced classifier to detect class of an object and bounding box separately. - During inference, run the test image at different scales to detect object at multiple scales to account for invariance This makes the nets extremely slow. Faster RCNN could operate at **7 FPS with 73.2% mAP** while SSD could achieve **59 FPS with 74.3% mAP ** on VOC 2007 dataset. #### Methodology SSD uses a single net for predict object class and bounding box. However it doesn't do that directly. It uses a mechanism for choosing ROIs, training end-to-end for predicting class and boundary shift for that ROI. ##### ROI selection Borrowing from FasterRCNNs SSD uses the concept of anchor boxes for generating ROIs from the feature maps of last layer of shared conv layer. For each pixel in layer of feature maps, k default boxes with different aspect ratios are chosen around every pixel in the map. So if there are feature maps each of m x n resolutions - that's *mnk* ROIs for a single feature layer. Now SSD uses multiple feature layers (with differing resolutions) for generating such ROIs primarily to capture size invariance of objects. But because earlier layers in deep conv net tends to capture low level features, it uses features after certain levels and layers henceforth. ##### ROI labelling Any ROI that matches to Ground Truth for a class after applying appropriate transforms and having Jaccard overlap greater than 0.5 is positive. Now, given all feature maps are at different resolutions and each boxes are at different aspect ratios, doing that's not simple. SDD uses simple scaling and aspect ratios to get to the appropriate ground truth dimensions for calculating Jaccard overlap for default boxes for each pixel at the given resolution ##### ROI classification SSD uses single convolution kernel of 3*3 receptive fields to predict for each ROI the 4 offsets (centre-x offset, centre-y offset, height offset , width offset) from the Ground Truth box for each RoI, along with class confidence scores for each class. So that is if there are c classes (including background), there are (c+4) filters for each convolution kernels that looks at a ROI. So summarily we have convolution kernels that look at ROIs (which are default boxes around each pixel in feature map layer) to generate (c+4) scores for each RoI. Multiple feature map layers with different resolutions are used for generating such ROIs. Some ROIs are positive and some negative depending on jaccard overlap after ground box has scaled appropriately taking resolution differences in input image and feature map into consideration. Here's how it looks :  ##### Training For each ROI a combined loss is calculated as a combination of localisation error and classification error. The details are best explained in the figure.  ##### Inference For each ROI predictions a small threshold is used to first filter out irrelevant predictions, Non Maximum Suppression (nms) with jaccard overlap of 0.45 per class is applied then on the remaining candidate ROIs and the top 200 detections per image are kept. For further understanding of the intuitions regarding the paper and the results obtained please consider giving the full paper a read. The open sourced code is available at this [Github repo](https://github.com/weiliu89/caffe/tree/ssd) ![]()
Your comment:
|
[link]
* They suggest a new bounding box detector. * Their detector works without an RPN and RoI-Pooling, making it very fast (almost 60fps). * Their detector works at multiple scales, making it better at detecting small and large objects. * They achieve scores similar to Faster R-CNN. ### How * Architecture * Similar to Faster R-CNN, they use a base network (modified version of VGG16) to transform images to feature maps. * They do not use an RPN. * They predict via convolutions for each location in the feature maps: * (a) one confidence value per class (high confidence indicates that there is a bounding box of that class at the given location) * (b) x/y offsets that indicate where exactly the center of the bounding box is (e.g. a bit to the left or top of the feature map cell's center) * (c) height/width values that reflect the (logarithm of) the height/width of the bounding box * Similar to Faster R-CNN, they also use the concept of anchor boxes. So they generate the described values not only once per location, but several times for several anchor boxes (they use six anchor boxes). Each anchor box has different height/width and optionally scale. * Visualization of the predictions and anchor boxes: *  * They generate these predictions not only for the final feature map, but also for various feature maps in between (e.g. before pooling layers). This makes it easier for the network to detect small (as well as large) bounding boxes (multi-scale detection). * Visualization of the multi-scale architecture: *  * Training * Ground truth bounding boxes have to be matched with anchor boxes (at multiple scales) to determine correct outputs. To do this, anchor boxes and ground truth bounding boxes are matched if their jaccard overlap is 0.5 or higher. Any unmatched ground truth bounding box is matched to the anchor box with highest jaccard overlap. * Note that this means that a ground truth bounding box can be assigned to multiple anchor boxes (in Faster R-CNN it is always only one). * The loss function is similar to Faster R-CNN, i.e. a mixture of confidence loss (classification) and location loss (regression). They use softmax with crossentropy for the confidence loss and smooth L1 loss for the location. * Similar to Faster R-CNN, they perform hard negative mining. Instead of training every anchor box at every scale they only train the ones with the highest loss (per example image). While doing that, they also pick the anchor boxes to be trained so that 3 in 4 boxes are negative examples (and 1 in 4 positive). * Data Augmentation: They sample patches from images using a wide range of possible sizes and aspect ratios. They also horizontally flip images, perform cropping and padding and perform some photo-metric distortions. * Non-Maximum-Suppression (NMS) * Upon inference, they remove all bounding boxes that have a confidence below 0.01. * They then apply NMS, removing bounding boxes if there is already a similar one (measured by jaccard overlap of 0.45 or more). ### Results * Pascal VOC 2007 * They achieve around 1-3 points mAP better results than Faster R-CNN. *  * Despite the multi-scale method, the model's performance is still significantly worse for small objects than for large ones. * Adding data augmentation significantly improved the results compared to no data augmentation (around 6 points mAP). * Using more than one anchor box also had noticeable effects on the results (around 2 mAP or more). * Using multiple feature maps to predict outputs (multi-scale architecture) significantly improves the results (around 10 mAP). Though adding very coarse (high-level) feature maps seems to rather hurt than help. * Pascal VOC 2012 * Around 4 mAP better results than Faster R-CNN. * COCO * Between 1 and 4 mAP better results than Faster R-CNN. * Times * At a batch size of 1, SSD runs at about 46 fps at input resolution 300x300 (74.3 mAP on Pascal VOC) and 19 fps at input resolution 512x512 (76.8 mAP on Pascal VOC). *  ![]() |