[link]
Summary by Alexander Jung 7 years ago
* They suggest a new version of YOLO, a model to detect bounding boxes in images.
* Their new version is more accurate, faster and is trained to recognize up to 9000 classes.
### How
* Their base model is the previous YOLOv1, which they improve here.
* Accuracy improvements
* They add batch normalization to the network.
* Pretraining usually happens on ImageNet at 224x224, fine tuning for bounding box detection then on another dataset, say Pascal VOC 2012, at higher resolutions, e.g. 448x448 in the case of YOLOv1.
This is problematic, because the pretrained network has to learn to deal with higher resolutions and a new task at the same time.
They instead first pretrain on low resolution ImageNet examples, then on higher resolution ImegeNet examples and only then switch to bounding box detection.
That improves their accuracy by about 4 percentage points mAP.
* They switch to anchor boxes, similar to Faster R-CNN. That's largely the same as in YOLOv1. Classification is now done per tested anchor box shape, instead of per grid cell.
The regression of x/y-coordinates is now a bit smarter and uses sigmoids to only translate a box within a grid cell.
* In Faster R-CNN the anchor box shapes are manually chosen (e.g. small squared boxes, large squared boxes, thin but high boxes, ...).
Here instead they learn these shapes from data.
That is done by applying k-Means to the bounding boxes in a dataset.
They cluster them into k=5 clusters and then use the centroids as anchor box shapes.
Their accuracy this way is the same as with 9 manually chosen anchor boxes.
(Using k=9 further increases their accuracy significantly, but also increases model complexity. As they want to predict 9000 classes they stay with k=5.)
* To better predict small bounding boxes, they add a pass-through connection from a higher resolution layer to the end of the network.
* They train their network now at multiple scales. (As the network is now fully convolutional, they can easily do that.)
* Speed improvements
* They get rid of their fully connected layers. Instead the network is now fully convolutional.
* They have also removed a handful or so of their convolutional layers.
* Capability improvement (weakly supervised learning)
* They suggest a method to predict bounding boxes of the 9000 most common classes in ImageNet.
They add a few more abstract classes to that (e.g. dog for all breeds of dogs) and arrive at over 9000 classes (9418 to be precise).
* They train on ImageNet and MSCOCO.
* ImageNet only contains class labels, no bounding boxes. MSCOCO only contains general classes (e.g. "dog" instead of the specific breed).
* They train iteratively on both datasets. MSCOCO is used for detection and classification, while ImageNet is only used for classification.
For an ImageNet example of class `c`, they search among the predicted bounding boxes for the one that has highest predicted probability of being `c`
and backpropagate only the classification loss for that box.
* In order to compensate the problem of different abstraction levels on the classes (e.g. "dog" vs a specific breed), they make use of WordNet.
Based on that data they generate a hierarchy/tree of classes, e.g. one path through that tree could be: object -> animal -> canine -> dog -> hunting dog -> terrier -> yorkshire terrier.
They let the network predict paths in that hierarchy, so that the prediction "dog" for a specific dog breed is not completely wrong.
* Visualization of the hierarchy:
* ![YOLO9000 hierarchy](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/YOLO9000__hierarchy.jpg?raw=true "YOLO9000 hierarchy")
* They predict many small softmaxes for the paths in the hierarchy, one per node:
* ![YOLO9000 softmaxes](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/YOLO9000__softmaxes.jpg?raw=true "YOLO9000 softmaxes")
### Results
* Accuracy
* They reach about 73.4 mAP when training on Pascal VOC 2007 and 2012. That's slightly behind Faster R-CNN with VGG16 with 75.9 mAP, trained on MSCOCO+2007+2012.
* Speed
* They reach 91 fps (10ms/image) at image resolution 288x288 and 40 fps (25ms/image) at 544x544.
* Weakly supervised learning
* They test their 9000-class-detection on ImageNet's detection task, which contains bounding boxes for 200 object classes.
* They achieve 19.7 mAP for all classes and 16.0% mAP for the 156 classes which are not part of MSCOCO.
* For some classes they get 0 mAP accuracy.
* The system performs well for all kinds of animals, but struggles with not-living objects, like sunglasses.
* Example images (notice the class labels):
* ![YOLO9000 examples](https://raw.githubusercontent.com/aleju/papers/master/neural-nets/images/YOLO9000__examples.jpg?raw=true "YOLO9000 examples")
more
less