Describes how to compare classifiers when they were evaluated on multiple datasets (e.g. CIFAR 10, MNIST and SVHN). Recommends Wilcoxon signed ranks test and Friedman test with the corresponding post-hoc tests. Introduce CD (critical difference) diagrams.
* McNemar test and 5x2cv are good when comparing two classifiers on one dataset
* Describes the Wilcoxon Signed-Ranks Test in section 3.1.3 in detail