This paper describes common pitfalls when classifiers are compared and recommends McNemars test
* t-test is simply the wrong test for such an experimental design
## See also
* Prechelt "A quantitative study of experimental evaluations of neural network algorithms" - most of 200 evaluated paper had flaws
* Wolpert "On the connection between in-sample testing and generalization error" - No classifier is always better than another one.
* Diettrich: [Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms](http://www.shortscience.org/paper?bibtexKey=journals/neco/Dietterich98)
* Demsar: [Statistical Comparisons of Classifiers over Multiple Data Sets](http://www.shortscience.org/paper?bibtexKey=demvsar2006statistical)