First published: 2017/06/14 (4 years ago) Abstract: Confidence calibration -- the problem of predicting probability estimates
representative of the true correctness likelihood -- is important for
classification models in many applications. We discover that modern neural
networks, unlike those from a decade ago, are poorly calibrated. Through
extensive experiments, we observe that depth, width, weight decay, and Batch
Normalization are important factors influencing calibration. We evaluate the
performance of various post-processing calibration methods on state-of-the-art
architectures with image and document classification datasets. Our analysis and
experiments not only offer insights into neural network learning, but also
provide a simple and straightforward recipe for practical settings: on most
datasets, temperature scaling -- a single-parameter variant of Platt Scaling --
is surprisingly effective at calibrating predictions.
A neural network for classification typically has a **softmax** layer and outputs the class with the max probability. However, this probability does not represent the **confidence**. If the average confidence (average of max probs) for a dataset matches the accuracy, it is called **well-calibrated**. Old models like LeNet (1998) was well-calibrated, but modern networks like ResNet (2016) are no longer well-calibrated. This paper explains what caused this and compares various calibration methods.
## Figure - Confidence Histogram
The bottom row: group the samples by confidence (max probailities) into bins, and calculates the accuracy (# correct / # bin size) within each bin.
- ECE (Expected Calibration Error): average of |accuracy-confidence| of bins
- MCE (Maximum Calibration Error): max of |accuracy-confidence| of bins
## Analysis - What
The paper experiments how models are mis-calibrated with different factors: (1) model capacity, (2) batch norm, (3) weight decay, (4) NLL.
## Solution - Calibration Methods
Many calibration methods for binary classification and multi-class classification are evaluated. The method that performed the best is **temperature scailing**, which simply multiplies logits before the softmax by some constant. The paper used the validation set to choose the best constant.