Learning Confidence for Out-of-Distribution Detection in Neural Networks
Terrance DeVries
and
Graham W. Taylor
arXiv e-Print archive - 2018 via Local arXiv
Keywords:
stat.ML, cs.LG
First published: 2018/02/13 (6 years ago) Abstract: Modern neural networks are very powerful predictive models, but they are
often incapable of recognizing when their predictions may be wrong. Closely
related to this is the task of out-of-distribution detection, where a network
must determine whether or not an input is outside of the set on which it is
expected to safely perform. To jointly address these issues, we propose a
method of learning confidence estimates for neural networks that is simple to
implement and produces intuitively interpretable outputs. We demonstrate that
on the task of out-of-distribution detection, our technique surpasses recently
proposed techniques which construct confidence based on the network's output
distribution, without requiring any additional labels or access to
out-of-distribution examples. Additionally, we address the problem of
calibrating out-of-distribution detectors, where we demonstrate that
misclassified in-distribution examples can be used as a proxy for
out-of-distribution examples.
## Summary
In a prior work 'On Calibration of Modern Nueral Networks', temperature scailing is used for outputing confidence. This is done at inference stage, and does not change the existing classifier. This paper considers the confidence at training stage, and directly outputs the confidence from the network.
## Architecture
An additional branch for confidence is added after the penultimate layer, in parallel to logits and probs (Figure 2).
https://i.imgur.com/vtKq9g0.png
## Training
The network outputs the prob $p$ and the confidence $c$ which is a single scalar. The modified prob $p'=c*p+(1-c)y$ where $y$ is the label (hint). The confidence loss is $\mathcal{L}_c=-\log c$, the NLL is $\mathcal{L}_t= -\sum \log(p'_i)y_i$.
### Budget Parameter
The authors introduced the confidence loss weight $\lambda$ and a budget $\beta$. If $\mathcal{L}_c>\beta$, increase $\lambda$, if $\mathcal{L}_c<\beta$, decrease $\lambda$. $\beta$ is found reasonable in [0.1,1.0].
### Hinting with 50%
Sometimes the model relies on the free label ($c=0$) and does not fit the complicated structure of data. The authors give hints with 50% so the model cannot rely 100% on the hint. They used $p'$ for only half of the bathes for each epoch.
### Misclassified Examples
A high-capacity network with small dataset overfits well, and mis-classified samples are required to learn the confidence. The network likely assigns low confidence to samples. The paper used an aggressive data augmentation to create difficult examples.
## Inference
Reject if $c\le\delta$.
For out-of-distribution detection, they used the same input perturbation as in ODIN (2018). ODIN used temperature scailing and used the max prob, while this paper does not need temperature scailing since it directly outputs $c$. In evaluation, this paper outperformed ODIN.
## Reference
ODIN: [Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks](http://www.shortscience.org/paper?bibtexKey=journals/corr/1706.02690#elbaro)