Learning Confidence for Out-of-Distribution Detection in Neural Networks

Terrance DeVries and Graham W. Taylor

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2018/02/13 (5 years ago)

**Abstract:** Modern neural networks are very powerful predictive models, but they are
often incapable of recognizing when their predictions may be wrong. Closely
related to this is the task of out-of-distribution detection, where a network
must determine whether or not an input is outside of the set on which it is
expected to safely perform. To jointly address these issues, we propose a
method of learning confidence estimates for neural networks that is simple to
implement and produces intuitively interpretable outputs. We demonstrate that
on the task of out-of-distribution detection, our technique surpasses recently
proposed techniques which construct confidence based on the network's output
distribution, without requiring any additional labels or access to
out-of-distribution examples. Additionally, we address the problem of
calibrating out-of-distribution detectors, where we demonstrate that
misclassified in-distribution examples can be used as a proxy for
out-of-distribution examples.
more
less

Terrance DeVries and Graham W. Taylor

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

[link]
## Summary In a prior work 'On Calibration of Modern Nueral Networks', temperature scailing is used for outputing confidence. This is done at inference stage, and does not change the existing classifier. This paper considers the confidence at training stage, and directly outputs the confidence from the network. ## Architecture An additional branch for confidence is added after the penultimate layer, in parallel to logits and probs (Figure 2). https://i.imgur.com/vtKq9g0.png ## Training The network outputs the prob $p$ and the confidence $c$ which is a single scalar. The modified prob $p'=c*p+(1-c)y$ where $y$ is the label (hint). The confidence loss is $\mathcal{L}_c=-\log c$, the NLL is $\mathcal{L}_t= -\sum \log(p'_i)y_i$. ### Budget Parameter The authors introduced the confidence loss weight $\lambda$ and a budget $\beta$. If $\mathcal{L}_c>\beta$, increase $\lambda$, if $\mathcal{L}_c<\beta$, decrease $\lambda$. $\beta$ is found reasonable in [0.1,1.0]. ### Hinting with 50% Sometimes the model relies on the free label ($c=0$) and does not fit the complicated structure of data. The authors give hints with 50% so the model cannot rely 100% on the hint. They used $p'$ for only half of the bathes for each epoch. ### Misclassified Examples A high-capacity network with small dataset overfits well, and mis-classified samples are required to learn the confidence. The network likely assigns low confidence to samples. The paper used an aggressive data augmentation to create difficult examples. ## Inference Reject if $c\le\delta$. For out-of-distribution detection, they used the same input perturbation as in ODIN (2018). ODIN used temperature scailing and used the max prob, while this paper does not need temperature scailing since it directly outputs $c$. In evaluation, this paper outperformed ODIN. ## Reference ODIN: [Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks](http://www.shortscience.org/paper?bibtexKey=journals/corr/1706.02690#elbaro) |

Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks

Shiyu Liang and Yixuan Li and R. Srikant

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2017/06/08 (5 years ago)

**Abstract:** We consider the problem of detecting out-of-distribution images in neural
networks. We propose ODIN, a simple and effective method that does not require
any change to a pre-trained neural network. Our method is based on the
observation that using temperature scaling and adding small perturbations to
the input can separate the softmax score distributions between in- and
out-of-distribution images, allowing for more effective detection. We show in a
series of experiments that ODIN is compatible with diverse network
architectures and datasets. It consistently outperforms the baseline approach
by a large margin, establishing a new state-of-the-art performance on this
task. For example, ODIN reduces the false positive rate from the baseline 34.7%
to 4.3% on the DenseNet (applied to CIFAR-10) when the true positive rate is
95%.
more
less

Shiyu Liang and Yixuan Li and R. Srikant

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, stat.ML

[link]
## Task Add '**rejection**' output to an existing classification model with softmax layer. ## Method 1. Choose some threshold $\delta$ and temperature $T$ 2. Add a perturbation to the input x (eq 2), let $\tilde x = x - \epsilon \text{sign}(-\nabla_x \log S_{\hat y}(x;T))$ 3. If $p(\tilde x;T)\le \delta$, rejects 4. If not, return the output of the original classifier $p(\tilde x;T)$ is the max prob with temperature scailing for input $\tilde x$ $\delta$ and $T$ are manually chosen. |

On Calibration of Modern Neural Networks

Chuan Guo and Geoff Pleiss and Yu Sun and Kilian Q. Weinberger

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG

**First published:** 2017/06/14 (5 years ago)

**Abstract:** Confidence calibration -- the problem of predicting probability estimates
representative of the true correctness likelihood -- is important for
classification models in many applications. We discover that modern neural
networks, unlike those from a decade ago, are poorly calibrated. Through
extensive experiments, we observe that depth, width, weight decay, and Batch
Normalization are important factors influencing calibration. We evaluate the
performance of various post-processing calibration methods on state-of-the-art
architectures with image and document classification datasets. Our analysis and
experiments not only offer insights into neural network learning, but also
provide a simple and straightforward recipe for practical settings: on most
datasets, temperature scaling -- a single-parameter variant of Platt Scaling --
is surprisingly effective at calibrating predictions.
more
less

Chuan Guo and Geoff Pleiss and Yu Sun and Kilian Q. Weinberger

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG

[link]
## Task A neural network for classification typically has a **softmax** layer and outputs the class with the max probability. However, this probability does not represent the **confidence**. If the average confidence (average of max probs) for a dataset matches the accuracy, it is called **well-calibrated**. Old models like LeNet (1998) was well-calibrated, but modern networks like ResNet (2016) are no longer well-calibrated. This paper explains what caused this and compares various calibration methods. ## Figure - Confidence Histogram https://i.imgur.com/dMtdWsL.png The bottom row: group the samples by confidence (max probailities) into bins, and calculates the accuracy (# correct / # bin size) within each bin. - ECE (Expected Calibration Error): average of |accuracy-confidence| of bins - MCE (Maximum Calibration Error): max of |accuracy-confidence| of bins ## Analysis - What The paper experiments how models are mis-calibrated with different factors: (1) model capacity, (2) batch norm, (3) weight decay, (4) NLL. ## Solution - Calibration Methods Many calibration methods for binary classification and multi-class classification are evaluated. The method that performed the best is **temperature scailing**, which simply multiplies logits before the softmax by some constant. The paper used the validation set to choose the best constant. |

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.CV

**First published:** 2017/07/25 (5 years ago)

**Abstract:** Top-down visual attention mechanisms have been used extensively in image
captioning and visual question answering (VQA) to enable deeper image
understanding through fine-grained analysis and even multiple steps of
reasoning. In this work, we propose a combined bottom-up and top-down attention
mechanism that enables attention to be calculated at the level of objects and
other salient image regions. This is the natural basis for attention to be
considered. Within our approach, the bottom-up mechanism (based on Faster
R-CNN) proposes image regions, each with an associated feature vector, while
the top-down mechanism determines feature weightings. Applying this approach to
image captioning, our results on the MSCOCO test server establish a new
state-of-the-art for the task, improving the best published result in terms of
CIDEr score from 114.7 to 117.9 and BLEU-4 from 35.2 to 36.9. Demonstrating the
broad applicability of the method, applying the same approach to VQA we obtain
first place in the 2017 VQA Challenge.
more
less

Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.CV

[link]
This paper solves two tasks: Image Captioning and VQA. The main idea is to use Faster R-CNN to embed images (kx2048 from k bounding boxes) instead of ResNet (14x14x2048) and apply attention over k vectors. For **VQA**, this is basically (Faster R-CNN + ShowAttendAskAnswer). SAAA(ShowAskAttendAnswer) calculates a 2D attention map from the concatenation of a text vector (2048-dim from LSTM) and image tensor (2048x14x14 from ResNet). This image feature can be thought as a collection of 2048-dim feature vectors. This paper uses Faster R-CNN to get k bounding boxes. Each bounding box is a 2048-dim vector so we have kx2048, which is fed to SAAA. **SAAA**: https://i.imgur.com/2FnPXi0.png **This paper (VQA)**: https://i.imgur.com/xib77Iy.png For **Image Captioning**, it uses 2-layer LSTM. The first layer gets the average of k 2048-dim vectors. The output is used to calculate the attention weights over k vectors. The second layer gets the weight-averaged 2048-dim vector and the output of the first layer. https://i.imgur.com/GeXaC30.png |

About