[link]
## Summary In a prior work 'On Calibration of Modern Nueral Networks', temperature scailing is used for outputing confidence. This is done at inference stage, and does not change the existing classifier. This paper considers the confidence at training stage, and directly outputs the confidence from the network. ## Architecture An additional branch for confidence is added after the penultimate layer, in parallel to logits and probs (Figure 2). https://i.imgur.com/vtKq9g0.png ## Training The network outputs the prob $p$ and the confidence $c$ which is a single scalar. The modified prob $p'=c*p+(1-c)y$ where $y$ is the label (hint). The confidence loss is $\mathcal{L}_c=-\log c$, the NLL is $\mathcal{L}_t= -\sum \log(p'_i)y_i$. ### Budget Parameter The authors introduced the confidence loss weight $\lambda$ and a budget $\beta$. If $\mathcal{L}_c>\beta$, increase $\lambda$, if $\mathcal{L}_c<\beta$, decrease $\lambda$. $\beta$ is found reasonable in [0.1,1.0]. ### Hinting with 50% Sometimes the model relies on the free label ($c=0$) and does not fit the complicated structure of data. The authors give hints with 50% so the model cannot rely 100% on the hint. They used $p'$ for only half of the bathes for each epoch. ### Misclassified Examples A high-capacity network with small dataset overfits well, and mis-classified samples are required to learn the confidence. The network likely assigns low confidence to samples. The paper used an aggressive data augmentation to create difficult examples. ## Inference Reject if $c\le\delta$. For out-of-distribution detection, they used the same input perturbation as in ODIN (2018). ODIN used temperature scailing and used the max prob, while this paper does not need temperature scailing since it directly outputs $c$. In evaluation, this paper outperformed ODIN. ## Reference ODIN: [Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks](http://www.shortscience.org/paper?bibtexKey=journals/corr/1706.02690#elbaro) |
[link]
## Task Add '**rejection**' output to an existing classification model with softmax layer. ## Method 1. Choose some threshold $\delta$ and temperature $T$ 2. Add a perturbation to the input x (eq 2), let $\tilde x = x - \epsilon \text{sign}(-\nabla_x \log S_{\hat y}(x;T))$ 3. If $p(\tilde x;T)\le \delta$, rejects 4. If not, return the output of the original classifier $p(\tilde x;T)$ is the max prob with temperature scailing for input $\tilde x$ $\delta$ and $T$ are manually chosen. |
[link]
## Task A neural network for classification typically has a **softmax** layer and outputs the class with the max probability. However, this probability does not represent the **confidence**. If the average confidence (average of max probs) for a dataset matches the accuracy, it is called **well-calibrated**. Old models like LeNet (1998) was well-calibrated, but modern networks like ResNet (2016) are no longer well-calibrated. This paper explains what caused this and compares various calibration methods. ## Figure - Confidence Histogram https://i.imgur.com/dMtdWsL.png The bottom row: group the samples by confidence (max probailities) into bins, and calculates the accuracy (# correct / # bin size) within each bin. - ECE (Expected Calibration Error): average of |accuracy-confidence| of bins - MCE (Maximum Calibration Error): max of |accuracy-confidence| of bins ## Analysis - What The paper experiments how models are mis-calibrated with different factors: (1) model capacity, (2) batch norm, (3) weight decay, (4) NLL. ## Solution - Calibration Methods Many calibration methods for binary classification and multi-class classification are evaluated. The method that performed the best is **temperature scailing**, which simply multiplies logits before the softmax by some constant. The paper used the validation set to choose the best constant. |
[link]
This paper solves two tasks: Image Captioning and VQA. The main idea is to use Faster R-CNN to embed images (kx2048 from k bounding boxes) instead of ResNet (14x14x2048) and apply attention over k vectors. For **VQA**, this is basically (Faster R-CNN + ShowAttendAskAnswer). SAAA(ShowAskAttendAnswer) calculates a 2D attention map from the concatenation of a text vector (2048-dim from LSTM) and image tensor (2048x14x14 from ResNet). This image feature can be thought as a collection of 2048-dim feature vectors. This paper uses Faster R-CNN to get k bounding boxes. Each bounding box is a 2048-dim vector so we have kx2048, which is fed to SAAA. **SAAA**: https://i.imgur.com/2FnPXi0.png **This paper (VQA)**: https://i.imgur.com/xib77Iy.png For **Image Captioning**, it uses 2-layer LSTM. The first layer gets the average of k 2048-dim vectors. The output is used to calculate the attention weights over k vectors. The second layer gets the weight-averaged 2048-dim vector and the output of the first layer. https://i.imgur.com/GeXaC30.png |