ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
Szegedy, Christian and Ioffe, Sergey and Vanhoucke, Vincent
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Open Review 9 years ago

This paper presents a combination of the inception architecture
with residual networks. This is done by adding a shortcut connection
to each inception module. This can alternatively be seen as a resnet where
the 2 conv layers are replaced by a (slightly modified) inception module.
The paper (claims to) provide results against the hypothesis that adding residual
connections improves training, rather increasing the model size is what makes the difference.

arxiv.org
arxiv-vanity.com
scholar.google.com

On Calibration of Modern Neural Networks
Chuan Guo and Geoff Pleiss and Yu Sun and Kilian Q. Weinberger
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG
more

[link] Summary by David Stutz 6 years ago

Guo et al. study calibration of deep neural networks as post-processing step. Here, calibration means a correction of the predicted confidence scores as these are commonlz too overconfident in recent deep networks. They consider several state-of-the-art post-processing steps for calibration, but surprisingly, they show that a simple linear mapping, or even scaling, works surprisingly well. So if $z_i$ are the logits of the network, then (the network being fixed) a parameter $T$ is found such that

$\sigma(\frac{z_i}{T})$

is calibrated and minimized the NLL loss on a held-out validation set. Here, the temeratur $T$ either softens or roughens the probability distribution over classes. Interestingly, finding $T$ by optimizing the same training loss helps to reduce over-confidence.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
arxiv-vanity.com
scholar.google.com

Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks
Shiyu Liang and Yixuan Li and R. Srikant
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.LG, stat.ML
more

[link] Summary by David Stutz 6 years ago

Liang et al. propose a perturbation-based approach for detecting out-of-distribution examples using a network’s confidence predictions. In particular, the approaches based on the observation that neural network’s make more confident predictions on images from the original data distribution, in-distribution examples, than on examples taken from a different distribution (i.e., a different dataset), out-distribution examples. This effect can further be amplified by using a temperature-scaled softmax, i.e.,

$ S_i(x, T) = \frac{\exp(f_i(x)/T)}{\sum_{j = 1}^N \exp(f_j(x)/T)}$

where $f_i(x)$ are the predicted logits and $T$ a temperature parameter. Based on these softmax scores, perturbations $\tilde{x}$ are computed using

$\tilde{x} = x - \epsilon \text{sign}(-\nabla_x \log S_{\hat{y}}(x;T))$

where $\hat{y}$ is the predicted label of $x$. This is similar to “one-step” adversarial examples; however, in contrast of minimizing the confidence of the true label, the confidence in the predicted label is maximized. This, applied to in-distribution and out-distribution examples is illustrated in Figure 1 and meant to emphasize the difference in confidence. Afterwards, in- and out-distribution examples can be distinguished using simple thresholding on the predicted confidence, as shown in various experiment, e.g., on Cifar10 and Cifar100.

https://i.imgur.com/OjDVZ0B.png
Figure 1: Illustration of the proposed perturbation to amplify the difference in confidence between in- and out-distribution examples.

Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

arxiv.org
scholar.google.com

Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift
Ovadia, Yaniv and Fertig, Emily and Ren, Jie and Nado, Zachary and Sculley, David and Nowozin, Sebastian and Dillon, Joshua V. and Lakshminarayanan, Balaji and Snoek, Jasper
arXiv e-Print archive - 2019 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 5 years ago

A common critique of deep learning is its brittleness off-distribution, combined with its tendency to give confident predictions for off-distribution inputs, as is seen in the case of adversarial examples. In response to this critique, a number of different methods have cropped up in recent years, that try to capture a model's uncertainty as well as its overall prediction. This paper tries to do a broad evaluation of uncertainty methods, and, particularly, to test how they perform on out of distribution data, including both data that is perturbed from its original values, and fully OOD data from ground-truth categories never seen during training. Ideally, we would want an uncertainty method that is less confident in its predictions as data is made more dissimilar from the distribution that the model is trained on. Some metrics the paper uses for capturing this are: 

- Brier Score (The difference between predicted score and ground truth 0/1 label, averaged over all examples)
- Negative Log Likelihood
- Expected Calibration Error (Within a given bucket, this is calculated as the difference between accuracy to ground truth labels, and the average predicted score in that bucket, capturing that you'd ideally want to have a lower predicted score in cases where you have low accuracy, and vice versa)
- Entropy - For labels that are fully out of distribution, and don't map to any of the model's categories, you can't directly calculate ground truth accuracy, but you can ideally ask for a model that has high entropy (close to uniform) probabilities over the classes it knows about when the image is drawn from an entirely different class

The authors test over image datasets small (MNIST) and large (ImageNet and CIFAR10), as well as a categorical ad-click-prediction dataset. They came up with some interesting findings. 

https://i.imgur.com/EVnjS1R.png

1. More fully principled Bayesian estimation of posteriors over parameters, in the form of Stochastic Variational Inference, works well on MNIST, but quite poorly on either categorical data or higher dimensional image datasets 

https://i.imgur.com/3emTYNP.png

2. Temperature scaling, which basically performs a second supervised calibration using a hold-out set to push your probabilities towards true probabilities, performs well in-distribution but collapses fairly quickly off-distribution (which sort of makes sense given that it too is just another supervised method that can do poorly when off-distribution) 
3. In general, ensemble methods, where you train different models on different subsets of the data and take their variance as uncertainty, perform the best across the bigger image models as well as the ad click model, likely because SVI (along with many other Bayesian methods) is too computationally intensive to get to work well on higher-dimensional data 
4. Overall, none of the methods worked particularly well, and even the best-performing ones were often confidently wrong off-distribution 

I think it's fair to say that we're far from where we wish we were when it comes to models that "know when they don't know," and this paper does a good job of highlighting that in specific fashion.

1 Comments

aclweb.org
scholar.google.com

Deep Reinforcement Learning for Dialogue Generation
Li, Jiwei and Monroe, Will and Ritter, Alan and Jurafsky, Dan and Galley, Michel and Gao, Jianfeng
Empirical Methods on Natural Language Processing (EMNLP) - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 8 years ago

This paper builds on top of a bunch of existing ideas for building neural conversational agents so as to control against generic and repetitive responses.

Their model is the sequence-to-sequence model with attention (Bahdanau et al.), first trained with the usual MLE loss and fine-tuned with policy gradients to optimize for specific conversational properties. Specifically, they define 3 rewards:

1. Ease of answering — Measured as the likelihood of responding to a query with a list of hand-picked dull responses (more negative log likelihood is higher reward).
2. Information flow — Consecutive responses from the same agent (person) should have different information, measured as negative of log cosine distance (more negative is better).
3. Semantic coherence — Mutual information between source and target (the response should make sense wrt query). $P(a|q) + P(q|a)$ where a is answer, q is question.

The model is pre-trained with the usual supervised objective function, taking source as concatenation of two previous utterances. Then they have two stages of policy gradient training, first with just a mutual information reward and then with a combination of all three. The policy network (sequence-to-sequence model) produces a probability distribution over actions (responses) given state (previous utterances). To estimate the gradient in an iteration, the network is frozen and responses are sampled from the model, the rewards for which are then averaged and gradients are computed for first L tokens of response using MLE and remaining T-L tokens with policy gradients, with L being gradually annealed to zero (moving towards just the long-term reward).

Evaluation is done based on length of dialogue, diversity (distinct unigram, bigrams) and human studies on

1. Which of two outputs has better quality (single turn)
2. Which of two outputs is easier to respond to, and
3. Which of two conversations have better quality (multi turn).

## Strengths

- Interesting results
- Avoids generic responses
- 'Ease of responding' reward encourages responses to be question-like
- Adding in hand-engineereed approximate reward functions based on conversational properties and using those to fine-tune a pre-trained network using policy gradients is neat.
- Policy gradient training also encourages two dialogue agents to interact with each other and explore the complete action space (space of responses), which seems desirable to identify modes of the distribution and not converge on a single, high-scoring, generic response.

## Weaknesses / Notes

- Evaluating conversational agents is hard. BLEU / perplexity are intentionally avoided as they don't necessarily reward desirable conversational properties.