Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1584 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Cutting out the Middle-Man: Training and Evaluating Energy-Based Models without Sampling

Grathwohl, Will and Wang, Kuan-Chieh and Jacobsen, Jorn-Henrik and Duvenaud, David and Zemel, Richard

- 2020 via Local Bibsonomy

Keywords: bayesian, generative-models, energy-models, uncertainty

Grathwohl, Will and Wang, Kuan-Chieh and Jacobsen, Jorn-Henrik and Duvenaud, David and Zemel, Richard

- 2020 via Local Bibsonomy

Keywords: bayesian, generative-models, energy-models, uncertainty

[link]
The authors introduce a new, sampling-free method for training and evaluating energy-based models (aka EBMs, aka unnormalized density models). There are two broad approches for training EBMs. Sampling-based approaches like contrastive divergence try to estimate the likelihood with MCMC, but can be biased if the chain is not sufficiently long. The speed of training also greatly depends on the sampling parameters. Other approches, like score matching, avoid sampling by solving a surrogate objective that approximates the likelihood. However, using a surrogate objective also introduces bias in the solution. In any case, comparing goodness of fit of different models is challenging, regardless of how the models were trained. The authors introduce a measure of probability distance between distributions $p$ and $q$ called the Learned Stein Discrepancy ($LSD$): $$ LSD(f_{\phi}, p, q) = \mathbb{E}_{p(x)} [\nabla_x \log q(x)^T f_{\phi}(x) + Tr(\nabla_x f_{\phi} (x)) $$ This measure is derived from the Stein Discrepancy $SD(p,q)$. Note that like the $SD$, the $LSD$ is 0 iff $p = q$. Typically, $p$ is the data distribution and $q$ is the learned approximate distribution (an EBM), although this doesn't have to be the case. Note also that this objective only requires a differentiable unnormalized distribution $\tilde{q}$, and does not require MCMC sampling or computation of the normalizing constant $Z$, since $\nabla_x \log q(x) = \nabla_x \log \tilde{q}(x) - \nabla_x \log Z = \nabla_x \log \tilde{q}(x)$. $f_\phi$ is known as the critic function, and minimizing the $LSD$ with respect to $\phi$ (i.e. with gradient descent) over a bounded space of functions $\mathcal{F}$ can approximate the $SD$ over that space. The authors choose to define the function space $\mathcal{F} = \{ f: \mathbb{E}_{p(x)} [f(x)^Tf(x)] < \infty \}$, which is convenient because it can be optimized by introducing a simple L2 regularizer on the critic's output: $\mathcal{R}_\lambda (f_\phi) = \lambda \mathbb{E}_{p(x)} [f_\phi(x)^T f_\phi(x)]$. Since the trace of a matrix is expensive to backpropagate through, the authors use a single-sample Monte Carlo estimate $Tr(\nabla_x f_\phi(x)) \approx \mathbb{E}_{\mathbb{N}(\epsilon|0,1)} [\epsilon^T \nabla_x f_\phi(x) \epsilon] $, which is more efficient since $\epsilon^T \nabla_x f_\phi(x)$ is a vector-Jacobian product. The overall objective is thus the following: $$ \text{arg} \max_\phi \mathbb{E}_{p(x)} [\nabla_x \log q(x)^T f_{\phi}(x) + \mathbb{E}_{\epsilon} [\epsilon^T \nabla_x f_{\phi} (x) \epsilon)] - \lambda f_\phi(x)^T f_\phi(x)] $$ It is possible to compare two different EBMs $q_1$ and $q_2$ by optimizing the above objective for two different critic parameters $\phi_1$ and $\phi_2$, using the training and validation data for critic optimization (then evaluating on the held-out test set). Note that when computing the $LSD$ on the test set, the exact trace can be computed instead of the Monte Carlo approximation to reduce variance, since gradients are no longer required. The model that is closer to 0 has achieved a better fit. Similarly, a hypothesis test using the $LSD$ can be used to test if $p = q$ for the data distribution $p$ and model distribution $q$. The authors then show how EBM parameters $\theta$ can actually be optimized by gradient descent on the $LSD$ objective, in a minimax problem that is similar to the problem of optimizing a generative adversarial network (GAN). For given $\theta$, you first optimize the critic $f_\phi$ w.r.t. $\phi$ to try to get the $LSD(f_\phi, p, q_\theta)$ close to its theoretical optimum with the current $q_\theta$, then you take a single gradient step $\nabla_\theta LSD$ to minimize the $LSD$. They show some experiments that indicates that this works pretty well. One thing that was not clear to me when reading this paper is whether the $LSD(f_\phi,p,q)$ should be minimized or maximized with respect to $\phi$ to get it close to the true $SD(p,q)$. Although it it possible for $LSD$ to be above or below 0 for a given choice of $q$ and $f_\phi$, the problem can always be formulated as minimization by simply changing the sign of $f_\phi$ at the beginning such that the $LSD$ is positive (or as maximization by making it negative). |

Deep Reinforcement Learning for Dialogue Generation

Li, Jiwei and Monroe, Will and Ritter, Alan and Jurafsky, Dan and Galley, Michel and Gao, Jianfeng

Empirical Methods on Natural Language Processing (EMNLP) - 2016 via Local Bibsonomy

Keywords: dblp

Li, Jiwei and Monroe, Will and Ritter, Alan and Jurafsky, Dan and Galley, Michel and Gao, Jianfeng

Empirical Methods on Natural Language Processing (EMNLP) - 2016 via Local Bibsonomy

Keywords: dblp

[link]
This paper builds on top of a bunch of existing ideas for building neural conversational agents so as to control against generic and repetitive responses. Their model is the sequence-to-sequence model with attention (Bahdanau et al.), first trained with the usual MLE loss and fine-tuned with policy gradients to optimize for specific conversational properties. Specifically, they define 3 rewards: 1. Ease of answering — Measured as the likelihood of responding to a query with a list of hand-picked dull responses (more negative log likelihood is higher reward). 2. Information flow — Consecutive responses from the same agent (person) should have different information, measured as negative of log cosine distance (more negative is better). 3. Semantic coherence — Mutual information between source and target (the response should make sense wrt query). $P(a|q) + P(q|a)$ where a is answer, q is question. The model is pre-trained with the usual supervised objective function, taking source as concatenation of two previous utterances. Then they have two stages of policy gradient training, first with just a mutual information reward and then with a combination of all three. The policy network (sequence-to-sequence model) produces a probability distribution over actions (responses) given state (previous utterances). To estimate the gradient in an iteration, the network is frozen and responses are sampled from the model, the rewards for which are then averaged and gradients are computed for first L tokens of response using MLE and remaining T-L tokens with policy gradients, with L being gradually annealed to zero (moving towards just the long-term reward). Evaluation is done based on length of dialogue, diversity (distinct unigram, bigrams) and human studies on 1. Which of two outputs has better quality (single turn) 2. Which of two outputs is easier to respond to, and 3. Which of two conversations have better quality (multi turn). ## Strengths - Interesting results - Avoids generic responses - 'Ease of responding' reward encourages responses to be question-like - Adding in hand-engineereed approximate reward functions based on conversational properties and using those to fine-tune a pre-trained network using policy gradients is neat. - Policy gradient training also encourages two dialogue agents to interact with each other and explore the complete action space (space of responses), which seems desirable to identify modes of the distribution and not converge on a single, high-scoring, generic response. ## Weaknesses / Notes - Evaluating conversational agents is hard. BLEU / perplexity are intentionally avoided as they don't necessarily reward desirable conversational properties. |

Deep Residual Learning for Image Recognition

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian

arXiv e-Print archive - 2015 via Local Bibsonomy

Keywords: dblp

[link]
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. Advantages: * Learning the identity becomes learning 0 which is simpler * Loss in information flow in the forward pass is not a problem anymore * No vanishing / exploding gradient * Identities don't have parameters to be learned ## Evaluation The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128. * ImageNet ILSVRC 2015: 3.57% (ensemble) * CIFAR-10: 6.43% * MS COCO: 59.0% mAp@0.5 (ensemble) * PASCAL VOC 2007: 85.6% mAp@0.5 * PASCAL VOC 2012: 83.8% mAp@0.5 ## See also * [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993) |

Understanding deep learning requires rethinking generalization

Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

**First published:** 2016/11/10 (7 years ago)

**Abstract:** Despite their massive size, successful deep artificial neural networks can
exhibit a remarkably small difference between training and test performance.
Conventional wisdom attributes small generalization error either to properties
of the model family, or to the regularization techniques used during training.
Through extensive systematic experiments, we show how these traditional
approaches fail to explain why large neural networks generalize well in
practice. Specifically, our experiments establish that state-of-the-art
convolutional networks for image classification trained with stochastic
gradient methods easily fit a random labeling of the training data. This
phenomenon is qualitatively unaffected by explicit regularization, and occurs
even if we replace the true images by completely unstructured random noise. We
corroborate these experimental findings with a theoretical construction showing
that simple depth two neural networks already have perfect finite sample
expressivity as soon as the number of parameters exceeds the number of data
points as it usually does in practice.
We interpret our experimental findings by comparison with traditional models.
more
less

Chiyuan Zhang and Samy Bengio and Moritz Hardt and Benjamin Recht and Oriol Vinyals

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.LG

[link]
This paper deals with the question what / how exactly CNNs learn, considering the fact that they usually have more trainable parameters than data points on which they are trained. When the authors write "deep neural networks", they are talking about Inception V3, AlexNet and MLPs. ## Key contributions * Deep neural networks easily fit random labels (achieving a training error of 0 and a test error which is just randomly guessing labels as expected). $\Rightarrow$Those architectures can simply brute-force memorize the training data. * Deep neural networks fit random images (e.g. Gaussian noise) with 0 training error. The authors conclude that VC-dimension / Rademacher complexity, and uniform stability are bad explanations for generalization capabilities of neural networks * The authors give a construction for a 2-layer network with $p = 2n+d$ parameters - where $n$ is the number of samples and $d$ is the dimension of each sample - which can easily fit any labeling. (Finite sample expressivity). See section 4. ## What I learned * Any measure $m$ of the generalization capability of classifiers $H$ should take the percentage of corrupted labels ($p_c \in [0, 1]$, where $p_c =0$ is a perfect labeling and $p_c=1$ is totally random) into account: If $p_c = 1$, then $m()$ should be 0, too, as it is impossible to learn something meaningful with totally random labels. * We seem to have built models which work well on image data in general, but not "natural" / meaningful images as we thought. ## Funny > deep neural nets remain mysterious for many reasons > Note that this is not exactly simple as the kernel matrix requires 30GB to store in memory. Nonetheless, this system can be solved in under 3 minutes in on a commodity workstation with 24 cores and 256 GB of RAM with a conventional LAPACK call. ## See also * [Deep Nets Don't Learn Via Memorization](https://openreview.net/pdf?id=rJv6ZgHYg) |

Deep contextualized word representations

Matthew E. Peters and Mark Neumann and Mohit Iyyer and Matt Gardner and Christopher Clark and Kenton Lee and Luke Zettlemoyer

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CL

**First published:** 2018/02/15 (6 years ago)

**Abstract:** We introduce a new type of deep contextualized word representation that
models both (1) complex characteristics of word use (e.g., syntax and
semantics), and (2) how these uses vary across linguistic contexts (i.e., to
model polysemy). Our word vectors are learned functions of the internal states
of a deep bidirectional language model (biLM), which is pre-trained on a large
text corpus. We show that these representations can be easily added to existing
models and significantly improve the state of the art across six challenging
NLP problems, including question answering, textual entailment and sentiment
analysis. We also present an analysis showing that exposing the deep internals
of the pre-trained network is crucial, allowing downstream models to mix
different types of semi-supervision signals.
more
less

Matthew E. Peters and Mark Neumann and Mohit Iyyer and Matt Gardner and Christopher Clark and Kenton Lee and Luke Zettlemoyer

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CL

[link]
This paper introduces a deep universal word embedding based on using a bidirectional LM (in this case, biLSTM). First words are embedded with a CNN-based, character-level, context-free, token embedding into $x_k^{LM}$ and then each sentence is parsed using a biLSTM, maximizing the log-likelihood of a word given it's forward and backward context (much like a normal language model). The innovation is in taking the output of each layer of the LSTM ($h_{k,j}^{LM}$ being the output at layer $j$) $$ \begin{align} R_k &= \{x_k^{LM}, \overrightarrow{h}_{k,j}^{LM}, \overleftarrow{h}_{k,j}^{LM} | j = 1 \ldots L \} \\ &= \{h_{k,j}^{LM} | j = 0 \ldots L \} \end{align} $$ and allowing the user to learn a their own task-specific weighted sum of these hidden states as the embedding: $$ ELMo_k^{task} = \gamma^{task} \sum_{j=0}^L s_j^{task} h_{k,j}^{LM} $$ The authors show that this weighted sum is better than taking only the top LSTM output (as in their previous work or in CoVe) because it allows capturing syntactic information in the lower layer of the LSTM and semantic information in the higher level. Table below shows that the second layer is more useful for the semantic task of word sense disambiguation, and the first layer is more useful for the syntactic task of POS tagging. https://i.imgur.com/dKnyvAa.png On other benchmarks, they show it is also better than taking the average of the layers (which could be done by setting $\gamma = 1$) https://i.imgur.com/f78gmKu.png To add the embeddings to your supervised model, ELMo is concatenated with your context-free embeddings, $\[ x_k; ELMo_k^{task} \]$. It can also be concatenated with the output of your RNN model $\[ h_k; ELMo_k^{task} \]$ which can show improvements on the same benchmarks https://i.imgur.com/eBqLe8G.png Finally, they show that adding ELMo to a competitive but simple baseline gets SOTA (at the time) on very many NLP benchmarks https://i.imgur.com/PFUlgh3.png It's all open-source and there's a tutorial [here](https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md) |

About