Summaries from Empirical Methods on Natural Language Processing (EMNLP) on ShortScience.org

aclweb.org
scholar.google.com

Automatic Features for Essay Scoring - An Empirical Study
Dong, Fei and Zhang, Yue
Empirical Methods on Natural Language Processing (EMNLP) - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

The authors investigate convolutional networks for essay scoring. They use a two-level convolution – first over words and then over sentences. Evaluation is performed on the Kaggle ASAP dataset, training separate models on individual topics, and also reporting some cross-topic results.

https://i.imgur.com/WmNqgGm.png

aclweb.org
scholar.google.com

Globally Coherent Text Generation with Neural Checklist Models
Kiddon, Chloé and Zettlemoyer, Luke and Choi, Yejin
Empirical Methods on Natural Language Processing (EMNLP) - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

They describe a neural model for text generation, which keeps track of a checklist of items that need to be mentioned in the text.

https://i.imgur.com/yKSIpza.png

The basic system is an encoder-decoder GRU model for text generation. On top of that, the model uses attention over items that need to be mentioned and items that have already been mentioned, both of which are encoded as vectors. An additional cost objective encourages the checklist to be filled by the end of the text. Evaluation is performed on recipe and dialogue generation.

aclweb.org
scholar.google.com

A Neural Approach to Automated Essay Scoring
Taghipour, Kaveh and Ng, Hwee Tou
Empirical Methods on Natural Language Processing (EMNLP) - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

The authors construct a neural network for automated essay scoring.

https://i.imgur.com/XTWGpmy.png

Convolution window of 3 is passed over the text, which is used as input to an LSTM. The output of the LSTM is averaged over all timesteps and then a single value in the range of [0,1] is predicted as a scaled-down score for the essay. They evaluate by measuring quadratic weighted Kappa on the Kaggle essay scoring dataset.

aclweb.org
scholar.google.com

Named Entity Recognition for Novel Types by Transfer Learning
Qu, Lizhen and Ferraro, Gabriela and Zhou, Liyuan and Hou, Weiwei and Baldwin, Timothy
Empirical Methods on Natural Language Processing (EMNLP) - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

The authors tackle the problem of domain adaptation for NER, where the label set of the target domain is different from the source domain.

They first train a CRF model on the source domain. Next, they train a LR classifier to predict labels in the target domain, based on predicted label scores from the model. Finally, the weights from the classifier are used to initialise another CRF model, which is then fine-tuned on the target domain data.

https://i.imgur.com/zwSB7qN.png

aclweb.org
scholar.google.com

Comparing Data Sources and Architectures for Deep Visual Representation Learning in Semantics
Kiela, Douwe and Vero, Anita Lilla and Clark, Stephen
Empirical Methods on Natural Language Processing (EMNLP) - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

The authors compare different image recognition models and image data sources for multimodal word representation learning.

https://i.imgur.com/iHwCSks.png

Image recognition models used for vector generation

Experiments are performed on SimLex-999 (similarity) and MEN (relatedness). The performance of different models (AlexNet, GoogLeNet, VGGNet) is found to be quite similar, with VGGNet performing slightly better at the cost of requiring more computation. Using search engines for image sources gives good coverage; ImageNet performs quite well with VGGNet; ESP Game dataset gave the lowest performance. Combining visual and linguistic vectors was found to be beneficial on both English and Italian.

aclweb.org
scholar.google.com

Numerically Grounded Language Models for Semantic Error Correction
Spithourakis, Georgios P. and Augenstein, Isabelle and Riedel, Sebastian
Empirical Methods on Natural Language Processing (EMNLP) - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

They create an LSTM neural language model that 1) has better handling of numerical values, and 2) is conditioned on a knowledge base.

https://i.imgur.com/Rb6V1Hy.png

First the the numerical value each token is given as an additional signal to the network at each time step. While we normally represent token “25” as a normal word embedding, we now also have an extra feature with numerical value float(25). Second, they condition the language model on text in a knowledge base. All the information in the KB is converted to a string, passed through an LSTM and then used to condition the main LM.

They evaluate on a dataset of 16,003 clinical records which come paired with small KB tuples of 20 possible attributes. The numerical grounding helps quite a bit, and the best results are obtained when the KB conditioning is also added.

aclweb.org
scholar.google.com

Variational Neural Machine Translation
Zhang, Biao and Xiong, Deyi and Su, Jinsong and Duan, Hong and Zhang, Min
Empirical Methods on Natural Language Processing (EMNLP) - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Marek Rei 7 years ago

They start with the neural machine translation model using alignment, by Bahdanau et al. (2014), and add an extra variational component.

https://i.imgur.com/6yIEbDf.png

The authors use two neural variational components to model a distribution over latent variables z that captures the semantics of a sentence being translated. First, they model the posterior probability of z, conditioned on both input and output. Then they also model the prior of z, conditioned only on the input. During training, these two distributions are optimised to be similar using Kullback-Leibler distance, and during testing the prior is used. They report improvements on Chinese-English and English-German translation, compared to using the original encoder-decoder NMT framework.

aclweb.org
scholar.google.com

Deep Reinforcement Learning for Dialogue Generation
Li, Jiwei and Monroe, Will and Ritter, Alan and Jurafsky, Dan and Galley, Michel and Gao, Jianfeng
Empirical Methods on Natural Language Processing (EMNLP) - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Abhishek Das 7 years ago

This paper builds on top of a bunch of existing ideas for building neural conversational agents so as to control against generic and repetitive responses.

Their model is the sequence-to-sequence model with attention (Bahdanau et al.), first trained with the usual MLE loss and fine-tuned with policy gradients to optimize for specific conversational properties. Specifically, they define 3 rewards:

1. Ease of answering — Measured as the likelihood of responding to a query with a list of hand-picked dull responses (more negative log likelihood is higher reward).
2. Information flow — Consecutive responses from the same agent (person) should have different information, measured as negative of log cosine distance (more negative is better).
3. Semantic coherence — Mutual information between source and target (the response should make sense wrt query). $P(a|q) + P(q|a)$ where a is answer, q is question.

The model is pre-trained with the usual supervised objective function, taking source as concatenation of two previous utterances. Then they have two stages of policy gradient training, first with just a mutual information reward and then with a combination of all three. The policy network (sequence-to-sequence model) produces a probability distribution over actions (responses) given state (previous utterances). To estimate the gradient in an iteration, the network is frozen and responses are sampled from the model, the rewards for which are then averaged and gradients are computed for first L tokens of response using MLE and remaining T-L tokens with policy gradients, with L being gradually annealed to zero (moving towards just the long-term reward).

Evaluation is done based on length of dialogue, diversity (distinct unigram, bigrams) and human studies on

1. Which of two outputs has better quality (single turn)
2. Which of two outputs is easier to respond to, and
3. Which of two conversations have better quality (multi turn).

## Strengths

- Interesting results
- Avoids generic responses
- 'Ease of responding' reward encourages responses to be question-like
- Adding in hand-engineereed approximate reward functions based on conversational properties and using those to fine-tune a pre-trained network using policy gradients is neat.
- Policy gradient training also encourages two dialogue agents to interact with each other and explore the complete action space (space of responses), which seems desirable to identify modes of the distribution and not converge on a single, high-scoring, generic response.

## Weaknesses / Notes

- Evaluating conversational agents is hard. BLEU / perplexity are intentionally avoided as they don't necessarily reward desirable conversational properties.

arxiv.org
arxiv-vanity.com
scholar.google.com

Natural Language Comprehension with the EpiReader
Adam Trischler and Zheng Ye and Xingdi Yuan and Kaheer Suleman
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL
more

[link] Summary by Denny Britz 8 years ago

TLDR; The authors prorpose the "EpiReader" model for Question Answering / Machine Comprehension. The model consists of two modules: An Extractor that selects answer candidates (single words) using a Pointer network, and a Reasoner that rank these candidates by estimating textual entailment. The model is trained end-to-end and works on cloze-style questions. The authors evaluate the model on CBT and CNN datasets where they beat Attention Sum Reader and MemNN architectures.


#### Notes

- In most architectures, the correct answer is among the top5 candidates 95% of the time.
- Soft Attention is a problem in many architectures. Need a way to do hard attention.

arxiv.org
scholar.google.com

Sequence-Level Knowledge Distillation
Kim, Yoon and Rush, Alexander M.
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Denny Britz 8 years ago

TLDR; The authors train a standard Neural Machine Translation (NMT) model (the teacher model) and distill it by having a smaller student model learn the distribution of the teacher model. They investigate three types of knowledge distillation for sequence models: 1. Word Level Distillation 2. Sequence Level Distillation and 3. Sequence Level Interpolation. Experiments on WMT'14 and IWSLT 2015 show that it is possible to significantly reduce the parameters of the model with only a minor loss in BLEU score. The experiments also demonstrates that the distillation techniques are largely complementary. Interestingly, the perplexity of distilled models is significantly higher than that of the baselines without leading to a loss in BLEU score.

### Key Points

- Knowledge Distillation: Learn a smaller student network from a larger teacher network.
- Approach 1 - Word Level KD: This is standard Knowledge Distillation applied to sequences where we match the student output distribution of each word to the teacher's using the cross-entropy loss.
- Approach 2 - Sequence Level KD: We want to mimic the distribution of a full sequence, not just per word. To do that we sample outputs from the teacher using beam search and then train the student on these "examples" using Cross Entropy. This is a very sparse approximation of the true objective.
- Approach 3: Sequence-Level Interpolation: We train the student on a mixture of training data and teacher-generated data. We could use the approximation from #2 here, but that's not ideal because it doubles size of training data and leads to different targets conditioned on the same source. The solution is to use generate a response that has high probability under the teacher model and is similar to the ground truth and then have both mixture terms use it.
- Greedy Decoding with seq-level fine-tuned model behaves similarly to beam search on original model.
- Hypothesis: KD allows student to only model the mode of the teacher distribution, not wasting other parameters. Experiments show good evidence of this. Thus, greedy decoding has an easier time finding the true max whereas beam search was necessary to do that previously.
- Lower perplexity does not lead to better BLEU. Distilled models have significantly higher perplexity (22.7 vs 8.2) but have better BLEU (+4.2).

arxiv.org
scholar.google.com

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation
Liu, Chia-Wei and Lowe, Ryan and Serban, Iulian Vlad and Noseworthy, Michael and Charlin, Laurent and Pineau, Joelle
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

[link] Summary by Shagun Sodhani 8 years ago

#### Introduction

* The paper explores the strengths and weaknesses of different evaluation metrics for end-to-end dialogue systems(in unsupervised setting).
* [Link to the paper](https://arxiv.org/abs/1603.08023)

#### Evaluation Metrics Considered

##### Word Based Similarity Metric

###### BLEU

* Analyses the co-occurrences of n-grams in the ground truth and the proposed responses.
* BLEU-N: N-gram precision for the entire dataset.
* Brevity penalty added to avoid bias towards short sentences.

###### METEOR

* Create explicit alignment between candidate and target response (using Wordnet, stemmed token etc).
* Compute the harmonic mean of precision and recall between proposed and ground truth.

###### ROGUE

* F-measure based on Longest Common Subsequence (LCS) between candidate and target response.

##### Embedding Based Metric

###### Greedy Matching

* Each token in actual response is greedily matched with each token in predicted response based on cosine similarity of word embedding (and vice-versa).
* Total score is averaged over all words.

###### Embedding Average

* Calculate sentence level embedding by averaging word level embeddings
* Compare sentence level embeddings between candidate and target sentences.

###### Vector Extrema

* For each dimension in the word vector, take the most extreme value amongst all word vectors in the sentence, and use
that value in the sentence-level embedding.
* Idea is that by taking the maxima along each dimension, we can ignore the common words (which will be pulled towards the origin in the vector space).

#### Dialogue Models Considered

##### Retrieval Models

###### TF-IDF

* Compute the TF-IDF vectors for each context and response in the corpus.
* C-TFIDF computes the cosine similarity between an input context and all other contexts in the corpus and returns the response with the highest score.
* R-TFIDF computes the cosine similarity between the input context and each response directly.

###### Dual Encoder

* Two RNNs which respectively compute the vector representation of the input context and response.
* Then calculate the probability that given response is the ground truth response given the context.

##### Generative Models

###### LSTM language model

* LSTM model trained to predict the next word in the (context, response) pair.
* Given a context, model encodes it with the LSTM and generates a response using a greedy beam search procedure.

###### Hierarchical Recurrent Encoder-Decoder (HRED)

* Uses a hierarchy of encoders.
* Each utterance in the context passes through an ‘utterance-level’ encoder and the output of these encoders is passed through another 'context-level' decoder.
* Better handling of long-term dependencies as compared to the conventional Encoder-Decoder.

#### Observations

* Human survey to determine the correlation between human judgement on the quality of responses, and the score assigned by each metric.
* Metrics (especially BLEU-4 and BLEU-3) correlate poorly with human evaluation.
* Best performing metric:
* Using word-overlaps - BLEU-2 score
* Using word embeddings - vector average
* Embedding-based metrics would benefit from a weighting of word saliency.
* BLEU could still be a good evaluation metric in constrained tasks like mapping dialogue acts to natural language sentences.

arxiv.org
arxiv-vanity.com
scholar.google.com

Sequence-to-Sequence Learning as Beam-Search Optimization
Sam Wiseman and Alexander M. Rush
arXiv e-Print archive - 2016 via Local arXiv
Keywords: cs.CL, cs.LG, cs.NE, stat.ML
more

1	[link] Summary by Udibr 8 years ago This paper is covered by author in this [talk](https://github.com/udibr/notes/blob/master/Talk%20by%20Sasha%20Rush%20-%20Interpreting%2C%20Training%2C%20and%20Distilling%20Seq2Seq%E2%80%A6.pdf) more less