How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation
Liu, Chia-Wei
and
Lowe, Ryan
and
Serban, Iulian Vlad
and
Noseworthy, Michael
and
Charlin, Laurent
and
Pineau, Joelle
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords:
dblp
#### Introduction
* The paper explores the strengths and weaknesses of different evaluation metrics for end-to-end dialogue systems(in unsupervised setting).
* [Link to the paper](https://arxiv.org/abs/1603.08023)
#### Evaluation Metrics Considered
##### Word Based Similarity Metric
###### BLEU
* Analyses the co-occurrences of n-grams in the ground truth and the proposed responses.
* BLEU-N: N-gram precision for the entire dataset.
* Brevity penalty added to avoid bias towards short sentences.
###### METEOR
* Create explicit alignment between candidate and target response (using Wordnet, stemmed token etc).
* Compute the harmonic mean of precision and recall between proposed and ground truth.
###### ROGUE
* F-measure based on Longest Common Subsequence (LCS) between candidate and target response.
##### Embedding Based Metric
###### Greedy Matching
* Each token in actual response is greedily matched with each token in predicted response based on cosine similarity of word embedding (and vice-versa).
* Total score is averaged over all words.
###### Embedding Average
* Calculate sentence level embedding by averaging word level embeddings
* Compare sentence level embeddings between candidate and target sentences.
###### Vector Extrema
* For each dimension in the word vector, take the most extreme value amongst all word vectors in the sentence, and use
that value in the sentence-level embedding.
* Idea is that by taking the maxima along each dimension, we can ignore the common words (which will be pulled towards the origin in the vector space).
#### Dialogue Models Considered
##### Retrieval Models
###### TF-IDF
* Compute the TF-IDF vectors for each context and response in the corpus.
* C-TFIDF computes the cosine similarity between an input context and all other contexts in the corpus and returns the response with the highest score.
* R-TFIDF computes the cosine similarity between the input context and each response directly.
###### Dual Encoder
* Two RNNs which respectively compute the vector representation of the input context and response.
* Then calculate the probability that given response is the ground truth response given the context.
##### Generative Models
###### LSTM language model
* LSTM model trained to predict the next word in the (context, response) pair.
* Given a context, model encodes it with the LSTM and generates a response using a greedy beam search procedure.
###### Hierarchical Recurrent Encoder-Decoder (HRED)
* Uses a hierarchy of encoders.
* Each utterance in the context passes through an ‘utterance-level’ encoder and the output of these encoders is passed through another 'context-level' decoder.
* Better handling of long-term dependencies as compared to the conventional Encoder-Decoder.
#### Observations
* Human survey to determine the correlation between human judgement on the quality of responses, and the score assigned by each metric.
* Metrics (especially BLEU-4 and BLEU-3) correlate poorly with human evaluation.
* Best performing metric:
* Using word-overlaps - BLEU-2 score
* Using word embeddings - vector average
* Embedding-based metrics would benefit from a weighting of word saliency.
* BLEU could still be a good evaluation metric in constrained tasks like mapping dialogue acts to natural language sentences.