Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems
Jesse Dodge
and
Andreea Gane
and
Xiang Zhang
and
Antoine Bordes
and
Sumit Chopra
and
Alexander Miller
and
Arthur Szlam
and
Jason Weston
arXiv e-Print archive - 2015 via Local arXiv
Keywords:
cs.CL, cs.LG
First published: 2015/11/21 (8 years ago) Abstract: A long-term goal of machine learning is to build intelligent conversational
agents. One recent popular approach is to train end-to-end models on a large
amount of real dialog transcripts between humans (Sordoni et al., 2015; Vinyals
& Le, 2015; Shang et al., 2015). However, this approach leaves many questions
unanswered as an understanding of the precise successes and shortcomings of
each model is hard to assess. A contrasting recent proposal are the bAbI tasks
(Weston et al., 2015b) which are synthetic data that measure the ability of
learning machines at various reasoning tasks over toy language. Unfortunately,
those tests are very small and hence may encourage methods that do not scale.
In this work, we propose a suite of new tasks of a much larger scale that
attempt to bridge the gap between the two regimes. Choosing the domain of
movies, we provide tasks that test the ability of models to answer factual
questions (utilizing OMDB), provide personalization (utilizing MovieLens),
carry short conversations about the two, and finally to perform on natural
dialogs from Reddit. We provide a dataset covering 75k movie entities and with
3.5M training examples. We present results of various models on these tasks,
and evaluate their performance.
#### Introduction
* The paper presents a suite of benchmark tasks to evaluate end-to-end dialogue systems such that performing well on the tasks is a necessary (but not sufficient) condition for a fully functional dialogue agent.
* [Link to the paper](https://research.facebook.com/publications/evaluating-prerequisite-qualities-for-learning-end-to-end-dialog-systems/)
#### Dataset
* Created using large-scale real-world sources - OMDB (Open Movie Database), MovieLens and Reddit.
* Consists of ~75K movie entities and ~3.5M training examples.
#### Tasks
##### QA Task
* Answering Factoid Questions without relation to the previous dialogue.
* KB(Knowledge Base) created using OMDB and stored as triplets of the form (Entity, Relation, Entity).
* Question (in Natural Language Form) generated by creating templates using [SimpleQuestions](https://arxiv.org/abs/1506.02075)
* Instead of giving out just 1 response, the system ranks all the answers in order of their relevance.
##### Recommendation Task
* Providing personalised responses to the user via recommendation instead of providing universal facts as in case 1.
* MovieLens dataset with a *user x item* matrix of ratings.
* Statements (for any user) are generated by sampling highly ranked movies by the user and forming a statement about these movies using natural language templates.
* Like the previous case, a list of ranked responses is generated.
##### QA + Recommendation Task
* Maintaining short dialogues involving both factoid and personalised content.
* Dataset consists of short conversations of 3 exchanges (3 from each participant).
##### Reddit Discussion Task
* Identify most likely response is discussions on Reddit.
* Data processed to flatten the potential conversation so that it appears to be a two participant conversation.
##### Joint Task
* Combines all the previous tasks into one single task to test all the skills at once.
#### Models Tested
* **Memory Networks** - Comprises of a memory component that includes both long term memory and short term context.
* **Supervised Embedding Models** - Sum the word embeddings of the input and the target independently and compare them with a similarity metric.
* **Recurrent Language Models** - RNN, LSTM, SeqToSeq
* **Question Answering Systems** - Systems that answer natural language questions by converting them into search queries over a KB.
* **SVD(Singular Value Decomposition)** - Standard benchmark for recommendation.
* **Information Retrieval Models** - Given a message, find the most similar message in the training dataset and report its output or find a most similar response to input directly.
#### Result
##### QA Task
* QA System > Memory Networks > Supervised Embeddings > LSTM
##### Recommendation Task
* Supervised Embeddings > Memory Networks > LSTM > SVD
##### Task Involving Dialog History
* QA + Recommendation Task and Reddit Discussion Task
* Memory Networks > Supervised Embeddings > LSTM
##### Joint Task
* Supervised word embeddings perform very poorly even when using a large number of dimensions (2000 dimensions).
* Memory Networks perform better than embedding models as they can utilise the local context and the long-term memory. But they do not perform as well on standalone QA tasks.