SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar
and
Jian Zhang
and
Konstantin Lopyrev
and
Percy Liang
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.CL
First published: 2016/06/16 (8 years ago) Abstract: We present a new reading comprehension dataset, SQuAD, consisting of 100,000+
questions posed by crowdworkers on a set of Wikipedia articles, where the
answer to each question is a segment of text from the corresponding reading
passage. We analyze the dataset in both manual and automatic ways to understand
the types of reasoning required to answer the questions, leaning heavily on
dependency and constituency trees. We built a strong logistic regression model,
which achieves an F1 score of 51.0%, a significant improvement over a simple
baseline (20%). However, human performance (86.8%) is much higher, indicating
that the dataset presents a good challenge problem for future research.
TLDR; A new dataset of ~100k questions and answers based on ~500 articles from Wikipedia. Both questions and answers were collected using crowdsourcing. Answers are of various types: 20% dates and numbers, 32% proper nouns, 31% noun phrase answers and 16% other phrases. Humans achieve an F1 score of 86%, and the proposed Logistic Regression model gets 51%. It does well on simple answers but struggles with more complex types of reasoning. Tge data set is publicly available at https://stanford-qa.com/.
#### Key Points
- System must select answers from all possible spans in a passage. $O(N^2)$ possibilities for N tokens in passage.
- Answers are ambiguous. Humans achieve 77% on exact match and 86% on F1 (overlap based). Humans would probably achieve close to 100% if the answer phrases were unambiguous.
- Lexicalized and dependency tree path features are most important for the LR model
- Model performs best on dates and numbers, single tokens, and categories with few possible candidates