On the Evaluation of Common-Sense Reasoning in Natural Language Understanding
Paul Trichelair
and
Ali Emami
and
Jackie Chi Kit Cheung
and
Adam Trischler
and
Kaheer Suleman
and
Fernando Diaz
arXiv e-Print archive - 2018 via Local arXiv
Keywords:
cs.LG, cs.AI, cs.CL, stat.ML
First published: 2018/11/05 (6 years ago) Abstract: The NLP and ML communities have long been interested in developing models
capable of common-sense reasoning, and recent works have significantly improved
the state of the art on benchmarks like the Winograd Schema Challenge (WSC).
Despite these advances, the complexity of tasks designed to test common-sense
reasoning remains under-analyzed. In this paper, we make a case study of the
Winograd Schema Challenge and, based on two new measures of instance-level
complexity, design a protocol that both clarifies and qualifies the results of
previous work. Our protocol accounts for the WSC's limited size and variable
instance difficulty, properties common to other common-sense benchmarks.
Accounting for these properties when assessing model results may prevent
unjustified conclusions.