First published: 2016/06/10 (8 years ago) Abstract: Supervised machine learning models boast remarkable predictive capabilities.
But can you trust your model? Will it work in deployment? What else can it tell
you about the world? We want models to be not only good, but interpretable. And
yet the task of interpretation appears underspecified. Papers provide diverse
and sometimes non-overlapping motivations for interpretability, and offer
myriad notions of what attributes render models interpretable. Despite this
ambiguity, many papers proclaim interpretability axiomatically, absent further
explanation. In this paper, we seek to refine the discourse on
interpretability. First, we examine the motivations underlying interest in
interpretability, finding them to be diverse and occasionally discordant. Then,
we address model properties and techniques thought to confer interpretability,
identifying transparency to humans and post-hoc explanations as competing
notions. Throughout, we discuss the feasibility and desirability of different
notions, and question the oft-made assertions that linear models are
interpretable and that deep neural networks are not.