"Why Should I Trust You?": Explaining the Predictions of Any Classifier
Marco Tulio Ribeiro
and
Sameer Singh
and
Carlos Guestrin
arXiv e-Print archive - 2016 via Local arXiv
Keywords:
cs.LG, cs.AI, stat.ML
First published: 2016/02/16 (8 years ago) Abstract: Despite widespread adoption, machine learning models remain mostly black
boxes. Understanding the reasons behind predictions is, however, quite
important in assessing trust, which is fundamental if one plans to take action
based on a prediction, or when choosing whether to deploy a new model. Such
understanding also provides insights into the model, which can be used to
transform an untrustworthy model or prediction into a trustworthy one. In this
work, we propose LIME, a novel explanation technique that explains the
predictions of any classifier in an interpretable and faithful manner, by
learning an interpretable model locally around the prediction. We also propose
a method to explain models by presenting representative individual predictions
and their explanations in a non-redundant way, framing the task as a submodular
optimization problem. We demonstrate the flexibility of these methods by
explaining different models for text (e.g. random forests) and image
classification (e.g. neural networks). We show the utility of explanations via
novel experiments, both simulated and with human subjects, on various scenarios
that require trust: deciding if one should trust a prediction, choosing between
models, improving an untrustworthy classifier, and identifying why a classifier
should not be trusted.
This paper describes how to find local interpretable model-agnostic explanations (LIME) why a black-box model $m_B$ came to a classification decision for one sample $x$. The key idea is to evaluate many more samples around $x$ (local) and fit an interpretable model $m_I$ to it. The way of sampling and the kind of interpretable model depends on the problem domain.
For computer vision / image classification, the image $x$ is divided into superpixels. Single super-pixels are made black, the new image $x'$ is evaluated $p' = m_B(x')$. This is done multiple times.
The paper is also explained in [this YouTube video](https://www.youtube.com/watch?v=KP7-JtFMLo4) by Marco Tulio Ribeiro.
A very similar idea is already in the [Zeiler & Fergus paper](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13#martinthoma).
## Follow-up Paper
* June 2016: [Model-Agnostic Interpretability of Machine Learning](https://arxiv.org/abs/1606.05386)
* November 2016:
* [Nothing Else Matters: Model-Agnostic Explanations By Identifying Prediction Invariance](https://arxiv.org/abs/1611.05817)
* [An unexpected unity among methods for interpreting
model predictions](https://arxiv.org/abs/1611.07478)