Interpretable & Explorable Approximations of Black Box Models
Lakkaraju, Himabindu
and
Kamar, Ece
and
Caruana, Rich
and
Leskovec, Jure
arXiv e-Print archive - 2017 via Local Bibsonomy
Keywords:
dblp
Model interpretations must be true to the model but must also promote human understanding of the working of the model. To this end we would need an interpretability model that balances the two.
**Idea** : Although there exist model interpretations that balance fidelity and human cognition on a local level specific to an underlying model, there are no global model agnostic interpretation models that can achieve the same.
**Solution:**
- Break up each aspect of the underlying model into distinct compact decision sets that have no overlap to generate explanations that are faithful to the model, and also cover all possible feature spaces of the model.
- How the solution dealt with:
- *Fidelity* (staying true to the model): the labels in the approximation match that of the underlying model.
- *Unambiguity* (single clear decision): compact decision sets in every feature space ensures unambiguity in the label assigned to it.
- *Interpretability* (Understandable by humans): Intuitive rule based representation, with limited number of rules and predicates.
- *Interactivity* (Allow user to focus on specific feature spaces): Each feature space is divided into distinct compact sets, allowing users to focus on their area of interest.
- Details on a “decision set”:
- Each decision set is a two-level decision (a nested if-then decision set), where the outer if-then clause specifies the sub-space, and the inner if-then clause specifies the logic of assigning a label by the model.
- A default set is defined to assign labels that do not satisfy any of the two-level decisions
- The pros of such a model is that we do not need to trace the logic of an assigned label too far, thus less complex than a decision tree which follows a similar if-then structure.
**Mapping fidelity vs interpretability**
- To see how their model handled fidelity vs interpretability, they mapped the rate of agreement (number of times the approximation label of an instance matches the blackbox assigned label) against pre-defined interpretability complexity defining terms such as:
- Number of predicates (sum of width of all decision sets)
- Number of rules (a set of outer decision, inner decision, and classifier label)
- Number of defined neighborhoods (outer if-then decision)
- Their model reached higher agreement rates to other models at lower values for interpretability complexity.