The Design and Evaluation of Neural Attention Mechanisms for Explaining Text Classifiers
Carton, Samuel
2019
Abstract
The last several years have seen a surge of interest in interpretability in AI and machine learning--the idea of producing human-understandable explanations for AI model behavior. This interest has grown out of concerns about the robustness and accountability of AI-driven systems, particularly deep neural networks, in light of the increasing ubiquity of such systems in industry, science and government. The general hope of the field is that by producing explanations of model behavior for human consumption, one or more model-using stakeholder groups (e.g. model designers, model-advised decision-makers, recipients of model-driven decisions) will be able to derive some type of increased utility from those models (e.g. easier model debugging, better decision-making, higher user satisfaction). The early years of this field have seen a profusion of technique but a paucity of evaluation. A number of methods have been proposed for explaining the decisions of deep neural models, or of constraining neural models to behave in more interpretable ways. However, it has proven difficult for the community to reach a consensus about how to evaluate the quality of such methods. Automated evaluation protocols such as collecting gold-standard explanations do not necessarily correlate well with true practical utility, while fully application-oriented evaluations are expensive, difficult to generalize from, and, it increasingly appears, an extremely difficult HCI challenge. In this work I address gaps in both the design and evaluation of interpretability methods for text classifiers. I present two novel interpretability methods. The first method is a feature-based explanation technique which uses an adversarial attention mechanism to identify all predictive signal in the body of an input text, allowing it to outperform strong baselines with respect to human gold-standard annotations. The second method is an example-based technique that retrieves explanatory examples using only the features that were important to a given prediction, leading to examples which are much more relevant than those produced by strong baselines. I accompany each method with a formal user study evaluating whether that type of explanation improves human performance in model-assisted decision-making. In neither study am I able to demonstrate an improvement in human performance as an effect of explanation presence. This, along with other recent results in the interpretability literature, begins to reveal an intriguing expectation gap between the enthusiasm that the interpretability topic has engendered in the machine learning community and the actual utility of these techniques in terms of human outcomes that the community has been able to demonstrate. Both studies represent contributions to the design of evaluation studies for interpretable machine learning. The second study in particular is one of the first human evaluations of example-based explanations for neural text classifiers. Its outcome reveals several important, nonobvious design issues in example-based explanation systems which should helpfully inform future work on the topic.Subjects
Machine learning Interpretability
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.