Hazy Oracles in Deep Learning
Lemmer, Stephan
2023
Abstract
While deep learning problems are often motivated as enabling technologies for human-computer interaction---a robot, for example, must align natural language referents and sensor readings to operate in a human world---assumptions of these works make them poorly suited to real-world human interaction. Specifically, evaluation typically assumes that humans are oracles that provide semantically correct and unambiguous information, and that all such information is equally useful. While this is enforced in controlled experiments via carefully curated datasets, models operating in the wild will need to compensate for the fact that humans are hazy oracles that may provide information that is incorrect, ambiguous, or misaligned with the features learned by the model. A natural question follows: how can we use models trained via the oracle assumption with hazy humans? We answer this question via a method we call deferred inference, which allows models trained via supervised learning to solicit and integrate additional information from the human when necessary. Deferred inference begins with a method for determining if the model should defer inference and wait for additional human-provided information. Past work has generally simplified this into one of two questions: is the human-provided information correct? or is the output correct? However, we find that these approaches are insufficient due to the complex relationship between human inputs, sensor readings, and deep models: low-quality human-provided information may not cause error, while high-quality human-provided information may not correct it. To address this misalignment we introduce Dual-loss Additional Error Regression, or DAER, a method that successfully locates instances where a new human input can reduce error. After introducing DAER, we note that we must not only consider how to find error caused by human input, but also how to integrate potentially noisy deferral responses and measure overall performance. For this, we introduce aggregation functions that integrate information across multiple inferences and a novel evaluation framework that measures the trade-off between error and additional human effort. Through this evaluation, we show that we can reduce error by up to 48% under a reasonable level of human effort without any changes to training or architecture. Last, we consider how to shift from datasets to individuals. While crowdsourced datasets allow rapid implementation and evaluation of deferral and aggregation functions, they do not accurately model human-computer interaction: the mechanisms used to crowdsource data impose shifts in the distribution, and the failure to identify individual annotators makes the tacit assumptions that all humans are the same and inputs do not change over time or deferral depth. Through a human-centered experiment, we show that these assumptions are not true: an ideal deferral function must be calibrated for a specific user, users learn the model over time, and the deferral response is likely to be of lower quality than the initial query. While deep-learned models have been proposed for many applications that require cooperation between humans and computers, deploying models that were trained and evaluated across carefully curated datasets remains a challenge due to the hazy nature of human inputs. In this dissertation, we propose deferred inference as a method for addressing this challenge while respecting the paradigm of supervised training. By demonstrating deferred inference on four disparate problems, we provide insights into its challenges, benefits, and generalizability that motivate and lay the foundation for the eventual deployment of deep-learned human-in-the-loop models.Deep Blue DOI
Subjects
Human-Computer Interaction Computer Vision Deep Learning Deferred Inference
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.