Understanding and Identifying Challenges in Design of Safety-Critical AI Systems
Lubana, Ekdeep Singh
2024
Abstract
Increasingly powerful and general-purpose AI systems have found their way into our daily lives. As these systems spread to high-stakes applications, ensuring their safe deployment has become crucial: we must ensure these systems augment and benefit our society, instead of becoming an active source of harm. To this end, recent regulatory work has sought to define standards for identifying and mitigating vulnerabilities in AI driven applications. These works have grounded themselves in frameworks of risk regulation, motivated by the fact that establishing causality in AI-caused harms can be difficult. The goal of this dissertation is to challenge this design choice. Over three broad parts, the presented work argues that peculiarities of modern neural networks, such as their increasingly open-ended nature, lead to loopholes in off-the-shelf use of risk regulation as a framework for ensuring development of safe AI. Contributions of each part are summarized as follows. (i) Risk regulation assumes one can preemptively list harms expected from a system, allowing design of protocols to monitor them. In the first part, we present several empirical and formal models that demonstrate the unpredictable nature of neural network capabilities, showing that they can suddenly emerge and enable a network to perform tasks it was not intended for. Specifically, we show emergent learning either occurs when general structures underlying the data-generating process are learned by the model, hence accelerating learning of narrower tasks, or when a task is compositional in nature and capabilities relevant to performing the composition are learned by the model. Such unpredictability renders preemptive expectations of risks infeasible. (ii) Risk assessment requires well-defined evaluations. In the second part, we analyze two challenges to this goal. First, we show that when evaluating compositional capabilities, models exhibit an intriguing phenomenon wherein much before standard benchmarking helps claim the model possesses a capability, there exist latent interventions that can force the model to generate the desired output. Such capabilities can therefore evade risk assessment evaluations. Second, we analyze how minor input modifications can significantly alter a model’s behavior, hence complicating the establishment of safe use standards. We formalize this problem as input underspecification and analyze the mechanisms used by a model for inferring which solution, among the spectrum of valid ones, should be used to respond to an input. We show evidence suggesting large-scale models engage in a Bayesian selection protocol, i.e., minimal input changes that alter posterior probability of a solution can completely change these models’ output. (iii) Fine-tuning protocols are the current de-facto strategy for mitigating vulnerabilities identified in a neural network. In the final part, we identify limitations in these protocols, revealing they learn minimalistic “wrappers” over base capabilities of a model and hence do not adequately suppress undesirable behaviors outside of the fine-tuning data distribution. Overall, the contributions of this dissertation suggest regulation of AI systems requires exploration of novel and more nuanced paradigms that go beyond mere risk regulation. This can involve intermixing several viable frameworks, e.g., liability based models and tort law, to define backstops when risk regulation fails to adequately foresee harms possible from a system.Deep Blue DOI
Subjects
Emergent abilities in neural networks Risk regulation
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.