Situated Language Grounding for Multimodal AI Assistant Modeling
Zhang, Yichi
2025
Abstract
Building multimodal AI assistants that can perceive the physical world, communicate naturally with humans, and help with real-world tasks is a cornerstone of AI research. Situated language grounding—the ability to connect language to rich, multimodal contexts—is fundamental for developing assistants that can interpret human instructions, reason about their environment, and provide timely responses. Despite advances in Large Language Models (LLMs) and their multimodal extensions (MLLMs) for processing multimodal inputs, effectively grounding language within the dynamic physical world to facilitate seamless human-AI interaction remains a significant challenge. This dissertation addresses this challenge by investigating situated language grounding across multiple dimensions and developing novel approaches for creating more contextually aware AI assistants. In Part I, we explore language grounding in visual perception. We first introduce GROUNDHOG, a multimodal large language model designed for holistic visual segmentation across various semantic granularities. By leveraging innovative architectural designs and a curated multi-granularity dataset, this model outperforms previous task-specific approaches on a range of grounding tasks using a single model. We then examine the behavior of vision-language models under a unique setup of visual illusions. We propose GVIL, a systematic benchmark to test whether models are able to perceive visual illusions similarly to humans or represent reality faithfully, and demonstrate critical insights into shared and misaligned patterns between human and machines. These contributions advance our understanding of how multimodal models link language to vision in both standard and challenging perceptual scenarios. In Part II, we extend our investigation to language grounding in embodied planning within 3D environments. We propose two neural-symbolic approaches to enable assistants to follow natural language instructions in such settings. First, HITUT introduces a unified transformer model for hierarchical task planning, allowing the use of the same model for both high-level task decomposition and low-level action prediction. Second, DANLI constructs semantic representations of the physical environment and integrates a symbolic planner for transparent and efficient task execution. These complementary approaches demonstrate how assistants can leverage grounded understanding to interact effectively within complex environments. In Part III, we focus on grounding language in dynamic and interactive settings, where assistants must process continuous visual input from a user’s first-person perspective and engage proactively. We propose two complementary approaches. First, ProAssist provides a framework for generating realistic user-assistant dialogues from egocentric videos, harnessing large-scale real-world video data to train proactive assistants. Second, BASIS presents a simulation-based pipeline that supports scalable assistant development and evaluation through situated interactions in virtual environments, reducing the need for costly human intervention. These works provide foundations of scalable development for interactive, perceptual egocentric assistant systems. Together, these contributions advance assistant modeling by introducing novel tasks, datasets, and modeling strategies for grounded, context-aware human-AI interaction.Deep Blue DOI
Subjects
Artificial Intelligence Multimodal Learning Language Grounding Embodied AI Human-Robot Interaction Large Language Model
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.