Towards Video Understanding through Language in Real-life Settings

Castro, Santiago

Towards Video Understanding through Language in Real-life Settings

dc.contributor.author	Castro, Santiago
dc.date.accessioned	2024-09-03T18:43:33Z
dc.date.available	2024-09-03T18:43:33Z
dc.date.issued	2024
dc.date.submitted	2024
dc.identifier.uri	https://hdl.handle.net/2027.42/194696
dc.description.abstract	Videos have become an integral part of our daily lives, with a rapidly growing number on YouTube, Netflix, and TikTok serving as testimony to their widespread popularity. Behind the simplicity of their interfaces and user experiences, the systems that power these products employ numerous video-understanding techniques, even for straightforward use cases such as finding a video on how to cook salmon. Despite the significant progress achieved in this area, there remains a gap between lab-setting capabilities and reality, as multiple phenomena are not adequately designed for realistic settings, causing various issues such as domain mismatches and the diverse way people interact in videos (e.g., sarcastically). My work aims to bridge this gap by enabling the understanding of video content in realistic settings. The issues that make current video understanding research unsuitable for real life can be classified into data, methods, and evaluation. The data aspect is crucial since current research has predominantly overlooked real-life settings. I present new datasets and benchmarks for such domains: daily situations and in-the-wild scenarios. These benchmarks measure the effectiveness of new methods in these more realistic settings. Likewise, I introduce a novel framework that accounts for a typical yet understudied human behavior: sarcasm. Sarcasm is particularly suited to be studied in video since I show that leveraging what we see and hear (as people commonly do) allows one to understand it better. For the methods aspect, I consider a fundamental issue, which is the impracticality and lack of scalability of the traditional in-the-lab setting, tuning one model for each newly addressed task and domain. I propose a robust method that allows practitioners to employ a single model for novel tasks and domains with satisfactory performance. Additionally, I present a technique to improve the compositional generalization of existing models. Finally, I focus on current practices for evaluation and propose a framework better suited to realistic settings. Current benchmarks for short video understanding have drawbacks, such as employing easy-to-detect distractor answers, not accounting for diversity when depicting the same situation, and not considering realistic settings. I present a novel evaluation format that tackles all these issues and a benchmark that leverages it. The benchmark shows a gap between the performance of several methods and humans.
dc.language.iso	en_US
dc.subject	Video Understanding
dc.subject	Natural Language Processing
dc.subject	Computer Vision
dc.subject	Compositional Generalization
dc.title	Towards Video Understanding through Language in Real-life Settings
dc.type	Thesis
dc.description.thesisdegreename	PhD
dc.description.thesisdegreediscipline	Computer Science & Engineering
dc.description.thesisdegreegrantor	University of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeemember	Mihalcea, Rada
dc.contributor.committeemember	Owens, Andrew
dc.contributor.committeemember	Caba, Fabian
dc.contributor.committeemember	Chai, Joyce
dc.contributor.committeemember	Johnson, Justin Christopher
dc.contributor.committeemember	Moncecchi, Guillermo
dc.subject.hlbsecondlevel	Computer Science
dc.subject.hlbtoplevel	Engineering
dc.contributor.affiliationumcampus	Ann Arbor
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/194696/1/sacastro_1.pdf
dc.identifier.doi	https://dx.doi.org/10.7302/24044
dc.identifier.orcid	0000-0001-8781-9323
dc.working.doi	10.7302/24044	en
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: sacastro_1.pdf
Size:: 11.88MB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.