Language-Driven Video Understanding

Zhou, Luowei

Language-Driven Video Understanding

dc.contributor.author	Zhou, Luowei
dc.date.accessioned	2020-05-08T14:35:52Z
dc.date.available	NO_RESTRICTION
dc.date.available	2020-05-08T14:35:52Z
dc.date.issued	2020
dc.date.submitted	2020
dc.identifier.uri	https://hdl.handle.net/2027.42/155174
dc.description.abstract	Video understanding has advanced quite a long way in the past decade, accomplishing tasks including low-level segmentation and tracking that study objects as pixel-level segments or bounding boxes to more high-level activity recognition or classification tasks that classify a video scene to a categorical action label. Despite the progress that has been made, much of this work remains a proxy for an eventual task or application that requires a holistic view of the video, such as objects, actions, attributes, and other semantic components. In this dissertation, we argue that language could deliver the required holistic representation. It plays a significant role in video understanding by allowing machines to communicate with humans and to understand our requests, as shown in tasks such as text-to-video search engine, voice-guided robot manipulation, to name a few. Our language-driven video understanding focuses on two specific problems: video description and visual grounding. What marks our viewpoint different from prior literature is twofold. First, we propose a bottom-up structured learning scheme by decomposing a long video into individual procedure steps and representing each step with a description. Second, we propose to have both explicit (i.e., supervised) and implicit (i.e., weakly-supervised and self-supervised) grounding between words and visual concepts which enables interpretable modeling of the two spaces. We start by drawing attention to the shortage of large benchmarks on long video-language and propose the largest-of-its-kind YouCook2 dataset and ActivityNet-Entities dataset in Chap. II and III. The rest of the chapters circle around two main problems: video description and visual grounding. For video description, we first address the problem of decomposing a long video into compact and self-contained event segments in Chap. IV. Given an event segment or short video clip in general, we propose a non-recurrent approach (i.e., Transformer) for video description generation in Chap. V as opposed to prior RNN-based methods and demonstrate superior performance. Moving forward, we notice one potential issue in end-to-end video description generation, i.e., lack of visual grounding ability and model interpretability that would allow humans to directly interact with machine vision models. To address this issue, we transition our focus from end-to-end, video-to-text systems to systems that could explicitly capture the grounding between the two modalities, with a novel grounded video description framework in Chap. VI. So far, all the methods are fully-supervised, i.e., the model training signal comes directly from heavy & expensive human annotations. In the following chapter, we answer the question "Can we perform visual grounding without explicit supervision?" with a weakly-supervised framework where models learn grounding from (weak) description signal. Finally, in Chap. VIII, we conclude the technical work by exploring a self-supervised grounding approach—vision-language pre-training—that implicitly learns visual grounding from web multi-modal data. This mimics how humans obtain their commonsense from the environment through multi-modal interactions.
dc.language.iso	en_US
dc.subject	Vision and Language
dc.subject	Video description
dc.subject	Visual captioning
dc.subject	Visual grounding
dc.subject	Computer Vision
dc.subject	Deep Learning and Machine Learning
dc.title	Language-Driven Video Understanding
dc.type	Thesis
dc.description.thesisdegreename	PhD	en_US
dc.description.thesisdegreediscipline	Robotics
dc.description.thesisdegreegrantor	University of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeemember	Corso, Jason
dc.contributor.committeemember	Mihalcea, Rada
dc.contributor.committeemember	Choi, Joyce Y
dc.contributor.committeemember	Fouhey, David Ford
dc.contributor.committeemember	Rohrbach, Marcus
dc.subject.hlbsecondlevel	Computer Science
dc.subject.hlbtoplevel	Engineering
dc.description.bitstreamurl	https://deepblue.lib.umich.edu/bitstream/2027.42/155174/1/luozhou_1.pdf
dc.identifier.orcid	0000-0003-1197-0101
dc.identifier.name-orcid	Zhou, Luowei; 0000-0003-1197-0101	en_US
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: luozhou_1.pdf
Size:: 13.83MB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.