Show simple item record

Language-Driven Video Understanding

dc.contributor.authorZhou, Luowei
dc.date.accessioned2020-05-08T14:35:52Z
dc.date.availableNO_RESTRICTION
dc.date.available2020-05-08T14:35:52Z
dc.date.issued2020
dc.date.submitted2020
dc.identifier.urihttps://hdl.handle.net/2027.42/155174
dc.description.abstractVideo understanding has advanced quite a long way in the past decade, accomplishing tasks including low-level segmentation and tracking that study objects as pixel-level segments or bounding boxes to more high-level activity recognition or classification tasks that classify a video scene to a categorical action label. Despite the progress that has been made, much of this work remains a proxy for an eventual task or application that requires a holistic view of the video, such as objects, actions, attributes, and other semantic components. In this dissertation, we argue that language could deliver the required holistic representation. It plays a significant role in video understanding by allowing machines to communicate with humans and to understand our requests, as shown in tasks such as text-to-video search engine, voice-guided robot manipulation, to name a few. Our language-driven video understanding focuses on two specific problems: video description and visual grounding. What marks our viewpoint different from prior literature is twofold. First, we propose a bottom-up structured learning scheme by decomposing a long video into individual procedure steps and representing each step with a description. Second, we propose to have both explicit (i.e., supervised) and implicit (i.e., weakly-supervised and self-supervised) grounding between words and visual concepts which enables interpretable modeling of the two spaces. We start by drawing attention to the shortage of large benchmarks on long video-language and propose the largest-of-its-kind YouCook2 dataset and ActivityNet-Entities dataset in Chap. II and III. The rest of the chapters circle around two main problems: video description and visual grounding. For video description, we first address the problem of decomposing a long video into compact and self-contained event segments in Chap. IV. Given an event segment or short video clip in general, we propose a non-recurrent approach (i.e., Transformer) for video description generation in Chap. V as opposed to prior RNN-based methods and demonstrate superior performance. Moving forward, we notice one potential issue in end-to-end video description generation, i.e., lack of visual grounding ability and model interpretability that would allow humans to directly interact with machine vision models. To address this issue, we transition our focus from end-to-end, video-to-text systems to systems that could explicitly capture the grounding between the two modalities, with a novel grounded video description framework in Chap. VI. So far, all the methods are fully-supervised, i.e., the model training signal comes directly from heavy & expensive human annotations. In the following chapter, we answer the question "Can we perform visual grounding without explicit supervision?" with a weakly-supervised framework where models learn grounding from (weak) description signal. Finally, in Chap. VIII, we conclude the technical work by exploring a self-supervised grounding approach—vision-language pre-training—that implicitly learns visual grounding from web multi-modal data. This mimics how humans obtain their commonsense from the environment through multi-modal interactions.
dc.language.isoen_US
dc.subjectVision and Language
dc.subjectVideo description
dc.subjectVisual captioning
dc.subjectVisual grounding
dc.subjectComputer Vision
dc.subjectDeep Learning and Machine Learning
dc.titleLanguage-Driven Video Understanding
dc.typeThesis
dc.description.thesisdegreenamePhDen_US
dc.description.thesisdegreedisciplineRobotics
dc.description.thesisdegreegrantorUniversity of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeememberCorso, Jason
dc.contributor.committeememberMihalcea, Rada
dc.contributor.committeememberChoi, Joyce Y
dc.contributor.committeememberFouhey, David Ford
dc.contributor.committeememberRohrbach, Marcus
dc.subject.hlbsecondlevelComputer Science
dc.subject.hlbtoplevelEngineering
dc.description.bitstreamurlhttps://deepblue.lib.umich.edu/bitstream/2027.42/155174/1/luozhou_1.pdf
dc.identifier.orcid0000-0003-1197-0101
dc.identifier.name-orcidZhou, Luowei; 0000-0003-1197-0101en_US
dc.owningcollnameDissertations and Theses (Ph.D. and Master's)


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.