Language-Driven Video Understanding
Zhou, Luowei
2020
Abstract
Video understanding has advanced quite a long way in the past decade, accomplishing tasks including low-level segmentation and tracking that study objects as pixel-level segments or bounding boxes to more high-level activity recognition or classification tasks that classify a video scene to a categorical action label. Despite the progress that has been made, much of this work remains a proxy for an eventual task or application that requires a holistic view of the video, such as objects, actions, attributes, and other semantic components. In this dissertation, we argue that language could deliver the required holistic representation. It plays a significant role in video understanding by allowing machines to communicate with humans and to understand our requests, as shown in tasks such as text-to-video search engine, voice-guided robot manipulation, to name a few. Our language-driven video understanding focuses on two specific problems: video description and visual grounding. What marks our viewpoint different from prior literature is twofold. First, we propose a bottom-up structured learning scheme by decomposing a long video into individual procedure steps and representing each step with a description. Second, we propose to have both explicit (i.e., supervised) and implicit (i.e., weakly-supervised and self-supervised) grounding between words and visual concepts which enables interpretable modeling of the two spaces. We start by drawing attention to the shortage of large benchmarks on long video-language and propose the largest-of-its-kind YouCook2 dataset and ActivityNet-Entities dataset in Chap. II and III. The rest of the chapters circle around two main problems: video description and visual grounding. For video description, we first address the problem of decomposing a long video into compact and self-contained event segments in Chap. IV. Given an event segment or short video clip in general, we propose a non-recurrent approach (i.e., Transformer) for video description generation in Chap. V as opposed to prior RNN-based methods and demonstrate superior performance. Moving forward, we notice one potential issue in end-to-end video description generation, i.e., lack of visual grounding ability and model interpretability that would allow humans to directly interact with machine vision models. To address this issue, we transition our focus from end-to-end, video-to-text systems to systems that could explicitly capture the grounding between the two modalities, with a novel grounded video description framework in Chap. VI. So far, all the methods are fully-supervised, i.e., the model training signal comes directly from heavy & expensive human annotations. In the following chapter, we answer the question "Can we perform visual grounding without explicit supervision?" with a weakly-supervised framework where models learn grounding from (weak) description signal. Finally, in Chap. VIII, we conclude the technical work by exploring a self-supervised grounding approach—vision-language pre-training—that implicitly learns visual grounding from web multi-modal data. This mimics how humans obtain their commonsense from the environment through multi-modal interactions.Subjects
Vision and Language Video description Visual captioning Visual grounding Computer Vision Deep Learning and Machine Learning
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.