Enforcing Realism and Temporal Consistency for Large-Scale Video Inpainting
AbstractToday, people are consuming more videos than ever before. At the same time, video manipulation has rapidly been gaining traction due to the influence of viral videos, as well as the convenience of editing software. Although video manipulation has legitimate entertainment purposes, it can also be incredibly destructive. In order to understand the positive and negative consequences of media manipulation---as well as to maintain the integrity of mass media---it is important to investigate the capabilities of video manipulation techniques. In this dissertation, we focus on the manipulation task of video inpainting, where the goal is to automatically fill in missing parts of a masked video with semantically relevant content. Inpainting results should possess high visual quality with respect to reconstruction performance, realism, and temporal consistency, i.e., they should faithfully recreate missing contents in a way that resembles the real world and exhibits minimal flickering artifacts. Two major challenges have impeded progress toward improving visual quality: semantic ambiguity and diagnostic evaluation. Semantic ambiguity exists for any masked video due to several plausible explanations of the events in the observed scene; however, prior methods have struggled with ambiguity due to their limited temporal contexts. As for diagnostic evaluation, prior work has overemphasized aggregate analysis on large datasets and underemphasized fine-grained analysis on modern inpainting failure modes; as a result, the expected behaviors of models under specific scenarios have remained poorly understood. Our work improves on both models and evaluation techniques for video inpainting, thereby providing deeper insight into how an inpainting model's design impacts the visual quality of its outputs. To advance state-of-the-art in video inpainting, we propose two novel solutions that improve visual quality by expanding the available temporal context. Our first approach, bi-TAI, intelligently integrates information from multiple frames before and after the desired sequence. It produces more realistic results than prior work, which could only consume limited contextual information. Our second approach, HyperCon, suppresses flickering artifacts from frame-wise processing by identifying and propagating consistencies found in high frame-rate space; we successfully apply it to tasks as disparate as video inpainting and style transfer. Aside from methodological improvements, we also propose two novel evaluation tools to diagnose failure modes of modern video inpainting methods. Our first such contribution is the Moving Symbols dataset, which we use to characterize the sensitivity of a state-of-the-art video prediction model to controllable appearance and motion parameters. Our second contribution is the DEVIL benchmark, which provides a dataset and a comprehensive evaluation scheme to quantify how several semantic properties of the input video and mask affect video inpainting quality. Through models that exploit temporal context---as well as evaluation paradigms that reveal fine-grained failure modes of modern inpainting methods at scale---our contributions enforce better visual quality for video inpainting on a larger scale than prior work. We enable the production of more convincing manipulated videos for data processing and social media needs; we also establish replicable fine-grained analysis techniques to cultivate future progress in the field.
Deep Blue DOI
computer visionvideo manipulationgenerative visual modeling
MetadataShow full item record
Accessibility: If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.