Show simple item record

Learning Visual Representations from Cross-Modal Correspondence

dc.contributor.authorEl Banani, Mohamed
dc.date.accessioned2024-05-22T17:22:32Z
dc.date.available2024-05-22T17:22:32Z
dc.date.issued2024
dc.date.submitted2024
dc.identifier.urihttps://hdl.handle.net/2027.42/193255
dc.description.abstractOne of the goals of computer vision is to develop visual agents that can learn without human annotation. This is typically done by learning from images and their augmentations. In contrast, humans learn from dynamic and multi-sensory environments without requiring such explicit supervision. My dissertation delves into this contrast, exploring how models can learn visual representations directly from their environments. My core observation is that such environments, despite their complexity, present consistent patterns across modalities. These cross-modal patterns offer a rich training signal as we can leverage similarity in one modality for learning generalizable representations in another without requiring additional supervision. In this dissertation, I argue that cross-modal correspondence provides a rich signal for learning visual representations and a useful tool for analyzing them. I first discuss how models can learn visual representations by finding 3D correspondence in RGB-D videos. Through estimating geometrically consistent correspondences between video frames, models can learn representations that rival supervised models. I then discuss how the notion of correspondence could be applied to language. I propose language-guided self-supervised learning, where language models are used to find image pairs that depict similar concepts. I show that using language guidance outperforms self-supervised and language-supervised models; further showcasing the utility of learning from correspondence. Finally, I explore how correspondence can also be used to analyze the 3D awareness and consistency of visual representations learned by large-scale vision models. My analysis suggests that while current approaches yield good models for semantics and localization, their 3D awareness remains limited.
dc.language.isoen_US
dc.subjectComputer Vision
dc.subjectSelf-supervised Learning
dc.subjectRepresentation Learning
dc.subjectCorrespondence
dc.titleLearning Visual Representations from Cross-Modal Correspondence
dc.typeThesis
dc.description.thesisdegreenamePhD
dc.description.thesisdegreedisciplineComputer Science & Engineering
dc.description.thesisdegreegrantorUniversity of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeememberJohnson, Justin Christopher
dc.contributor.committeememberOwens, Andrew
dc.contributor.committeememberEfros, Alexei
dc.contributor.committeememberFouhey, David Ford
dc.contributor.committeememberYu, Stella
dc.subject.hlbsecondlevelComputer Science
dc.subject.hlbtoplevelEngineering
dc.contributor.affiliationumcampusAnn Arbor
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/193255/1/mbanani_1.pdf
dc.identifier.doihttps://dx.doi.org/10.7302/22900
dc.identifier.orcid0000-0003-4686-6048
dc.identifier.name-orcidEl Banani, Mohamed; 0000-0003-4686-6048en_US
dc.working.doi10.7302/22900en
dc.owningcollnameDissertations and Theses (Ph.D. and Master's)


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.