Learning Visual Representations from Cross-Modal Correspondence

El Banani, Mohamed

Learning Visual Representations from Cross-Modal Correspondence

dc.contributor.author	El Banani, Mohamed
dc.date.accessioned	2024-05-22T17:22:32Z
dc.date.available	2024-05-22T17:22:32Z
dc.date.issued	2024
dc.date.submitted	2024
dc.identifier.uri	https://hdl.handle.net/2027.42/193255
dc.description.abstract	One of the goals of computer vision is to develop visual agents that can learn without human annotation. This is typically done by learning from images and their augmentations. In contrast, humans learn from dynamic and multi-sensory environments without requiring such explicit supervision. My dissertation delves into this contrast, exploring how models can learn visual representations directly from their environments. My core observation is that such environments, despite their complexity, present consistent patterns across modalities. These cross-modal patterns offer a rich training signal as we can leverage similarity in one modality for learning generalizable representations in another without requiring additional supervision. In this dissertation, I argue that cross-modal correspondence provides a rich signal for learning visual representations and a useful tool for analyzing them. I first discuss how models can learn visual representations by finding 3D correspondence in RGB-D videos. Through estimating geometrically consistent correspondences between video frames, models can learn representations that rival supervised models. I then discuss how the notion of correspondence could be applied to language. I propose language-guided self-supervised learning, where language models are used to find image pairs that depict similar concepts. I show that using language guidance outperforms self-supervised and language-supervised models; further showcasing the utility of learning from correspondence. Finally, I explore how correspondence can also be used to analyze the 3D awareness and consistency of visual representations learned by large-scale vision models. My analysis suggests that while current approaches yield good models for semantics and localization, their 3D awareness remains limited.
dc.language.iso	en_US
dc.subject	Computer Vision
dc.subject	Self-supervised Learning
dc.subject	Representation Learning
dc.subject	Correspondence
dc.title	Learning Visual Representations from Cross-Modal Correspondence
dc.type	Thesis
dc.description.thesisdegreename	PhD
dc.description.thesisdegreediscipline	Computer Science & Engineering
dc.description.thesisdegreegrantor	University of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeemember	Johnson, Justin Christopher
dc.contributor.committeemember	Owens, Andrew
dc.contributor.committeemember	Efros, Alexei
dc.contributor.committeemember	Fouhey, David Ford
dc.contributor.committeemember	Yu, Stella
dc.subject.hlbsecondlevel	Computer Science
dc.subject.hlbtoplevel	Engineering
dc.contributor.affiliationumcampus	Ann Arbor
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/193255/1/mbanani_1.pdf
dc.identifier.doi	https://dx.doi.org/10.7302/22900
dc.identifier.orcid	0000-0003-4686-6048
dc.identifier.name-orcid	El Banani, Mohamed; 0000-0003-4686-6048	en_US
dc.working.doi	10.7302/22900	en
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: mbanani_1.pdf
Size:: 8.609MB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.