Language Supervision for Computer Vision

Desai, Karan Prakash

Language Supervision for Computer Vision

dc.contributor.author	Desai, Karan Prakash
dc.date.accessioned	2024-05-22T17:21:41Z
dc.date.available	2024-05-22T17:21:41Z
dc.date.issued	2024
dc.date.submitted	2024
dc.identifier.uri	https://hdl.handle.net/2027.42/193220
dc.description.abstract	Representation learning lies at the core of modern Artificial Intelligence. In computer vision, labeled image datasets like ImageNet have been the standard choice for representation learning. Despite being empirically successful, this approach is expensive to scale due to labeling costs. Moreover, the representation quality is limited by the size and diversity of datasets and their associated label ontologies. My research explores using natural language supervision for computer vision. Using natural language allows us to go beyond fixed label ontologies and scale up to more general sources such as internet data. Toward this goal, my dissertation explores four problems -- (1) Learning representations: I propose one of the first methods for language-supervised visual learning that uses image captioning as the training objective, showing its efficacy compared to ImageNet-trained methods on downstream tasks like object detection and segmentation. (2) Scaling data: I explore social media as a rich source of high-quality image descriptions and curate a dataset of 12 million image-text pairs while ensuring responsible curation practices. (3) Understanding data: It is difficult to comprehend the diversity of visual concepts present in millions of image-text pairs. I posit that images and text naturally organize into a tree-like hierarchy and propose an approach for learning representations that capture this hierarchy using tools from hyperbolic geometry. (4) Transfer to downstream tasks: Large vision-language models show impressive zero-shot transfer capabilities on image-level tasks like classification and retrieval. However, their transferability to pixel-level tasks like object detection and segmentation has relied on expensive labeled mask annotations. I propose an object detector to efficiently transfer pre-trained vision models to segment and classify visual objects without any fine-tuning, unlike existing detectors that train using orders of magnitude more labeled masks to achieve high performance. In summary, my research affirms that using language supervision can drive the next leap of progress in computer vision and has immense utility in practical applications.
dc.language.iso	en_US
dc.subject	Use natural language as a supervisory signal with visual data to train computer vision models.
dc.title	Language Supervision for Computer Vision
dc.type	Thesis
dc.description.thesisdegreename	PhD
dc.description.thesisdegreediscipline	Computer Science & Engineering
dc.description.thesisdegreegrantor	University of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeemember	Johnson, Justin Christopher
dc.contributor.committeemember	Owens, Andrew
dc.contributor.committeemember	Baldridge, Jason
dc.contributor.committeemember	Yu, Stella
dc.subject.hlbsecondlevel	Computer Science
dc.subject.hlbtoplevel	Engineering
dc.contributor.affiliationumcampus	Ann Arbor
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/193220/1/kdexd_1.pdf
dc.identifier.doi	https://dx.doi.org/10.7302/22865
dc.identifier.orcid	0009-0000-9739-3047
dc.identifier.name-orcid	Desai, Karan; 0009-0000-9739-3047	en_US
dc.working.doi	10.7302/22865	en
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: kdexd_1.pdf
Size:: 23.09MB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.