Towards Human Action Understanding in Social Media Videos Using Multimodal Models

Ignat, Oana

Towards Human Action Understanding in Social Media Videos Using Multimodal Models

dc.contributor.author	Ignat, Oana
dc.date.accessioned	2022-09-06T16:25:59Z
dc.date.available	2022-09-06T16:25:59Z
dc.date.issued	2022
dc.date.submitted	2022
dc.identifier.uri	https://hdl.handle.net/2027.42/174613
dc.description.abstract	Human action understanding is one of the most impactful and challenging tasks a computer system can do. Once a computer system learns how to interact with humans, it can assist us in our everyday life activities and significantly improve our quality of life. Despite the attention it has received in fields such as Natural Language Processing and Computer Vision, and the significant strides towards accurate and robust action recognition and localization systems, human action understanding still remains an unsolved problem. In this thesis, we introduce and analyze how models can learn from multimodal data, i.e, from what humans say and do while performing their everyday activities. As a step towards endowing systems with a richer understanding of human actions in online videos, this thesis proposes new techniques that rely on the vision and language channels to address four important challenges: i) human action visibility identification in online videos, ii) temporal human action localization in online videos, iii) human action reason identification in online videos, and iv) human action co-occurrence identification. We focus on the widely spread genre of lifestyle vlogs, which consist of videos of people performing actions while verbally describing them. We construct a dataset with crowdsourced manual annotations of visible actions, temporal action localization and action reason identification in online vlogs. We propose a multimodal unsupervised model to automatically infer the reasons corresponding to an action presented in the video, a simple yet effective method to localize the narrated actions based on their expected duration, and a multimodal supervised classification model of action visibility in videos. We also perform ablations on how each modality contributes to solving the tasks and compare the multimodal models performance with the single-modalities models based on the visual content and vlog transcripts. Finally, we present an extensive analysis of this data, which allows for a better understanding of how the language and visual modalities interact throughout the videos and pave the road for rich avenues for future work.
dc.language.iso	en_US
dc.subject	human action understanding
dc.subject	multimodal machine learning models
dc.subject	computer vision and natural language processing
dc.title	Towards Human Action Understanding in Social Media Videos Using Multimodal Models
dc.type	Thesis
dc.description.thesisdegreename	PhD	en_US
dc.description.thesisdegreediscipline	Computer Science & Engineering
dc.description.thesisdegreegrantor	University of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeemember	Mihalcea, Rada
dc.contributor.committeemember	Owens, Andrew
dc.contributor.committeemember	Caba Heilbron, Fabian
dc.contributor.committeemember	Chai, Joyce
dc.contributor.committeemember	Fouhey, David Ford
dc.subject.hlbsecondlevel	Computer Science
dc.subject.hlbtoplevel	Engineering
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/174613/1/oignat_1.pdf
dc.identifier.doi	https://dx.doi.org/10.7302/6344
dc.identifier.orcid	0000-0003-0272-5147
dc.identifier.name-orcid	Ignat, Oana; 0000-0003-0272-5147	en_US
dc.working.doi	10.7302/6344	en
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: oignat_1.pdf
Size:: 5.729MB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.