LifeQA dataset features

Castro, Santiago; Azab, Mahmoud; Stroud, Jonathan C.; Noujaim, Cristina; Wang, Ruoyao; Deng, Jia; Mihalcea, Rada

Work Description

Title: LifeQA dataset features Open Access Deposited

Attribute	Value
Methodology	To collect this dataset, we begin by searching for videos on YouTube, using manually-chosen keywords that lead to videos of people living out their daily lives in varied settings (e.g., "my morning routine," "dialogue," "kids playing," "class in elementary school" and "watching TV"). We then hand-pick 59 such videos based on the condition that they must contain recordings of natural interactions in natural settings. We explicitly exclude videos that do not contain language interactions. The identification of such videos turns out to be a challenging task, requiring significant manual effort. This is primarily because most of the recordings available online are in the form of vlogs, which include video recordings with voice layovers and are, therefore, not typical of natural interactions. We manually split the source videos into 275 video clips so that each clip includes coherent scenes lasting 1-2 minutes. We obtain transcriptions for the video clips using the Google Cloud Speech-to-Text platform. We also collect manual transcriptions for each video. Next, two annotators write five questions per video. For each question, we ask the annotators to write the correct answer to the question as well as three distractors (which we define as incorrect but semantically related answers). The annotators are instructed to formulate a diverse set of questions, which require an understanding of both the visual and linguistic content of the videos. We then instruct a third annotator to merge the two sets of questions from the original annotators, manually eliminate duplicate questions, and correct typographical errors. In total, we collect 2,326 questions using this procedure. * Question + Transcriptions + Vision. We implement two neural models that use all modalities, TVQA (Lei et al., 2018) and MovieQA (Tapaswi et al., 2016). Both models use object detection networks to identify visual concepts in the corresponding video frames, allowing them to make use of the visual modality. For both we use as visual inputs the output predictions of a Faster R-CNN (Ren et al., 2015) object detection model pretrained on Visual Genome (Krishna et al., 2017). For more information, refer to https://lit.eecs.umich.edu/lifeqa/. The features present here are extracted for the following baselines: * Question + Vision. We use two variants of ST-VQA (Jang et al., 2017). Both encode the video using a CNN followed by an LSTM, whose final hidden state is then used as in STVQA-Text. ST-VQA-Tp. uses the concatenation of the output of an ImageNet (Deng et al., 2009) pretrained ResNet152 (He et al., 2016) pool5 layer and of a Sports1M (Karpathy et al., 2014) pretrained C3D (Tran et al., 2015) fc6 layer as the video encoder. ST-VQA-Sp.Tp. computes a spatial attention map to decide what parts of the image are most useful and uses the res5c and conv5b of the two CNN encoders. Both use temporal attention maps to pool important information across video frames. We also tried a variant that uses RGB-I3D (Carreira and Zisserman, 2017) (with avg_pool and mixed_5c layers, respectively) instead of C3D, pretrained on ImageNet and Kinetics but do not report it because we obtained similar results. * Question + Transcriptions + Vision. We implement two neural models that use all modalities, TVQA (Lei et al., 2018) and MovieQA (Tapaswi et al., 2016). Both models use object detection networks to identify visual concepts in the corresponding video frames, allowing them to make use of the visual modality. For both we use as visual inputs the output predictions of a Faster R-CNN (Ren et al., 2015) object detection model pretrained on Visual Genome (Krishna et al., 2017). For more information, refer to https://lit.eecs.umich.edu/lifeqa/.
Description	We introduce LifeQA, a benchmark dataset for video question answering focusing on daily real-life situations. Current video question-answering datasets consist of movies and TV shows. However, it is well-known that these visual domains do not represent our day-to-day lives. Movies and TV shows, for example, benefit from professional camera movements, clean editing, crisp audio recordings, and scripted dialog between professional actors. While these domains provide a large amount of data for training models, their properties make them unsuitable for testing real-life question-answering systems. Our dataset, by contrast, consists of video clips that represent only real-life scenarios. We collect 275 such video clips and over 2.3k multiple-choice questions. In this paper, we analyze the challenging but realistic aspects of LifeQA and apply several state-of-the-art video question-answering models to provide benchmarks for future research. For more information, refer to https://lit.eecs.umich.edu/lifeqa/.
Creator	Castro, Santiago; Azab, Mahmoud; Stroud, Jonathan C.; Noujaim, Cristina; Wang, Ruoyao; Deng, Jia; and Mihalcea, Rada
Creator ORCID iD	https://orcid.org/0000-0001-8781-9323
Depositor	[email protected]
Contact information	[email protected]
Discipline	Science
Funding agency	Other Funding Agency
Other Funding agency	Toyota Research Institute
Citations to related material	Castro, S., Azab, M., Stroud, J., Noujaim, C., Wang, R., Deng, J., & Mihalcea, R. (2020, May). LifeQA: A real-life dataset for video question answering. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 4352-4358). https://aclanthology.org/2020.lrec-1.536/
Resource type	Dataset
Last modified	03/07/2025
Published	03/07/2025
DOI	https://doi.org/10.7302/nbj0-np80
License	http://creativecommons.org/licenses/by/4.0/

To Cite this Work:
Castro, S., Azab, M., Stroud, J. C., Noujaim, C., Wang, R., Deng, J., Mihalcea, R. (2025). LifeQA dataset features [Data set], University of Michigan - Deep Blue Data. https://doi.org/10.7302/nbj0-np80

Relationships


This work is not a member of any user collections.

Files (Count: 88; Size: 409 GB)

Title	Original Upload	Last Modified	File Size	Access	Actions
c3d.pickle	2024-03-05	2024-03-05	305 MB	Open Access	View Details Download
i3d.pt	2024-03-05	2024-03-05	48.5 MB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf500	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf501	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf502	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf503	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf504	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf506	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf507	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf508	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf509	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf510	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf511	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf512	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf513	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf514	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf515	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf516	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf517	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf518	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf519	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf520	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf521	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_conv5b.hdf522	2024-03-05	2024-03-05	736 MB	Open Access	View Details Download
LifeQA_C3D_fc6.hdf500	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_fc6.hdf501	2024-03-05	2024-03-05	4.04 GB	Open Access	View Details Download
LifeQA_C3D_fc7.hdf500	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_C3D_fc7.hdf501	2024-03-05	2024-03-05	4.04 GB	Open Access	View Details Download
LifeQA_I3D_avg_pool.hdf5	2024-03-05	2024-03-05	2.26 GB	Open Access	View Details Download
LifeQA_RESNET_pool5.hdf5	2024-03-05	2024-03-05	4.52 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf500	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf501	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf502	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf503	2024-03-05	2024-03-05	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf504	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf505	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf506	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf507	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf508	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf509	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf510	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf511	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf512	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf513	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf514	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf515	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf516	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf517	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf518	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf519	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf520	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf521	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf522	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf524	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf525	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf526	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf527	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf528	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf529	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf530	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf531	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf532	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf533	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf534	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf535	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf536	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf537	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf538	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf539	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf540	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf541	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf542	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf543	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
LifeQA_RESNET_res5c.hdf544	2024-03-06	2024-03-06	1.44 GB	Open Access	View Details Download
LifeQA_RESOF_pool5.hdf5	2024-03-05	2024-03-05	4.52 GB	Open Access	View Details Download
visual-genome.tar.gz00	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
visual-genome.tar.gz01	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
visual-genome.tar.gz02	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
visual-genome.tar.gz03	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
visual-genome.tar.gz04	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
visual-genome.tar.gz05	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
visual-genome.tar.gz06	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
visual-genome.tar.gz07	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
visual-genome.tar.gz08	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
visual-genome.tar.gz09	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
visual-genome.tar.gz10	2024-03-06	2024-03-06	5 GB	Open Access	View Details Download
visual-genome.tar.gz11	2024-03-06	2024-03-06	1.99 GB	Open Access	View Details Download
README.txt	2024-06-17	2024-06-17	6.47 KB	Open Access	View Details Download

Date: 11 May, 2020

Dataset Title: LifeQA: A Real-Life Dataset for Video Question Answering

Dataset Creators: Santiago Castro, Mahmoud Azab, Jonathan C. Stroud, Cristina Noujaim, Ruoyao Wang, Jia Deng, Rada Mihalcea

Dataset Contact: Santiago Castro

Funding: N024988 (Toyota Research Institute, TRI)

Research Overview:
We introduce LifeQA, a benchmark dataset for video question answering focusing on daily real-life situations. Current video question-answering datasets consist of movies and TV shows. However, it is well-known that these visual domains do not represent our day-to-day lives. Movies and TV shows, for example, benefit from professional camera movements, clean editing, crisp audio recordings, and scripted dialog between professional actors. While these domains provide a large amount of data for training models, their properties make them unsuitable for testing real-life question-answering systems. Our dataset, by contrast, consists of video clips that represent only real-life scenarios. We collect 275 such video clips and over 2.3k multiple-choice questions. In this paper, we analyze the challenging but realistic aspects of LifeQA and apply several state-of-the-art video question-answering models to provide benchmarks for future research.

Methodology:
To collect this dataset, we begin by searching for videos on YouTube, using manually-chosen keywords that lead to videos of people living out their daily lives in varied settings (e.g., "my morning routine," "dialogue," "kids playing," "class in elementary school" and "watching TV"). We then hand-pick 59 such videos based on the condition that they must contain recordings of natural interactions in natural settings. We explicitly exclude videos that do not contain language interactions. The identification of such videos turns out to be a challenging task, requiring significant manual effort. This is primarily because most of the recordings available online are in the form of vlogs, which include video recordings with voice layovers and are, therefore, not typical of natural interactions. We manually split the source videos into 275 video clips so that each clip includes coherent scenes lasting 1-2 minutes. We obtain transcriptions for the video clips using the Google Cloud Speech-to-Text platform. We also collect manual transcriptions for each video. Next, two annotators write five questions per video. For each question, we ask the annotators to write the correct answer to the question as well as three distractors (which we define as incorrect but semantically related answers). The annotators are instructed to formulate a diverse set of questions, which require an understanding of both the visual and linguistic content of the videos. We then instruct a third annotator to merge the two sets of questions from the original annotators, manually eliminate duplicate questions, and correct typographical errors. In total, we collect 2,326 questions using this procedure.

The features present here are extracted for the following baselines:

* Question + Vision. We use two variants of ST-VQA (Jang et al., 2017). Both encode the video using a CNN followed by an LSTM, whose final hidden state is then used as in STVQA-Text. ST-VQA-Tp. uses the concatenation of the output of an ImageNet (Deng et al., 2009) pretrained ResNet152 (He et al., 2016) pool5 layer and of a Sports1M (Karpathy et al., 2014) pretrained C3D (Tran et al., 2015) fc6 layer as the video encoder. ST-VQA-Sp.Tp. computes a spatial attention map to decide what parts of the image are most useful and uses the res5c and conv5b of the two CNN encoders. Both use temporal attention maps to pool important information across video frames. We also tried a variant that uses RGB-I3D (Carreira and Zisserman, 2017) (with avg_pool and mixed_5c layers, respectively) instead of C3D, pretrained on ImageNet and Kinetics but do not report it because we obtained similar results.
* Question + Transcriptions + Vision. We implement two neural models that use all modalities, TVQA (Lei et al., 2018) and MovieQA (Tapaswi et al., 2016). Both models use object detection networks to identify visual concepts in the corresponding video frames, allowing them to make use of the visual modality. For both we use as visual inputs the output predictions of a Faster R-CNN (Ren et al., 2015) object detection model pretrained on Visual Genome (Krishna et al., 2017).

For more information, refer to https://lit.eecs.umich.edu/lifeqa/.

Files contained here:
The files contain pre-computed features in HDF format for the videos in our dataset:
* c3d.pickle: the weights of a Sports1M-pre-trained C3D model.
* i3d.pt: the weights of a pre-trained RGB-I3D model (pre-trained on ImageNet and Kinetics).
* LifeQA_C3D_conv5b.hdf5{00..08}: features from the conv5b layer output from the Sports1M-pre-trained C3D model.
* LifeQA_C3D_fc6.hdf5{00..01}: features from the fc6 layer output from the Sports1M-pre-trained C3D model.
* LifeQA_C3D_fc7.hdf5{00..01}: features from the fc7 layer output from the Sports1M-pre-trained C3D model.
* LifeQA_I3D_avg_pool.hdf5: features from the avg_pool layer output from the pre-trained RGB-I3D model (pre-trained on ImageNet and Kinetics).
* LifeQA_RESNET_pool5.hdf5: features from the pool5 layer output from the ImageNet-pre-trained ResNet152 model.
* LifeQA_RESNET_res5c.hdf5{00..44}: features from the res5c layer output from the ImageNet-pre-trained ResNet152 model.
* LifeQA_RESOF_pool5.hdf5: features from the pool5 layer output from the ImageNet-pre-trained ResNet152 model, applied to Gunnar-Farneback's optical flow.
* visual-genome.tar.gz{00..11}: the list of detected objects using a Visual-Genome-pre-trained Faster R-CNN.

These features were extracting with the file https://github.com/mmazab/LifeQA/blob/master/feature_extraction/extract_features.py

Related publication(s):
Castro, S., Azab, M., Stroud, J., Noujaim, C., Wang, R., Deng, J., & Mihalcea, R. (2020, May). LifeQA: A real-life dataset for video question answering. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 4352-4358).

Use and Access:
The features (and the list of detected objects) made available under a Creative Commons Attribution 4.0 International license (CC BY 4.0). For the weights, please refer to the original publications.

To Cite Data:
Castro, S., Azab, M., Stroud, J., Noujaim, C., Wang, R., Deng, J., & Mihalcea, R. (2020, May). LifeQA: A real-life dataset for video question answering. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 4352-4358).

Update Provenance Log Entries

Download All Files (To download individual files, select them in the “Files” panel above)

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to contact us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.