Work Description

Title: LifeQA dataset features Open Access Deposited

h
Attribute Value
Methodology
  • To collect this dataset, we begin by searching for videos on YouTube, using manually-chosen keywords that lead to videos of people living out their daily lives in varied settings (e.g., "my morning routine," "dialogue," "kids playing," "class in elementary school" and "watching TV"). We then hand-pick 59 such videos based on the condition that they must contain recordings of natural interactions in natural settings. We explicitly exclude videos that do not contain language interactions. The identification of such videos turns out to be a challenging task, requiring significant manual effort. This is primarily because most of the recordings available online are in the form of vlogs, which include video recordings with voice layovers and are, therefore, not typical of natural interactions. We manually split the source videos into 275 video clips so that each clip includes coherent scenes lasting 1-2 minutes. We obtain transcriptions for the video clips using the Google Cloud Speech-to-Text platform. We also collect manual transcriptions for each video. Next, two annotators write five questions per video. For each question, we ask the annotators to write the correct answer to the question as well as three distractors (which we define as incorrect but semantically related answers). The annotators are instructed to formulate a diverse set of questions, which require an understanding of both the visual and linguistic content of the videos. We then instruct a third annotator to merge the two sets of questions from the original annotators, manually eliminate duplicate questions, and correct typographical errors. In total, we collect 2,326 questions using this procedure. * Question + Transcriptions + Vision. We implement two neural models that use all modalities, TVQA (Lei et al., 2018) and MovieQA (Tapaswi et al., 2016). Both models use object detection networks to identify visual concepts in the corresponding video frames, allowing them to make use of the visual modality. For both we use as visual inputs the output predictions of a Faster R-CNN (Ren et al., 2015) object detection model pretrained on Visual Genome (Krishna et al., 2017). For more information, refer to  https://lit.eecs.umich.edu/lifeqa/.

  • The features present here are extracted for the following baselines: * Question + Vision. We use two variants of ST-VQA (Jang et al., 2017). Both encode the video using a CNN followed by an LSTM, whose final hidden state is then used as in STVQA-Text. ST-VQA-Tp. uses the concatenation of the output of an ImageNet (Deng et al., 2009) pretrained ResNet152 (He et al., 2016) pool5 layer and of a Sports1M (Karpathy et al., 2014) pretrained C3D (Tran et al., 2015) fc6 layer as the video encoder. ST-VQA-Sp.Tp. computes a spatial attention map to decide what parts of the image are most useful and uses the res5c and conv5b of the two CNN encoders. Both use temporal attention maps to pool important information across video frames. We also tried a variant that uses RGB-I3D (Carreira and Zisserman, 2017) (with avg_pool and mixed_5c layers, respectively) instead of C3D, pretrained on ImageNet and Kinetics but do not report it because we obtained similar results.

  • * Question + Transcriptions + Vision. We implement two neural models that use all modalities, TVQA (Lei et al., 2018) and MovieQA (Tapaswi et al., 2016). Both models use object detection networks to identify visual concepts in the corresponding video frames, allowing them to make use of the visual modality. For both we use as visual inputs the output predictions of a Faster R-CNN (Ren et al., 2015) object detection model pretrained on Visual Genome (Krishna et al., 2017). For more information, refer to  https://lit.eecs.umich.edu/lifeqa/.
Description
  • We introduce LifeQA, a benchmark dataset for video question answering focusing on daily real-life situations. Current video question-answering datasets consist of movies and TV shows. However, it is well-known that these visual domains do not represent our day-to-day lives. Movies and TV shows, for example, benefit from professional camera movements, clean editing, crisp audio recordings, and scripted dialog between professional actors. While these domains provide a large amount of data for training models, their properties make them unsuitable for testing real-life question-answering systems. Our dataset, by contrast, consists of video clips that represent only real-life scenarios. We collect 275 such video clips and over 2.3k multiple-choice questions. In this paper, we analyze the challenging but realistic aspects of LifeQA and apply several state-of-the-art video question-answering models to provide benchmarks for future research. For more information, refer to  https://lit.eecs.umich.edu/lifeqa/.
Creator
Creator ORCID iD
Depositor
Contact information
Discipline
Funding agency
  • Other Funding Agency
Other Funding agency
  • Toyota Research Institute
Citations to related material
  • Castro, S., Azab, M., Stroud, J., Noujaim, C., Wang, R., Deng, J., & Mihalcea, R. (2020, May). LifeQA: A real-life dataset for video question answering. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 4352-4358). https://aclanthology.org/2020.lrec-1.536/
Resource type
Last modified
  • 03/07/2025
Published
  • 03/07/2025
DOI
  • https://doi.org/10.7302/nbj0-np80
License
To Cite this Work:
Castro, S., Azab, M., Stroud, J. C., Noujaim, C., Wang, R., Deng, J., Mihalcea, R. (2025). LifeQA dataset features [Data set], University of Michigan - Deep Blue Data. https://doi.org/10.7302/nbj0-np80

Relationships

This work is not a member of any user collections.

Files (Count: 88; Size: 409 GB)

Date: 11 May, 2020

Dataset Title: LifeQA: A Real-Life Dataset for Video Question Answering

Dataset Creators: Santiago Castro, Mahmoud Azab, Jonathan C. Stroud, Cristina Noujaim, Ruoyao Wang, Jia Deng, Rada Mihalcea

Dataset Contact: Santiago Castro

Funding: N024988 (Toyota Research Institute, TRI)

Research Overview:
We introduce LifeQA, a benchmark dataset for video question answering focusing on daily real-life situations. Current video question-answering datasets consist of movies and TV shows. However, it is well-known that these visual domains do not represent our day-to-day lives. Movies and TV shows, for example, benefit from professional camera movements, clean editing, crisp audio recordings, and scripted dialog between professional actors. While these domains provide a large amount of data for training models, their properties make them unsuitable for testing real-life question-answering systems. Our dataset, by contrast, consists of video clips that represent only real-life scenarios. We collect 275 such video clips and over 2.3k multiple-choice questions. In this paper, we analyze the challenging but realistic aspects of LifeQA and apply several state-of-the-art video question-answering models to provide benchmarks for future research.

Methodology:
To collect this dataset, we begin by searching for videos on YouTube, using manually-chosen keywords that lead to videos of people living out their daily lives in varied settings (e.g., "my morning routine," "dialogue," "kids playing," "class in elementary school" and "watching TV"). We then hand-pick 59 such videos based on the condition that they must contain recordings of natural interactions in natural settings. We explicitly exclude videos that do not contain language interactions. The identification of such videos turns out to be a challenging task, requiring significant manual effort. This is primarily because most of the recordings available online are in the form of vlogs, which include video recordings with voice layovers and are, therefore, not typical of natural interactions. We manually split the source videos into 275 video clips so that each clip includes coherent scenes lasting 1-2 minutes. We obtain transcriptions for the video clips using the Google Cloud Speech-to-Text platform. We also collect manual transcriptions for each video. Next, two annotators write five questions per video. For each question, we ask the annotators to write the correct answer to the question as well as three distractors (which we define as incorrect but semantically related answers). The annotators are instructed to formulate a diverse set of questions, which require an understanding of both the visual and linguistic content of the videos. We then instruct a third annotator to merge the two sets of questions from the original annotators, manually eliminate duplicate questions, and correct typographical errors. In total, we collect 2,326 questions using this procedure.

The features present here are extracted for the following baselines:

* Question + Vision. We use two variants of ST-VQA (Jang et al., 2017). Both encode the video using a CNN followed by an LSTM, whose final hidden state is then used as in STVQA-Text. ST-VQA-Tp. uses the concatenation of the output of an ImageNet (Deng et al., 2009) pretrained ResNet152 (He et al., 2016) pool5 layer and of a Sports1M (Karpathy et al., 2014) pretrained C3D (Tran et al., 2015) fc6 layer as the video encoder. ST-VQA-Sp.Tp. computes a spatial attention map to decide what parts of the image are most useful and uses the res5c and conv5b of the two CNN encoders. Both use temporal attention maps to pool important information across video frames. We also tried a variant that uses RGB-I3D (Carreira and Zisserman, 2017) (with avg_pool and mixed_5c layers, respectively) instead of C3D, pretrained on ImageNet and Kinetics but do not report it because we obtained similar results.
* Question + Transcriptions + Vision. We implement two neural models that use all modalities, TVQA (Lei et al., 2018) and MovieQA (Tapaswi et al., 2016). Both models use object detection networks to identify visual concepts in the corresponding video frames, allowing them to make use of the visual modality. For both we use as visual inputs the output predictions of a Faster R-CNN (Ren et al., 2015) object detection model pretrained on Visual Genome (Krishna et al., 2017).

For more information, refer to https://lit.eecs.umich.edu/lifeqa/.

Files contained here:
The files contain pre-computed features in HDF format for the videos in our dataset:
* c3d.pickle: the weights of a Sports1M-pre-trained C3D model.
* i3d.pt: the weights of a pre-trained RGB-I3D model (pre-trained on ImageNet and Kinetics).
* LifeQA_C3D_conv5b.hdf5{00..08}: features from the conv5b layer output from the Sports1M-pre-trained C3D model.
* LifeQA_C3D_fc6.hdf5{00..01}: features from the fc6 layer output from the Sports1M-pre-trained C3D model.
* LifeQA_C3D_fc7.hdf5{00..01}: features from the fc7 layer output from the Sports1M-pre-trained C3D model.
* LifeQA_I3D_avg_pool.hdf5: features from the avg_pool layer output from the pre-trained RGB-I3D model (pre-trained on ImageNet and Kinetics).
* LifeQA_RESNET_pool5.hdf5: features from the pool5 layer output from the ImageNet-pre-trained ResNet152 model.
* LifeQA_RESNET_res5c.hdf5{00..44}: features from the res5c layer output from the ImageNet-pre-trained ResNet152 model.
* LifeQA_RESOF_pool5.hdf5: features from the pool5 layer output from the ImageNet-pre-trained ResNet152 model, applied to Gunnar-Farneback's optical flow.
* visual-genome.tar.gz{00..11}: the list of detected objects using a Visual-Genome-pre-trained Faster R-CNN.

These features were extracting with the file https://github.com/mmazab/LifeQA/blob/master/feature_extraction/extract_features.py

Related publication(s):
Castro, S., Azab, M., Stroud, J., Noujaim, C., Wang, R., Deng, J., & Mihalcea, R. (2020, May). LifeQA: A real-life dataset for video question answering. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 4352-4358).

Use and Access:
The features (and the list of detected objects) made available under a Creative Commons Attribution 4.0 International license (CC BY 4.0). For the weights, please refer to the original publications.

To Cite Data:
Castro, S., Azab, M., Stroud, J., Noujaim, C., Wang, R., Deng, J., & Mihalcea, R. (2020, May). LifeQA: A real-life dataset for video question answering. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 4352-4358).

Download All Files (To download individual files, select them in the “Files” panel above)

Total work file size of 409 GB is too large to download directly. Consider using Globus (see below).



Best for data sets > 3 GB. Globus is the platform Deep Blue Data uses to make large data sets available.   More about Globus

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to contact us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.