Index Catalog // Deep Blue Data

China urban carbon offsets (CDM) datasets

Creator:: Benjamin Leffel
Description:: The three datasets provided here identify the city location of all CDM projects in China by referencing the individual Project Description Documents (via the UNFCCC) attached to each project. Through this method, all 3,764 Clean Development Mechanism projects at the city-level in China are identified out of a total of over 8,000 globally.
Keyword:: climate finance, carbon offset, China, urban, clean development mechanism, cities, and climate change
Citation to related publication:: https://escholarship.org/uc/item/3vr8850s and https://doi.org/10.1002/wcc.709
Discipline:: Government, Politics and Law, Social Sciences, International Studies, and Business

Data for Service Providers' Influence in Collaborative Governance Networks: Effectiveness in Reducing Chronic Homelessness

Creator:: Mosley, Jennifer and Park, Sunggeun
Description:: This data set is comprised of publicly available data from three HUD websites and the 2014 National Continuum of Care (CoC) Survey questionnaire and protocol. The HUD data sets are comprised of Community Planning and Development (CPD) Awards information from 2005-2014, Demographic information on areas served by CoC sites (sub-region estimates from the 2006-2010 American Community Survey), and Housing Inventory Count (HIC) and Point-in-Time (PIT) counts per CoC from 2015-2007. The data are associated with the article "Service Providers' Influence in Collaborative Governance Networks: Effectiveness in Reducing Chronic Homelessness" conditionally accepted for publication in the Journal of Public Administration Research and Theory (JPART).
Discipline:: Social Sciences

Impulse Buying: Designing for Self-Control with E-commerce

Creator:: Moser, Carol, Schoenebeck, Sarita , and Resnick, Paul
Description:: These data, survey instruments (including informed consent) and analysis scripts come from Carol Moser's dissertation titled, Impulse Buying: Designing for Self-Control with E-commerce.
Keyword:: Impulse Buying, Self-control, and Experimental Design
Discipline:: Social Sciences

Magnetometer Survey Data from the Weeden Island Site (8Pi1), Florida (2013-2014)

Creator:: Horsley, Timothy J. and Sampson, Christina P.
Description:: The data (raw data, composite files [processed], and some images) can be read by the program TerraSurveyor. Version 3.0.34.10 of the software was used to create the composite files in this deposit. and The magnetometer data was the second step in a geophysical survey program that began with magnetic susceptibility survey of a portion of the Weedon Island Preserve in St. Petersburg, Florida. Geophysical survey was used to map human occupation of the study area and to guide subsequent archaeological excavations.
Keyword:: magnetometry, geophysical survey, remote sensing, Florida archaeology, and coastal archaeology
Citation to related publication:: Sampson, C. P. (2019) Safety Harbor at the Weeden Island Site: Late Pre-Columbian Craft, Community, and Complexity on Florida's Gulf Coast. PhD Dissertation, University of Michigan. and Sampson, Christina Perry and Timothy J. Horsley. Using Multi-Staged Magnetic Survey and Excavation to Assess Community Settlement Organization: A Case Study from the Central Peninsular Gulf Coast of Florida. Advances in Archaeological Practice. Cambridge University Press: 18 December 2019. https://doi.org/10.1017/aap.2019.45
Discipline:: Science and Social Sciences

Judgment Accuracy Experiment Data

Creator:: Yan, Haoyang MI and Yates, J Frank
Description:: The research aims to demonstrate judgment accuracy about other individuals' socio-political opinions produced by prototype matching and base/shift heuristics or people's own strategies (control).
Keyword:: judgment accuracy and socio-political opinions
Discipline:: Social Sciences

The Lannang Corpus (LanCorp): A POS-tagged, sociolinguistic corpus containing recordings and transcriptions of Lannang speech collected from the metropolitan Manila Lannangs between 2016 and 2020

Creator:: Gonzales, Wilkinson Daniel Wong
Description:: The Lannang Corpus (LanCorp) is a sociolinguistic POS-tagged 375,000-word speech-and-text corpus of Lannang languages based on audio recordings collected in metropolitan Manila between 2016 and 2020. It hopes to furnish scholars interested in Sino-Philippine (socio)linguistics with a contemporary, multilingual corpus (i.e., Hokkien, Tagalog, English, Lánnang-uè, Mandarin) compiled using recorded oral data primarily collected from a Sino-Philippine community in metropolitan Manila by the community: the Manila Lannangs. The publicly available corpus contains manual transcriptions (time-aligned to the audio), source language and part-of-speech tags derived using a mix of manual and computational methods, and a wide range of social metadata; it is also organized and stored systematically for easy data retrieval and (socio)linguistic analysis. Although there are existing sociolinguistic corpora, they are small in scale and were not released publicly due to lack of informant consent – LanCorp readily fills the gap.
Keyword:: Lannang, Chinese Filipino, Filipino-Chinese, Hokkien, diaspora, mixed language, recordings, oral variety, multilingual, corpus, data, dataset, databank, LanCorp, Lannang Corpus, sociolinguistics, and ELAN
Citation to related publication:: [1] Gonzales, Wilkinson Daniel Wong. 2021. Interactions of Sinitic languages in the Philippines: Sinicization, Filipinization, and Sino-Philippine language creation. The Palgrave handbook of Chinese language studies, ed. by Zhengdao Ye. London: Palgrave-MacMillan., [2] Gonzales, Wilkinson Daniel Wong. 2021. Filipino, Chinese, neither, or both? The Lannang identity and its relationship with language. Language & Communication 77., [3] Gonzales, Wilkinson Daniel Wong. 2022. “Truly a Language of Our Own” A Corpus-Based, Experimental, and Variationist Account of Lánnang-uè in Manila. Ann Arbor: University of Michigan Ph.D. dissertation., [4] Gonzales, Wilkinson Daniel Wong. 2022. Hybridization. Philippine English: Development, Structure, and Sociology of English in the Philippines, ed. by Ariane Macalinga Borlongan. London: Routledge., and [5] Gonzales, Wilkinson Daniel Wong. in preparation. Advancing Sino-Philippine (socio)linguistics using the Lannang Corpus (LanCorp) – a multilingual, POS-tagged, and audio-textual databank.
Discipline:: Social Sciences

Human challenge study dataset 2015

Creator:: Hero, Alfred O, Zhai, Yaya, Burke, Thomas, Doraiswamy, Murali, Ginsburg, Geoffrey S, Henao, Ricardo, Turner, Ronald B, and Woods, Christopher W
Description:: The data deposited here is as follows: The clinical shedding/symptom data, RNAseq, steroid, and wearable E4 data was partially presented in publications [1]-[3] and the cognitive lumos and VAFS data is presented in the paper [4], which is under review and embargoed. The data files are: subject.json, sample.json, and genematrix_TPM.csv. In addition, a copy of the blank consent form used to enroll volunteers in the study is included (17964_Adult Consent_2015Mar17-Mod 1_clean.pdf)., Clinical symptom and viral shedding data (in subject.json): reports each subject's accumulated and maximum self-reported symptom score (modified Jackson score) and shedding titrations from nasal-pharyngeal washes after inoculation. , RNAseq data (genematrix_TMP.csv): Whole blood was collected in PAXgene™ Blood RNA tubes (PreAnalytiX), and total RNA extracted using the PAXgene™ Blood miRNA Kit (QIAGEN) using the manufacturer’s recommended protocol. RNA quantity and quality were assessed using Nanodrop 2000 spectrophotometer (Thermo-Fisher) and Bioanalyzer 2100 with RNA 6000 Nano Chips (Agilent). RNA sequencing libraries were prepared using Illumina TruSeq mRNA Library Kit with RiboZero Globin depletion, and sequenced on an Illumina NextSeq sequencer with 50bp paired-end reads (target 40M reads per sample). After demultiplexing to FASTQ paired-end read counts files, the 396 samples were TPM transformed using HISAT2 software with the reference genome Homo_sapiens.GRCh38.84. Each sample corresponds to one of the 18 subjects at one of 22 time points. One of these samples was of insufficient quality to be mapped to read counts. In addition to the TPM normalized RNAseq data contained in this repository, the raw FASTQ data for the 395 samples are deposited in the GEO repository ( https://www.ncbi.nlm.nih.gov/geo), Accession # GSE215087. , Cognitive data (sample.json): Outcomes from a NeuroCognitive Performance Test (NCPT) that was taken approximately 3 time daily by all volunteers. The NCPT is a repeatable, web-based, computerized, cognitive assessment platform designed to measure subtle changes in performance across multiple cognitive domains. Subject scores along 18 cognitive variables data were collected at approximated 22 time points during the challenge study. The data structure sample.json contains the raw cognitive data and the extracted 18 cognitive scores over time for each subject. , The Visual Analog Fatigue Scale (sample.json): the VAFS is a measure of cognitive fatigue that was measured approximately 3 times per day at the same time as the NCPT and blood draw. , Wearable device data (sample.json): participants wore an Empatica E4 device for the duration of the challenge study. Summarized features are provided for each subject that include sleep duration (mean and std), sleep offset (mean and std), and temperature (mean and std). , Steroid data was also collected and is included in the sample.json. This steroid data was collected from the whole blood samples and consists of cortisol, melatonin, and DHEAS. , and See README.txt for more specific details on the data structures contained in the sample.json, subject.json, and genematrix_TPM.csv files.
Keyword:: human challenge study and cognitive health and immunity
Citation to related publication:: X She, Y Zhai, R Henao, CW Woods, C Chiu, Geoffrey S. Ginsburg, Peter X.K. Song, AO. Hero, “Adaptive multi-channel event segmentation and feature extraction for monitoring health outcomes,” IEEE Transactions on Biomedical Engineering, vol. 68, no. 8, pp. 2377-2388, Aug. 2021, doi: 10.1109/TBME.2020.3038652. Available on arxiv:2008.09215 , Emilia Grzesiak, Brinnae Bent, Micah T. McClain, Christopher W. Woods, Ephraim L. Tsalik, Bradly P. Nicholson, Timothy Veldman, Thomas W. Burke, Zoe Gardener, Emma Bergstrom, Ronald B. Turner, Christopher Chiu, P. Murali Doraiswamy, Alfred Hero, Ricardo Henao, Geoffrey S. Ginsburg, Jessilyn Dunn Assessment of the Feasibility of Using Noninvasive Wearable Biometric Monitoring Sensors to Detect Influenza and the Common Cold Before Symptom Onset. JAMA Netw Open. 2021;4(9):e2128534. doi:10.1001/jamanetworkopen.2021.28534 , E Sabeti, S Oh, PX Song, A Hero. “A Pattern Dictionary Method for Anomaly Detection,” Entropy, vol 24, pp. 1095 Aug 2022. doi: 10.3390/e24081095, and Yaya Zhai, P. Murali Doraiswamy, Christopher W. Woods, Ronald B. Turner, Thomas W. Burke, Geoffrey S. Ginsburg, Alfred O. Hero, "Pre-exposure cognitive performance variability is associated with severity of respiratory infection," manuscript under review.
Discipline:: Health Sciences and Social Sciences

Personal and Social Media Data Survey

Creator:: Hemphill, Libby
Description:: Social media data offer a rich resource for researchers interested in public health, labor economics, politics, social behaviors, and other topics. However, scale and anonymity mean that researchers often cannot directly get permission from users to collect and analyze their social media data. This article applies the basic ethical principle of respect for persons to consider individuals’ perceptions of acceptable uses of data. We compare individuals' perceptions of acceptable uses of other types of sensitive data, such as health records and individual identifiers, with their perceptions of acceptable uses of social media data. Our survey of 1018 people shows that individuals think of their social media data as moderately sensitive and agree that it should be protected. Respondents are generally okay with researchers using their data in social research but prefer that researchers clearly articulate benefits and seek explicit consent before conducting research. We argue that researchers must ensure that their research provides social benefits worthy of individual risks and that they must address those risks throughout the research process.
Keyword:: social media, data ethics, and data reuse
Discipline:: Social Sciences

Interview Protocol on the Use of Information Spaces to Navigate Identity

Creator:: Schöpke-Gonzalez, Angela M., Thomer, Andrea K., and Conway, Paul
Description:: This interview protocol was designed to investigate the research question: How do self-identified refugees in the receiving societies of Greece and Germany engage with information spaces to navigate identity during liminal and post-liminal portions of their refugee experiences?
Keyword:: information space, identity, liminality, and migration
Citation to related publication:: Schöpke-Gonzalez, A., Thomer, A., & Conway, P. (2020). Identity Navigation During Refugee Experiences: The International Journal of Information, Diversity, & Inclusion (IJIDI), 4(2), 36–67. https://doi.org/10.33137/ijidi.v4i2.33151
Discipline:: Social Sciences

Quantifying News Media Bias through Crowdsourcing and Machine Learning Dataset

Creator:: Budak, Ceren, Goel, Sharad, and Rao, Justin M
Description:: Our primary analysis is based on articles published in 2013 by the top thirteen US news outlets and two popular political blogs. To compile the set of articles published by these outlets, we first examined the complete web-browsing records for US-located users who installed the Bing Toolbar, an optional add-on application for the Internet Explorer web browser. For each of the fifteen news sites, we recorded all unique URLs that were visited by at least ten toolbar users, and we then crawled the news sites to obtain the full article title and text. This process resulted in a corpus of 803,146 articles published on the fifteen news sites over the course of a year, with each article annotated with its relative popularity. , Next, we built two binary classifiers using large-scale logistic regression. The first classifier—which we refer to as the news classifier —identifies “news” articles (i.e., articles that would typically appear in the front section of a traditional newspaper). The second classifier—the politics classifier —identifies political news from the subset of articles identified as news by the first classifier. 340,191 (42 percent) were classified as news. On the set of 340,191 news articles, 114,814 (34 percent) were classified as political. , Having identified approximately 115,000 political news articles, we next seek to categorize the articles by topic (e.g., gay rights, healthcare, etc.), and to quantify the political slant of the article. To do so, we turn to human judges recruited via Mechanical Turk to analyze the articles. For every day in 2013, we randomly selected two political articles, when available, from each of the 15 outlets we study, with sampling weights equal to the number of times the article was visited by our panel of toolbar users., Amazon Mechanical Turk Labeling task: To detect and control for possible preconceptions of an outlet’s ideological slant, workers, upon first entering the experiment, were randomly assigned to either a blinded or unblinded condition. In the blinded condition, workers were presented with only the article’s title and text, whereas in the unblinded condition, they were additionally shown the name of the outlet in which the article was published. Each article was then analyzed by two workers, one each from the sets of workers in the two conditions. For each article, each worker completed the following three tasks. First, they provided primary and secondary article classifications from a list of fifteen topics: (1) civil rights; (2) Democrat scandals; (3) drugs; (4) economy; (5) education; (6) elections; (7) environment; (8) gay rights; (9) gun-related crimes; (10) gun rights/regulation; (11) healthcare; (12) international news; (13) national security; (14) Republican scandals; and (15) other. , and Second, workers determined whether the article was descriptive news or opinion. Third, to measure ideological slant, workers were asked, “Is the article generally positive, neutral, or negative toward members of the Democratic Party?” and separately, “Is the article generally positive, neutral, or negative toward members of the Republican Party?” Choices for these last two questions were provided on a five-point scale: very positive, somewhat positive, neutral, somewhat negative, and very negative. To mitigate question-ordering effects, workers were initially randomly assigned to being asked either the Democratic or Republican party question first; the question order remained the same for any subsequent articles the worker rated. Finally, we assigned each article a partisanship score between –1 and 1, where a negative rating indicates that the article is net left-leaning and a positive rating indicates that it is net right-leaning. Specifically, for an article’s depiction of the Democratic Party, the five-point scale from very positive to very negative is encoded as –1, –0.5, 0, 0.5, 1. Analogously, for an article’s depiction of the Republican Party, the scale is encoded as 1, 0.5, 0, –.0.5, –1. The score for each article is defined as the average over these two ratings. Thus, an average score of –1, for example, indicates that the article is very positive toward Democrats and very negative toward Republicans. The result of this procedure is a large, representative sample of political news articles, with direct human judgments on partisanship and article topic.
Keyword:: news media, media bias, crowdsourcing, and machine learning
Citation to related publication:: https://academic.oup.com/poq/article-abstract/80/S1/250/2223443/?redirectedFrom=fulltext and Ceren Budak, Sharad Goel, Justin M. Rao, Fair and Balanced? Quantifying Media Bias through Crowdsourced Content Analysis, Public Opinion Quarterly, Volume 80, Issue S1, 2016, Pages 250–271, https://doi.org/10.1093/poq/nfw007
Discipline:: Social Sciences

China urban carbon offsets (CDM) datasets

Data for Service Providers' Influence in Collaborative Governance Networks: Effectiveness in Reducing Chronic Homelessness

Impulse Buying: Designing for Self-Control with E-commerce

Magnetometer Survey Data from the Weeden Island Site (8Pi1), Florida (2013-2014)

Judgment Accuracy Experiment Data

The Lannang Corpus (LanCorp): A POS-tagged, sociolinguistic corpus containing recordings and transcriptions of Lannang speech collected from the metropolitan Manila Lannangs between 2016 and 2020

Human challenge study dataset 2015

Personal and Social Media Data Survey

Interview Protocol on the Use of Information Spaces to Navigate Identity

Quantifying News Media Bias through Crowdsourcing and Machine Learning Dataset

Limit your search

Resource type

Creator

Discipline

Language

Search Results

Search Constraints

Search Results

Limit your search