Modern Survey Estimation with Social Media and Auxiliary Data
Ferg, Robyn
2020
Abstract
Traditional survey methods have been successful for nearly a century, but recently response rates have been declining and costs have been increasing, making the future of survey science uncertain. At the same time, new media sources are generating new forms of data, population data is increasingly readily available, and sophisticated machine learning algorithms are being created. This dissertation uses modern data sources and tools to improve survey estimates and advance the field of survey science. We begin by exploring the challenges of using data from new media, demonstrating how relationships between social media data and survey responses can appear deceptively strong. We examine a previously observed relationship between sentiment of ``jobs" tweets and consumer confidence, performing a sensitivity analysis on how sentiment of tweets is calculated and sorting ``jobs" tweets into categories based on their content, concluding that the original observed relationship was merely a chance occurrence. Next we track the relationship between sentiment of ``Trump" tweets and presidential approval. We develop a framework to interpret the strength of this observed relationship by implementing placebo analyses, in which we perform the same analysis but with tweets assumed to be unrelated to presidential approval, concluding that our observed relationship is not strong. Failing to find a meaningful signal, we next propose following a set of users over time. For a set of politically active users, we are able to find evidence of a political signal in terms of frequency and sentiment of their tweets around the 2016 presidential election. In a given corpus of tweets, there are likely to be several topics present, which has the potential to introduce bias when using the corpus to track survey responses. To help discover and sort tweets into these topics, we create a clustering-based topic modeling algorithm. Using the entire corpus, we create distances between words based on how often they appear together in the same tweet, create distances between tweets based on the distance between words in the tweets, and perform clustering on the resulting distances. We show that this method is effective using a validation set of tweets and apply it to the corpus of tweets from politically active users and ``jobs" tweets. Finally, we use population auxiliary data and machine learning algorithms to improve survey estimates. We develop an imputation-based estimation method that produces an unbiased estimate of the mean response of a finite population from a simple random sample when population auxiliary data are available. Our method allows for any prediction function or machine learning algorithm to be used to predict the response for out-of-sample observations, and is therefore able to accommodate a high dimensional setting and all covariate types. Exact unbiasedness is guaranteed by estimating the bias of the prediction function using subsamples of the original simple random sample. Importantly, the unbiasedness property does not depend on the accuracy of the imputation method. We apply this estimation method to simulated data, college tuition data, and the American Community Survey.Subjects
statistics Twitter data survey estimation topic modeling predictive modeling
Types
Thesis
Metadata
Show full item recordCollections
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.