Show simple item record

Selecting Methods for Multiple Imputation of Missing Data

dc.contributor.authorFischer, Micha
dc.date.accessioned2023-09-22T15:32:43Z
dc.date.available2023-09-22T15:32:43Z
dc.date.issued2023
dc.date.submitted2023
dc.identifier.urihttps://hdl.handle.net/2027.42/177950
dc.description.abstractMost data sets from sample surveys contain incomplete observations for various reasons, such as a respondent’s refusal to answer questions. Unfortunately, most analysis tools assume complete data sets. When applying such tools to incomplete data, researchers are limited to using either complete observations or complete variables, which can have problematic consequences: biased and inefficient estimates, and decreased power in statistical tests. However, often, the challenges of missing data can be circumvented through sequential imputation (SI), an iterative procedure that imputes missing values variable by variable, conditioning on observed or previously imputed values of other variables. SI generates a complete data set that can be analyzed using standard analytical tools. Multiple imputation, which generates multiple data sets with different draws of the missing values, can be used to improve efficiency and provide inferences that take into account imputation uncertainty. Various procedures have been proposed for SI, and each procedure involves a choice of options, which can lead to subjectivity in the imputation process. Further, data are mainly analyzed with a substantive question in mind and missing data imputation might not be the primary focus of an analyst. To address these issues, previous studies compared different procedures to find the best way to apply SI. However, they often rely on one assessment strategy, e.g., simulated data only, and often compare only a small number of procedures. These shortcomings lead to findings with low generalizability. This dissertation tries to close this gap by comparing multiple parametric and non-parametric procedures for multiple imputation within the SI framework and to automate and reduce sensitivity in the SI process. Study One compares several parametric and non-parametric procedures for SI. The evaluation uses a simulation approach, analyzing data from 1) parametric models, 2) non-parametric models, and 3) a real survey data set, a publicly available version of the National Health and Nutrition Examination Survey (NHANES) data. The procedures to be compared include parametric and tree-based procedures. The first study finds that there is no overall best performing method. However, we provide guidance for practice based on the simulation, taking into account the data situation and required modelling effort. Study Two proposes a modified SI procedure in which the assessment of different procedures is automated. The study develops criteria for binary, nominal, and continuous incomplete variables to assess imputation methods within SI in an automated and objective fashion. The modified SI process is assessed via a simulation study using data from the NHANES. This study provides methodology for a more automated SI procedure with included plausibility checks for a potential application to high-dimensional data sets with missing values, where specifying models via a human imputer is inefficient. Study Three investigates the use and implications of incorporating response indicators (RIs) for covariates in the imputation process. This approach leads to imputation under a missing-not-at-random (MNAR) model. A literature review provides insights into how to include RIs for predictors into models with different analysis goals. Furthermore, a targeted simulation study suggests data situations and analysis goals where this approach is sensible. The simulation shows that, under MAR, methods including RIs perform as well as those without them. In MNAR scenarios, methods including RIs can improve performance.
dc.language.isoen_US
dc.subjectMultiple Imputation
dc.subjectMissing Data
dc.subjectModel Selection
dc.titleSelecting Methods for Multiple Imputation of Missing Data
dc.typeThesis
dc.description.thesisdegreenamePhDen_US
dc.description.thesisdegreedisciplineSurvey and Data Science
dc.description.thesisdegreegrantorUniversity of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeememberLittle, Roderick J
dc.contributor.committeememberWest, Brady Thomas
dc.contributor.committeememberBuskirk, Trent D.
dc.contributor.committeememberElliott, Michael R
dc.contributor.committeememberRaghunathan, Trivellore E
dc.contributor.committeemembervan Buuren, Stef
dc.subject.hlbsecondlevelStatistics and Numeric Data
dc.subject.hlbtoplevelScience
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/177950/1/michaf_1.pdf
dc.identifier.doihttps://dx.doi.org/10.7302/8407
dc.working.doi10.7302/8407en
dc.owningcollnameDissertations and Theses (Ph.D. and Master's)


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.