JavaScript is disabled for your browser. Some features of this site may not work without it.
Multiple Imputation Methods for Statistical Disclosure Control.
An, Di
An, Di
2008
Abstract: Statistical disclosure control (SDC) is an important consideration in the release of public use data sets. Statistical agencies seek SDC methods that limit risk of identification of respondents while preserving original information in the data.
This dissertation concerns disclosure risk caused by extreme values of variables such as income or age. Top-coding is a simple SDC procedure for this situation, but it limits analysis for the data user and may result in distorted inference. We propose two alternative methods to top-coding for SDC, a non-parametric hot-deck procedure and a parametric Bayesian method. Both methods are based on multiple imputation (MI).
In the first part of the dissertation we describe our SDC methods and illustrate the performance of these methods for inference about the mean of a variable subject to SDC, by simulations and application to data from the Chinese income project. We compare estimates from our methods with those calculated from the original data, and from the top-coding method. Results show that our MI methods provide better inferences of the publicly-released data than top-coding.
In the second part of the dissertation, we study impact of SDC methods on linear regression where the outcome is subject to top-coding, and extend previous methods to condition on the observed covariates. We propose stratified and regression-based extensions of our MI methods and show in simulation studies that these methods yield estimates of regression coefficients close to those obtained before deletion.
In the third part of the dissertation, we consider a specific application concerning disclosure risk caused by some participants attaining high ages because of prolonged participation in a longitudinal study, and develop nonparametric, stratified MI methods. We apply these methods in survival analysis using Cox’s proportional hazard model. Simulation studies prove these methods work well in preserving the relationship between hazard and covariates. We illustrate the methods on data from Charleston Heart Study.