Show simple item record

High-dimensional principal component analysis with heterogeneous missingness

dc.contributor.authorZhu, Ziwei
dc.contributor.authorWang, Tengyao
dc.contributor.authorSamworth, Richard J.
dc.date.accessioned2022-12-05T16:38:44Z
dc.date.available2023-12-05 11:38:43en
dc.date.available2022-12-05T16:38:44Z
dc.date.issued2022-11
dc.identifier.citationZhu, Ziwei; Wang, Tengyao; Samworth, Richard J. (2022). "High-dimensional principal component analysis with heterogeneous missingness." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 84(5): 2000-2031.
dc.identifier.issn1369-7412
dc.identifier.issn1467-9868
dc.identifier.urihttps://hdl.handle.net/2027.42/175179
dc.description.abstractWe study the problem of high-dimensional Principal Component Analysis (PCA) with missing observations. In a simple, homogeneous observation model, we show that an existing observed-proportion weighted (OPW) estimator of the leading principal components can (nearly) attain the minimax optimal rate of convergence, which exhibits an interesting phase transition. However, deeper investigation reveals that, particularly in more realistic settings where the observation probabilities are heterogeneous, the empirical performance of the OPW estimator can be unsatisfactory; moreover, in the noiseless case, it fails to provide exact recovery of the principal components. Our main contribution, then, is to introduce a new method, which we call primePCA, that is designed to cope with situations where observations may be missing in a heterogeneous manner. Starting from the OPW estimator, primePCA iteratively projects the observed entries of the data matrix onto the column space of our current estimate to impute the missing entries, and then updates our estimate by computing the leading right singular space of the imputed data matrix. We prove that the error of primePCAconverges to zero at a geometric rate in the noiseless case, and when the signal strength is not too small. An important feature of our theoretical guarantees is that they depend on average, as opposed to worst-case, properties of the missingness mechanism. Our numerical studies on both simulated and real data reveal that primePCA exhibits very encouraging performance across a wide range of scenarios, including settings where the data are not Missing Completely At Random.
dc.publisherAcademic Press
dc.publisherWiley Periodicals, Inc.
dc.subject.otherheterogeneous missingness
dc.subject.otherhigh-dimensional statistics
dc.subject.otheriterative projections
dc.subject.othermissing data
dc.subject.otherprincipal component analysis
dc.titleHigh-dimensional principal component analysis with heterogeneous missingness
dc.typeArticle
dc.rights.robotsIndexNoFollow
dc.subject.hlbsecondlevelStatistics and Numeric Data
dc.subject.hlbtoplevelScience
dc.description.peerreviewedPeer Reviewed
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/175179/1/rssb12550_am.pdf
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/175179/2/rssb12550.pdf
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/175179/3/rssb12550-sup-0001-SupInfo.pdf
dc.identifier.doi10.1111/rssb.12550
dc.identifier.sourceJournal of the Royal Statistical Society: Series B (Statistical Methodology)
dc.identifier.citedreferenceNegahban, S. & Wainwright, M.J. ( 2012 ) Restricted strong convexity and weighted matrix completion: optimal bounds with noise. Journal of Machine Learning Research, 13, 1665 – 1697.
dc.identifier.citedreferenceJosse, J., Pagès, J. & Husson, F. ( 2009 ) Gestion des donneés manquantes en analyse en composantes principales. Journal de la société française de statistique, 150, 28 – 51.
dc.identifier.citedreferenceKeshavan, R.H., Montanari, A. & Oh, S. ( 2010 ) Matrix completion from a few entries. IEEE Transactions on Information Theory, 56, 2980 – 2998.
dc.identifier.citedreferenceKiers, H.A.L. ( 1997 ) Weighted least squares fitting using ordinary least squares algorithms. Psychometrika, 62, 251 – 266.
dc.identifier.citedreferenceKoltchinskii, V., Lounici, K. & Tsybakov, A.B. ( 2011 ) Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39, 2302 – 2329.
dc.identifier.citedreferenceLittle, R.J. & Rubin, D.B. ( 2019 ) Statistical analysis with missing data. Hoboken: John Wiley & Sons.
dc.identifier.citedreferenceLoh, P.-L. & Tan, X.L. ( 2018 ) High-dimensional robust precision matrix estimation: cellwise corruption under ϵ $$ epsilon $$ -contamination. The Electronic Journal of Statistics, 12, 1429 – 1467.
dc.identifier.citedreferenceLoh, P.-L. & Wainwright, M.J. ( 2012 ) High-dimensional regression with noisy and missing data: provable guarantees with nonconvexity. The Annals of Statistics, 40, 1637 – 1664.
dc.identifier.citedreferenceLounici, K. ( 2013 ) Sparse principal component analysis with missing observations. In: Houdré, C. (Ed.) High dimensional probability VI. Basel: Birkhäuser, pp. 327 – 356.
dc.identifier.citedreferenceLounici, K. ( 2014 ) High-dimensional covariance matrix estimation with missing observations. Bernoulli, 20, 1029 – 1058.
dc.identifier.citedreferenceMazumder, R., Hastie, T. & Tibshirani, R. ( 2010 ) Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11, 2287 – 2322.
dc.identifier.citedreferencePaul, D. ( 2007 ) Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, 17, 1617 – 1642.
dc.identifier.citedreferenceRohe, K., Chatterjee, S. & Yu, B. ( 2011 ) Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 39, 1878 – 1915.
dc.identifier.citedreferenceRubin, D.B. ( 1976 ) Inference and missing data. Biometrika, 63, 581 – 592.
dc.identifier.citedreferenceRubin, D.B. ( 2004 ) Multiple imputation for nonresponse in surveys. Hoboken: John Wiley & Sons.
dc.identifier.citedreferenceSchönemann, P. ( 1966 ) A generalized solution of the orthogonal Procrustes problem. Psychometrika, 31, 1 – 10.
dc.identifier.citedreferenceSeaman, S., Galati, J., Jackson, D. & Carlin, J. ( 2013 ) What is meant by “missing at random”? Statistical Science, 28, 257 – 268.
dc.identifier.citedreferenceShen, D., Shen, H., Zhu, H. & Marron, J. ( 2016 ) The statistics and mathematics of high dimension low sample size asymptotics. Statistica Sinica, 26, 1747 – 1770.
dc.identifier.citedreferenceVershynin, R. ( 2018 ) High-dimensional probability: an introduction with applications in data science. Cambridge: Cambridge University Press.
dc.identifier.citedreferenceWang, T. ( 2016 ) Spectral methods and computational trade-offs in high-dimensional statistical inference. Ph.D thesis, University of Cambridge.
dc.identifier.citedreferenceWang, W. & Fan, J. ( 2017 ) Asymptotics of empirical eigen-structure for ultra-high dimensional spiked covariance model. The Annals of Statistics, 45, 1342 – 1374.
dc.identifier.citedreferenceWold, H. & Lyttkens, E. ( 1969 ) Nonlinear iterative partial least squares (NIPALS) estimation procedures. Bulletin of the International Statistical Institute, 43, 29 – 51.
dc.identifier.citedreferenceZhang, A., Cai, T.T. & Wu, Y. ( 2018 ) Heteroskedastic PCA: algorithm, optimality, and applications. arXiv:1810.08316.
dc.identifier.citedreferenceZhu, Z., Wang, T. & Samworth, R. J. ( 2019 ) primePCA: projected refinement for imputation of missing entries in principal component analysis. R package, version 1.2. Available from: https://CRAN.R-project.org/web/packages/primePCA/.
dc.identifier.citedreferenceBeaton, A.E. ( 1964 ) The use of special matrix operators in statistical calculus. ETS Research Bulletin Series, 2, i – 222.
dc.identifier.citedreferenceBelloni, A., Rosenbaum, M. & Tsybakov, A.B. ( 2017 ) Linear and conic programming estimators in high dimensional errors-in-variables models. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 79, 939 – 956.
dc.identifier.citedreferenceAnderson, T.W. ( 1957 ) Maximum likelihood estimates for a multivariate normal distribution when some observations are missing. Journal of the American Statistical Association, 52, 200 – 203.
dc.identifier.citedreferenceCai, T.T., Ma, Z. & Wu, Y. ( 2013 ) Sparse PCA: optimal rates and adaptive estimation. The Annals of Statistics, 41, 3074 – 3110.
dc.identifier.citedreferenceCai, T.T. & Zhang, A. ( 2016 ) Minimax rate-optimal estimation of high-dimensional covariance matrices with incomplete data. The Journal of Multivariate Analysis, 150, 55 – 74.
dc.identifier.citedreferenceCai, T.T. & Zhang, L. ( 2018 ) High-dimensional linear discriminant analysis: optimality, adaptive algorithm, and missing data. arXiv:1804.03018.
dc.identifier.citedreferenceCandès, E.J., Li, X., Ma, Y. & Wright, J. ( 2011 ) Robust principal component analysis? Journal of the ACM, 58, 11:1 – 11:37.
dc.identifier.citedreferenceCandès, E.J. & Plan, Y. ( 2010 ) Matrix completion with noise. Proceedings of the IEEE, 98, 925 – 936.
dc.identifier.citedreferenceCandès, E.J. & Recht, B. ( 2009 ) Exact matrix completion via convex optimization. Foundations of Computational Mathematics (FoCM), 9, 717 – 772.
dc.identifier.citedreferenceChi Y., Lu Y. & Chen Y. ( 2018 ) Nonconvex optimization meets low-rank matrix factorization: an overview. arXiv preprint arXiv:1809.09573.
dc.identifier.citedreferenceCho, J., Kim, D. & Rohe, K. ( 2017 ) Asymptotic theory for estimating the singular vectors and values of a partially-observed low rank matrix with noise. Statistica Sinica, 27, 1921 – 1948.
dc.identifier.citedreferenceDavis, C. & Kahan, W.M. ( 1970 ) The rotation of eigenvectors by a perturbation III. The SIAM Journal on Numerical Analysis, 7, 1 – 46.
dc.identifier.citedreferenceDempster, A.P., Laird, N.M. & Rubin, D.B. ( 1977 ) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 39, 1 – 38.
dc.identifier.citedreferenceDray, S. & Josse, J. ( 2015 ) Principal component analysis with missing values: a comparative survey of methods. Plant Ecology, 216, 657 – 667.
dc.identifier.citedreferenceElsener, A & van de Geer, S. ( 2018 ) Sparse spectral estimation with missing and corrupted measurements. arXiv:1811.10443.
dc.identifier.citedreferenceFan, J., Liao, Y. & Micheva, M. ( 2013 ) Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 75, 603 – 680.
dc.identifier.citedreferenceFord, B.L. ( 1983 ) An overview of hot-deck procedures. In: Madow, W.G., Olkin, I. & Rubin, D.B. (Eds.) Incomplete data in sample surveys, Vol. 2: theory and bibliographies. New York: Academic Press, pp. 185 – 207.
dc.identifier.citedreferenceGao, C., Ma, Z., Zhang, A.Y. & Zhou, H.H. ( 2016 ) Achieving optimal misclassification proportion in stochastic block models. Journal of Machine Learning Research, 18, 1 – 45.
dc.identifier.citedreferenceHastie, T., Mazumder, R., Lee, J.D. & Zadeh, R. ( 2015 ) Matrix completion and low-rank SVD via fast alternating least squares. Journal of Machine Learning Research, 16, 3367 – 3402.
dc.identifier.citedreferenceJohnstone, I.M. & Lu, A.Y. ( 2009 ) On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104, 682 – 693.
dc.identifier.citedreferenceJosse, J. & Husson, F. ( 2012 ) Handling missing values in exploratory multivariate data analysis methods. Journal de la société française de statistique, 153, 1 – 21.
dc.working.doiNOen
dc.owningcollnameInterdisciplinary and Peer-Reviewed


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.