Show simple item record

Debiased lasso for generalized linear models with a diverging number of covariates

dc.contributor.authorXia, Lu
dc.contributor.authorNan, Bin
dc.contributor.authorLi, Yi
dc.date.accessioned2023-04-04T17:40:15Z
dc.date.available2024-04-04 13:40:13en
dc.date.available2023-04-04T17:40:15Z
dc.date.issued2023-03
dc.identifier.citationXia, Lu; Nan, Bin; Li, Yi (2023). "Debiased lasso for generalized linear models with a diverging number of covariates." Biometrics 79(1): 344-357.
dc.identifier.issn0006-341X
dc.identifier.issn1541-0420
dc.identifier.urihttps://hdl.handle.net/2027.42/176040
dc.description.abstractModeling and drawing inference on the joint associations between single-nucleotide polymorphisms and a disease has sparked interest in genome-wide associations studies. In the motivating Boston Lung Cancer Survival Cohort (BLCSC) data, the presence of a large number of single nucleotide polymorphisms of interest, though smaller than the sample size, challenges inference on their joint associations with the disease outcome. In similar settings, we find that neither the debiased lasso approach (van de Geer et al., 2014), which assumes sparsity on the inverse information matrix, nor the standard maximum likelihood method can yield confidence intervals with satisfactory coverage probabilities for generalized linear models. Under this “large n, diverging p” scenario, we propose an alternative debiased lasso approach by directly inverting the Hessian matrix without imposing the matrix sparsity assumption, which further reduces bias compared to the original debiased lasso and ensures valid confidence intervals with nominal coverage probabilities. We establish the asymptotic distributions of any linear combinations of the parameter estimates, which lays the theoretical ground for drawing inference. Simulations show that the proposed refined debiased estimating method performs well in removing bias and yields honest confidence interval coverage. We use the proposed method to analyze the aforementioned BLCSC data, a large-scale hospital-based epidemiology cohort study investigating the joint effects of genetic variants on lung cancer risks.
dc.publisherSpringer
dc.publisherWiley Periodicals, Inc.
dc.subject.otherlung cancer
dc.subject.otherstatistical inference
dc.subject.otherhigh-dimensional regression
dc.subject.otherbias correction
dc.subject.otherasymptotics
dc.titleDebiased lasso for generalized linear models with a diverging number of covariates
dc.typeArticle
dc.rights.robotsIndexNoFollow
dc.subject.hlbsecondlevelMathematics
dc.subject.hlbtoplevelScience
dc.description.peerreviewedPeer Reviewed
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/176040/1/biom13587_am.pdf
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/176040/2/biom13587.pdf
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/176040/3/biom13587-sup-0001-SuppMat.pdf
dc.identifier.doi10.1111/biom.13587
dc.identifier.sourceBiometrics
dc.identifier.citedreferenceSur, P. & Candès, E.J. ( 2019 ) A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences of the United States of America, 116, 14516 – 14525.
dc.identifier.citedreferenceKong, S. & Nan, B. ( 2014 ) Non-asymptotic oracle inequalities for the high-dimensional Cox regression via lasso. Statistica Sinica, 24, 25 – 42.
dc.identifier.citedreferenceLee, J.D., Sun, D.L., Sun, Y. & Taylor, J.E. ( 2016 ) Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44, 907 – 927.
dc.identifier.citedreferenceMeinshausen, N. & Bühlmann, P. ( 2006 ) High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34, 1436 – 1462.
dc.identifier.citedreferenceMiller, D.P., Liu, G., De Vivo, I., Lynch, T.J., Wain, J.C., Su, L. et al. ( 2002 ) Combinations of the variant genotypes of GSTP1, GSTM1, and p53 are associated with an increased lung cancer risk. Cancer Research, 62, 2819 – 2823.
dc.identifier.citedreferenceNing, Y. & Liu, H. ( 2017 ) A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45, 158 – 195.
dc.identifier.citedreferencePintarelli, G., Galvan, A., Pozzi, P., Noci, S., Pasetti, G., Sala, F. et al. ( 2017 ) Pharmacogenetic study of seven polymorphisms in three nicotinic acetylcholine receptor subunits in smoking-cessation therapies. Scientific Reports, 7, 16730.
dc.identifier.citedreferencePortnoy, S. ( 1984 ) Asymptotic behavior of M-estimators of p regression parameters when p 2 / n $p^2/n$ is large, I. Consistency. The Annals of Statistics, 12, 1298 – 1309.
dc.identifier.citedreferencePortnoy, S. ( 1985 ) Asymptotic behavior of M-estimators of p regression parameters when p 2 / n $p^2/n$ is large, II. Normal approximation. The Annals of Statistics, 13, 1403 – 1417.
dc.identifier.citedreferenceRepapi, E., Sayers, I., Wain, L.V., Burton, P.R., Johnson, T., Obeidat, M. et al. ( 2010 ) Genome-wide association study identifies five loci associated with lung function. Nature Genetics, 42, 36 – 44.
dc.identifier.citedreferenceSchaid, D.J., Chen, W. & Larson, N.B. ( 2018 ) From genome-wide associations to candidate causal variants by statistical fine-mapping. Nature Reviews Genetics, 19, 491 – 504.
dc.identifier.citedreferenceStevens, V.L., Bierut, L.J., Talbot, J.T., Wang, J.C., Sun, J., Hinrichs, A.L. et al. ( 2008 ) Nicotinic receptor gene variants influence susceptibility to heavy smoking. Cancer Epidemiology, Biomarkers & Prevention, 17, 3517 – 3525.
dc.identifier.citedreferenceTaylor, J.G., Choi, E.-H., Foster, C.B. & Chanock, S.J. ( 2001 ) Using genetic variation to study human disease. Trends in Molecular Medicine, 7, 507 – 512.
dc.identifier.citedreferenceTibshirani, R. ( 1996 ) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58, 267 – 288.
dc.identifier.citedreferencevan de Geer, S.A. ( 2008 ) High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36, 614 – 645.
dc.identifier.citedreferencevan de Geer, S., Bühlmann, P., Ritov, Y. & Dezeure, R. ( 2014 ) On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42, 1166 – 1202.
dc.identifier.citedreferenceVershynin, R. ( 2012 ) Introduction to the non-asymptotic analysis of random matrices. In: Compressed sensing. Cambridge: Cambridge University Press, pp. 210 – 268.
dc.identifier.citedreferenceWang, L. ( 2011 ) GEE analysis of clustered binary data with diverging number of covariates. The Annals of Statistics, 39, 389 – 417.
dc.identifier.citedreferenceYohai, V.J. & Maronna, R.A. ( 1979 ) Asymptotic behavior of M-estimators for the linear model. The Annals of Statistics, 7, 258 – 268.
dc.identifier.citedreferenceZhang, X. & Cheng, G. ( 2017 ) Simultaneous inference for high-dimensional linear models. Journal of the American Statistical Association, 112, 757 – 768.
dc.identifier.citedreferenceZhang, C.-H. & Zhang, S.S. ( 2014 ) Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76, 217 – 242.
dc.identifier.citedreferenceZou, H. ( 2006 ) The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418 – 1429.
dc.identifier.citedreferenceZou, H. & Hastie, T. ( 2005 ) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301 – 320.
dc.identifier.citedreferenceMa, R., Cai, T.T. & Li, H. ( 2021 ) Global and simultaneous hypothesis testing for high-dimensional logistic regression models. Journal of the American Statistical Association, 116, 984 – 998.
dc.identifier.citedreferenceMcKay, J.D., Hung, R.J., Han, Y., Zong, X., Carreras-Torres, R., Christiani, D.C. et al. ( 2017 ) Large-scale association analysis identifies new lung cancer susceptibility loci and heterogeneity in genetic susceptibility across histological subtypes. Nature Genetics, 49, 1126 – 1132.
dc.identifier.citedreferenceAmos, C.I., Wu, X., Broderick, P., Gorlov, I.P., Gu, J., Eisen, T. et al. ( 2008 ) Genome-wide association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25. 1. Nature Genetics, 40, 616 – 622.
dc.identifier.citedreferenceBellec, P.C., Lecué, G. & Tsybakov, A.B. ( 2018 ) Slope meets lasso: improved oracle bounds and optimality. The Annals of Statistics, 46, 3603 – 3642.
dc.identifier.citedreferenceBossé, Y. & Amos, C.I. ( 2018 ) A decade of GWAS results in lung cancer. Cancer Epidemiology, Biomarkers & Prevention, 27, 363 – 379.
dc.identifier.citedreferenceBühlmann, P. & van de Geer, S. ( 2011 ) Statistics for high-dimensional data: methods, theory and applications. Berlin: Springer.
dc.identifier.citedreferenceCandès, E. & Tao, T. ( 2007 ) The Dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, 35, 2313 – 2351.
dc.identifier.citedreferenceDoyle, G.A., Wang, M.-J., Chou, A.D., Oleynick, J.U., Arnold, S.E., Buono, R.J. et al. ( 2011 ) In vitro and ex vivo analysis of CHRNA3 and CHRNA5 haplotype expression. PLoS One, 6, e23373.
dc.identifier.citedreferenceEvans, W.E. & Relling, M.V. ( 2004 ) Moving towards individualized medicine with pharmacogenomics. Nature, 429, 464 – 468.
dc.identifier.citedreferenceFan, J. & Li, R. ( 2001 ) Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348 – 1360.
dc.identifier.citedreferenceFan, J. & Peng, H. ( 2004 ) Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32, 928 – 961.
dc.identifier.citedreferenceFang, E.X., Ning, Y. & Liu, H. ( 2017 ) Testing and confidence intervals for high dimensional proportional hazards models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 1415 – 1437.
dc.identifier.citedreferenceFriedman, J., Hastie, T. & Tibshirani, R. ( 2010 ) Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1 – 22.
dc.identifier.citedreferenceGabrielsen, M.E., Romundstad, P., Langhammer, A., Krokan, H.E. & Skorpen, F. ( 2013 ) Association between a 15q25 gene variant, nicotine-related habits, lung cancer and COPD among 56307 individuals from the HUNT study in Norway. European Journal of Human Genetics, 21, 1293 – 1299.
dc.identifier.citedreferenceGuan, Y. & Stephens, M. ( 2011 ) Bayesian variable selection regression for genome-wide association studies and other large-scale problems. The Annals of Applied Statistics, 5, 1780 – 1815.
dc.identifier.citedreferenceHalldén, S., Sjögren, M., Hedblad, B., Engström, G., Hamrefors, V., Manjer, J. et al. ( 2016 ) Gene variance in the nicotinic receptor cluster ( CHRNA5-CHRNA3-CHRNB4 ) predicts death from cardiopulmonary disease and cancer in smokers. Journal of Internal Medicine, 279, 388 – 398.
dc.identifier.citedreferenceHe, Q. & Lin, D.-Y. ( 2010 ) A variable selection method for genome-wide association studies. Bioinformatics, 27, 1 – 8.
dc.identifier.citedreferenceHe, X. & Shao, Q.-M. ( 2000 ) On parameters of increasing dimensions. Journal of Multivariate Analysis, 73, 120 – 135.
dc.identifier.citedreferenceHuber, P.J. ( 1973 ) Robust regression: asymptotics, conjectures and Monte Carlo. The Annals of Statistics, 1, 799 – 821.
dc.identifier.citedreferenceJavanmard, A. & Montanari, A. ( 2014 ) Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15, 2869 – 2909.
dc.working.doiNOen
dc.owningcollnameInterdisciplinary and Peer-Reviewed


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.