Brief review of regression‐based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience
dc.contributor.author | Ziegler, Andreas | en_US |
dc.contributor.author | MacCluer, Jean W. | en_US |
dc.contributor.author | Almasy, Laura A. | en_US |
dc.date.accessioned | 2011-12-05T18:32:02Z | |
dc.date.available | 2012-02-21T18:47:02Z | en_US |
dc.date.issued | 2011 | en_US |
dc.identifier.citation | Ziegler, Andreas; MacCluer, Jean W.; Almasy, Laura (2011). "Brief review of regression‐based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience." Genetic Epidemiology 35(S1): S5-S11. <http://hdl.handle.net/2027.42/88012> | en_US |
dc.identifier.issn | 0741-0395 | en_US |
dc.identifier.issn | 1098-2272 | en_US |
dc.identifier.uri | https://hdl.handle.net/2027.42/88012 | |
dc.description.abstract | Genetics Analysis Workshop 17 provided common and rare genetic variants from exome sequencing data and simulated binary and quantitative traits in 200 replicates. We provide a brief review of the machine learning and regression‐based methods used in the analyses of these data. Several regression and machine learning methods were used to address different problems inherent in the analyses of these data, which are high‐dimension, low‐sample‐size data typical of many genetic association studies. Unsupervised methods, such as cluster analysis, were used for data segmentation and, subset selection. Supervised learning methods, which include regression‐based methods (e.g., generalized linear models, logic regression, and regularized regression) and tree‐based methods (e.g., decision trees and random forests), were used for variable selection (selecting genetic and clinical features most associated or predictive of outcome) and prediction (developing models using common and rare genetic variants to accurately predict outcome), with the outcome being case‐control status or quantitative trait value. We include a discussion of cross‐validation for model selection and assessment, and a description of available software resources for these methods. Genet. Epidemiol . 35:S5–S11, 2011. © 2011 Wiley Periodicals, Inc. | en_US |
dc.publisher | Wiley Subscription Services, Inc., A Wiley Company | en_US |
dc.subject.other | Unsupervised Learning | en_US |
dc.subject.other | Supervised Learning | en_US |
dc.subject.other | Cluster Analysis | en_US |
dc.subject.other | Logistic Regression | en_US |
dc.subject.other | Poisson Regression | en_US |
dc.subject.other | Logic Regression | en_US |
dc.subject.other | LASSO | en_US |
dc.subject.other | Ridge Regression | en_US |
dc.subject.other | Decision Trees | en_US |
dc.subject.other | Random Forests | en_US |
dc.subject.other | Cross‐Validation | en_US |
dc.subject.other | Software | en_US |
dc.title | Brief review of regression‐based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience | en_US |
dc.type | Article | en_US |
dc.rights.robots | IndexNoFollow | en_US |
dc.subject.hlbsecondlevel | Biological Chemistry | en_US |
dc.subject.hlbsecondlevel | Genetics | en_US |
dc.subject.hlbsecondlevel | Molecular, Cellular and Developmental Biology | en_US |
dc.subject.hlbtoplevel | Health Sciences | en_US |
dc.subject.hlbtoplevel | Science | en_US |
dc.description.peerreviewed | Peer Reviewed | en_US |
dc.contributor.affiliationum | Department of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, MI | en_US |
dc.contributor.affiliationother | Clinical Sciences Section, National Institute of Arthritis, Musculoskeletal, and Skin Diseases, National Institutes of Health, Bethesda, MD | en_US |
dc.contributor.affiliationother | Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Lübeck, Germany | en_US |
dc.contributor.affiliationother | Statistical Genetics Section, National Human Genome Research Institute, National Institutes of Health, Baltimore, MD | en_US |
dc.contributor.affiliationother | Center for Information Technology, National Institutes of Health, Bethesda, MD | en_US |
dc.contributor.affiliationother | 333 Cassell Drive, Suite 1200, National Institute of Health/NHGRI, Baltimore, MD 21224 | en_US |
dc.description.bitstreamurl | http://deepblue.lib.umich.edu/bitstream/2027.42/88012/1/20642_ftp.pdf | |
dc.identifier.doi | 10.1002/gepi.20642 | en_US |
dc.identifier.source | Genetic Epidemiology | en_US |
dc.identifier.citedreference | Almasy LA, Dyer TD, Peralta JM, Kent Jr JW, Charlesworth JC, Curran JE, Blangero J. 2011. Genetic Analysis Workshop 17 mini‐exome simulation. BMC Proc 5: S2. | en_US |
dc.identifier.citedreference | Breiman L. 1996. Bagging predictors. Mach Learn 24: 123 – 140. | en_US |
dc.identifier.citedreference | Breiman L. 2001. Random forests. Mach Learn 45: 5 – 32. | en_US |
dc.identifier.citedreference | Breiman L, Freidman JH, Olshen RA, Stone CJ. 1984. Classification and Regression Trees. Boca Raton, FL: CRC Press. | en_US |
dc.identifier.citedreference | Clarke B, Fokoue E, Zhang HH. 2009. Principles and Theory for Data Mining and Machine Learning. New York: Springer. | en_US |
dc.identifier.citedreference | Diaz‐Uriarte R. 2007. GeneSrF and varSelRF: a web‐based tool and R package for gene selection and classification using random forest. BMC Bioinform 8: 328. | en_US |
dc.identifier.citedreference | Efron B. 1979. Bootstrap methods: another look at the jackknife. Ann Stat 7: 1 – 26. | en_US |
dc.identifier.citedreference | Efron B, Tibshirani RJ. 1993. An Introduction to the Bootstrap. New York: Chapman & Hall. | en_US |
dc.identifier.citedreference | Elisseeff A, Evgeniou T, Pontil M. 2005. Stability of randomized learning algorithms. J Mach Learn Res 6: 55 – 79. | en_US |
dc.identifier.citedreference | Evgeniou T, Pontil M, Elisseeff A. 2004. Leave one out error, stability, and generalization of voting combinations of classifiers. Mach Learn 55: 71 – 97. | en_US |
dc.identifier.citedreference | Friedman JH, Hall P. 2000. On bagging and non‐linear estimation. Technical report, Stanford University, Stanford, CA. | en_US |
dc.identifier.citedreference | Grandvalet Y. 2004. Bagging equalizes influence. Mach Learn 55: 251 – 270. | en_US |
dc.identifier.citedreference | Hall DB, Shen J. 2010. Robust estimation for zero‐inflated Poisson regression. Scand J Stat 37: 237 – 252. | en_US |
dc.identifier.citedreference | Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. 2009. The WEKA data mining software: an update. SIGKDD Explorations 11: 10 – 18. | en_US |
dc.identifier.citedreference | Hartigan JA, Wong MA. 1979. A K ‐means clustering algorithm. Appl Stat 28: 100 – 108. | en_US |
dc.identifier.citedreference | Hastie T, Tibshirani R, Friedman J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York: Springer. | en_US |
dc.identifier.citedreference | Hoerl AE, Kennard R. 1970. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12: 55 – 67. | en_US |
dc.identifier.citedreference | Kaufman L, Rousseeuw PJ. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley. | en_US |
dc.identifier.citedreference | Kohavi R. 1995. A study of cross‐validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, v. 2, 1137 – 1145. San Francisco, CA: Morgan Kaufmann. http://citeseer.ist.psu.edu/kohavi95study.html. | en_US |
dc.identifier.citedreference | Lambert D. 1992. Zero‐inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34: 1 – 14. | en_US |
dc.identifier.citedreference | Liu H, Zhang J. 2009. Estimation consistency of the group LASSO and its applications. J Mach Learn Res Workshop Conf Proc 5: 376 – 383. | en_US |
dc.identifier.citedreference | McCullagh P, Nelder JA. 1989. Generalized Linear Models, 2nd ed. New York: Chapman & Hall. | en_US |
dc.identifier.citedreference | Meier L, van de Geer S, Buhlmann P. 2008. The group LASSO for logistic regression. J R Stat Soc Ser B 70: 53 – 71. | en_US |
dc.identifier.citedreference | Meinshausen N, Yu B. 2009. LASSO‐type recovery of sparse representations for high‐dimensional data. Ann Stat 37: 246 – 270. | en_US |
dc.identifier.citedreference | Nisbet R, Elder J, Miner G. 2009. Handbook of Statistical Analysis and Data Mining Applications. New York: Academic Press. | en_US |
dc.identifier.citedreference | Quinlan JR. 1993. C4.5: Programs for Machine Learning. San Francisco, CA: Morgan Kaufmann. | en_US |
dc.identifier.citedreference | Ruczinski I, Kooperberg C, LeBlanc M. 2003. Logic regression. J Comput Graph Stat 12: 475 – 511. | en_US |
dc.identifier.citedreference | Ruczinski I, Kooperberg C, LeBlanc M. 2004. Exploring interactions in high‐dimensional genomic data: an overview of logic regression, with applications. J Multivariate Anal 90: 178 – 195. | en_US |
dc.identifier.citedreference | Schwarz DF, Konig IR, Ziegler A. 2010. On safari to Random Jungle: a fast implementation of random forests for high‐dimensional data. Bioinformatics 26: 1752 – 1758. | en_US |
dc.identifier.citedreference | Sun YV. 2010. Multigenic modeling of complex disease by random forests. Adv Genet 72: 73 – 99. | en_US |
dc.identifier.citedreference | Szymczak S, Biernacka JM, Cordell HJ, González‐Recio O, König IR, Zhang H, Sun YV. 2009. Machine learning in genome‐wide association studies. Genet Epidemiol 33: S51 – 7. | en_US |
dc.identifier.citedreference | Tibshirani R. 1996. Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58: 267 – 288. | en_US |
dc.identifier.citedreference | Wilson AF, Ziegler A. 2011. Lessons learned from the Genetic Analysis Workshop 17: Transitioning from genome‐wide association studies to whole‐genome statistical genetic analysis. Genet Epidemiol, this issue. | en_US |
dc.identifier.citedreference | Wu TT, Chen YF, Hastie T, Sobel E, Lange K. 2009. Genome‐wide association analysis by LASSO penalized logistic regression. Bioinformatics 25: 714 – 721. | en_US |
dc.owningcollname | Interdisciplinary and Peer-Reviewed |
Files in this item
Remediation of Harmful Language
The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.