Brief review of regression‐based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience

Ziegler, Andreas; MacCluer, Jean W.; Almasy, Laura A.

Brief review of regression‐based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience

dc.contributor.author	Ziegler, Andreas	en_US
dc.contributor.author	MacCluer, Jean W.	en_US
dc.contributor.author	Almasy, Laura A.	en_US
dc.date.accessioned	2011-12-05T18:32:02Z
dc.date.available	2012-02-21T18:47:02Z	en_US
dc.date.issued	2011	en_US
dc.identifier.citation	Ziegler, Andreas; MacCluer, Jean W.; Almasy, Laura (2011). "Brief review of regression‐based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience." Genetic Epidemiology 35(S1): S5-S11. <http://hdl.handle.net/2027.42/88012>	en_US
dc.identifier.issn	0741-0395	en_US
dc.identifier.issn	1098-2272	en_US
dc.identifier.uri	https://hdl.handle.net/2027.42/88012
dc.description.abstract	Genetics Analysis Workshop 17 provided common and rare genetic variants from exome sequencing data and simulated binary and quantitative traits in 200 replicates. We provide a brief review of the machine learning and regression‐based methods used in the analyses of these data. Several regression and machine learning methods were used to address different problems inherent in the analyses of these data, which are high‐dimension, low‐sample‐size data typical of many genetic association studies. Unsupervised methods, such as cluster analysis, were used for data segmentation and, subset selection. Supervised learning methods, which include regression‐based methods (e.g., generalized linear models, logic regression, and regularized regression) and tree‐based methods (e.g., decision trees and random forests), were used for variable selection (selecting genetic and clinical features most associated or predictive of outcome) and prediction (developing models using common and rare genetic variants to accurately predict outcome), with the outcome being case‐control status or quantitative trait value. We include a discussion of cross‐validation for model selection and assessment, and a description of available software resources for these methods. Genet. Epidemiol . 35:S5–S11, 2011. © 2011 Wiley Periodicals, Inc.	en_US
dc.publisher	Wiley Subscription Services, Inc., A Wiley Company	en_US
dc.subject.other	Unsupervised Learning	en_US
dc.subject.other	Supervised Learning	en_US
dc.subject.other	Cluster Analysis	en_US
dc.subject.other	Logistic Regression	en_US
dc.subject.other	Poisson Regression	en_US
dc.subject.other	Logic Regression	en_US
dc.subject.other	LASSO	en_US
dc.subject.other	Ridge Regression	en_US
dc.subject.other	Decision Trees	en_US
dc.subject.other	Random Forests	en_US
dc.subject.other	Cross‐Validation	en_US
dc.subject.other	Software	en_US
dc.title	Brief review of regression‐based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience	en_US
dc.type	Article	en_US
dc.rights.robots	IndexNoFollow	en_US
dc.subject.hlbsecondlevel	Biological Chemistry	en_US
dc.subject.hlbsecondlevel	Genetics	en_US
dc.subject.hlbsecondlevel	Molecular, Cellular and Developmental Biology	en_US
dc.subject.hlbtoplevel	Health Sciences	en_US
dc.subject.hlbtoplevel	Science	en_US
dc.description.peerreviewed	Peer Reviewed	en_US
dc.contributor.affiliationum	Department of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, MI	en_US
dc.contributor.affiliationother	Clinical Sciences Section, National Institute of Arthritis, Musculoskeletal, and Skin Diseases, National Institutes of Health, Bethesda, MD	en_US
dc.contributor.affiliationother	Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Lübeck, Germany	en_US
dc.contributor.affiliationother	Statistical Genetics Section, National Human Genome Research Institute, National Institutes of Health, Baltimore, MD	en_US
dc.contributor.affiliationother	Center for Information Technology, National Institutes of Health, Bethesda, MD	en_US
dc.contributor.affiliationother	333 Cassell Drive, Suite 1200, National Institute of Health/NHGRI, Baltimore, MD 21224	en_US
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/88012/1/20642_ftp.pdf
dc.identifier.doi	10.1002/gepi.20642	en_US
dc.identifier.source	Genetic Epidemiology	en_US
dc.identifier.citedreference	Almasy LA, Dyer TD, Peralta JM, Kent Jr JW, Charlesworth JC, Curran JE, Blangero J. 2011. Genetic Analysis Workshop 17 mini‐exome simulation. BMC Proc 5: S2.	en_US
dc.identifier.citedreference	Breiman L. 1996. Bagging predictors. Mach Learn 24: 123 – 140.	en_US
dc.identifier.citedreference	Breiman L. 2001. Random forests. Mach Learn 45: 5 – 32.	en_US
dc.identifier.citedreference	Breiman L, Freidman JH, Olshen RA, Stone CJ. 1984. Classification and Regression Trees. Boca Raton, FL: CRC Press.	en_US
dc.identifier.citedreference	Clarke B, Fokoue E, Zhang HH. 2009. Principles and Theory for Data Mining and Machine Learning. New York: Springer.	en_US
dc.identifier.citedreference	Diaz‐Uriarte R. 2007. GeneSrF and varSelRF: a web‐based tool and R package for gene selection and classification using random forest. BMC Bioinform 8: 328.	en_US
dc.identifier.citedreference	Efron B. 1979. Bootstrap methods: another look at the jackknife. Ann Stat 7: 1 – 26.	en_US
dc.identifier.citedreference	Efron B, Tibshirani RJ. 1993. An Introduction to the Bootstrap. New York: Chapman & Hall.	en_US
dc.identifier.citedreference	Elisseeff A, Evgeniou T, Pontil M. 2005. Stability of randomized learning algorithms. J Mach Learn Res 6: 55 – 79.	en_US
dc.identifier.citedreference	Evgeniou T, Pontil M, Elisseeff A. 2004. Leave one out error, stability, and generalization of voting combinations of classifiers. Mach Learn 55: 71 – 97.	en_US
dc.identifier.citedreference	Friedman JH, Hall P. 2000. On bagging and non‐linear estimation. Technical report, Stanford University, Stanford, CA.	en_US
dc.identifier.citedreference	Grandvalet Y. 2004. Bagging equalizes influence. Mach Learn 55: 251 – 270.	en_US
dc.identifier.citedreference	Hall DB, Shen J. 2010. Robust estimation for zero‐inflated Poisson regression. Scand J Stat 37: 237 – 252.	en_US
dc.identifier.citedreference	Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. 2009. The WEKA data mining software: an update. SIGKDD Explorations 11: 10 – 18.	en_US
dc.identifier.citedreference	Hartigan JA, Wong MA. 1979. A K ‐means clustering algorithm. Appl Stat 28: 100 – 108.	en_US
dc.identifier.citedreference	Hastie T, Tibshirani R, Friedman J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York: Springer.	en_US
dc.identifier.citedreference	Hoerl AE, Kennard R. 1970. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12: 55 – 67.	en_US
dc.identifier.citedreference	Kaufman L, Rousseeuw PJ. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley.	en_US
dc.identifier.citedreference	Kohavi R. 1995. A study of cross‐validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, v. 2, 1137 – 1145. San Francisco, CA: Morgan Kaufmann. http://citeseer.ist.psu.edu/kohavi95study.html.	en_US
dc.identifier.citedreference	Lambert D. 1992. Zero‐inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34: 1 – 14.	en_US
dc.identifier.citedreference	Liu H, Zhang J. 2009. Estimation consistency of the group LASSO and its applications. J Mach Learn Res Workshop Conf Proc 5: 376 – 383.	en_US
dc.identifier.citedreference	McCullagh P, Nelder JA. 1989. Generalized Linear Models, 2nd ed. New York: Chapman & Hall.	en_US
dc.identifier.citedreference	Meier L, van de Geer S, Buhlmann P. 2008. The group LASSO for logistic regression. J R Stat Soc Ser B 70: 53 – 71.	en_US
dc.identifier.citedreference	Meinshausen N, Yu B. 2009. LASSO‐type recovery of sparse representations for high‐dimensional data. Ann Stat 37: 246 – 270.	en_US
dc.identifier.citedreference	Nisbet R, Elder J, Miner G. 2009. Handbook of Statistical Analysis and Data Mining Applications. New York: Academic Press.	en_US
dc.identifier.citedreference	Quinlan JR. 1993. C4.5: Programs for Machine Learning. San Francisco, CA: Morgan Kaufmann.	en_US
dc.identifier.citedreference	Ruczinski I, Kooperberg C, LeBlanc M. 2003. Logic regression. J Comput Graph Stat 12: 475 – 511.	en_US
dc.identifier.citedreference	Ruczinski I, Kooperberg C, LeBlanc M. 2004. Exploring interactions in high‐dimensional genomic data: an overview of logic regression, with applications. J Multivariate Anal 90: 178 – 195.	en_US
dc.identifier.citedreference	Schwarz DF, Konig IR, Ziegler A. 2010. On safari to Random Jungle: a fast implementation of random forests for high‐dimensional data. Bioinformatics 26: 1752 – 1758.	en_US
dc.identifier.citedreference	Sun YV. 2010. Multigenic modeling of complex disease by random forests. Adv Genet 72: 73 – 99.	en_US
dc.identifier.citedreference	Szymczak S, Biernacka JM, Cordell HJ, González‐Recio O, König IR, Zhang H, Sun YV. 2009. Machine learning in genome‐wide association studies. Genet Epidemiol 33: S51 – 7.	en_US
dc.identifier.citedreference	Tibshirani R. 1996. Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58: 267 – 288.	en_US
dc.identifier.citedreference	Wilson AF, Ziegler A. 2011. Lessons learned from the Genetic Analysis Workshop 17: Transitioning from genome‐wide association studies to whole‐genome statistical genetic analysis. Genet Epidemiol, this issue.	en_US
dc.identifier.citedreference	Wu TT, Chen YF, Hastie T, Sobel E, Lange K. 2009. Genome‐wide association analysis by LASSO penalized logistic regression. Bioinformatics 25: 714 – 721.	en_US
dc.owningcollname	Interdisciplinary and Peer-Reviewed

Files in this item

Name:: 20642_ftp.pdf
Size:: 119.6KB
Format:: PDF

View/Open

Interdisciplinary and Peer-Reviewed

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.