Show simple item record

Brief review of regression‐based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience

dc.contributor.authorZiegler, Andreasen_US
dc.contributor.authorMacCluer, Jean W.en_US
dc.contributor.authorAlmasy, Laura A.en_US
dc.date.accessioned2011-12-05T18:32:02Z
dc.date.available2012-02-21T18:47:02Zen_US
dc.date.issued2011en_US
dc.identifier.citationZiegler, Andreas; MacCluer, Jean W.; Almasy, Laura (2011). "Brief review of regression‐based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience." Genetic Epidemiology 35(S1): S5-S11. <http://hdl.handle.net/2027.42/88012>en_US
dc.identifier.issn0741-0395en_US
dc.identifier.issn1098-2272en_US
dc.identifier.urihttps://hdl.handle.net/2027.42/88012
dc.description.abstractGenetics Analysis Workshop 17 provided common and rare genetic variants from exome sequencing data and simulated binary and quantitative traits in 200 replicates. We provide a brief review of the machine learning and regression‐based methods used in the analyses of these data. Several regression and machine learning methods were used to address different problems inherent in the analyses of these data, which are high‐dimension, low‐sample‐size data typical of many genetic association studies. Unsupervised methods, such as cluster analysis, were used for data segmentation and, subset selection. Supervised learning methods, which include regression‐based methods (e.g., generalized linear models, logic regression, and regularized regression) and tree‐based methods (e.g., decision trees and random forests), were used for variable selection (selecting genetic and clinical features most associated or predictive of outcome) and prediction (developing models using common and rare genetic variants to accurately predict outcome), with the outcome being case‐control status or quantitative trait value. We include a discussion of cross‐validation for model selection and assessment, and a description of available software resources for these methods. Genet. Epidemiol . 35:S5–S11, 2011. © 2011 Wiley Periodicals, Inc.en_US
dc.publisherWiley Subscription Services, Inc., A Wiley Companyen_US
dc.subject.otherUnsupervised Learningen_US
dc.subject.otherSupervised Learningen_US
dc.subject.otherCluster Analysisen_US
dc.subject.otherLogistic Regressionen_US
dc.subject.otherPoisson Regressionen_US
dc.subject.otherLogic Regressionen_US
dc.subject.otherLASSOen_US
dc.subject.otherRidge Regressionen_US
dc.subject.otherDecision Treesen_US
dc.subject.otherRandom Forestsen_US
dc.subject.otherCross‐Validationen_US
dc.subject.otherSoftwareen_US
dc.titleBrief review of regression‐based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experienceen_US
dc.typeArticleen_US
dc.rights.robotsIndexNoFollowen_US
dc.subject.hlbsecondlevelBiological Chemistryen_US
dc.subject.hlbsecondlevelGeneticsen_US
dc.subject.hlbsecondlevelMolecular, Cellular and Developmental Biologyen_US
dc.subject.hlbtoplevelHealth Sciencesen_US
dc.subject.hlbtoplevelScienceen_US
dc.description.peerreviewedPeer Revieweden_US
dc.contributor.affiliationumDepartment of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, MIen_US
dc.contributor.affiliationotherClinical Sciences Section, National Institute of Arthritis, Musculoskeletal, and Skin Diseases, National Institutes of Health, Bethesda, MDen_US
dc.contributor.affiliationotherInstitut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Lübeck, Germanyen_US
dc.contributor.affiliationotherStatistical Genetics Section, National Human Genome Research Institute, National Institutes of Health, Baltimore, MDen_US
dc.contributor.affiliationotherCenter for Information Technology, National Institutes of Health, Bethesda, MDen_US
dc.contributor.affiliationother333 Cassell Drive, Suite 1200, National Institute of Health/NHGRI, Baltimore, MD 21224en_US
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/88012/1/20642_ftp.pdf
dc.identifier.doi10.1002/gepi.20642en_US
dc.identifier.sourceGenetic Epidemiologyen_US
dc.identifier.citedreferenceAlmasy LA, Dyer TD, Peralta JM, Kent Jr JW, Charlesworth JC, Curran JE, Blangero J. 2011. Genetic Analysis Workshop 17 mini‐exome simulation. BMC Proc 5: S2.en_US
dc.identifier.citedreferenceBreiman L. 1996. Bagging predictors. Mach Learn 24: 123 – 140.en_US
dc.identifier.citedreferenceBreiman L. 2001. Random forests. Mach Learn 45: 5 – 32.en_US
dc.identifier.citedreferenceBreiman L, Freidman JH, Olshen RA, Stone CJ. 1984. Classification and Regression Trees. Boca Raton, FL: CRC Press.en_US
dc.identifier.citedreferenceClarke B, Fokoue E, Zhang HH. 2009. Principles and Theory for Data Mining and Machine Learning. New York: Springer.en_US
dc.identifier.citedreferenceDiaz‐Uriarte R. 2007. GeneSrF and varSelRF: a web‐based tool and R package for gene selection and classification using random forest. BMC Bioinform 8: 328.en_US
dc.identifier.citedreferenceEfron B. 1979. Bootstrap methods: another look at the jackknife. Ann Stat 7: 1 – 26.en_US
dc.identifier.citedreferenceEfron B, Tibshirani RJ. 1993. An Introduction to the Bootstrap. New York: Chapman & Hall.en_US
dc.identifier.citedreferenceElisseeff A, Evgeniou T, Pontil M. 2005. Stability of randomized learning algorithms. J Mach Learn Res 6: 55 – 79.en_US
dc.identifier.citedreferenceEvgeniou T, Pontil M, Elisseeff A. 2004. Leave one out error, stability, and generalization of voting combinations of classifiers. Mach Learn 55: 71 – 97.en_US
dc.identifier.citedreferenceFriedman JH, Hall P. 2000. On bagging and non‐linear estimation. Technical report, Stanford University, Stanford, CA.en_US
dc.identifier.citedreferenceGrandvalet Y. 2004. Bagging equalizes influence. Mach Learn 55: 251 – 270.en_US
dc.identifier.citedreferenceHall DB, Shen J. 2010. Robust estimation for zero‐inflated Poisson regression. Scand J Stat 37: 237 – 252.en_US
dc.identifier.citedreferenceHall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. 2009. The WEKA data mining software: an update. SIGKDD Explorations 11: 10 – 18.en_US
dc.identifier.citedreferenceHartigan JA, Wong MA. 1979. A K ‐means clustering algorithm. Appl Stat 28: 100 – 108.en_US
dc.identifier.citedreferenceHastie T, Tibshirani R, Friedman J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York: Springer.en_US
dc.identifier.citedreferenceHoerl AE, Kennard R. 1970. Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12: 55 – 67.en_US
dc.identifier.citedreferenceKaufman L, Rousseeuw PJ. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley.en_US
dc.identifier.citedreferenceKohavi R. 1995. A study of cross‐validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, v. 2, 1137 – 1145. San Francisco, CA: Morgan Kaufmann. http://citeseer.ist.psu.edu/kohavi95study.html.en_US
dc.identifier.citedreferenceLambert D. 1992. Zero‐inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34: 1 – 14.en_US
dc.identifier.citedreferenceLiu H, Zhang J. 2009. Estimation consistency of the group LASSO and its applications. J Mach Learn Res Workshop Conf Proc 5: 376 – 383.en_US
dc.identifier.citedreferenceMcCullagh P, Nelder JA. 1989. Generalized Linear Models, 2nd ed. New York: Chapman & Hall.en_US
dc.identifier.citedreferenceMeier L, van de Geer S, Buhlmann P. 2008. The group LASSO for logistic regression. J R Stat Soc Ser B 70: 53 – 71.en_US
dc.identifier.citedreferenceMeinshausen N, Yu B. 2009. LASSO‐type recovery of sparse representations for high‐dimensional data. Ann Stat 37: 246 – 270.en_US
dc.identifier.citedreferenceNisbet R, Elder J, Miner G. 2009. Handbook of Statistical Analysis and Data Mining Applications. New York: Academic Press.en_US
dc.identifier.citedreferenceQuinlan JR. 1993. C4.5: Programs for Machine Learning. San Francisco, CA: Morgan Kaufmann.en_US
dc.identifier.citedreferenceRuczinski I, Kooperberg C, LeBlanc M. 2003. Logic regression. J Comput Graph Stat 12: 475 – 511.en_US
dc.identifier.citedreferenceRuczinski I, Kooperberg C, LeBlanc M. 2004. Exploring interactions in high‐dimensional genomic data: an overview of logic regression, with applications. J Multivariate Anal 90: 178 – 195.en_US
dc.identifier.citedreferenceSchwarz DF, Konig IR, Ziegler A. 2010. On safari to Random Jungle: a fast implementation of random forests for high‐dimensional data. Bioinformatics 26: 1752 – 1758.en_US
dc.identifier.citedreferenceSun YV. 2010. Multigenic modeling of complex disease by random forests. Adv Genet 72: 73 – 99.en_US
dc.identifier.citedreferenceSzymczak S, Biernacka JM, Cordell HJ, González‐Recio O, König IR, Zhang H, Sun YV. 2009. Machine learning in genome‐wide association studies. Genet Epidemiol 33: S51 – 7.en_US
dc.identifier.citedreferenceTibshirani R. 1996. Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58: 267 – 288.en_US
dc.identifier.citedreferenceWilson AF, Ziegler A. 2011. Lessons learned from the Genetic Analysis Workshop 17: Transitioning from genome‐wide association studies to whole‐genome statistical genetic analysis. Genet Epidemiol, this issue.en_US
dc.identifier.citedreferenceWu TT, Chen YF, Hastie T, Sobel E, Lange K. 2009. Genome‐wide association analysis by LASSO penalized logistic regression. Bioinformatics 25: 714 – 721.en_US
dc.owningcollnameInterdisciplinary and Peer-Reviewed


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.