Show simple item record

Using data mining techniques to characterize participation in observational studies

dc.contributor.authorLinden, Ariel
dc.contributor.authorYarnold, Paul R.
dc.date.accessioned2017-01-06T20:48:23Z
dc.date.available2018-01-08T19:47:51Zen
dc.date.issued2016-12
dc.identifier.citationLinden, Ariel; Yarnold, Paul R. (2016). "Using data mining techniques to characterize participation in observational studies." Journal of Evaluation in Clinical Practice 22(6): 835-843.
dc.identifier.issn1356-1294
dc.identifier.issn1365-2753
dc.identifier.urihttps://hdl.handle.net/2027.42/134951
dc.description.abstractData mining techniques are gaining in popularity among health researchers for an array of purposes, such as improving diagnostic accuracy, identifying high‐risk patients and extracting concepts from unstructured data. In this paper, we describe how these techniques can be applied to another area in the health research domain: identifying characteristics of individuals who do and do not choose to participate in observational studies. In contrast to randomized studies where individuals have no control over their treatment assignment, participants in observational studies self‐select into the treatment arm and therefore have the potential to differ in their characteristics from those who elect not to participate. These differences may explain part, or all, of the difference in the observed outcome, making it crucial to assess whether there is differential participation based on observed characteristics. As compared to traditional approaches to this assessment, data mining offers a more precise understanding of these differences. To describe and illustrate the application of data mining in this domain, we use data from a primary care‐based medical home pilot programme and compare the performance of commonly used classification approaches – logistic regression, support vector machines, random forests and classification tree analysis (CTA) – in correctly classifying participants and non‐participants. We find that CTA is substantially more accurate than the other models. Moreover, unlike the other models, CTA offers transparency in its computational approach, ease of interpretation via the decision rules produced and provides statistical results familiar to health researchers. Beyond their application to research, data mining techniques could help administrators to identify new candidates for participation who may most benefit from the intervention.
dc.publisherAPA Books
dc.publisherWiley Periodicals, Inc.
dc.subject.otherobservational studies
dc.subject.otherobserved characteristics
dc.subject.otherselection bias
dc.subject.otherselection
dc.subject.othermachine learning
dc.subject.otherdata mining
dc.titleUsing data mining techniques to characterize participation in observational studies
dc.typeArticleen_US
dc.rights.robotsIndexNoFollow
dc.subject.hlbsecondlevelMedicine (General)
dc.subject.hlbtoplevelHealth Sciences
dc.description.peerreviewedPeer Reviewed
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/134951/1/jep12515.pdf
dc.identifier.doi10.1111/jep.12515
dc.identifier.sourceJournal of Evaluation in Clinical Practice
dc.identifier.citedreferenceFeinglass, J., Yarnold, P. R., Martin, G. J. & McCarthy, W. J. ( 1998 ) A classification tree analysis of selection for discretionary treatment. Medical Care, 36, 740 – 747.
dc.identifier.citedreferenceYarnold, P. R. & Soltysik, R. C. ( 2016 ) Maximizing Predictive Accuracy. Chicago, IL: ODA Books. doi: 10.13140/RG.2.1.1368.3286.
dc.identifier.citedreferenceLinden, A., Adams, J. & Roberts, N. ( 2003 ) An assessment of the total population approach for evaluating disease management program effectiveness. Disease Management, 6, 93 – 102.
dc.identifier.citedreferenceLinden, A. ( 2011 ) Identifying spin in health management evaluations. Journal of Evaluation in Clinical Practice, 17, 1223 – 1230.
dc.identifier.citedreferenceLinden, A. & Samuels, S. J. ( 2013 ) Using balance statistics to determine the optimal number of controls in matching studies. Journal of Evaluation in Clinical Practice, 19, 968 – 975.
dc.identifier.citedreferenceFernández‐Delgado, M., Cernadas, E., Barro, S. & Amorim, D. ( 2014 ) Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 15, 3133 – 3181.
dc.identifier.citedreferenceNoble, W. S. ( 2006 ) What is a support vector machine? Nature Biotechnology, 24, 1565 – 1567.
dc.identifier.citedreferenceBreiman, L. ( 2001 ) Random forests. Machine Learning, 45, 5 – 32.
dc.identifier.citedreferenceWitten, I. H., Frank, E. & Hall, M. A. ( 2011 ) Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. San Francisco, CA: Morgan Kaufmann.
dc.identifier.citedreferenceBreiman, L., Friedman, J., Olshen, R. & Stone, C. ( 1984 ) Classification and Regression Trees. Belmont, CA: Wadsworth International Group.
dc.identifier.citedreferenceQuinlan, J. R. ( 1993 ) C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
dc.identifier.citedreferenceYarnold, P. R., Soltysik, R. C. & Bennett, C. L. ( 1997 ) Predicting in‐hospital mortality of patients with AIDS‐related Pneumocystis carinii pneumonia: an example of hierarchically optimal classification tree analysis. Statistics in Medicine, 16, 1451 – 1463.
dc.identifier.citedreferenceYarnold, P. R. & Bryant, F. B. ( 2015 ) Obtaining a hierarchically optimal CTA model via UniODA software. Optimal Data Analysis, 4, 36 – 53.
dc.identifier.citedreferenceYarnold, P. R. & Bryant, F. B. ( 2015 ) Obtaining an enumerated CTA model via Automated CTA Software. Optimal Data Analysis, 4, 54 – 61.
dc.identifier.citedreferenceLinden, A. & Roberts, N. ( 2005 ) A users guide to the disease management literature: recommendations for reporting and assessing program outcomes. American Journal of Managed Care, 11, 81 – 90.
dc.identifier.citedreferenceAltman, D. G. & Bland, M. ( 1994 ) Diagnostic tests 2: predictive values. British Medical Journal, 309, 102.
dc.identifier.citedreferenceYourman, L. C., Lee, S. J., Schonberg, M. A., Widera, E. W. & Smith, A. K. ( 2012 ) Prognostic indices for older adults: a systematic review. JAMA: The Journal of the American Medical Association, 307, 182 – 192.
dc.identifier.citedreferenceSoltysik, R. C. & Yarnold, P. R. ( 2010 ) Automated CTA software: fundamental concepts and control commands. Optimal Data Analysis, 1, 144 – 160.
dc.identifier.citedreferenceLinden, A. ( 2015 ) CLASSTABI: Stata module for generating classification statistics and table using summarized data. Statistical Software Components s458127, Boston College Department of Economics. Available at: https://ideas.repec.org/c/boc/bocode/s458127.html (last accessed 30 December 2015).
dc.identifier.citedreferenceLinden, A. ( 2015 ) LOOCLASS: Stata module for generating classification statistics of Leave‐One‐Out cross‐validation for binary outcomes. Statistical Software Components s458032, Boston College Department of Economics. Available at: http://ideas.repec.org/c/boc/bocode/s458032.html (last accessed 23 November 2015).
dc.identifier.citedreferenceAthey, S. & Imbens, G. ( 2015 ) Recursive Partitioning for Heterogeneous Causal Effects. Working Paper. Available at: http://arxiv.org/abs/1504.01132 (last accessed 20 January 2016).
dc.identifier.citedreferenceLinden, A., Adams, J. & Roberts, N. ( 2004 ) Evaluating disease management program effectiveness: an introduction to survival analysis. Disease Management, 7, 180 – 190.
dc.identifier.citedreferenceLinden, A., Adams, J. & Roberts, N. ( 2006 ) Strengthening the case for disease management effectiveness: unhiding the hidden bias. Journal of Evaluation in Clinical Practice, 12, 140 – 147.
dc.identifier.citedreferenceHand, D. J. ( 2000 ) Mining medical data. Statistical Methods in Medical Research, 9, 305 – 307.
dc.identifier.citedreferenceSmyth, P. ( 2000 ) Data mining: data analysis on a grand scale. Statistical Methods in Medical Research, 9, 309 – 327.
dc.identifier.citedreferenceIavindrasana, J., Cohen, G., Depeursinge, A., Müller, H., Meyer, R. & Geissbuhler, A. ( 2009 ) Clinical data mining: a review. In IMIA Yearbook of Medical Informatics, Geissbuhler, A., Kulikowski, C. (editors), 48, Suppl 1, 121 – 133.
dc.identifier.citedreferenceBreiman, L. ( 2001 ) Statistical modeling: the two cultures (with comments and a rejoinder by the author). Statistical Science, 16, 199 – 231.
dc.identifier.citedreferenceLinden, A. ( 2006 ) Measuring diagnostic and predictive accuracy in disease management: an introduction to receiver operating characteristic (ROC) analysis. Journal of Evaluation in Clinical Practice, 12, 132 – 139.
dc.identifier.citedreferenceYarnold, P. R. & Soltysik, R. C. ( 2005 ) Optimal Data Analysis: A Guidebook with Software for Windows. Washington, DC: APA Books.
dc.identifier.citedreferenceLinden, A., Adams, J. & Roberts, N. ( 2004 ) The generalizability of disease management program results: getting from here to there. Managed Care Interface, 17, 38 – 45.
dc.owningcollnameInterdisciplinary and Peer-Reviewed


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.