Using data mining techniques to characterize participation in observational studies

Linden, Ariel; Yarnold, Paul R.

Using data mining techniques to characterize participation in observational studies

dc.contributor.author	Linden, Ariel
dc.contributor.author	Yarnold, Paul R.
dc.date.accessioned	2017-01-06T20:48:23Z
dc.date.available	2018-01-08T19:47:51Z	en
dc.date.issued	2016-12
dc.identifier.citation	Linden, Ariel; Yarnold, Paul R. (2016). "Using data mining techniques to characterize participation in observational studies." Journal of Evaluation in Clinical Practice 22(6): 835-843.
dc.identifier.issn	1356-1294
dc.identifier.issn	1365-2753
dc.identifier.uri	https://hdl.handle.net/2027.42/134951
dc.description.abstract	Data mining techniques are gaining in popularity among health researchers for an array of purposes, such as improving diagnostic accuracy, identifying high‐risk patients and extracting concepts from unstructured data. In this paper, we describe how these techniques can be applied to another area in the health research domain: identifying characteristics of individuals who do and do not choose to participate in observational studies. In contrast to randomized studies where individuals have no control over their treatment assignment, participants in observational studies self‐select into the treatment arm and therefore have the potential to differ in their characteristics from those who elect not to participate. These differences may explain part, or all, of the difference in the observed outcome, making it crucial to assess whether there is differential participation based on observed characteristics. As compared to traditional approaches to this assessment, data mining offers a more precise understanding of these differences. To describe and illustrate the application of data mining in this domain, we use data from a primary care‐based medical home pilot programme and compare the performance of commonly used classification approaches – logistic regression, support vector machines, random forests and classification tree analysis (CTA) – in correctly classifying participants and non‐participants. We find that CTA is substantially more accurate than the other models. Moreover, unlike the other models, CTA offers transparency in its computational approach, ease of interpretation via the decision rules produced and provides statistical results familiar to health researchers. Beyond their application to research, data mining techniques could help administrators to identify new candidates for participation who may most benefit from the intervention.
dc.publisher	APA Books
dc.publisher	Wiley Periodicals, Inc.
dc.subject.other	observational studies
dc.subject.other	observed characteristics
dc.subject.other	selection bias
dc.subject.other	selection
dc.subject.other	machine learning
dc.subject.other	data mining
dc.title	Using data mining techniques to characterize participation in observational studies
dc.type	Article	en_US
dc.rights.robots	IndexNoFollow
dc.subject.hlbsecondlevel	Medicine (General)
dc.subject.hlbtoplevel	Health Sciences
dc.description.peerreviewed	Peer Reviewed
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/134951/1/jep12515.pdf
dc.identifier.doi	10.1111/jep.12515
dc.identifier.source	Journal of Evaluation in Clinical Practice
dc.identifier.citedreference	Feinglass, J., Yarnold, P. R., Martin, G. J. & McCarthy, W. J. ( 1998 ) A classification tree analysis of selection for discretionary treatment. Medical Care, 36, 740 – 747.
dc.identifier.citedreference	Yarnold, P. R. & Soltysik, R. C. ( 2016 ) Maximizing Predictive Accuracy. Chicago, IL: ODA Books. doi: 10.13140/RG.2.1.1368.3286.
dc.identifier.citedreference	Linden, A., Adams, J. & Roberts, N. ( 2003 ) An assessment of the total population approach for evaluating disease management program effectiveness. Disease Management, 6, 93 – 102.
dc.identifier.citedreference	Linden, A. ( 2011 ) Identifying spin in health management evaluations. Journal of Evaluation in Clinical Practice, 17, 1223 – 1230.
dc.identifier.citedreference	Linden, A. & Samuels, S. J. ( 2013 ) Using balance statistics to determine the optimal number of controls in matching studies. Journal of Evaluation in Clinical Practice, 19, 968 – 975.
dc.identifier.citedreference	Fernández‐Delgado, M., Cernadas, E., Barro, S. & Amorim, D. ( 2014 ) Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 15, 3133 – 3181.
dc.identifier.citedreference	Noble, W. S. ( 2006 ) What is a support vector machine? Nature Biotechnology, 24, 1565 – 1567.
dc.identifier.citedreference	Breiman, L. ( 2001 ) Random forests. Machine Learning, 45, 5 – 32.
dc.identifier.citedreference	Witten, I. H., Frank, E. & Hall, M. A. ( 2011 ) Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. San Francisco, CA: Morgan Kaufmann.
dc.identifier.citedreference	Breiman, L., Friedman, J., Olshen, R. & Stone, C. ( 1984 ) Classification and Regression Trees. Belmont, CA: Wadsworth International Group.
dc.identifier.citedreference	Quinlan, J. R. ( 1993 ) C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
dc.identifier.citedreference	Yarnold, P. R., Soltysik, R. C. & Bennett, C. L. ( 1997 ) Predicting in‐hospital mortality of patients with AIDS‐related Pneumocystis carinii pneumonia: an example of hierarchically optimal classification tree analysis. Statistics in Medicine, 16, 1451 – 1463.
dc.identifier.citedreference	Yarnold, P. R. & Bryant, F. B. ( 2015 ) Obtaining a hierarchically optimal CTA model via UniODA software. Optimal Data Analysis, 4, 36 – 53.
dc.identifier.citedreference	Yarnold, P. R. & Bryant, F. B. ( 2015 ) Obtaining an enumerated CTA model via Automated CTA Software. Optimal Data Analysis, 4, 54 – 61.
dc.identifier.citedreference	Linden, A. & Roberts, N. ( 2005 ) A users guide to the disease management literature: recommendations for reporting and assessing program outcomes. American Journal of Managed Care, 11, 81 – 90.
dc.identifier.citedreference	Altman, D. G. & Bland, M. ( 1994 ) Diagnostic tests 2: predictive values. British Medical Journal, 309, 102.
dc.identifier.citedreference	Yourman, L. C., Lee, S. J., Schonberg, M. A., Widera, E. W. & Smith, A. K. ( 2012 ) Prognostic indices for older adults: a systematic review. JAMA: The Journal of the American Medical Association, 307, 182 – 192.
dc.identifier.citedreference	Soltysik, R. C. & Yarnold, P. R. ( 2010 ) Automated CTA software: fundamental concepts and control commands. Optimal Data Analysis, 1, 144 – 160.
dc.identifier.citedreference	Linden, A. ( 2015 ) CLASSTABI: Stata module for generating classification statistics and table using summarized data. Statistical Software Components s458127, Boston College Department of Economics. Available at: https://ideas.repec.org/c/boc/bocode/s458127.html (last accessed 30 December 2015).
dc.identifier.citedreference	Linden, A. ( 2015 ) LOOCLASS: Stata module for generating classification statistics of Leave‐One‐Out cross‐validation for binary outcomes. Statistical Software Components s458032, Boston College Department of Economics. Available at: http://ideas.repec.org/c/boc/bocode/s458032.html (last accessed 23 November 2015).
dc.identifier.citedreference	Athey, S. & Imbens, G. ( 2015 ) Recursive Partitioning for Heterogeneous Causal Effects. Working Paper. Available at: http://arxiv.org/abs/1504.01132 (last accessed 20 January 2016).
dc.identifier.citedreference	Linden, A., Adams, J. & Roberts, N. ( 2004 ) Evaluating disease management program effectiveness: an introduction to survival analysis. Disease Management, 7, 180 – 190.
dc.identifier.citedreference	Linden, A., Adams, J. & Roberts, N. ( 2006 ) Strengthening the case for disease management effectiveness: unhiding the hidden bias. Journal of Evaluation in Clinical Practice, 12, 140 – 147.
dc.identifier.citedreference	Hand, D. J. ( 2000 ) Mining medical data. Statistical Methods in Medical Research, 9, 305 – 307.
dc.identifier.citedreference	Smyth, P. ( 2000 ) Data mining: data analysis on a grand scale. Statistical Methods in Medical Research, 9, 309 – 327.
dc.identifier.citedreference	Iavindrasana, J., Cohen, G., Depeursinge, A., Müller, H., Meyer, R. & Geissbuhler, A. ( 2009 ) Clinical data mining: a review. In IMIA Yearbook of Medical Informatics, Geissbuhler, A., Kulikowski, C. (editors), 48, Suppl 1, 121 – 133.
dc.identifier.citedreference	Breiman, L. ( 2001 ) Statistical modeling: the two cultures (with comments and a rejoinder by the author). Statistical Science, 16, 199 – 231.
dc.identifier.citedreference	Linden, A. ( 2006 ) Measuring diagnostic and predictive accuracy in disease management: an introduction to receiver operating characteristic (ROC) analysis. Journal of Evaluation in Clinical Practice, 12, 132 – 139.
dc.identifier.citedreference	Yarnold, P. R. & Soltysik, R. C. ( 2005 ) Optimal Data Analysis: A Guidebook with Software for Windows. Washington, DC: APA Books.
dc.identifier.citedreference	Linden, A., Adams, J. & Roberts, N. ( 2004 ) The generalizability of disease management program results: getting from here to there. Managed Care Interface, 17, 38 – 45.
dc.owningcollname	Interdisciplinary and Peer-Reviewed

Files in this item

Name:: jep12515.pdf
Size:: 317.2KB
Format:: PDF

View/Open

Interdisciplinary and Peer-Reviewed

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.