Improving Machine Learning Models for Microbiome Analysis and Democratizing Data Science Along the Way

Sovacool, Kelly

Improving Machine Learning Models for Microbiome Analysis and Democratizing Data Science Along the Way

dc.contributor.author	Sovacool, Kelly
dc.date.accessioned	2023-09-22T15:31:49Z
dc.date.available	2023-09-22T15:31:49Z
dc.date.issued	2023
dc.date.submitted	2023
dc.identifier.uri	https://hdl.handle.net/2027.42/177936
dc.description.abstract	The human microbiome plays an important role in maintaining health. Changes in the taxonomic and functional composition of the gut microbiota have been implicated in numerous diseases including colorectal cancer, Clostridioides difficile infection (CDI), and others. Thus, the gut microbiome is a promising source of biomarkers for disease diagnosis and prediction. Machine learning (ML) approaches can leverage large datasets to gain insights into associations between the microbiota and disease. Here, we present a new algorithm that improves microbiome analysis for ML applications, apply ML to predict severity of CDI, and introduce resources that empower data scientists to go from the basics of coding to applying ML for reproducible research. Assigning amplicon sequences to operational taxonomic units (OTUs) is an important step in characterizing microbial communities across large datasets. However, a gap in existing OTU assignment methods inhibited the ability of researchers to incorporate new samples to previously clustered datasets, such as when deploying ML models. To provide an efficient method to fit sequences to existing OTUs while maintaining high OTU quality, we developed the OptiFit algorithm, an improved implementation of reference-based clustering. Our benchmarks revealed that OptiFit produces similar quality OTUs as a gold standard method yet at faster speeds. Thus, OptiFit provides a suitable option for users requiring consistent and high quality OTU assignments for ML applications and beyond. CDI can lead to severe complications including death, with half a million cases annually in the United States. The composition of the gut microbiome plays an important role in determining colonization resistance and clearance upon exposure to C. difficile. We investigated whether ML models trained on OTUs from stool samples on the day of CDI diagnosis could predict which cases led to severe outcomes. We trained models to predict CDI severity for four different severity definitions. The models performed best when predicting pragmatic severity, a composite definition of complications due to any cause or confirmed as CDI-attributable via chart review when possible. Our results suggest that while chart review is valuable to verify the cause of complications, including as many samples as possible is indispensable for training performant models on imbalanced datasets. We evaluated the potential clinical value of these models and found similar performance compared to prior models based on electronic health records, although further work is needed to determine the feasibility of deploying such models in clinical practice. These results represent a step toward the goal of deploying ML to inform clinical decisions and ultimately improve CDI outcomes. Bioinformatics is a kind of data science, an interdisciplinary field integrating computer science, statistics, and domain knowledge. Novice researchers frequently have domain knowledge, but lack other skills necessary to apply data science to their datasets while adhering to best practices in reproducibility. We developed three resources to help democratize data science: a curriculum teaching the basics of Python for data science to young students, a curriculum teaching programming skills for reproducible research, and an R package implementing an ML framework to help novices apply ML responsibly while being customizable for advanced users. These contributions cover a breadth of audience skill levels to help fill gaps in existing resources for data science. In summary, this dissertation advances bioinformatics for microbiome research from the start of data analysis through application, and ultimately toward enabling others to reproduce and extend our work.
dc.language.iso	en_US
dc.subject	bioinformatics
dc.subject	human gut microbiome
dc.subject	supervised machine learning
dc.subject	16S rRNA gene amplicon sequencing
dc.subject	Clostridioides difficile infection
dc.subject	data science education
dc.title	Improving Machine Learning Models for Microbiome Analysis and Democratizing Data Science Along the Way
dc.type	Thesis
dc.description.thesisdegreename	PhD	en_US
dc.description.thesisdegreediscipline	Bioinformatics
dc.description.thesisdegreegrantor	University of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeemember	Schloss, Patrick D
dc.contributor.committeemember	Dick, Gregory James
dc.contributor.committeemember	Wiens, Jenna
dc.contributor.committeemember	Young, Vincent Bensan
dc.subject.hlbsecondlevel	Computer Science
dc.subject.hlbsecondlevel	Microbiology and Immunology
dc.subject.hlbsecondlevel	Statistics and Numeric Data
dc.subject.hlbtoplevel	Engineering
dc.subject.hlbtoplevel	Science
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/177936/1/sovacool_1.pdf
dc.identifier.doi	https://dx.doi.org/10.7302/8393
dc.identifier.orcid	0000-0003-3283-829X
dc.identifier.name-orcid	Sovacool, Kelly; 0000-0003-3283-829X	en_US
dc.working.doi	10.7302/8393	en
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: sovacool_1.pdf
Size:: 5.330MB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.