Detecting and Correcting Contamination in Genetic Data.

Flickinger, Matthew

Detecting and Correcting Contamination in Genetic Data.

dc.contributor.author	Flickinger, Matthew
dc.date.accessioned	2016-06-10T19:31:47Z
dc.date.available	NO_RESTRICTION
dc.date.available	2016-06-10T19:31:47Z
dc.date.issued	2016
dc.date.submitted	2016
dc.identifier.uri	https://hdl.handle.net/2027.42/120783
dc.description.abstract	While technological innovation has dramatically increased the amount and variety of genomic data available to geneticists, no assay is perfect and both human error and technical artifacts can lead to erroneous data. A proper analysis pipeline must both detect errors, and, if possible, correct them. One common source of errors in genetic data is sample-to-sample contamination. This dissertation will identify methods to address contamination in the most common types of genetic studies. Chapter 2 focuses on methods for detecting and quantifying contamination in both array-based and next-generation sequencing (NGS) genotype data. For the array-based data, we use the observed intensities from the genotyping instruments to quantify contamination with two distinct methods: 1) a regression-based model using intensities and population allele frequencies and 2) a multivariate normal mixture model that looks at the clustering of intensities. For NGS data, we model the reads using a mixture model to determine the proportion of reads from the true sample and the contaminating sample. Chapter 3 outlines a method to make accurate genotype calls with contaminated NGS data. Given an estimated level of contamination, we propose a likelihood that can be maximized to call genotypes and estimate allele frequencies for samples with no previous genotype data. We investigate the method from data from two common sequencing strategies: 1) low-pass (2-4x depth) genome-wide sequencing and 2) high-depth (50-100x depth) exome sequencing. Chapter 4 looks at contamination in the context of RNA sequencing (RNA-Seq) data. While the technology to generate RNA-Seq data is similar to exome sequencing, the difference in expression between the contaminating and true sample makes it more difficult to accurately estimate the contamination proportion. We propose methods to improve the quality of these estimates.
dc.language.iso	en_US
dc.subject	contamination
dc.subject	genetic sequencing
dc.title	Detecting and Correcting Contamination in Genetic Data.
dc.type	Thesis	en_US
dc.description.thesisdegreename	PhD
dc.description.thesisdegreediscipline	Biostatistics
dc.description.thesisdegreegrantor	University of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeemember	Boehnke, Michael Lee
dc.contributor.committeemember	Burke, David T
dc.contributor.committeemember	Abecasis, Goncalo
dc.contributor.committeemember	Kang, Hyun Min
dc.subject.hlbsecondlevel	Genetics
dc.subject.hlbsecondlevel	Statistics and Numeric Data
dc.subject.hlbtoplevel	Science
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/120783/1/mflick_1.pdf
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: mflick_1.pdf
Size:: 3.217MB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.