Detecting and Correcting Contamination in Genetic Data.
dc.contributor.author | Flickinger, Matthew | |
dc.date.accessioned | 2016-06-10T19:31:47Z | |
dc.date.available | NO_RESTRICTION | |
dc.date.available | 2016-06-10T19:31:47Z | |
dc.date.issued | 2016 | |
dc.date.submitted | 2016 | |
dc.identifier.uri | https://hdl.handle.net/2027.42/120783 | |
dc.description.abstract | While technological innovation has dramatically increased the amount and variety of genomic data available to geneticists, no assay is perfect and both human error and technical artifacts can lead to erroneous data. A proper analysis pipeline must both detect errors, and, if possible, correct them. One common source of errors in genetic data is sample-to-sample contamination. This dissertation will identify methods to address contamination in the most common types of genetic studies. Chapter 2 focuses on methods for detecting and quantifying contamination in both array-based and next-generation sequencing (NGS) genotype data. For the array-based data, we use the observed intensities from the genotyping instruments to quantify contamination with two distinct methods: 1) a regression-based model using intensities and population allele frequencies and 2) a multivariate normal mixture model that looks at the clustering of intensities. For NGS data, we model the reads using a mixture model to determine the proportion of reads from the true sample and the contaminating sample. Chapter 3 outlines a method to make accurate genotype calls with contaminated NGS data. Given an estimated level of contamination, we propose a likelihood that can be maximized to call genotypes and estimate allele frequencies for samples with no previous genotype data. We investigate the method from data from two common sequencing strategies: 1) low-pass (2-4x depth) genome-wide sequencing and 2) high-depth (50-100x depth) exome sequencing. Chapter 4 looks at contamination in the context of RNA sequencing (RNA-Seq) data. While the technology to generate RNA-Seq data is similar to exome sequencing, the difference in expression between the contaminating and true sample makes it more difficult to accurately estimate the contamination proportion. We propose methods to improve the quality of these estimates. | |
dc.language.iso | en_US | |
dc.subject | contamination | |
dc.subject | genetic sequencing | |
dc.title | Detecting and Correcting Contamination in Genetic Data. | |
dc.type | Thesis | en_US |
dc.description.thesisdegreename | PhD | |
dc.description.thesisdegreediscipline | Biostatistics | |
dc.description.thesisdegreegrantor | University of Michigan, Horace H. Rackham School of Graduate Studies | |
dc.contributor.committeemember | Boehnke, Michael Lee | |
dc.contributor.committeemember | Burke, David T | |
dc.contributor.committeemember | Abecasis, Goncalo | |
dc.contributor.committeemember | Kang, Hyun Min | |
dc.subject.hlbsecondlevel | Genetics | |
dc.subject.hlbsecondlevel | Statistics and Numeric Data | |
dc.subject.hlbtoplevel | Science | |
dc.description.bitstreamurl | http://deepblue.lib.umich.edu/bitstream/2027.42/120783/1/mflick_1.pdf | |
dc.owningcollname | Dissertations and Theses (Ph.D. and Master's) |
Files in this item
Remediation of Harmful Language
The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.