Error Correction for DNA Sequencing via Disk Based Index and Box Queries
dc.contributor.author | Gu, Yarong | |
dc.contributor.advisor | Zhu, Qiang | |
dc.date.accessioned | 2017-04-26T17:33:13Z | |
dc.date.available | 2018-05-04T20:56:59Z | en |
dc.date.issued | 2017-04-30 | |
dc.date.submitted | 2017-04-07 | |
dc.identifier.uri | https://hdl.handle.net/2027.42/136615 | |
dc.description.abstract | The vast increase in DNA sequencing capacity over the last decade has quickly turned biology into a data-intensive science. Nevertheless, current sequencers such as Illumia HiSeq have high random per-base error rates, which makes sequencing error correction an indispensable require-ment for many sequence analysis applications. Most existing methods for error correction demand large expensive memory space, which limits their scalability for handling large datasets. In this thesis, we introduce a new disk based method, called DiskBQcor, for sequencing error correction. DiskBQcor stores k-mers for sequencing genome data along with their associated metadata in a disk based index tree, called the BoND-tree, and uses the index to efficiently process specially designed box queries to obtain relevant k-mers and their occurring frequencies. It takes an input read and locates the potential errors in the sequence. It then applies a comprehensive voting mech-anism and possibly an efficient binary encoding based assembly technique to verify and correct an erroneous base in a genome sequence under various conditions. To overcome the drawback of an offline approach such as DiskBQcor for wasting computing resources while DNA sequecing is in process, we suggest an online approach to correcting sequencing errors. The online processing strategies and accuracy measures are discussed. An algorithm for deleting indexed k-mers from the BoND-tree, which is a step stone for the online sequencing error correction, is also introduced. Our experiments demonstrate that the proposed methods are quite promising in error correction for sequencing genome data on disk. The resulting BoND-tree with correct k-mers can also be used for sequence analysis applications such as variant detection. | en_US |
dc.language.iso | en_US | en_US |
dc.subject | Bioinformatics | en_US |
dc.subject | Genome Database | en_US |
dc.subject | Sequencing Error Correction | en_US |
dc.subject | Index Method | en_US |
dc.subject | Box Query | en_US |
dc.subject | Algorithm | en_US |
dc.subject.other | Computer and Information Science | en_US |
dc.title | Error Correction for DNA Sequencing via Disk Based Index and Box Queries | en_US |
dc.type | Thesis | en_US |
dc.description.thesisdegreename | Master of Science (MS) | en_US |
dc.description.thesisdegreediscipline | Computer and Information Science, College of Engineering and Computer Science | en_US |
dc.description.thesisdegreegrantor | University of Michigan-Dearborn | en_US |
dc.contributor.committeemember | Shen, Jie | |
dc.contributor.committeemember | Medjahed, Brahim | |
dc.identifier.uniqname | 81982392 | en_US |
dc.description.bitstreamurl | https://deepblue.lib.umich.edu/bitstream/2027.42/136615/3/Thesis_YarongGu_519_corrected.pdf | |
dc.identifier.orcid | 0000-0002-8862-1025 | en_US |
dc.identifier.name-orcid | Gu, Yarong; 0000-0002-8862-1025 | en_US |
dc.owningcollname | Dissertations and Theses (Ph.D. and Master's) |
Files in this item
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.