Show simple item record

Error Correction for DNA Sequencing via Disk Based Index and Box Queries

dc.contributor.authorGu, Yarong
dc.contributor.advisorZhu, Qiang
dc.date.accessioned2017-04-26T17:33:13Z
dc.date.available2018-05-04T20:56:59Zen
dc.date.issued2017-04-30
dc.date.submitted2017-04-07
dc.identifier.urihttps://hdl.handle.net/2027.42/136615
dc.description.abstractThe vast increase in DNA sequencing capacity over the last decade has quickly turned biology into a data-intensive science. Nevertheless, current sequencers such as Illumia HiSeq have high random per-base error rates, which makes sequencing error correction an indispensable require-ment for many sequence analysis applications. Most existing methods for error correction demand large expensive memory space, which limits their scalability for handling large datasets. In this thesis, we introduce a new disk based method, called DiskBQcor, for sequencing error correction. DiskBQcor stores k-mers for sequencing genome data along with their associated metadata in a disk based index tree, called the BoND-tree, and uses the index to efficiently process specially designed box queries to obtain relevant k-mers and their occurring frequencies. It takes an input read and locates the potential errors in the sequence. It then applies a comprehensive voting mech-anism and possibly an efficient binary encoding based assembly technique to verify and correct an erroneous base in a genome sequence under various conditions. To overcome the drawback of an offline approach such as DiskBQcor for wasting computing resources while DNA sequecing is in process, we suggest an online approach to correcting sequencing errors. The online processing strategies and accuracy measures are discussed. An algorithm for deleting indexed k-mers from the BoND-tree, which is a step stone for the online sequencing error correction, is also introduced. Our experiments demonstrate that the proposed methods are quite promising in error correction for sequencing genome data on disk. The resulting BoND-tree with correct k-mers can also be used for sequence analysis applications such as variant detection.en_US
dc.language.isoen_USen_US
dc.subjectBioinformaticsen_US
dc.subjectGenome Databaseen_US
dc.subjectSequencing Error Correctionen_US
dc.subjectIndex Methoden_US
dc.subjectBox Queryen_US
dc.subjectAlgorithmen_US
dc.subject.otherComputer and Information Scienceen_US
dc.titleError Correction for DNA Sequencing via Disk Based Index and Box Queriesen_US
dc.typeThesisen_US
dc.description.thesisdegreenameMaster of Science (MS)en_US
dc.description.thesisdegreedisciplineComputer and Information Science, College of Engineering and Computer Scienceen_US
dc.description.thesisdegreegrantorUniversity of Michigan-Dearbornen_US
dc.contributor.committeememberShen, Jie
dc.contributor.committeememberMedjahed, Brahim
dc.identifier.uniqname81982392en_US
dc.description.bitstreamurlhttps://deepblue.lib.umich.edu/bitstream/2027.42/136615/3/Thesis_YarongGu_519_corrected.pdf
dc.identifier.orcid0000-0002-8862-1025en_US
dc.identifier.name-orcidGu, Yarong; 0000-0002-8862-1025en_US
dc.owningcollnameDissertations and Theses (Ph.D. and Master's)


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.