Error Correction for DNA Sequencing via Disk Based Index and Box Queries

Gu, Yarong

Error Correction for DNA Sequencing via Disk Based Index and Box Queries

dc.contributor.author	Gu, Yarong
dc.contributor.advisor	Zhu, Qiang
dc.date.accessioned	2017-04-26T17:33:13Z
dc.date.available	2018-05-04T20:56:59Z	en
dc.date.issued	2017-04-30
dc.date.submitted	2017-04-07
dc.identifier.uri	https://hdl.handle.net/2027.42/136615
dc.description.abstract	The vast increase in DNA sequencing capacity over the last decade has quickly turned biology into a data-intensive science. Nevertheless, current sequencers such as Illumia HiSeq have high random per-base error rates, which makes sequencing error correction an indispensable require-ment for many sequence analysis applications. Most existing methods for error correction demand large expensive memory space, which limits their scalability for handling large datasets. In this thesis, we introduce a new disk based method, called DiskBQcor, for sequencing error correction. DiskBQcor stores k-mers for sequencing genome data along with their associated metadata in a disk based index tree, called the BoND-tree, and uses the index to efﬁciently process specially designed box queries to obtain relevant k-mers and their occurring frequencies. It takes an input read and locates the potential errors in the sequence. It then applies a comprehensive voting mech-anism and possibly an efﬁcient binary encoding based assembly technique to verify and correct an erroneous base in a genome sequence under various conditions. To overcome the drawback of an ofﬂine approach such as DiskBQcor for wasting computing resources while DNA sequecing is in process, we suggest an online approach to correcting sequencing errors. The online processing strategies and accuracy measures are discussed. An algorithm for deleting indexed k-mers from the BoND-tree, which is a step stone for the online sequencing error correction, is also introduced. Our experiments demonstrate that the proposed methods are quite promising in error correction for sequencing genome data on disk. The resulting BoND-tree with correct k-mers can also be used for sequence analysis applications such as variant detection.	en_US
dc.language.iso	en_US	en_US
dc.subject	Bioinformatics	en_US
dc.subject	Genome Database	en_US
dc.subject	Sequencing Error Correction	en_US
dc.subject	Index Method	en_US
dc.subject	Box Query	en_US
dc.subject	Algorithm	en_US
dc.subject.other	Computer and Information Science	en_US
dc.title	Error Correction for DNA Sequencing via Disk Based Index and Box Queries	en_US
dc.type	Thesis	en_US
dc.description.thesisdegreename	Master of Science (MS)	en_US
dc.description.thesisdegreediscipline	Computer and Information Science, College of Engineering and Computer Science	en_US
dc.description.thesisdegreegrantor	University of Michigan-Dearborn	en_US
dc.contributor.committeemember	Shen, Jie
dc.contributor.committeemember	Medjahed, Brahim
dc.identifier.uniqname	81982392	en_US
dc.description.bitstreamurl	https://deepblue.lib.umich.edu/bitstream/2027.42/136615/3/Thesis_YarongGu_519_corrected.pdf
dc.identifier.orcid	0000-0002-8862-1025	en_US
dc.identifier.name-orcid	Gu, Yarong; 0000-0002-8862-1025	en_US
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: Thesis_YarongGu_519_corrected.pdf
Size:: 379.3KB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.