A Paucity of Data in Machine Learning: Applications in Single Cell RNA Sequencing and Ranking

Varma, Umang

A Paucity of Data in Machine Learning: Applications in Single Cell RNA Sequencing and Ranking

dc.contributor.author	Varma, Umang
dc.date.accessioned	2020-01-27T16:25:46Z
dc.date.available	NO_RESTRICTION
dc.date.available	2020-01-27T16:25:46Z
dc.date.issued	2019
dc.date.submitted
dc.identifier.uri	https://hdl.handle.net/2027.42/153449
dc.description.abstract	A driving force behind the development of machine learning techniques is the availability of vast amounts of data that are continuously generated and the enormous potential of using these data. It is, however, not uncommon to have a paucity of data. This paucity can come in different forms: we may have an insufficient amount of data to perform the task at hand, we may have missing entries, or we may be working with aggregate statistics that only capture a fraction of the data we seek to study. To extract meaningful conclusions from scarce data is challenging and requires creative approaches that account for the scarcity in their assumptions and/or finding other means to make up for insufficient data. It is not always possible to adapt the standard algorithm for a given problem and the challenge often lies in finding the right basis from which to build a viable solution. This thesis takes on two problems within this realm, where there is a paucity of data, and develops techniques to overcome the challenges such paucities can pose. In single cell RNA-sequencing, entries of a gene expression matrix are counts of the number of molecules observed where only a small fraction of the molecules in the cell have been observed. This is particularly challenging because biologists are often concerned with whether a gene is expressed in a cell at all; however, a zero entry in a gene expression matrix only says that no such molecules were emph{seen} in the given cell, not that there emph{were} no such molecules in the given cell. In practice, a vast majority (often over 90%) of entries in gene expression matrices are zero. We focus on the problem of feature selection: we give theoretical guarantees for information-theoretic algorithms and address practical issues involved in their implementation. The problem of ranking items from pairwise comparisons is also often constrained by a paucity of data---collecting pairwise comparisons made by humans can be expensive---and we propose ways to incorporate other data to overcome a small number of pairwise comparison observations. In addition to giving a better sample complexity bound for the RankCentrality algorithm with simpler proofs that use matrix concentration inequalities, we introduce $lambda$-regularized RankCentrality (theoretical analysis for this regularization depends on our new proofs for RankCentrality), that is capable of giving non-trivial output even when the number of observations is small, and a similarity-based regularization that can use features of the items being ranked to significantly improve performance in the small-sample regime.
dc.language.iso	en_US
dc.subject	single cell RNA-seq
dc.subject	ranking
dc.subject	machine learning
dc.subject	feature selection
dc.subject	pairwise comparisons
dc.title	A Paucity of Data in Machine Learning: Applications in Single Cell RNA Sequencing and Ranking
dc.type	Thesis
dc.description.thesisdegreename	PhD	en_US
dc.description.thesisdegreediscipline	Mathematics
dc.description.thesisdegreegrantor	University of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeemember	Gilbert, Anna Catherine
dc.contributor.committeemember	Colacino, Justin
dc.contributor.committeemember	Jain, Lalit
dc.contributor.committeemember	Rajapakse, Indika
dc.contributor.committeemember	Smith, Karen E
dc.subject.hlbsecondlevel	Computer Science
dc.subject.hlbsecondlevel	Genetics
dc.subject.hlbsecondlevel	Mathematics
dc.subject.hlbsecondlevel	Statistics and Numeric Data
dc.subject.hlbtoplevel	Engineering
dc.subject.hlbtoplevel	Science
dc.description.bitstreamurl	https://deepblue.lib.umich.edu/bitstream/2027.42/153449/1/uvarma_1.pdf
dc.identifier.orcid	0000-0003-1326-7970
dc.identifier.name-orcid	Varma, Umang; 0000-0003-1326-7970	en_US
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: uvarma_1.pdf
Size:: 1.266MB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.