Show simple item record

A Paucity of Data in Machine Learning: Applications in Single Cell RNA Sequencing and Ranking

dc.contributor.authorVarma, Umang
dc.date.accessioned2020-01-27T16:25:46Z
dc.date.availableNO_RESTRICTION
dc.date.available2020-01-27T16:25:46Z
dc.date.issued2019
dc.date.submitted
dc.identifier.urihttps://hdl.handle.net/2027.42/153449
dc.description.abstractA driving force behind the development of machine learning techniques is the availability of vast amounts of data that are continuously generated and the enormous potential of using these data. It is, however, not uncommon to have a paucity of data. This paucity can come in different forms: we may have an insufficient amount of data to perform the task at hand, we may have missing entries, or we may be working with aggregate statistics that only capture a fraction of the data we seek to study. To extract meaningful conclusions from scarce data is challenging and requires creative approaches that account for the scarcity in their assumptions and/or finding other means to make up for insufficient data. It is not always possible to adapt the standard algorithm for a given problem and the challenge often lies in finding the right basis from which to build a viable solution. This thesis takes on two problems within this realm, where there is a paucity of data, and develops techniques to overcome the challenges such paucities can pose. In single cell RNA-sequencing, entries of a gene expression matrix are counts of the number of molecules observed where only a small fraction of the molecules in the cell have been observed. This is particularly challenging because biologists are often concerned with whether a gene is expressed in a cell at all; however, a zero entry in a gene expression matrix only says that no such molecules were emph{seen} in the given cell, not that there emph{were} no such molecules in the given cell. In practice, a vast majority (often over 90%) of entries in gene expression matrices are zero. We focus on the problem of feature selection: we give theoretical guarantees for information-theoretic algorithms and address practical issues involved in their implementation. The problem of ranking items from pairwise comparisons is also often constrained by a paucity of data---collecting pairwise comparisons made by humans can be expensive---and we propose ways to incorporate other data to overcome a small number of pairwise comparison observations. In addition to giving a better sample complexity bound for the RankCentrality algorithm with simpler proofs that use matrix concentration inequalities, we introduce $lambda$-regularized RankCentrality (theoretical analysis for this regularization depends on our new proofs for RankCentrality), that is capable of giving non-trivial output even when the number of observations is small, and a similarity-based regularization that can use features of the items being ranked to significantly improve performance in the small-sample regime.
dc.language.isoen_US
dc.subjectsingle cell RNA-seq
dc.subjectranking
dc.subjectmachine learning
dc.subjectfeature selection
dc.subjectpairwise comparisons
dc.titleA Paucity of Data in Machine Learning: Applications in Single Cell RNA Sequencing and Ranking
dc.typeThesis
dc.description.thesisdegreenamePhDen_US
dc.description.thesisdegreedisciplineMathematics
dc.description.thesisdegreegrantorUniversity of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeememberGilbert, Anna Catherine
dc.contributor.committeememberColacino, Justin
dc.contributor.committeememberJain, Lalit
dc.contributor.committeememberRajapakse, Indika
dc.contributor.committeememberSmith, Karen E
dc.subject.hlbsecondlevelComputer Science
dc.subject.hlbsecondlevelGenetics
dc.subject.hlbsecondlevelMathematics
dc.subject.hlbsecondlevelStatistics and Numeric Data
dc.subject.hlbtoplevelEngineering
dc.subject.hlbtoplevelScience
dc.description.bitstreamurlhttps://deepblue.lib.umich.edu/bitstream/2027.42/153449/1/uvarma_1.pdf
dc.identifier.orcid0000-0003-1326-7970
dc.identifier.name-orcidVarma, Umang; 0000-0003-1326-7970en_US
dc.owningcollnameDissertations and Theses (Ph.D. and Master's)


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.