A Paucity of Data in Machine Learning: Applications in Single Cell RNA Sequencing and Ranking
dc.contributor.author | Varma, Umang | |
dc.date.accessioned | 2020-01-27T16:25:46Z | |
dc.date.available | NO_RESTRICTION | |
dc.date.available | 2020-01-27T16:25:46Z | |
dc.date.issued | 2019 | |
dc.date.submitted | ||
dc.identifier.uri | https://hdl.handle.net/2027.42/153449 | |
dc.description.abstract | A driving force behind the development of machine learning techniques is the availability of vast amounts of data that are continuously generated and the enormous potential of using these data. It is, however, not uncommon to have a paucity of data. This paucity can come in different forms: we may have an insufficient amount of data to perform the task at hand, we may have missing entries, or we may be working with aggregate statistics that only capture a fraction of the data we seek to study. To extract meaningful conclusions from scarce data is challenging and requires creative approaches that account for the scarcity in their assumptions and/or finding other means to make up for insufficient data. It is not always possible to adapt the standard algorithm for a given problem and the challenge often lies in finding the right basis from which to build a viable solution. This thesis takes on two problems within this realm, where there is a paucity of data, and develops techniques to overcome the challenges such paucities can pose. In single cell RNA-sequencing, entries of a gene expression matrix are counts of the number of molecules observed where only a small fraction of the molecules in the cell have been observed. This is particularly challenging because biologists are often concerned with whether a gene is expressed in a cell at all; however, a zero entry in a gene expression matrix only says that no such molecules were emph{seen} in the given cell, not that there emph{were} no such molecules in the given cell. In practice, a vast majority (often over 90%) of entries in gene expression matrices are zero. We focus on the problem of feature selection: we give theoretical guarantees for information-theoretic algorithms and address practical issues involved in their implementation. The problem of ranking items from pairwise comparisons is also often constrained by a paucity of data---collecting pairwise comparisons made by humans can be expensive---and we propose ways to incorporate other data to overcome a small number of pairwise comparison observations. In addition to giving a better sample complexity bound for the RankCentrality algorithm with simpler proofs that use matrix concentration inequalities, we introduce $lambda$-regularized RankCentrality (theoretical analysis for this regularization depends on our new proofs for RankCentrality), that is capable of giving non-trivial output even when the number of observations is small, and a similarity-based regularization that can use features of the items being ranked to significantly improve performance in the small-sample regime. | |
dc.language.iso | en_US | |
dc.subject | single cell RNA-seq | |
dc.subject | ranking | |
dc.subject | machine learning | |
dc.subject | feature selection | |
dc.subject | pairwise comparisons | |
dc.title | A Paucity of Data in Machine Learning: Applications in Single Cell RNA Sequencing and Ranking | |
dc.type | Thesis | |
dc.description.thesisdegreename | PhD | en_US |
dc.description.thesisdegreediscipline | Mathematics | |
dc.description.thesisdegreegrantor | University of Michigan, Horace H. Rackham School of Graduate Studies | |
dc.contributor.committeemember | Gilbert, Anna Catherine | |
dc.contributor.committeemember | Colacino, Justin | |
dc.contributor.committeemember | Jain, Lalit | |
dc.contributor.committeemember | Rajapakse, Indika | |
dc.contributor.committeemember | Smith, Karen E | |
dc.subject.hlbsecondlevel | Computer Science | |
dc.subject.hlbsecondlevel | Genetics | |
dc.subject.hlbsecondlevel | Mathematics | |
dc.subject.hlbsecondlevel | Statistics and Numeric Data | |
dc.subject.hlbtoplevel | Engineering | |
dc.subject.hlbtoplevel | Science | |
dc.description.bitstreamurl | https://deepblue.lib.umich.edu/bitstream/2027.42/153449/1/uvarma_1.pdf | |
dc.identifier.orcid | 0000-0003-1326-7970 | |
dc.identifier.name-orcid | Varma, Umang; 0000-0003-1326-7970 | en_US |
dc.owningcollname | Dissertations and Theses (Ph.D. and Master's) |
Files in this item
Remediation of Harmful Language
The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form. More information at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.