Streaming Sketching: Mathematical Theory and Practical Algorithms
dc.contributor.author | Wang, Dingyu | |
dc.date.accessioned | 2025-05-12T17:39:43Z | |
dc.date.available | 2025-05-12T17:39:43Z | |
dc.date.issued | 2025 | |
dc.date.submitted | 2025 | |
dc.identifier.uri | https://hdl.handle.net/2027.42/197252 | |
dc.description.abstract | A streaming sketch is a data structure that can approximately answer queries of a data stream, whose size is much smaller than the stream length or the domain size. This thesis studies a wide range of topics on streaming sketching including approximate counting, cardinality estimation, the tractability of queries, the universality of sketches, non-mergeable sketches, as well as sampling. My research goes in two directions. Mathematical theory: For this direction, the simplicity and generality of the theory is the primary goal, and the produced algorithms can be existential or more like mathematical prototypes. Notable progress includes the following. I explicitly connect Levy processes to streaming algorithms, revealing a deep connection between infinite divisibility and scale-invariance. This connection sheds new light on the important tractability problem and also pins down the hash pattern one needs to $G$-sample over incremental streams with exactly correct sampling probabilities. I propose a new universal sketching scheme by decomposing general streaming queries into easy-to-estimate “harmonic queries.” The decomposition can be obtained from the Levy-Khintchine representation theorem or the Fourier transform. I propose a formal measurement of the goodness of a cardinality sketch, which sets up a tension between the Fisher information number and Shannon's entropy. I also generalize cardinality sketches to work over any (finite) commutative monoid, using the harmonic decomposition technique together with the semigroup decomposition of commutative monoids. Practical algorithms: For this direction, the simplicity and practicality of the algorithm is the primary goal, and the associated analysis is usually complicated and computation-heavy. Small-scale experiments are often included to confirm the analysis. I analyze various classes of sketches and estimators for the cardinality query. Many recipes are constructed for practitioners to pick and construct cardinality sketches and estimators in different contexts with different performance emphases. I study the sequential behaviors of cardinality sketches and construct simple fraud detectors. I study two variants of the Morris counter, one tracks a multi-dimensional count vector and the other tracks a count that can be both incremented and decremented. Fortunately, some of the algorithms I construct turn out to possess simplicity in both theory and practice. Notable examples are the $F_{1/2}$-sampler, stable-HyperLogLog, and the $d$-dimensional approximate counter. | |
dc.language.iso | en_US | |
dc.subject | streaming and sketching | |
dc.subject | algorithms and data structures | |
dc.title | Streaming Sketching: Mathematical Theory and Practical Algorithms | |
dc.type | Thesis | |
dc.description.thesisdegreename | PhD | |
dc.description.thesisdegreediscipline | Computer Science & Engineering | |
dc.description.thesisdegreegrantor | University of Michigan, Horace H. Rackham School of Graduate Studies | |
dc.contributor.committeemember | Pettie, Seth | |
dc.contributor.committeemember | Rudelson, Mark | |
dc.contributor.committeemember | Bansal, Nikhil | |
dc.contributor.committeemember | Derezinski, Michal | |
dc.contributor.committeemember | Woodruff, David | |
dc.subject.hlbsecondlevel | Computer Science | |
dc.subject.hlbtoplevel | Engineering | |
dc.contributor.affiliationumcampus | Ann Arbor | |
dc.description.bitstreamurl | http://deepblue.lib.umich.edu/bitstream/2027.42/197252/1/wangdy_1.pdf | |
dc.identifier.doi | https://dx.doi.org/10.7302/25678 | |
dc.identifier.orcid | 0009-0000-5462-8715 | |
dc.identifier.name-orcid | Wang, Dingyu; 0009-0000-5462-8715 | en_US |
dc.working.doi | 10.7302/25678 | en |
dc.owningcollname | Dissertations and Theses (Ph.D. and Master's) |
Files in this item
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.