Conditional Clustering Method on KNN for Big Data
dc.contributor.author | Chen, Qi | |
dc.contributor.advisor | Fredrickson, Mark | |
dc.date.accessioned | 2024-06-25T14:16:34Z | |
dc.date.available | 2024-06-25T14:16:34Z | |
dc.date.issued | 2024 | |
dc.identifier.uri | https://hdl.handle.net/2027.42/193909 | |
dc.description.abstract | This thesis endeavors to address the challenges faced by k-nearest neighbor (KNN) classifiers when handling big data, particularly concerning large storage requirements and extended training times. The proposed solution revolves around data filtering techniques. Drawing upon the fundamental principle of KNN classifiers, which assumes that similar data possess similar conditional distributions regarding the response variable, this study advocates for employing clustering methods to segment the training data and subsequently filter it by selecting the closest cluster as the training set. Stem from Bayes' rule and the Mixture distribution of data, the clustering refinement involves utilizing clustering techniques conditional on the class of responses. To execute this approach, the algorithm prefers hierarchical clustering model, chosen for its stability and efficiency. To counterbalance the loss of information resulting from filtering the training set, the algorithm replaces the standard KNN classifier with a local KNN classifier. By locally adjusting the parameters of the KNN classifier, the model achieves a more favorable trade-off between bias and variance. The effectiveness of the proposed model is evaluated using three sets of real-world data: Fashion MNIST, Forest Cover Type Prediction, and Online Shoppers Intention from the UCI Machine Learning repository. The results of the tests demonstrate that the conditional clustering method significantly enhances runtime efficiency, while employing the local KNN classifier improves model prediction ability. Notably, the number of clusters proves to be a critical factor influencing the model's accuracy. While increasing the number of clusters may reduce the filtered training dataset size, thus resulting in information loss, a higher number of clusters affords the local KNN classifier greater opportunities to strike a balance between variance and bias, consequently lowering model risk. | |
dc.subject | k-nearest neighbor | |
dc.subject | local k-nearest neighbor | |
dc.subject | hierarchical clustering | |
dc.subject | bayes estimator | |
dc.subject | classification | |
dc.title | Conditional Clustering Method on KNN for Big Data | |
dc.type | Thesis | |
dc.description.thesisdegreename | Honors (Bachelor's) | |
dc.description.thesisdegreediscipline | Statistics | en_US |
dc.description.thesisdegreegrantor | University of Michigan | |
dc.subject.hlbsecondlevel | Statistics | |
dc.subject.hlbtoplevel | Science | |
dc.contributor.affiliationum | Statistics | |
dc.contributor.affiliationumcampus | Ann Arbor | |
dc.description.bitstreamurl | http://deepblue.lib.umich.edu/bitstream/2027.42/193909/1/chenlala.pdf | |
dc.identifier.doi | https://dx.doi.org/10.7302/23391 | |
dc.working.doi | 10.7302/23391 | en |
dc.owningcollname | Honors Theses (Bachelor's) |
Files in this item
Remediation of Harmful Language
The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.
Accessibility
If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.