The Evalutation Of The K-Means Clustering Algorithm Using Different Distancing Methods
School Name
South Carolina Governor's School for Science and Mathematics
Grade Level
12th Grade
Presentation Topic
Math and Computer Science
Presentation Type
Mentored
Abstract
This research primarily focused on finding differences in various distancing methods used in the k-means clustering algorithm. The distancing methods used throughout the experiment are the Euclidean, Manhattan, and Earth Movers Distance. To accomplish this task, code in Python, wrapped around some C and Fortran code, was used to process images and determine the quality of the clusters made in the algorithm. The tests executed were performed on a ground truth to determine the quality of the measurements. For this experiment, Kylberg’s Texture Set was used as that primary ground truth. After initial results were determined, with an assumed cluster count of 28 (1 cluster per texture), further testing was required to search for significant differences in the data. So, the cluster count was optimized using a Q-test and the Anderson-Darling Statistic. All the optimization data was used to find more accurate results of the primary experiment, and the cluster count, for Kylberg’s Texture set, was actually optimized around 400 or 500. Results for the k-means clustering run on Kylberg’s Texture set clusters, using the optimized cluster count, are not included in this paper.
Recommended Citation
Hoover, Alex, "The Evalutation Of The K-Means Clustering Algorithm Using Different Distancing Methods" (2015). South Carolina Junior Academy of Science. 112.
https://scholarexchange.furman.edu/scjas/2015/all/112
Start Date
4-11-2015 10:30 AM
End Date
4-11-2015 10:45 AM
The Evalutation Of The K-Means Clustering Algorithm Using Different Distancing Methods
This research primarily focused on finding differences in various distancing methods used in the k-means clustering algorithm. The distancing methods used throughout the experiment are the Euclidean, Manhattan, and Earth Movers Distance. To accomplish this task, code in Python, wrapped around some C and Fortran code, was used to process images and determine the quality of the clusters made in the algorithm. The tests executed were performed on a ground truth to determine the quality of the measurements. For this experiment, Kylberg’s Texture Set was used as that primary ground truth. After initial results were determined, with an assumed cluster count of 28 (1 cluster per texture), further testing was required to search for significant differences in the data. So, the cluster count was optimized using a Q-test and the Anderson-Darling Statistic. All the optimization data was used to find more accurate results of the primary experiment, and the cluster count, for Kylberg’s Texture set, was actually optimized around 400 or 500. Results for the k-means clustering run on Kylberg’s Texture set clusters, using the optimized cluster count, are not included in this paper.
Mentor
Mentor: Sebastian Palacio, Data Mining, German Research Center for Artificial Intelligence (DKFI)