The Evalutation Of The K-Means Clustering Algorithm Using Different Distancing Methods

Author(s)

Alex Hoover

School Name

South Carolina Governor's School for Science and Mathematics

Grade Level

12th Grade

Presentation Topic

Math and Computer Science

Presentation Type

Mentored

Mentor

Mentor: Sebastian Palacio, Data Mining, German Research Center for Artificial Intelligence (DKFI)

Abstract

This research primarily focused on finding differences in various distancing methods used in the k-means clustering algorithm. The distancing methods used throughout the experiment are the Euclidean, Manhattan, and Earth Movers Distance. To accomplish this task, code in Python, wrapped around some C and Fortran code, was used to process images and determine the quality of the clusters made in the algorithm. The tests executed were performed on a ground truth to determine the quality of the measurements. For this experiment, Kylberg’s Texture Set was used as that primary ground truth. After initial results were determined, with an assumed cluster count of 28 (1 cluster per texture), further testing was required to search for significant differences in the data. So, the cluster count was optimized using a Q-test and the Anderson-Darling Statistic. All the optimization data was used to find more accurate results of the primary experiment, and the cluster count, for Kylberg’s Texture set, was actually optimized around 400 or 500. Results for the k-means clustering run on Kylberg’s Texture set clusters, using the optimized cluster count, are not included in this paper.

Start Date

4-11-2015 10:30 AM

End Date

4-11-2015 10:45 AM

COinS
 
Apr 11th, 10:30 AM Apr 11th, 10:45 AM

The Evalutation Of The K-Means Clustering Algorithm Using Different Distancing Methods

This research primarily focused on finding differences in various distancing methods used in the k-means clustering algorithm. The distancing methods used throughout the experiment are the Euclidean, Manhattan, and Earth Movers Distance. To accomplish this task, code in Python, wrapped around some C and Fortran code, was used to process images and determine the quality of the clusters made in the algorithm. The tests executed were performed on a ground truth to determine the quality of the measurements. For this experiment, Kylberg’s Texture Set was used as that primary ground truth. After initial results were determined, with an assumed cluster count of 28 (1 cluster per texture), further testing was required to search for significant differences in the data. So, the cluster count was optimized using a Q-test and the Anderson-Darling Statistic. All the optimization data was used to find more accurate results of the primary experiment, and the cluster count, for Kylberg’s Texture set, was actually optimized around 400 or 500. Results for the k-means clustering run on Kylberg’s Texture set clusters, using the optimized cluster count, are not included in this paper.