Clustering Gene Cells Using Hierarchical Clustering
School Name
Governor's School for Science and Math
Grade Level
12th Grade
Presentation Topic
Math and Computer Science
Presentation Type
Mentored
Abstract
Over the years the amount of genetic data that needs to be analyzed has increased. With all this extra data, faster and more efficient algorithms are needed to process it. The Dynamically Growing Self Organizing Tree (DGSOT) algorithm, a Java program published in 2004, is one of the many algorithms used to group genes using a method called hierarchical clustering. The goal of this algorithm is to overcome the drawbacks of other clustering algorithms. The research goal for this project is to test the DGSOT algorithm on multiple well-known datasets to determine its accuracy and efficiency. The algorithm was run on several sets of data, containing genetic data from different cell types. The algorithm clustered the data into anywhere from ten to twelve clusters, showing a similar number of clusters as the Shared Nearest Neighbor (SNN) and Locality Preserving Projection (LPP) algorithms. Using the number of clusters, the Adjusted Rand Index (ARI) was calculated. The ARI is a commonly used clustering validation program that returns a number less than one. As the ARI value gets closer to one, the accuracy of the algorithm is shown to be higher. Despite the similar number of clusters, the DGSOT algorithm was shown to be the least accurate of the three that were tested due to the significantly lower ARI value.
Recommended Citation
Campbell, Kaitlyn, "Clustering Gene Cells Using Hierarchical Clustering" (2016). South Carolina Junior Academy of Science. 86.
https://scholarexchange.furman.edu/scjas/2016/all/86
Location
Owens 207
Start Date
4-16-2016 9:30 AM
Clustering Gene Cells Using Hierarchical Clustering
Owens 207
Over the years the amount of genetic data that needs to be analyzed has increased. With all this extra data, faster and more efficient algorithms are needed to process it. The Dynamically Growing Self Organizing Tree (DGSOT) algorithm, a Java program published in 2004, is one of the many algorithms used to group genes using a method called hierarchical clustering. The goal of this algorithm is to overcome the drawbacks of other clustering algorithms. The research goal for this project is to test the DGSOT algorithm on multiple well-known datasets to determine its accuracy and efficiency. The algorithm was run on several sets of data, containing genetic data from different cell types. The algorithm clustered the data into anywhere from ten to twelve clusters, showing a similar number of clusters as the Shared Nearest Neighbor (SNN) and Locality Preserving Projection (LPP) algorithms. Using the number of clusters, the Adjusted Rand Index (ARI) was calculated. The ARI is a commonly used clustering validation program that returns a number less than one. As the ARI value gets closer to one, the accuracy of the algorithm is shown to be higher. Despite the similar number of clusters, the DGSOT algorithm was shown to be the least accurate of the three that were tested due to the significantly lower ARI value.
Mentor
Mentor: Dr. Luo; School of Computing, Clemson University