Title

Clustering Gene Cells Using Hierarchical Clustering

Author(s)

Kaitlyn Campbell

School Name

Governor's School for Science and Math

Grade Level

12th Grade

Presentation Topic

Math and Computer Science

Presentation Type

Mentored

Mentor

Mentor: Dr. Luo; School of Computing, Clemson University

Abstract

Over the years the amount of genetic data that needs to be analyzed has increased. With all this extra data, faster and more efficient algorithms are needed to process it. The Dynamically Growing Self Organizing Tree (DGSOT) algorithm, a Java program published in 2004, is one of the many algorithms used to group genes using a method called hierarchical clustering. The goal of this algorithm is to overcome the drawbacks of other clustering algorithms. The research goal for this project is to test the DGSOT algorithm on multiple well-known datasets to determine its accuracy and efficiency. The algorithm was run on several sets of data, containing genetic data from different cell types. The algorithm clustered the data into anywhere from ten to twelve clusters, showing a similar number of clusters as the Shared Nearest Neighbor (SNN) and Locality Preserving Projection (LPP) algorithms. Using the number of clusters, the Adjusted Rand Index (ARI) was calculated. The ARI is a commonly used clustering validation program that returns a number less than one. As the ARI value gets closer to one, the accuracy of the algorithm is shown to be higher. Despite the similar number of clusters, the DGSOT algorithm was shown to be the least accurate of the three that were tested due to the significantly lower ARI value.

Location

Owens 207

Start Date

4-16-2016 9:30 AM

COinS
 
Apr 16th, 9:30 AM

Clustering Gene Cells Using Hierarchical Clustering

Owens 207

Over the years the amount of genetic data that needs to be analyzed has increased. With all this extra data, faster and more efficient algorithms are needed to process it. The Dynamically Growing Self Organizing Tree (DGSOT) algorithm, a Java program published in 2004, is one of the many algorithms used to group genes using a method called hierarchical clustering. The goal of this algorithm is to overcome the drawbacks of other clustering algorithms. The research goal for this project is to test the DGSOT algorithm on multiple well-known datasets to determine its accuracy and efficiency. The algorithm was run on several sets of data, containing genetic data from different cell types. The algorithm clustered the data into anywhere from ten to twelve clusters, showing a similar number of clusters as the Shared Nearest Neighbor (SNN) and Locality Preserving Projection (LPP) algorithms. Using the number of clusters, the Adjusted Rand Index (ARI) was calculated. The ARI is a commonly used clustering validation program that returns a number less than one. As the ARI value gets closer to one, the accuracy of the algorithm is shown to be higher. Despite the similar number of clusters, the DGSOT algorithm was shown to be the least accurate of the three that were tested due to the significantly lower ARI value.