#### Title

Clustering Gene Cells Using Hierarchical Clustering

#### School Name

Governor's School for Science and Math

#### Grade Level

12th Grade

#### Presentation Topic

Math and Computer Science

#### Presentation Type

Mentored

#### Abstract

Over the years the amount of genetic data that needs to be analyzed has increased. With all this extra data, faster and more efficient algorithms are needed to process it. The Dynamically Growing Self Organizing Tree (DGSOT) algorithm, a Java program published in 2004, is one of the many algorithms used to group genes using a method called hierarchical clustering. The goal of this algorithm is to overcome the drawbacks of other clustering algorithms. The research goal for this project is to test the DGSOT algorithm on multiple well-known datasets to determine its accuracy and efficiency. The algorithm was run on several sets of data, containing genetic data from different cell types. The algorithm clustered the data into anywhere from ten to twelve clusters, showing a similar number of clusters as the Shared Nearest Neighbor (SNN) and Locality Preserving Projection (LPP) algorithms. Using the number of clusters, the Adjusted Rand Index (ARI) was calculated. The ARI is a commonly used clustering validation program that returns a number less than one. As the ARI value gets closer to one, the accuracy of the algorithm is shown to be higher. Despite the similar number of clusters, the DGSOT algorithm was shown to be the least accurate of the three that were tested due to the significantly lower ARI value.

#### Recommended Citation

Campbell, Kaitlyn, "Clustering Gene Cells Using Hierarchical Clustering" (2016). *South Carolina Junior Academy of Science*. 86.

http://scholarexchange.furman.edu/scjas/2016/all/86

#### Location

Owens 207

#### Start Date

4-16-2016 9:30 AM

Clustering Gene Cells Using Hierarchical Clustering

Owens 207

Over the years the amount of genetic data that needs to be analyzed has increased. With all this extra data, faster and more efficient algorithms are needed to process it. The Dynamically Growing Self Organizing Tree (DGSOT) algorithm, a Java program published in 2004, is one of the many algorithms used to group genes using a method called hierarchical clustering. The goal of this algorithm is to overcome the drawbacks of other clustering algorithms. The research goal for this project is to test the DGSOT algorithm on multiple well-known datasets to determine its accuracy and efficiency. The algorithm was run on several sets of data, containing genetic data from different cell types. The algorithm clustered the data into anywhere from ten to twelve clusters, showing a similar number of clusters as the Shared Nearest Neighbor (SNN) and Locality Preserving Projection (LPP) algorithms. Using the number of clusters, the Adjusted Rand Index (ARI) was calculated. The ARI is a commonly used clustering validation program that returns a number less than one. As the ARI value gets closer to one, the accuracy of the algorithm is shown to be higher. Despite the similar number of clusters, the DGSOT algorithm was shown to be the least accurate of the three that were tested due to the significantly lower ARI value.

## Mentor

Mentor: Dr. Luo; School of Computing, Clemson University