Session 2B: An Improved Initialization Approach to the K-Means Clustering Algorithm

Session Number

Session 2B: 1st Presentation

Advisor(s)

Jon Jung-Woon Yoo, Bradley University

Location

Room A149

Start Date

28-4-2017 10:00 AM

End Date

28-4-2017 11:15 AM

Abstract

Cluster analysis is a method in data analysis used to group data points with similar characteristics. Kmeans clustering is a widely-used algorithm to sort data using a prototype which was, in this case, the mean of all the data points in a cluster. Throughout this project, a better way to initialize data points into clusters before applying the algorithm was investigated. After programming the improved algorithm in Visual Basic for Applications and applying it to the Abalone data set, it was found that the proposed algorithm was 64.7% faster than the basic algorithm. However, the quality of the clusters was of a slightly lesser quality as shown by the sum of square distances to cluster centroid. Therefore, a huge area of future research in this project would be to test this improved initializer on other clustering data sets and adapting the algorithm to improve the accuracy and quality of outputted clusters. Cluster analysis is a concept with applications in every field using data analysis, therefore, finding improved algorithms to complete this process is crucial to our understanding of almost every aspect of our world.

Share

COinS
 
Apr 28th, 10:00 AM Apr 28th, 11:15 AM

Session 2B: An Improved Initialization Approach to the K-Means Clustering Algorithm

Room A149

Cluster analysis is a method in data analysis used to group data points with similar characteristics. Kmeans clustering is a widely-used algorithm to sort data using a prototype which was, in this case, the mean of all the data points in a cluster. Throughout this project, a better way to initialize data points into clusters before applying the algorithm was investigated. After programming the improved algorithm in Visual Basic for Applications and applying it to the Abalone data set, it was found that the proposed algorithm was 64.7% faster than the basic algorithm. However, the quality of the clusters was of a slightly lesser quality as shown by the sum of square distances to cluster centroid. Therefore, a huge area of future research in this project would be to test this improved initializer on other clustering data sets and adapting the algorithm to improve the accuracy and quality of outputted clusters. Cluster analysis is a concept with applications in every field using data analysis, therefore, finding improved algorithms to complete this process is crucial to our understanding of almost every aspect of our world.