Session 2B: An Improved Initialization Approach to the K-Means Clustering Algorithm
Session Number
Session 2B: 1st Presentation
Advisor(s)
Jon Jung-Woon Yoo, Bradley University
Location
Room A149
Start Date
28-4-2017 10:00 AM
End Date
28-4-2017 11:15 AM
Abstract
Cluster analysis is a method in data analysis used to group data points with similar characteristics. Kmeans clustering is a widely-used algorithm to sort data using a prototype which was, in this case, the mean of all the data points in a cluster. Throughout this project, a better way to initialize data points into clusters before applying the algorithm was investigated. After programming the improved algorithm in Visual Basic for Applications and applying it to the Abalone data set, it was found that the proposed algorithm was 64.7% faster than the basic algorithm. However, the quality of the clusters was of a slightly lesser quality as shown by the sum of square distances to cluster centroid. Therefore, a huge area of future research in this project would be to test this improved initializer on other clustering data sets and adapting the algorithm to improve the accuracy and quality of outputted clusters. Cluster analysis is a concept with applications in every field using data analysis, therefore, finding improved algorithms to complete this process is crucial to our understanding of almost every aspect of our world.
Session 2B: An Improved Initialization Approach to the K-Means Clustering Algorithm
Room A149
Cluster analysis is a method in data analysis used to group data points with similar characteristics. Kmeans clustering is a widely-used algorithm to sort data using a prototype which was, in this case, the mean of all the data points in a cluster. Throughout this project, a better way to initialize data points into clusters before applying the algorithm was investigated. After programming the improved algorithm in Visual Basic for Applications and applying it to the Abalone data set, it was found that the proposed algorithm was 64.7% faster than the basic algorithm. However, the quality of the clusters was of a slightly lesser quality as shown by the sum of square distances to cluster centroid. Therefore, a huge area of future research in this project would be to test this improved initializer on other clustering data sets and adapting the algorithm to improve the accuracy and quality of outputted clusters. Cluster analysis is a concept with applications in every field using data analysis, therefore, finding improved algorithms to complete this process is crucial to our understanding of almost every aspect of our world.