Rakotomandimby, Monsieur Seta (2022) Determining the number of clusters of k-means using coding compression principles PRE - Research Project, ENSTA.



In this report, we propose a criterion to find the right number of clusters in a multi-dimensional data set. This criterion is the length of a Two-Part Code. We designed this Two-Part code according to the Minimum description Length principle. Its first part en-codes the MODEL used to describe the data set, its second part encodes the DATA with the help of the model. The problem at hand is formalized as a model selection problem among multiple round Gaussian models. We designed three versions of Two-Part codes, each version codes the partition of the clustered data set differently — by its cluster tag, by its cluster size or by coding its cluster size offset. To do the clustering in itself, we used the Random Swap algorithm [1], an im-proved version of K-MEANS which swap randomly a centroid with a point at each step. Our codes were tested on synthetic multivariate Gaussian data, with different dimensions, number of clusters, overlapping between clusters and were compared with AIC and BIC and some cluster validation indices from [4]. Our Two-part code with the tag proved to be better than all of the other criterion with the exception of the SILHOUETTE index which performed as good and even better. However our Two-Part code is considerably faster than the SILHOUETTE index, having a linear time complexity against a quadratic time complexity

Item Type:Thesis (PRE - Research Project)
Uncontrolled Keywords:Clustering
Subjects:Information and Communication Sciences and Technologies
Mathematics and Applications
ID Code:9182
Deposited By:Seta Rakotomandimby
Deposited On:18 juill. 2023 10:58
Dernière modification:18 juill. 2023 10:58

Repository Staff Only: item control page