Rakotomandimby, Monsieur Seta (2022) Determining the number of clusters of k-means using coding compression principles PRE - Projet de recherche, ENSTA.
Fichier(s) associé(s) à ce document :
| PDF 473Kb |
Résumé
In this report, we propose a criterion to find the right number of clusters in a multi-dimensional data set. This criterion is the length of a Two-Part Code. We designed this Two-Part code according to the Minimum description Length principle. Its first part en-codes the MODEL used to describe the data set, its second part encodes the DATA with the help of the model. The problem at hand is formalized as a model selection problem among multiple round Gaussian models. We designed three versions of Two-Part codes, each version codes the partition of the clustered data set differently — by its cluster tag, by its cluster size or by coding its cluster size offset. To do the clustering in itself, we used the Random Swap algorithm [1], an im-proved version of K-MEANS which swap randomly a centroid with a point at each step. Our codes were tested on synthetic multivariate Gaussian data, with different dimensions, number of clusters, overlapping between clusters and were compared with AIC and BIC and some cluster validation indices from [4]. Our Two-part code with the tag proved to be better than all of the other criterion with the exception of the SILHOUETTE index which performed as good and even better. However our Two-Part code is considerably faster than the SILHOUETTE index, having a linear time complexity against a quadratic time complexity
Type de document: | Rapport ou mémoire (PRE - Projet de recherche) |
---|---|
Mots-clés libres: | Clustering |
Sujets: | Sciences et technologies de l'information et de la communication Mathématiques et leurs applications |
Code ID : | 9182 |
Déposé par : | Seta Rakotomandimby |
Déposé le : | 18 juill. 2023 10:58 |
Dernière modification: | 18 juill. 2023 10:58 |