Fast high-dimensional Bayesian classification and clustering

Vahid Partovi Nia
2009
Thèse EPFL

Résumé

We introduce a fast approach to classification and clustering applicable to high-dimensional continuous data, based on Bayesian mixture models for which explicit computations are available. This permits us to treat classification and clustering in a single framework, and allows calculation of unobserved class probability. The new classifier is robust to adding noise variables as a drawback of the built-in spike-and-slab structure of the proposed Bayesian model. The usefulness of classification using our method is shown on metabololomic example, and on the Iris data with and without noise variables. Agglomerative hierarchical clustering is used to construct a dendrogram based on the posterior probabilities of particular partitions, to provide a dendrogram with a probabilistic interpretation. An extension to variable selection is proposed which summarises the importance of variables for classification or clustering and has probabilistic interpretation. Having a simple model provides estimation of the model parameters using maximum likelihood and therefore yields a fully automatic algorithm. The new clustering method is applied to metabolomic, microarray, and image data and is studied using simulated data motivated by real datasets. The computational difficulties of the new approach are discussed, solutions for algorithm acceleration are proposed, and the written computer code is briefly analysed. Simulations shows that the quality of the estimated model parameters depends on the parametric distribution assumed for effects, but after fixing the model parameters to reasonable values, the distribution of the effects influences clustering very little. Simulations confirms that the clustering algorithm and the proposed variable selection method is reliable when the model assumptions are wrong. The new approach is compared with the popular Bayesian clustering alternative, MCLUST, fitted on the principal components using two loss functions in which our proposed approach is found to be more efficient in almost every situation.

Source officielle

https://infoscience.epfl.ch/record/138938?ln=fr

À propos de ce résultat

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Fast high-dimensional Bayesian classification and clustering

Graph Chatbot

Chattez avec Graph Search

Interpret3C: Interpretable Student Clustering Through Individualized Feature Selection

Improving K-means Clustering Using Speculation

Single-Trace Clustering Power Analysis of the Point-Swapping Procedure in the Three Point Ladder of Cortex-M4 SIKE

Interpret3C: Interpretable Student Clustering Through Individualized Feature Selection

Single-Trace Clustering Power Analysis of the Point-Swapping Procedure in the Three Point Ladder of Cortex-M4 SIKE

Improving K-means Clustering Using Speculation