Publication

epsilon-Coresets for Clustering (with Outliers) in Doubling Metrics

Jian Li, Lingxiao Huang
2018
Article de conférence
Résumé

We study the problem of constructing epsilon-coresets for the (k, z)-clustering problem in a doubling metric M(X, d). An epsilon-coreset is a weighted subset S subset of X with weight function w : S -> R->= 0, such that for any k-subset C is an element of X, it holds that Sigma(x is an element of S) w(x).d(z) (x, C) is an element of (1 +/- epsilon) . Sigma(x is an element of X) d(z) (x, C). We present an efficient algorithm that constructs an epsilon-coreset for the (k, z)-clustering problem in M(X, d), where the size of the coreset only depends on the parameters k, z, epsilon and the doubling dimension ddim(M). To the best of our knowledge, this is the first efficient epsilon-coreset construction of size independent of vertical bar X vertical bar for general clustering problems in doubling metrics. To this end, we establish the first relation between the doubling dimension of M(X, d) and the shattering dimension (or VC-dimension) of the range space induced by the distance d. Such a relation is not known before, since one can easily construct instances in which neither one can be bounded by (some function of) the other. Surprisingly, we show that if we allow a small (1 +/- epsilon)-distortion of the distance function d (the distorted distance is called the smoothed distance function), the shattering dimension can be upper bounded by O(epsilon(-O(ddim(M)))). For the purpose of coreset construction, the above bound does not suffice as it only works for unweighted spaces. Therefore, we introduce the notion of tau-error probabilistic shattering dimension, and prove a (drastically better) upper bound of O(ddim(M).log(1/epsilon)+log log 1/tau) for the probabilistic shattering dimension for weighted doubling metrics. As it turns out, an upper bound for the probabilistic shattering dimension is enough for constructing a small coreset. We believe the new relation between doubling and shattering dimensions is of independent interest and may find other applications. Furthermore, we study robust coresets for (k, z)-clustering with outliers in a doubling metric. We show an improved connection between alpha-approximation and robust coresets. This also leads to improvement upon the previous best known bound of the size of robust coreset for Euclidean space [Feldman and Langberg, STOC 11]. The new bound entails a few new results in clustering and property testing. As another application, we show constant-sized (epsilon, k, z)-centroid sets in doubling metrics can be constructed by extending our coreset construction. Prior to our result, constant-sized centroid sets for general clustering problems were only known for Euclidean spaces. We can apply our centroid set to accelerate the local search algorithm (studied in [Friggstad et al., FOCS 2016]) for the (k, z)-clustering problem in doubling metrics.

À propos de ce résultat
Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.
Concepts associés (34)
K-moyennes
Le partitionnement en k-moyennes (ou k-means en anglais) est une méthode de partitionnement de données et un problème d'optimisation combinatoire. Étant donnés des points et un entier k, le problème est de diviser les points en k groupes, souvent appelés clusters, de façon à minimiser une certaine fonction. On considère la distance d'un point à la moyenne des points de son cluster ; la fonction à minimiser est la somme des carrés de ces distances.
Partitionnement de données
vignette|upright=1.2|Exemple de clustering hiérarchique. Le partitionnement de données (ou data clustering en anglais) est une méthode en analyse des données. Elle vise à diviser un ensemble de données en différents « paquets » homogènes, en ce sens que les données de chaque sous-ensemble partagent des caractéristiques communes, qui correspondent le plus souvent à des critères de proximité (similarité informatique) que l'on définit en introduisant des mesures et classes de distance entre objets.
Correlation clustering
Clustering is the problem of partitioning data points into groups based on their similarity. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance. Cluster analysis In machine learning, correlation clustering or cluster editing operates in a scenario where the relationships between the objects are known instead of the actual representations of the objects.
Afficher plus
Publications associées (58)

Interpret3C: Interpretable Student Clustering Through Individualized Feature Selection

Vinitra Swamy, Paola Mejia Domenzain, Julian Thomas Blackwell, Isadora Alves de Salles

Clustering in education, particularly in large-scale online environments like MOOCs, is essential for understanding and adapting to diverse student needs. However, the effectiveness of clustering depends on its interpretability, which becomes challenging w ...
2024

Fairness and Explainability in Clustering Problems

Xinrui Jia

In this thesis we present and analyze approximation algorithms for three different clustering problems. The formulations of these problems are motivated by fairness and explainability considerations, two issues that have recently received attention in the ...
EPFL2023

Sublinear Algorithms for Spectral Graph Clustering

Aidasadat Mousavifar

This thesis focuses on designing spectral tools for graph clustering in sublinear time. With the emergence of big data, many traditional polynomial time, and even linear time algorithms have become prohibitively expensive. Processing modern datasets requir ...
EPFL2021
Afficher plus
MOOCs associés (19)
Introduction to optimization on smooth manifolds: first order methods
Learn to optimize on smooth, nonlinear spaces: Join us to build your foundations (starting at "what is a manifold?") and confidently implement your first algorithm (Riemannian gradient descent).
Algèbre Linéaire (Partie 1)
Un MOOC francophone d'algèbre linéaire accessible à tous, enseigné de manière rigoureuse et ne nécessitant aucun prérequis.
Algèbre Linéaire (Partie 1)
Un MOOC francophone d'algèbre linéaire accessible à tous, enseigné de manière rigoureuse et ne nécessitant aucun prérequis.
Afficher plus

Graph Chatbot

Chattez avec Graph Search

Posez n’importe quelle question sur les cours, conférences, exercices, recherches, actualités, etc. de l’EPFL ou essayez les exemples de questions ci-dessous.

AVERTISSEMENT : Le chatbot Graph n'est pas programmé pour fournir des réponses explicites ou catégoriques à vos questions. Il transforme plutôt vos questions en demandes API qui sont distribuées aux différents services informatiques officiellement administrés par l'EPFL. Son but est uniquement de collecter et de recommander des références pertinentes à des contenus que vous pouvez explorer pour vous aider à répondre à vos questions.