**Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?**

Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur GraphSearch.

Publication# epsilon-Coresets for Clustering (with Outliers) in Doubling Metrics

Résumé

We study the problem of constructing epsilon-coresets for the (k, z)-clustering problem in a doubling metric M(X, d). An epsilon-coreset is a weighted subset S subset of X with weight function w : S -> R->= 0, such that for any k-subset C is an element of X, it holds that Sigma(x is an element of S) w(x).d(z) (x, C) is an element of (1 +/- epsilon) . Sigma(x is an element of X) d(z) (x, C). We present an efficient algorithm that constructs an epsilon-coreset for the (k, z)-clustering problem in M(X, d), where the size of the coreset only depends on the parameters k, z, epsilon and the doubling dimension ddim(M). To the best of our knowledge, this is the first efficient epsilon-coreset construction of size independent of vertical bar X vertical bar for general clustering problems in doubling metrics. To this end, we establish the first relation between the doubling dimension of M(X, d) and the shattering dimension (or VC-dimension) of the range space induced by the distance d. Such a relation is not known before, since one can easily construct instances in which neither one can be bounded by (some function of) the other. Surprisingly, we show that if we allow a small (1 +/- epsilon)-distortion of the distance function d (the distorted distance is called the smoothed distance function), the shattering dimension can be upper bounded by O(epsilon(-O(ddim(M)))). For the purpose of coreset construction, the above bound does not suffice as it only works for unweighted spaces. Therefore, we introduce the notion of tau-error probabilistic shattering dimension, and prove a (drastically better) upper bound of O(ddim(M).log(1/epsilon)+log log 1/tau) for the probabilistic shattering dimension for weighted doubling metrics. As it turns out, an upper bound for the probabilistic shattering dimension is enough for constructing a small coreset. We believe the new relation between doubling and shattering dimensions is of independent interest and may find other applications. Furthermore, we study robust coresets for (k, z)-clustering with outliers in a doubling metric. We show an improved connection between alpha-approximation and robust coresets. This also leads to improvement upon the previous best known bound of the size of robust coreset for Euclidean space [Feldman and Langberg, STOC 11]. The new bound entails a few new results in clustering and property testing. As another application, we show constant-sized (epsilon, k, z)-centroid sets in doubling metrics can be constructed by extending our coreset construction. Prior to our result, constant-sized centroid sets for general clustering problems were only known for Euclidean spaces. We can apply our centroid set to accelerate the local search algorithm (studied in [Friggstad et al., FOCS 2016]) for the (k, z)-clustering problem in doubling metrics.

Official source

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Concepts associés

Chargement

Publications associées

Chargement

Publications associées (7)

Chargement

Chargement

Chargement

Concepts associés (22)

Espace euclidien

En mathématiques, un espace euclidien est un objet algébrique permettant de généraliser de façon naturelle la géométrie traditionnelle développée par Euclide, dans ses Éléments. Une géométrie de cett

Distance (mathématiques)

En mathématiques, une distance est une application qui formalise l'idée intuitive de distance, c'est-à-dire la longueur qui sépare deux points. C'est par l'analyse des principales propriétés de la di

K-moyennes

Le partitionnement en k-moyennes (ou k-means en anglais) est une méthode de partitionnement de données et un problème d'optimisation combinatoire. Étant donnés des points et un entier k, le problèm

Approximation algorithms are a commonly used tool for designing efficient algorithmic solutions for intractable problems, at the expense of the quality of the output solution. A prominent technique for designing such algorithms is the use of Linear Programming (LP) relaxations. An optimal solution to such a relaxation provides a bound on the objective value of the optimal integral solution, to which we compare the integral solution we return. In this context, when studying a specific problem, two natural questions often arise: What is a strong LP relaxation for this problem, and how can we exploit it? Over the course of the past few decades, a significant amount of effort has been expended by the research community in order to answer these questions for a variety of interesting intractable problems. Although there exist multiple problems for which we have designed LP relaxations that achieve best-possible guarantees, there still exist numerous problems for which we either have no strong LP relaxations, or do not know how to use them. The main focus of this thesis is extending our understanding of such strong relaxations. We focus on designing good approximation algorithms for certain allocation problems, by employing a class of strong LP relaxations, called configuration-LPs. For many such allocation problems, the best-known results are derived by using simple and natural LP relaxations, whereas configuration-LPs have been used successfully on several occasions in order to break pre-existing barriers set by weaker relaxations. However, our understanding of configuration-LPs is far from complete for many problems. Therefore, understanding and using these relaxations to the farthest extent possible is a quite intriguing question. Answering this question could result in improved approximation algorithms for a wide variety of allocation problems. The first problem we address in this thesis is the restricted max-min fair allocation problem. Prior to our work, the best known result provided an $\Omega(1)$-approximation that ran in polynomial time. Also, it was known how to estimate the value of an optimal solution to the problem within a factor of $1/(4+c)$, for any $c>0$, by solving the corresponding configuration-LP. Our first contribution in this thesis is the design of a $1/13$-approximation algorithm for the problem, using the configuration-LP. Specifically, although our algorithm is fully combinatorial, it consists of a local-search procedure that is guaranteed to succeed only when the configuration-LP is feasible. In order to establish the correctness and running time of the algorithm, it is crucial to use the configuration-LP in our analysis. The second problem we study is the scheduling of jobs on unrelated machines in order to minimize the sum of weighted completion times. For this problem, the best known approximation algorithm achieves a ratio of $3/2-r$, for some small $r>0$. Our second contribution in this thesis is the improvement of this ratio to $(1+\sqrt{2})/2+c$, for any $c>0$, for the special case of the problem where the jobs have uniform Smith ratios. To achieve this ratio, we design a randomized rounding algorithm that rounds solutions to the corresponding configuration-LP. Through a careful examination of the distribution this randomized algorithm outputs, we identify the one that maximizes the approximation ratio, and we then upper bound the ratio this worst-case distribution exhibits by $(1+\sqrt{2})/2+c$.

Carlos Eisenberg, Boi Faltings

Local search algorithms have been very successful for solving constraint satisfaction problems (CSP). However, a major weakness has been that local search is unable to detect unsolvability and is thus not suitable for highly constrained or overconstrained problems. In this paper, we present a scheme where a local search algorithm, the breakout algorithm, is used to identify hard or unsolvable subproblems. This is used in two ways. The first to generate a fail-first variable order for a systematic backtrack search that proves unsolvability or solves the problem efficiently. The combination of the two methods is a complete algorithm. On randomly generated coloring problems, the method performs extremely well, in particular, for tightly and overconstrained CSPs. The second way of using the breakout algorithm is as a filter for identifying possibly unsolvable subproblems. We present an efficient algorithm that guarantees to find the smallest unsolvable subproblem by systematic search. The presented scheme is of great practical use as ideal failure analysis tool, which also supports the repair of a problem.

Market forces are continuously driving public and private organisations towards higher productivity, shorter process and production times, and fewer labour hours. To cope with these changes, organisations are adopting new organisational models of coordination and cooperation that increase their flexibility, consistency, efficiency, productivity and profit margins. In this thesis an organisational model of coordination and cooperation is examined using a real life example; the technical integration of a distributed large-scale project of an international physics collaboration. The distributed resource constraint project scheduling problem is modelled and solved with the methods of distributed constraint satisfaction. A distributed local search method, the distributed breakout algorithm (DisBO), is used as the basis for the coordination scheme. The efficiency of the local search method is improved by extending it with an incremental problem solving scheme with variable ordering. The scheme is implemented as central algorithm, incremental breakout algorithm (IncBO), and as distributed algorithm, distributed incremental breakout algorithm (DisIncBO). In both cases, strong performance gains are observed for solving underconstrained problems. Distributed local search algorithms are incomplete and lack a termination guarantee. When problems contain hard or unsolvable subproblems and are tightly or overconstrained, local search falls into infinite cycles without explanation. A scheme is developed that identifies hard or unsolvable subproblems and orders these to size. This scheme is based on the constraint weight information generated by the breakout algorithm during search. This information, combined with the graph structure, is used to derive a fail first variable order. Empirical results show that the derived variable order is 'perfect'. When it guides simple backtracking, exceptionally hard problems do not occur, and, when problems are unsolvable, the fail depth is always the shortest. Two hybrid algorithms, BOBT and BOBT-SUSP are developed. When the problem is unsolvable, BOBT returns the minimal subproblem within the search scope and BOBT-SUSP returns the smallest unsolvable subproblem using a powerful weight sum constraint. A distributed hybrid algorithm (DisBOBT) is developed that combines DisBO with DisBT. The distributed hybrid algorithm first attempts to solve the problem with DisBO. If no solution is available after a bounded number of breakouts, DisBO is terminated, and DisBT solves the problem. DisBT is guided by a distributed variable order that is derived from the constraint weight information and the graph structure. The variable order is incrementally established, every time the partial solution needs to be extended, the next variable within the order is identified. Empirical results show strong performance gains, especially when problems are overconstrained and contain small unsolvable subproblems.