**Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?**

Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur GraphSearch.

Publication# Testing Graph Clusterability: Algorithms and Lower Bounds

Ashish Hari Chiplunkar, Mikhail Kapralov, Aidasadat Mousavifar

*IEEE COMPUTER SOC, *2018

Article de conférence

Article de conférence

Résumé

We consider the problem of testing graph cluster structure: given access to a graph G = (V, E), can we quickly determine whether the graph can be partitioned into a few clusters with good inner conductance, or is far from any such graph? This is a generalization of the well-studied problem of testing graph expansion, where one wants to distinguish between the graph having good expansion (i.e. being a good single cluster) and the graph having a sparse cut (i.e. being a union of at least two clusters). A recent work of Czumaj, Peng, and Sohler (STOC'15) gave an ingenious sublinear time algorithm for testing k-clusterability in time (O) over tilde (n(1/2)poly(k)). Their algorithm implicitly embeds a random sample of vertices of the graph into Euclidean space, and then clusters the samples based on estimates of Euclidean distances between the points. This yields a very efficient testing algorithm, but only works if the cluster structure is very strong: it is necessary to assume that the gap between conductances of accepted and rejected graphs is at least logarithmic in the size of the graph G. In this paper we show how one can leverage more refined geometric information, namely angles as opposed to distances, to obtain a sublinear time tester that works even when the gap is a sufficiently large constant. Our tester is based on the singular value decomposition of a natural matrix derived from random walk transition probabilities from a small sample of seed nodes. We complement our algorithm with a matching lower bound on the query complexity of testing clusterability. Our lower bound is based on a novel property testing problem, which we analyze using Fourier analytic tools. As a byproduct of our techniques, we also achieve new lower bounds for the problem of approximating MAX-CUT value in sublinear time.

Official source

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Concepts associés

Chargement

Publications associées

Chargement

Concepts associés (9)

Algorithme

thumb|Algorithme de découpe d'un polygone quelconque en triangles (triangulation).
Un algorithme est une suite finie et non ambiguë d'instructions et d’opérations permettant de résoudre une classe de

Algorithme de tri

Un algorithme de tri est, en informatique ou en mathématiques, un algorithme qui permet d'organiser une collection d'objets selon une relation d'ordre déterminée. Les objets à trier sont des élément

Résolution de problème

vignette|Résolution d'un problème mathématique.
La résolution de problème est le processus d'identification puis de mise en œuvre d'une solution à un problème.
Méthodologie
Dans l'ind

Publications associées (19)

Chargement

Chargement

Chargement

As it has become easier and cheaper to collect big datasets in the last few decades, designing efficient and low-cost algorithms for these datasets has attracted unprecedented attention. However, in most applications, even storing datasets as acquired has become extremely costly and inefficient, which motivates the study of sublinear algorithms. This thesis focuses on studying two fundamental graph problems in the sublinear regime. Furthermore, it presents a fast kernel density estimation algorithm and data structure. The first part of this thesis focuses on graph spectral sparsification in dynamic streams. Our algorithm achieves almost optimal runtime and space simultaneously in a single pass. Our method is based on a novel bucketing scheme that enables us to recover high effective resistance edges faster. This contribution presents a novel approach to the effective resistance embedding of the graph, using locality-sensitive hash functions, with possible further future applications.The second part of this thesis presents spanner construction results in the dynamic streams and the simultaneous communication models. First, we show how one can construct a $\tilde{O}(n^{2/3})$-spanner using the above-mentioned almost-optimal single-pass spectral sparsifier, resulting in the first single-pass algorithm for non-trivial spanner construction in the literature. Then, we generalize this result to constructing $\tilde{O}(n^{2/3(1-\alpha)})$-spanners using $\tilde{O}(n^{1+\alpha})$ space for any $\alpha \in [0,1]$, providing a smooth trade-off between distortion and memory complexity. Moreover, we study the simultaneous communication model and propose a novel protocol with low per player information. Also, we show how one can leverage more rounds of communication in this setting to achieve better distortion guarantees. Finally, in the third part of this thesis, we study the kernel density estimation problem. In this problem, given a kernel function, an input dataset imposes a kernel density on the space. The goal is to design fast and memory-efficient data structures that can output approximations to the kernel density at queried points. This thesis presents a data structure based on the classical near neighbor search and locality-sensitive hashing techniques that improves or matches the query time and space complexity for radial kernels considered in the literature. The approach is based on an implementation of (approximate) importance sampling for each distance range and then using near neighbor search algorithms to recover points from these distance ranges. Later, we show how to improve the runtime, using recent advances in the data-dependent near neighbor search data structures, for a class of radial kernels that includes the Gaussian kernel.

Many of the currently best-known approximation algorithms for NP-hard optimization problems are based on Linear Programming (LP) and Semi-definite Programming (SDP) relaxations. Given its power, this class of algorithms seems to contain the most favourable candidates for outperforming the current state-of-the-art approximation guarantees for NP-hard problems, for which there still exists a gap between the inapproximability results and the approximation guarantees that we know how to achieve in polynomial time. In this thesis, we address both the power and the limitations of these relaxations, as well as the connection between the shortcomings of these relaxations and the inapproximability of the underlying problem. In the first part, we study the limitations of LP relaxations of well-known graph problems such as the Vertex Cover problem and the Independent Set problem. We prove that any small LP relaxation for the aforementioned problems, cannot have an integrality gap strictly better than $2$ and $\omega(1)$, respectively. Furthermore, our lower bound for the Independent Set problem also holds for any SDP relaxation. Prior to our work, it was only known that such LP relaxations cannot have an integrality gap better than $1.5$ for the Vertex Cover Problem, and better than $2$ for the Independent Set problem. In the second part, we study the so-called knapsack cover inequalities that are used in the current best relaxations for numerous combinatorial optimization problems of covering type. In spite of their widespread use, these inequalities yield LP relaxations of exponential size, over which it is not known how to optimize exactly in polynomial time. We address this issue and obtain LP relaxations of quasi-polynomial size that are at least as strong as that given by the knapsack cover inequalities. In the last part, we show a close connection between structural hardness for k-partite graphs and tight inapproximability results for scheduling problems with precedence constraints. This connection is inspired by a family of integrality gap instances of a certain LP relaxation. Assuming the hardness of an optimization problem on k-partite graphs, we obtain a hardness of $2-\varepsilon$ for the problem of minimizing the makespan for scheduling with preemption on identical parallel machines, and a super constant inapproximability for the problem of scheduling on related parallel machines. Prior to this result, it was only known that the first problem does not admit a PTAS, and the second problem is NP-hard to approximate within a factor strictly better than 2, assuming the Unique Games Conjecture.

This thesis focuses on designing spectral tools for graph clustering in sublinear time. With the emergence of big data, many traditional polynomial time, and even linear time algorithms have become prohibitively expensive. Processing modern datasets requires a new set of algorithms for computing with extremely constrained resources, i.e., \emph{sublinear algorithms}. Clustering is one of the well-known techniques for solving large-scale optimization problems in a wide variety of domains, including machine learning, data science and graph analysis~\cite{aydin2016distributed, rolnick2016geocuts, gargi2011large}. Efficient sublinear solutions for fundamental graph clustering problems require going well beyond classic techniques. In this thesis, we present an \emph{optimal} sublinear-time algorithm for \textit{testing $k$-clusterability problem}, i.e., quickly determining whether the graph can be partitioned into at most $k$ expanders, or is far from any such graph. This is a generalization of a well-studied problem of testing graph expansion. The classic results on testing $k$-clusterability either consider the testing expansion problem (i.e, $k=1$ vs $k\geq 2$) \cite{KaleS_SIAMJC11,NachmiasS10}, or address the problem for larger values of $k$ under the assumption that the gap between conductances of accepted and rejected graphs is at least logarithmic in the size of the graph \cite{CzumajPS_STOC15}. We overcome these barriers by developing novel spectral techniques based on analyzing the spectrum of the Gram matrix ofrandom walk transition probabilities. We complement our algorithm with a matching lower bound on the query complexity of testing $k$-clusterability, which improves upon the long-standing previous lower bound for testing graph expansion.Furthermore, we extend our previous result from the \textit{property testing} framework to an efficient clustering algorithm in the \textit{local computation algorithm} (LCA) model. We focus on a popular variant of graph clustering where the input graph can be partitioned into $k$ expanders with outer conductance bounded by $\epsilon$. We construct a small space data structure that allows quickly classifying vertices of $G$ according to the cluster they belong to in sublinear time. Our spectral clustering oracle provides $O(\epsilon \log k)$ error per cluster for any $\epsilon \ll 1/\log k$. Our main contribution is a sublinear time oracle that provides dot product access to the spectral embedding of the graph. We estimate dot products with high precision using an appropriate linear transformation of the Gram matrix of random walk transition probabilities. Finally, using dot product access to the spectral embedding we design a spectral clustering oracle. At a high level, our approach amounts to hyperplane partitioning in the spectral embedding of the graph but crucially operates on a nested sequence of carefully defined subspaces in the spectral embedding to achieve per cluster recovery guarantees.