**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Publication# Sketches, metrics and fast algorithms

Abstract

As it has become easier and cheaper to collect big datasets in the last few decades, designing efficient and low-cost algorithms for these datasets has attracted unprecedented attention. However, in most applications, even storing datasets as acquired has become extremely costly and inefficient, which motivates the study of sublinear algorithms. This thesis focuses on studying two fundamental graph problems in the sublinear regime. Furthermore, it presents a fast kernel density estimation algorithm and data structure. The first part of this thesis focuses on graph spectral sparsification in dynamic streams. Our algorithm achieves almost optimal runtime and space simultaneously in a single pass. Our method is based on a novel bucketing scheme that enables us to recover high effective resistance edges faster. This contribution presents a novel approach to the effective resistance embedding of the graph, using locality-sensitive hash functions, with possible further future applications.The second part of this thesis presents spanner construction results in the dynamic streams and the simultaneous communication models. First, we show how one can construct a $\tilde{O}(n^{2/3})$-spanner using the above-mentioned almost-optimal single-pass spectral sparsifier, resulting in the first single-pass algorithm for non-trivial spanner construction in the literature. Then, we generalize this result to constructing $\tilde{O}(n^{2/3(1-\alpha)})$-spanners using $\tilde{O}(n^{1+\alpha})$ space for any $\alpha \in [0,1]$, providing a smooth trade-off between distortion and memory complexity. Moreover, we study the simultaneous communication model and propose a novel protocol with low per player information. Also, we show how one can leverage more rounds of communication in this setting to achieve better distortion guarantees. Finally, in the third part of this thesis, we study the kernel density estimation problem. In this problem, given a kernel function, an input dataset imposes a kernel density on the space. The goal is to design fast and memory-efficient data structures that can output approximations to the kernel density at queried points. This thesis presents a data structure based on the classical near neighbor search and locality-sensitive hashing techniques that improves or matches the query time and space complexity for radial kernels considered in the literature. The approach is based on an implementation of (approximate) importance sampling for each distance range and then using near neighbor search algorithms to recover points from these distance ranges. Later, we show how to improve the runtime, using recent advances in the data-dependent near neighbor search data structures, for a class of radial kernels that includes the Gaussian kernel.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related concepts

Loading

Related publications

Loading

Related publications (56)

Loading

Loading

Loading

Related concepts (23)

Data structure

In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of

Kernel density estimation

In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function o

Nearest neighbor search

Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically

Mikhail Kapralov, Jakab Tardos

Cut and spectral sparsification of graphs have numerous applications, including e.g. speeding up algorithms for cuts and Laplacian solvers. These powerful notions have recently been extended to hypergraphs, which are much richer and may offer new applications. However, the current bounds on the size of hypergraph sparsifiers are not as tight as the corresponding bounds for graphs. Our first result is a polynomial-time algorithm that, given a hypergraph on n vertices with maximum hyperedge size r, outputs an epsilon-spectral sparsifier with O* (nr) hyperedges, where O* suppresses (epsilon(-1) log n)(O(1) )factors. This size bound improves the two previous bounds: O*(n(3)) [Soma and Yoshida, SODA'19] and O* (nr(3)) [Bansal, Svensson and Trevisan, FOCS'19]. Our main technical tool is a new method for proving concentration of the nonlinear analogue of the quadratic form of the Laplacians for hypergraph expanders. We complement this with lower bounds on the bit complexity of any compression scheme that (1 + epsilon)-approximates all the cuts in a given hypergraph, and hence also on the bit complexity of every epsilon-cut/spectral sparsifier. These lower bounds are based on Ruzsa-Szemeredi graphs, and a particular instantiation yields an Omega(nr) lower bound on the bit complexity even for fixed constant epsilon. In the case of hypergraph cut sparsifiers, this is tight up to polylogarithmic factors in n, due to recent result of [Chen, Khanna and Nagda, FOCS'20]. For spectral sparsifiers it narrows the gap to O*(r). Finally, for directed hypergraphs, we present an algorithm that computes an c-spectral sparsifier with O*(n(2)r(3)) hyperarcs, where r is the maximum size of a hyperarc. For small r, this improves over O*(n(3)) known from [Soma and Yoshida, SODA'19], and is getting close to the trivial lower bound of Omega(n(2)) hyperarcs.

Dynamic Programming OPtimization (DPOP) is an algorithm proposed for solving distributed constraint optimization problems. In this algorithm, Hypercubes are used as the format of the messages exchanged between the agents. Then, Hybrid-DPOP (H-DPOP), a hybrid algorithm based on DPOP was proposed as an improved solution. Its main advantage over DPOP lies under the use of Multi-valued Decision Diagrams (MDDs) instead of Hypercubes. Furthermore, MDDs give a more compact representation of the messages. The MDDs were implemented and presented in by Kumar even though it was presented as Constrained Decision Diagrams (CDD) which is a generalization of MDDs. The first part of this project aims at implementing Hypercubes. An Implementation already exists. Therefore, the goal is to propose an implementation which is more efficient in term of speed and memory usage for the already implemented functions (join, project and slice). The solution implemented during this project should also provide new functions (split and re-order) and the possibility of saving Hypercubes in the XML format. The second part of the project consists in proposing and implementing a data structure that is more efficient than the MDDs implemented in H-DPOP. This data structure should also provide more functions, and also provide the possibility of saving data in the XML format.

2008Graph theory is an important topic in discrete mathematics. It is particularly interesting because it has a wide range of applications. Among the main problems in graph theory, we shall mention the following ones: graph coloring and the Hamiltonian circuit problem. Chapter 1 presents basic definitions of graph theory, such as graph coloring, graph coloring with color-classes of bounded size b, and Hamiltonian circuits and paths. We also present online algorithms and online coloring. Chapter 2 starts with some general remarks about online graph covering with sets of bounded sizes (such as online bounded coloring): we give a simple method for transforming an online covering algorithm into an online bounded covering algorithm, and to derive the performance ratio of the bounded algorithm from the performance ratio of the unbounded algorithm. As will be shown in later chapters, this method often leads to optimal results. Furthermore, some basic preliminary results on online graph covering with sets of bounded size are given: for every graph, the performance ratio is bounded above by 1/2 + b/2 and for b = 2, this bound is optimal. In the second part, online coloring of co-interval graphs is studied. Based on two industrial applications, two different versions of this problem are discussed. In the case where the intervals are presented in increasing order of their left ends, we show that the performance ratio is 1 in the unbounded case and 2 - 1/b in the bounded case. In the case where the intervals may be presented in any order, we show that the performance ratio is at most 3 in the bounded case. Chapter 3 deals with online coloring of permutation and comparability graphs. First, we give a tight analysis of the First-Fit algorithm on bipartite permutation graphs and we show that its performance ratio is O(√n), even for some simple presentation orders. For both classes of graphs, we show that the performance ratio is bounded above by (χ+1)/2 in the unbounded case and that the performance ratio of First-Fit is equal to 1/2 + b/2 in the bounded case. In the second part of this chapter, we study cocoloring of permutation graphs. We show that the performance ratio is n/4 + 1/2 and we give better bounds in some more restricted cocoloring problems. Chapter 4 deals with an application of online coloring: the online Track Assignment Problem. Depending on the assumptions that are made, the Track Assignment Problem can be reduced to coloring permutation or overlap graphs online. We show that when a permutation graph is presented on a latticial plane, from west to east, then the performance ratio is exactly 2 - (min{b,k})-1, where k is the best known upper bound on the bounded chromatic number. We also show that, when a permutation graph is presented on a latticial plan, starting from the origin and growing, simultaneously or not, towards west and east, then the performance ratio is exactly 2 - 1/χ. We also show that online coloring overlap graphs does not have a performance ratio bounded by a constant, even if the overlap graph is bipartite and presented in increasing order of the intervals left ends. In this special case, we show that First-Fit has a tight performance ratio of O(√n). We consider coloring overlap graphs online where the intervals have a bounded size between 1 and a given number M. In this case, we show that the performance ratio can be bounded above by 2√M if M ≤ M0, and by log M (⎡log M / log log M⎤ + 1) if M > M0, M0 being defined by the equation 2√M0 = 3 log(M0). For large values of M, the ratio is O(log2 M / log log M). Chapter 5 is about online coloring of trees, forests and split-graphs. For trees, we show that the performance ratio of online coloring is exactly ½log2(2n) in the unbouded case and at most 1 + ⎣log2(b)⎦/χb in the bounded case. For split-graphs, we show that the performance ratio of online coloring is exactly 1 + 1/χ in the unbounded case and is at most 2 + 1/χb + 3/b in the bounded case. In Chapter 6, we present a class of digraphs: the quasi-adjoint graphs. These are a super class of both the graphs used for a DNA sequencing algorithm in (Blazewicz, Kasprzak, "Computational complexity of isothermic DNA sequencing by hybridization", 2006) and the adjoints. A polynomial recognition algorithm in O(n3), as well as a polynomial algorithm in O(n2 + m2) for finding a Hamiltonian circuit in quasi-adjoint graphs are given. Furthermore, some results about related problems such as finding a Eulerian circuit while respecting some forbidden transitions (a sequence of two consecutive arcs) are discussed.