Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
This thesis focuses on developing efficient algorithmic tools for processing large datasets. In many modern data analysis tasks, the sheer volume of available datasets far outstrips our abilities to process them. This scenario commonly arises in tasks including parameter tuning of machine learning models (e.g., Google Vizier) and training neural networks. These tasks often require solving numerical linear algebraic problems on large matrices, making the classical primitives prohibitively expensive. Hence, there is a crucial need to efficiently compress the available datasets. In other settings, even collecting the input dataset is extremely expensive, making it vital to design optimal data sampling strategies. This is common in applications such as MRI acquisition and spectrum sensing. The fundamental questions above are often dual to each other, and hence can be addressed using the same set of core techniques. Indeed, exploiting structured Fourier sparsity is a recurring source of efficiency in this thesis, leading to both fast numerical linear algebra methods and sample efficient data acquisition schemes.
One of the main results that we present in this thesis is the first Sublinear-time Model-based Sparse FFT algorithm that achieves a nearly optimal sample complexity for recovery of every signal whose Fourier transform is well approximated by a small number of blocks (e.g., such structure is common in spectrum sensing). Our method matches in sublinear time the result of Baraniuk et. al. (2010), which started the field of model-based compressed sensing. Another highlight of this thesis includes the first Dimension-independent Sparse FFT algorithm that, computes the Fourier transform of a sparse signal in sublinear runtime in any dimension. This is the first algorithm that just like the FFT of Cooley and Tukey is dimension independent and avoids the curse of dimensionality inherent to all previously known techniques. Finally, we give a Universal Sampling Scheme for the reconstruction of structured Fourier signals from continuous measurements. Our approach matches the classical results of Slepian, Pollak, and Landau (1960s) on the reconstruction of bandlimited signals via Prolate Spheroidal Wave Functions and extends these results to a wide class of Fourier structure types.
Besides having classical applications in signal processing and data analysis, Fourier techniques have been at the core of many machine learning tasks such as Kernel Matrix Approximation. The second half of this thesis is dedicated to finding compressed and low-rank representations of kernel matrices, which are the primary means of computation with large kernel matrices in machine learning. We build on Fourier techniques and achieve spectral approximation guarantees to the Gaussian kernel using an optimal number of samples, significantly improving upon the classical Random Fourier Features of Rahimi and Recht (2008). Finally, we present a nearly-optimal Oblivious Subspace Embedding for high-degree Polynomial kernels which leads to nearly-optimal embeddings of the high-dimensional Gaussian kernel. This is the first result that does not suffer from an exponential loss in the degree of the polynomial kernel or the dimension of the input point set, providing exponential improvements over the prior works, including the TensorSketch of Pagh (2013) and application of the celebrated Fast Multipole Method of Greengard and Rokhlin (1986) to the kernel approximation problem.
Laurent Villard, Stephan Brunner, Alberto Bottino, Moahan Murugappan
Tatiana Pieloni, Nicolas Frank Mounet, Christophe Emmanuel R. Lannoy