Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Analyzing and processing data that are siloed and dispersed among multiple distrustful stakeholders is difficult and can even become impossible when the data are sensitive or confidential. Current data-protection and privacy regulations highly restrict the sharing and outsourcing of personal information among stakeholders that are in different jurisdictions. Sharing data is, however, required in many domains. The medical sector is a paradigmatic example: Privacy is paramount and data sharing is needed in numerous applications where data is scarce and scattered among multiple stakeholders around the world. Existing privacy-preserving solutions for federated analytics (FA) rely either (1) on data centralization or outsourcing to a limited number of entities, which incur multiple security and trust issues, or (2) on the exchange of cleartext aggregated and optionally obfuscated data, which can leak personal information or introduce bias in the final result. In this thesis, our goal is (1) to propose privacy-preserving federated solutions for exploration, and for statistical and machine-learning analyses on data held by multiple distrustful stakeholders, and (2) to analyze and evaluate the proposed systems, thus showing that they provide an efficient, secure, scalable, and accurate alternative to existing solutions for FA by proving their utility in real-world state-of-the-art biomedical studies. We rely on multiparty homomorphic encryption (MHE). MHE combines secure multiparty computation (SMC) techniques with homomorphic encryption (HE) by pooling the advantages of both SMC and HE, i.e., interactivity and flexibility, and by minimizing their disadvantages, i.e., difficulty in scaling to a large number of parties and computation complexity.First, we design UnLynx, a system that enables privacy-preserving federated data exploration on a distributed dataset held by multiple data-providers (DPs), where N-1 out of N of the nodes performing the computations can be malicious. We build interactive protocols by relying on ElGamal additive homomorphic encryption (AHE) and ensure that each untrusted-node operation can be publicly verified by means of zero-knowledge proofs (ZKPs). We then explore how statistics, e.g., standard deviation and variance, can be computed by relying on AHE and ZKPs through the design of another system named Drynx. In Drynx, we also explore how to limit the influence of an entity that inputs wrong data in the system, and we propose an efficient federated solution for correctness verification.We propose Spindle a solution for secure cooperative gradient descent on federated data that we instantiate for the privacy-preserving training and oblivious evaluation of generalized linear models. Spindle covers the entire machine-learning workflow, as it enables oblivious predictions to be performed on a trained model that remains secret. It ensures both data and model confidentiality in a passive adversarial model in which N-1 out of N DPs can collude.Finally, we demonstrate that the solutions proposed in this thesis can be efficient enablers for large-scale, sensitive, multi-site biomedical studies. We design and test, by replicating recent medical studies, secure workflows for the federated execution of computations that span from analyses with low computational complexity, such as survival analyses, to analyses with high computational complexity such as genome-wide association studies on millions of variants.
Jean-Philippe Thiran, Tobias Kober, Bénédicte Marie Maréchal, Jonas Richiardi
David Atienza Alonso, Tomas Teijeiro Campo, Una Pale