Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
The rapid progress in human-genome sequencing is leading to a high availability of genomic data. This data is notoriously very sensitive and stable in time. It is also highly correlated among relatives. A growing number of genomes are becoming accessible online (e.g., because of leakage, or after their posting on genome-sharing websites). What are then the implications for kin genomic privacy? We formalize the problem and detail an efficient reconstruction attack based on graphical models and belief propagation. With this approach, an attacker can infer the genomes of the relatives of an individual whose genome is observed, relying notably on Mendel's Laws and statistical relationships between the nucleotides (on the DNA sequence). Then, to quantify the level of genomic privacy as a result of the proposed inference attack, we discuss possible definitions of genomic privacy metrics. Genomic data reveals Mendelian diseases and the likelihood of developing degenerative diseases such as Alzheimer's. We also introduce the quantification of health privacy, specifically the measure of how well the predisposition to a disease is concealed from an attacker. We evaluate our approach on actual genomic data from a pedigree and show the threat extent by combining data gathered from a genome-sharing website and from an online social network.
Melanie Blokesch, Sandrine Stutzmann, Alexandre Lemopoulos, Natalia Carolina Drebes Dorr
Christof Holliger, Julien Maillard, Aline Sondra Adler, Marco Pagni, Simon Marius Jean Poirier