Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?
Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur GraphSearch.
The focus of the work presented in this thesis is the exploration of the genetic architecture of complex human traits - at the dawn of genomic medicine.
The underlying mechanisms explaining the enormously polygenic nature of most human complex traits are still unknown. The first chapter explores a possible explanatory model in which variant effects are due to an indirect mechanism, namely competition among genes for shared intracellular resources such as ribosomes. Our findings show that under most reasonable assumptions, resource competition should not be expected to have much impact on either protein expression levels of individual genes or on complex trait outcomes.
The prediction accuracy of polygenic scores (PGS) remains relatively modest compared to what is expected given the estimated heritability of traits. Traditionally, the construction of PGS uses a large number of genetic variations, most of which have weak additive effects. Recent machine learning methods could improve PGS by also aggregating epistatic effects. To evaluate these different methods, we conducted an experiment based on an innovative concept of crowdsourcing, detailed in the second chapter. We collaborated with opensnp.org, an open repository where people share their genotyping data and phenotypic information, and with crowdai.org, a platform that allowed us to create a public competition for the genomic prediction of height. The challenge lasted three months and attracted 138 participants. This was the first crowd-sourcing challenge based on publicly available genome-wide genotyping data.
Due to the enormous number of potential combinations of variants, it is difficult to integrate epistatic effects into PGS. In the third chapter, we present a method where we limit the possible combinations to the boundaries of each topologically associated domain (TAD) independently. With the UK Biobank, for the height phenotype, we included 17,560 variants in an artificial neural network (ANN) and compared the variance explained () by the PGS with or without the knowledge of the TADs. We found that it brings a significant improvement with an average going from 0.287 to 0.293 (with a p-value for n=20). We concluded that it should be possible to build better PGS using ANNs and epistasis in TADs.
The effect of genetic ancestry on phenotypes is not taken into account in PGS-based risk estimates. Doing so could accelerate the adoption of genomic medicine for underrepresented populations and mixed-race individuals. The fourth chapter presents a method for its integration through a secondary score derived from genome-wide genotyping data, the PC score (PCS). We compared two models, one using only the PGS and the other using both the PGS and the PCS. Using the UK Biobank, we found an improvement in genetic prediction for all phenotypes tested:
Tom Ian Battin, Hannes Markus Peter, Grégoire Marie Octave Edouard Michoud, Nicola Deluigi, David Touchette, Martina Gonzalez Mateu, Florian Baier