Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
The evolutionary relationships between organisms or phylogenies are fundamental to biology. They are invaluable as guiding tools to mine, organize and exploit the enormous amounts of biological data in the post-genomic era. The advent of high-throughput sequencing has resulted in whole genome data of many organisms and the need for inferring larger phylogenies ; both have necessitated the development of new methods and models in the field of phylogenetic reconstruction. The work presented in this dissertation contributes to this development. Phylogenies are most often inferred from genomic sequences – DNA or amino acid sequences. A key step before using a phylogenetic reconstruction method is that of aligning the input sequences. Finding accurate multiple sequence alignments is hard because of the heterogeneity of evolutionary signal present in the sequences—a more acute problem in whole genome sequences. A new approach to refining the phylogenetic signal of alignments is presented that identifies and eliminates phylogenetically noisy parts of the alignment in order to yield better phylogenies. The more recent model of evolution based on rearrangements (large-scale structural changes in genomes) has been a major step towards reconstructing large phylogenies. The design of models and methods in this area has presented significant mathematical challenges and some of these algorithmic and statistical questions are addressed in this work. The efficacy of simple distance-based methods are demonstrated in building accurate trees with rearrangement data, using precise estimates of true evolutionary distances. Novel methods methods are designed for robustness assessment of trees inferred by such distance-based methods. These methods are the first methods for robustness assessment for trees inferred from rearrangement data that are on par with previous such measures for trees inferred from sequence data. Further, two algorithmic problems on inversions are discussed : sorting by inversions and the inversion median problem. New algorithms and theoretical insights about the structure of the problems are described.
Kathryn Hess Bellwald, Lida Kanari, Adélie Eliane Garin
Anne-Florence Raphaëlle Bitbol, Nicola Dietler, Umberto Lupo