Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
There is a growing need for unbiased clustering algorithms, ideally automated to analyze complex data sets. Topological data analysis (TDA) has been used to approach this problem. This recent field of mathematics discerns characteristic features of a space without relying on probabilistic approaches. It provides robust qualitative and quantitative assessments of the structure of data. Mapper, an algorithm of TDA, showed increased power over standard methods for complex data and overcame problems of noise. However, it relies on the selection of several parameters and is not well suited for small datasets. To overcome these problems, we have developed a topology-based clustering algorithm called Two-Tier Mapper (TTMap) to detect subgroups in global gene expression datasets and to identify their distinguishing features in a two groups comparison. First, TTMap discerns and adjusts for highly variable features in the control group and identifies outliers. Second, in order to obtain an individual appreciation of the differences with respect to the control group, a profile of deviation is computed for each test sample. Test samples are clustered according to two tiers creating a global and local network using a new topological algorithm based on Mapper, where all the parameters are carefully chosen or data-driven, avoiding any user induced bias. These choices render the algorithm theoretically stable. In particular when sample sizes are small, TTMap outperforms existing clustering methods in finding relevant subgroups, in stability on synthetic and biological datasets and in revealing more gene expression changes. Datasets from different sources can readily be combined into one analysis. Thus, TTMap can extract information from highly variable biological samples, and since an individual profile of deviation is established, it has potential for personalized medicine. The algorithm was developed as an open source R package deposited at the Bioconductor.
Furthermore, two additional applications of topology were developed in order to find differences in gene expression through the menstrual cycle and cyclical patterns in gene expression related to hormone response.
Michel Bierlaire, Marija Kukic
, ,