Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?
Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur Graph Search.
High-throughput sequencing of DNA molecules has revolutionized biomedical research by enabling the quantitative analysis of the genome to study its function, structure and dynamics. It is driving sequencing-based experiments in life sciences as evidenced by the plethora of emergent omics applications powered by sequence data. However, the capacity to generate massive datasets of sequence data greatly outpaces our ability to analyze them, the notorious bottleneck in omics analyses. With the democratization of computational analyses, practical solutions to the storage, distribution and processing of sequence data will become a necessity for the progress of life science research. The intrinsic high entropy metadata, known as quality scores, is largely the cause of the substantial size of sequence data files. Despite several efforts to evidence marginal impact on downstream analyses following their lossy representation, no consensus on the limits of "safe" representation with losses exists. In this research work, we study the effect of lossy quality score representation on three applications: variant calling, gene expression and sequence alignment, to assess the relevance of this metadata for omics analyses. We confirmed negligible impact and discovered that it is possible to compute a threshold value for transparent quality score distortion in sequence alignment, allowing the identification of a "safe" representation for the quality score scale. These results align with current trends in sequencing platforms pushing for coarser resolutions to reduce the storage footprint of sequence data.
Marilyne Andersen, Forrest Simon Webler
Bart Deplancke, Daniel Migliozzi, Gilles Weder, Riccardo Dainese, Daniel Alpern, Hüseyin Baris Atakan, Mustafa Demir, Dariia Gudkova