Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
The intrinsic high entropy metadata, known as quality scores, are largely the cause of the substantial size of sequence data files. Yet, there is no consensus on a viable reduction of the resolution of the quality score scale, arguably because of collateral side effects. In this paper we leverage on the penalty functions of HISAT2 aligner to rebin the quality score scale in such a way as to avoid any impact on sequence alignment, identifying alongside a distortion threshold. We tested our findings on whole-genome sequence and RNA sequence data, and contrasted the results with three methods for lossy distortion of the quality scores.
David Atienza Alonso, Marina Zapater Sancho, Yasir Mahmood Qureshi, José Manuel Herruzo Ruiz