Parallel and Scalable Bioinformatics

Stuart Anthony Byma
2020
EPFL thesis

Abstract

The field of genomics is likely to become the largest producer of data as a consequence of the large-scale application of next-generation sequencing technology for biological research and personalized medical treatments. The raw sequence data produced by these methods is limited in usefulness and requires computational analysis to unlock its potential. Bioinformatics is a field that combines biology, genomics, and computer science to build algorithms and software to analyze biological data. Some of the current bioinformatics tools are having difficulty keeping up with the increasing rate of data production. For example, raw sequence preprocessing, which involves aligning subsequences to a reference genome, sorting, and other operations, can take many hours. Downstream processing applications also require computational innovation -- protein sequence similarity search, an important tool in protein function characterization and the study of evolution, can take weeks or months to build high-quality databases, even relatively small ones composed of just a few thousand genomes.

This thesis shows that these computational challenges can be effectively and efficiently solved by a combination of fine-grained parallelism and horizontal scaling on highly-parallel compute clusters and data centers. This is shown through three primary contributions.

First, the preprocessing of whole-genome sequencing reads is addressed with Persona. Persona is a high performance and scalable bioinformatics system that unifies data, tools, algorithms, and processes for alignment, sorting, duplicate marking, and other operations in a common framework that scales linearly. For example, Persona can align 220 million short reads in ~17 seconds using a 32-node cluster. Second, a new technique for measuring and analyzing heap usage is introduced, which can help bioinformatics and other programs make more efficient use of memory, leading to performance gains of up to 10%. Finally, to accelerate protein similarity search, a new clustering algorithm is introduced that exposes parallelism, which, when combined with dynamic load-balancing, allows for efficient and scalable execution, leading to speedups of over 1400x over existing methods.

Official source

https://infoscience.epfl.ch/record/277224?ln=en

About this result

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Parallel and Scalable Bioinformatics

Graph Chatbot

Chat with Graph Search

Multi-well plate lid for single-step pooling of 96 samples for high-throughput barcode-based sequencing

Comparison of Three Viral Nucleic Acid Preamplification Pipelines for Sewage Viral Metagenomics

Single-mitosis dissection of acute and chronic DNA mutagenesis and repair

Multi-well plate lid for single-step pooling of 96 samples for high-throughput barcode-based sequencing

Comparison of Three Viral Nucleic Acid Preamplification Pipelines for Sewage Viral Metagenomics

Single-mitosis dissection of acute and chronic DNA mutagenesis and repair