In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).
The problem of sequence assembly can be compared to taking many copies of a book, passing each of them through a shredder with a different cutter, and piecing the text of the book back together just by looking at the shredded pieces. Besides the obvious difficulty of this task, there are some extra practical issues: the original may have many repeated paragraphs, and some shreds may be modified during shredding to have typos. Excerpts from another book may also be added in, and some shreds may be completely unrecognizable.
The first sequence assemblers began to appear in the late 1980s and early 1990s as variants of simpler sequence alignment programs to piece together vast quantities of fragments generated by automated sequencing instruments called DNA sequencers. As the sequenced organisms grew in size and complexity (from small viruses over plasmids to bacteria and finally eukaryotes), the assembly programs used in these genome projects needed increasingly sophisticated strategies to handle:
terabytes of sequencing data which need processing on computing clusters;
identical and nearly identical sequences (known as repeats) which can, in the worst case, increase the time and space complexity of algorithms quadratically;
DNA read errors in the fragments from the sequencing instruments, which can confound assembly.
Faced with the challenge of assembling the first larger eukaryotic genomes—the fruit fly Drosophila melanogaster in 2000 and the human genome just a year later,—scientists developed assemblers like Celera Assembler and Arachne able to handle genomes of 130 million (e.
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Ce cours présente les principes fondamentaux à l'œuvre dans les organismes vivants. Autant que possible, l'accent est mis sur les contributions de l'Informatique aux progrès des Sciences de la Vie.
This course will train doctoral students to use bioinformatic tools to analyse amplicon and metagenomic sequences. In addition, we will also touch upon meta-transcriptomics and meta-proteomics.
This course will take place from 3rd to 7th June 2024.It will introduce the workflows and techniques that are used for the analysis of bulk and single cell RNA-seq data. It will empower students to
Explores tools and models for Next-Generation Sequencing data analysis, covering DNA sequencing technologies, data analysis pipelines, and statistical models.
Explores single molecule sequencing strategies, including sequencing by synthesis and real-time sequencing based on zero mode waveguides, as well as DNA translocation through nanopores.
Author summaryIn recent years, the application of deep learning represented a breakthrough in the mass spectrometry (MS) field by improving the assignment of the correct sequence of amino acids from observable MS spectra without prior knowledge, also known ...
PUBLIC LIBRARY SCIENCE2023
,
Viral metagenomics is a useful tool for detecting multiple human viruses in urban sewage. However, more refined protocols are required for its effective use in disease surveillance. In this study, we investigated the performance of three different preampli ...
DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery. Knowledge of DNA sequences has become indispensable for basic biological research, DNA Genographic Projects and in numerous applied fields such as medical diagnosis, biotechnology, forensic biology, virology and biological systematics.
Genome projects are scientific endeavours that ultimately aim to determine the complete genome sequence of an organism (be it an animal, a plant, a fungus, a bacterium, an archaean, a protist or a virus) and to annotate protein-coding genes and other important genome-encoded features. The genome sequence of an organism includes the collective DNA sequences of each chromosome in the organism. For a bacterium containing a single chromosome, a genome project will aim to map the sequence of that chromosome.
Pyrosequencing is a method of DNA sequencing (determining the order of nucleotides in DNA) based on the "sequencing by synthesis" principle, in which the sequencing is performed by detecting the nucleotide incorporated by a DNA polymerase. Pyrosequencing relies on light detection based on a chain reaction when pyrophosphate is released. Hence, the name pyrosequencing.
Single-cell sequencing (sc-seq) provides a species agnostic tool to study cellular processes. However, these technologies are expensive and require sufficient cell quantities and biological replicates to avoid artifactual results. An option to address thes ...