Publication

Benchmarking informatics approaches for virus discovery: caution is needed when combining in silico identification methods

Jaspreet Singh Saini
2024
Article

Résumé

Understanding the ecological impacts of viruses on natural and engineered ecosystems relies on the accurate identification of viral sequences from community sequencing data. To maximize viral recovery from metagenomes, researchers frequently combine viral identification tools. However, the effectiveness of this strategy is unknown. Here, we benchmarked combinations of six widely used informatics tools for viral identification and analysis (VirSorter, VirSorter2, VIBRANT, DeepVirFinder, CheckV, and Kaiju), called "rulesets." Rulesets were tested against mock metagenomes composed of taxonomically diverse sequence types and diverse aquatic metagenomes to assess the effects of the degree of viral enrichment and habitat on tool performance. We found that six rulesets achieved equivalent accuracy [Matthews Correlation Coefficient (MCC) = 0.77, Padj >= 0.05]. Each contained VirSorter2, and five used our "tuning removal" rule designed to remove non-viral contamination. While DeepVirFinder, VIBRANT, and VirSorter were each found once in these high-accuracy rulesets, they were not found in combination with each other: combining tools does not lead to optimal performance. Our validation suggests that the MCC plateau at 0.77 is partly caused by inaccurate labeling within reference sequence databases. In aquatic metagenomes, our highest MCC ruleset identified more viral sequences in virus-enriched (44%-46%) than in cellular metagenomes (7%-19%). While improved algorithms may lead to more accurate viral identification tools, this should be done in tandem with careful curation of sequence databases. We recommend using the VirSorter2 ruleset and our empirically derived tuning removal rule. Our analysis provides insight into methods for in silico viral identification and will enable more robust viral identification from metagenomic data sets.

Source officielle

À propos de ce résultat

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Graph Chatbot

Chattez avec Graph Search

Posez n’importe quelle question sur les cours, conférences, exercices, recherches, actualités, etc. de l’EPFL ou essayez les exemples de questions ci-dessous.

AVERTISSEMENT : Le chatbot Graph n'est pas programmé pour fournir des réponses explicites ou catégoriques à vos questions. Il transforme plutôt vos questions en demandes API qui sont distribuées aux différents services informatiques officiellement administrés par l'EPFL. Son but est uniquement de collecter et de recommander des références pertinentes à des contenus que vous pouvez explorer pour vous aider à répondre à vos questions.

Connectez-vous pour utiliser Chat avec Graph Search

Jaspreet Singh Saini
2024
Article

Résumé

Source officielle

À propos de ce résultat

Proximité ontologique

Biologie

Génétique: Génomique

Concepts associés (35)

Publications associées (32)

MOOCs associés (6)

Benchmarking informatics approaches for virus discovery: caution is needed when combining in silico identification methods

Graph Chatbot

Chattez avec Graph Search

Comparison of Three Viral Nucleic Acid Preamplification Pipelines for Sewage Viral Metagenomics

Microbial genome collection of aerobic granular sludge cultivated in sequencing batch reactors using different carbon source mixtures

The microbial genomics of glacier-fed streams: adaptations to an extreme ecosystem

The microbial genomics of glacier-fed streams: adaptations to an extreme ecosystem

Comparison of Three Viral Nucleic Acid Preamplification Pipelines for Sewage Viral Metagenomics

Microbial genome collection of aerobic granular sludge cultivated in sequencing batch reactors using different carbon source mixtures