CheckV assesses the quality and completeness of metagenome-assembled viral genomes.

Published on May 1, 2021in Nature Biotechnology36.558
· DOI :10.1038/S41587-020-00774-7
Stephen Nayfach18
Estimated H-index: 18
(LBNL: Lawrence Berkeley National Laboratory),
Antonio P. Camargo6
Estimated H-index: 6
(State University of Campinas)
+ 3 AuthorsNikos C. Kyrpides100
Estimated H-index: 100
(LBNL: Lawrence Berkeley National Laboratory)
Millions of new viral sequences have been identified from metagenomes, but the quality and completeness of these sequences vary considerably. Here we present CheckV, an automated pipeline for identifying closed viral genomes, estimating the completeness of genome fragments and removing flanking host regions from integrated proviruses. CheckV estimates completeness by comparing sequences with a large database of complete viral genomes, including 76,262 identified from a systematic search of publicly available metagenomes, metatranscriptomes and metaviromes. After validation on mock datasets and comparison to existing methods, we applied CheckV to large and diverse collections of metagenome-assembled viral sequences, including IMG/VR and the Global Ocean Virome. This revealed 44,652 high-quality viral genomes (that is, >90% complete), although the vast majority of sequences were small fragments, which highlights the challenge of assembling viral genomes from short-read metagenomes. Additionally, we found that removal of host contamination substantially improved the accurate identification of auxiliary metabolic genes and interpretation of viral-encoded functions.
📖 Papers frequently viewed together
25 Citations
19 Citations
11 Citations
#1Ann C. Gregory (OSU: Ohio State University)H-Index: 14
#2Olivier Zablocki (OSU: Ohio State University)H-Index: 9
Last. Matthew B. Sullivan (OSU: Ohio State University)H-Index: 75
view all 6 authors...
The gut microbiome profoundly affects human health and disease, and their infecting viruses are likely as important, but often missed because of reference database limitations. Here, we (1) built a human Gut Virome Database (GVD) from 2,697 viral particle or microbial metagenomes from 1,986 individuals representing 16 countries, (2) assess its effectiveness, and (3) report a meta-analysis that reveals age-dependent patterns across healthy Westerners. The GVD contains 33,242 unique viral populati...
61 CitationsSource
#1Dmitry Antipov (SPbU: Saint Petersburg State University)H-Index: 13
#2Mikhail Raiko (SPbU: Saint Petersburg State University)H-Index: 4
Last. Pavel A. Pevzner (UCSD: University of California, San Diego)H-Index: 99
view all 4 authors...
MOTIVATION: Although the set of currently known viruses has been steadily expanding, only a tiny fraction of the Earth's virome has been sequenced so far. Shotgun metagenomic sequencing provides an opportunity to reveal novel viruses but faces the computational challenge of identifying viral genomes that are often difficult to detect in metagenomic assemblies. RESULTS: We describe a metaviralSPAdes tool for identifying viral genomes in metagenomic assembly graphs that is based on analyzing varia...
27 CitationsSource
#1Kristopher Kieft (UW: University of Wisconsin-Madison)H-Index: 5
#2Zhichao Zhou (UW: University of Wisconsin-Madison)H-Index: 22
Last. Karthik Anantharaman (UW: University of Wisconsin-Madison)H-Index: 28
view all 3 authors...
Viruses are central to microbial community structure in all environments. The ability to generate large metagenomic assemblies of mixed microbial and viral sequences provides the opportunity to tease apart complex microbiome dynamics, but these analyses are currently limited by the tools available for analyses of viral genomes and assessing their metabolic impacts on microbiomes. Here we present VIBRANT, the first method to utilize a hybrid machine learning and protein similarity approach that i...
77 CitationsSource
#2P. O. TikhonovaH-Index: 2
Last. Vadim M. GovorunH-Index: 32
view all 7 authors...
SUMMARY: Phigaro is a standalone command-line application that is able to detect prophage regions taking raw genome and metagenome assemblies as an input. It also produces dynamic annotated "prophage genome maps" and marks possible transposon insertion spots inside prophages. It is applicable for mining prophage regions from large metagenomic datasets. AVAILABILITY: Source code for Phigaro is freely available for download at along with test data. The code is ...
13 CitationsSource
#1Eugene V. Koonin (NIH: National Institutes of Health)H-Index: 218
#2Valerian V. Dolja (OSU: Oregon State University)H-Index: 72
Last. Jens H. Kuhn (NIH: National Institutes of Health)H-Index: 65
view all 8 authors...
SUMMARY Viruses and mobile genetic elements are molecular parasites or symbionts that coevolve with nearly all forms of cellular life. The route of virus replication and protein expression is determined by the viral genome type. Comparison of these routes led to the classification of viruses into seven “Baltimore classes” (BCs) that define the major features of virus reproduction. However, recent phylogenomic studies identified multiple evolutionary connections among viruses within each of the B...
116 CitationsSource
#1John BeaulaurierH-Index: 10
#2Elaine Luo (UH: University of Hawaii)H-Index: 6
Last. Edward F. DeLong (UH: University of Hawaii)H-Index: 7
view all 11 authors...
Viruses are the most abundant biological entities on Earth and play key roles in host ecology, evolution, and horizontal gene transfer. Despite recent progress in viral metagenomics, the inherent genetic complexity of virus populations still poses technical difficulties for recovering complete virus genomes from natural assemblages. To address these challenges, we developed an assembly-free, single-molecule nanopore sequencing approach, enabling direct recovery of complete virus genome sequences...
39 CitationsSource
#1Basem Al-Shayeb (University of California, Berkeley)H-Index: 11
#2Rohan Sachdeva (University of California, Berkeley)H-Index: 15
Last. Jillian F. BanfieldH-Index: 136
view all 45 authors...
Bacteriophages typically have small genomes1 and depend on their bacterial hosts for replication2. Here we sequenced DNA from diverse ecosystems and found hundreds of phage genomes with lengths of more than 200 kilobases (kb), including a genome of 735 kb, which is—to our knowledge—the largest phage genome to be described to date. Thirty-five genomes were manually curated to completion (circular and no gaps). Expanded genetic repertoires include diverse and previously undescribed CRISPR–Cas syst...
115 CitationsSource
#1Michael J. Tisza (NIH: National Institutes of Health)H-Index: 7
#2Diana V. Pastrana (NIH: National Institutes of Health)H-Index: 30
Last. Christopher B. Buck (NIH: National Institutes of Health)
view all 26 authors...
When scientists hunt for new DNA sequences, sometimes they get a lot more than they bargained for. Such is the case in metagenomic surveys, which analyze not just DNA of a particular organism, but all the DNA in an environment at large. A vexing problem with these surveys is the overwhelming number of DNA sequences detected that are so different from any known microbe that they cannot be classified using traditional approaches. However, some of these “known unknowns” are undoubtedly viral sequen...
59 CitationsSource
#1Frederik Schulz (LBNL: Lawrence Berkeley National Laboratory)H-Index: 24
#2Simon Roux (LBNL: Lawrence Berkeley National Laboratory)H-Index: 41
Last. Tanja Woyke (LBNL: Lawrence Berkeley National Laboratory)H-Index: 13
view all 11 authors...
Current knowledge about the nucleocytoplasmic large DNA viruses (NCLDV) is largely derived from viral isolates co-cultivated with protists and algae. Building on the rapidly increasing wealth of publicly available metagenome data, we reconstructed 2,074 NCLDV genomes from sampling sites spanning the globe. This led to an 11-fold increase in phylogenetic diversity and a parallel 10-fold expansion in functional diversity. Analysing 58,023 metagenomic major capsid proteins of large and giant viruse...
65 CitationsSource
#1Felipe H. CoutinhoH-Index: 14
#1F. H. CoutinhoH-Index: 1
view all 3 authors...
Viruses of Archaea and Bacteria are among the most abundant and diverse biological entities on Earth. Unraveling their biodiversity has been challenging due to methodological limitations. Recent advances in culture-independent techniques, such as metagenomics, shed light on the unknown viral diversity, revealing thousands of new viral nucleotide sequences at an unprecedented scale. However, these novel sequences have not been properly classified and the evolutionary associations between them wer...
12 CitationsSource
Cited By45
#1Adair L. Borges (University of California, Berkeley)H-Index: 10
#2Yue Clare LouH-Index: 1
Last. J. F. BanfieldH-Index: 2
view all 0 authors...
The genetic code is a highly conserved feature of life. However, some alternative genetic codes use reassigned stop codons to code for amino acids. Here, we survey stop codon recoding across bacteriophages (phages) in human and animal gut microbiomes. We find that stop codon recoding has evolved in diverse clades of phages predicted to infect hosts that use the standard code. We provide evidence for an evolutionary path towards recoding involving reduction in the frequency of TGA and TAG stop co...
Warming climate has increased the frequency and size of high severity wildfires in the western United States, with deleterious impacts on forest ecosystem resilience. Although forest soil microbiomes provide a myriad of ecosystem functions, little is known regarding the impact of high severity fire on microbially-mediated processes. Here, we characterized functional shifts in the soil microbiome (bacterial, fungal, and viral) across wildfire burn severity gradients one year post-fire in conifero...
#1Evelien M. Adriaenssens (Norwich University)H-Index: 22
Bacteriophages (phages) have been known for over a century, but only in the last 2 decades have we really come to appreciate how abundant and diverse they are. With that realization, research groups across the globe have shown the importance of phage-based processes in a myriad of environments, including the global oceans and soils, and as part of the human microbiome. Through advances in sequencing technology, genomics, and bioinformatics, we know that the morphological diversity of bacteriopha...
#1Tomás Alarcón-Schumacher (MPG: Max Planck Society)H-Index: 1
Last. Beatriz Díez (UC: Pontifical Catholic University of Chile)H-Index: 19
view all 4 authors...
The Southern Ocean (SO) represents up to one-fifth of the total carbon drawdown worldwide. Intense selective pressures (low temperature, high UV radiation, and strong seasonality) and physical isolation characterize the SO, serving as a "natural" laboratory for the study of ecogenomics and unique adaptations of endemic viral populations. Here, we report 2,416 novel viral genomes from the SO, obtained from newly sequenced viral metagenomes in combination with mining of publicly available data set...
#1Sungeun Lee (ECL: École centrale de Lyon)H-Index: 2
#2Ella T. Sieradzki (University of California, Berkeley)H-Index: 4
Last. Graeme W. Nicol (ECL: École centrale de Lyon)H-Index: 9
view all 7 authors...
The concentration of atmospheric methane (CH4) continues to increase with microbial communities controlling soil-atmosphere fluxes. While there is substantial knowledge of the diversity and function of prokaryotes regulating CH4 production and consumption, their active interactions with viruses in soil have not been identified. Metagenomic sequencing of soil microbial communities enables identification of linkages between viruses and hosts. However, this does not determine if these represent cur...
#1Michael J. Roach (Flinders University)
#2Katelyn McNair (SDSU: San Diego State University)H-Index: 14
Last. Robert A. Edwards (Flinders University)H-Index: 2
view all 8 authors...
Background Most bacterial genomes contain integrated bacteriophages—prophages—in various states of decay. Many are active and able to excise from the genome and replicate, while others are cryptic prophages, remnants of their former selves. Over the last two decades, many computational tools have been developed to identify the prophage components of bacterial genomes, and it is a particularly active area for the application of machine learning approaches. However, progress is hindered and compar...
#1Luciano Lopes Queiroz (USP: University of São Paulo)H-Index: 4
#2Gustavo Augusto Lacorte (Instituto Federal de Minas Gerais)H-Index: 10
Last. Christian Hoffmann (USP: University of São Paulo)H-Index: 37
view all 7 authors...
Endogenous starter cultures are used in the production of several cheeses around the world, such as Parmigiano Reggiano, in Italy, Epoisses, in France, and Canastra, in Brazil. These microbial communities are responsible for many of the intrinsic characteristics of each of these cheeses. Bacteriophages are ubiquitous around the world, well known to be involved in the modulation of complex microbiological processes. However, little is known about phage bacteria growth dynamics in cheese productio...
#1Sean Benler (NIH: National Institutes of Health)H-Index: 3
Metagenomics and metatranscriptomics have become the principal approaches for discovery of novel bacteriophages and preliminary characterization of their ecology and biology. Metagenomic sequencing dramatically expanded the known diversity of tailed and non-tailed phages with double-stranded DNA genomes and those with single-stranded DNA genomes, whereas metatranscriptomics led to the discovery of thousands of new single-stranded RNA phages. Apart from expanding phage diversity, metagenomics stu...
#1Dennis Sandris Nielsen (UCPH: University of Copenhagen)H-Index: 47
#2Shiraz A. Shah (Copenhagen University Hospital)H-Index: 5
Last. Romain Sausset (Université Paris-Saclay)H-Index: 1
view all 22 authors...
The gut microbiome (GM) is shaped through infancy and plays a major role in determining susceptibility to chronic inflammatory diseases later in life. Bacteriophages (phages) are known to modulate bacterial populations in numerous ecosystems, including the gut. However, virome data is difficult to analyse because it mostly consists of unknown viruses, i.e. viral dark matter. Here, we manually resolved the viral dark matter in the largest human virome study published to date. Fecal viromes from a...
#1Ian M. Rambo (University of Texas at Austin)H-Index: 2
#2V. De Anda (University of Texas at Austin)
Last. Brett J. Baker (University of Texas at Austin)H-Index: 51
view all 4 authors...
Asgard archaea are newly described microbes that are related to eukaryotes. Asgards are diverse and globally distributed, however, their viruses have not been described. Here we characterize seven viral genomes that infected Lokiarchaeota, Helarchaeota, and Thorarchaeota in deep-sea hydrothermal sediments. These viruses code for structural proteins similar to those in Caudovirales, as well as proteins distinct from those described in archaeal viruses. They also have genes common in eukaryotic nu...