NCBI's Virus Discovery Codeathon: Building "FIVE" -The Federated Index of Viral Experiments API Index.

Published on Dec 10, 2020in Viruses3.816
· DOI :10.3390/V12121424
Joan Martí-Carreras6
Estimated H-index: 6
Alejandro R. Gener2
Estimated H-index: 2
+ 30 AuthorsBen Busby11
Estimated H-index: 11
Viruses represent important test cases for data federation due to their genome size and the rapid increase in sequence data in publicly available databases. However, some consequences of previously decentralized (unfederated) data are lack of consensus or comparisons between feature annotations. Unifying or displaying alternative annotations should be a priority both for communities with robust entry representation and for nascent communities with burgeoning data sources. To this end, during this three-day continuation of the Virus Hunting Toolkit codeathon series (VHT-2), a new integrated and federated viral index was elaborated. This Federated Index of Viral Experiments (FIVE) integrates pre-existing and novel functional and taxonomy annotations and virus–host pairings. Variability in the context of viral genomic diversity is often overlooked in virus databases. As a proof-of-concept, FIVE was the first attempt to include viral genome variation for HIV, the most well-studied human pathogen, through viral genome diversity graphs. As per the publication of this manuscript, FIVE is the first implementation of a virus-specific federated index of such scope. FIVE is coded in BigQuery for optimal access of large quantities of data and is publicly accessible. Many projects of database or index federation fail to provide easier alternatives to access or query information. To this end, a Python API query system was developed to enhance the accessibility of FIVE.
#1Donovan H. Parks (UQ: University of Queensland)H-Index: 36
#2Maria Chuvochina (UQ: University of Queensland)H-Index: 14
Last. Philip Hugenholtz (UQ: University of Queensland)H-Index: 114
view all 6 authors...
We recently introduced the Genome Taxonomy Database (GTDB), a phylogenetically consistent, genome-based taxonomy providing rank normalized classifications for nearly 150,000 genomes from domain to genus. However, nearly 40% of the genomes used to infer the GTDB reference tree lack a species name, reflecting the large number of genomes in public repositories without complete taxonomic assignments. Here we address this limitation by proposing 24,706 species clusters which encompass all publicly av...
24 CitationsSource
#1Ryan Connor (NIH: National Institutes of Health)H-Index: 1
#2Rodney Brister (NIH: National Institutes of Health)H-Index: 1
Last. Ben Busby (NIH: National Institutes of Health)H-Index: 11
view all 38 authors...
A wealth of viral data sits untapped in publicly available metagenomic data sets when it might be extracted to create a usable index for the virological research community. We hypothesized that work of this complexity and scale could be done in a hackathon setting. Ten teams comprised of over 40 participants from six countries, assembled to create a crowd-sourced set of analysis and processing pipelines for a complex biological data set in a three-day event on the San Diego State University camp...
4 CitationsSource
#1Stephen Nayfach (LBNL: Lawrence Berkeley National Laboratory)H-Index: 18
#2Zhou Jason Shi (Gladstone Institutes)H-Index: 3
Last. Nikos C. Kyrpides (LBNL: Lawrence Berkeley National Laboratory)H-Index: 100
view all 5 authors...
The genome sequences of many species of the human gut microbiome remain unknown, largely owing to challenges in cultivating microorganisms under laboratory conditions. Here we address this problem by reconstructing 60,664 draft prokaryotic genomes from 3,810 faecal metagenomes, from geographically and phenotypically diverse humans. These genomes provide reference points for 2,058 newly identified species-level operational taxonomic units (OTUs), which represents a 50% increase over the previousl...
175 CitationsSource
#1Jake L. Weissman (UMD: University of Maryland, College Park)H-Index: 6
#2William F. Fagan (UMD: University of Maryland, College Park)H-Index: 71
Last. Philip L. F. Johnson (UMD: University of Maryland, College Park)H-Index: 24
view all 3 authors...
Abstract Prokaryotes are under nearly constant attack by viral pathogens. To protect against this threat of infection, bacteria and archaea have evolved a wide array of defense mechanisms, singly a...
6 CitationsSource
This paper presents two novel statistical analyses of multiblock data using the R language. It is designed for data organized in (K + 1) blocks (i.e., tables) consisting of a block of response variables to be explained by a large number of explanatory variables which are divided into K meaningful blocks. All the variables - explanatory and dependent - are measured on the same individuals. Two multiblock methods both useful in practice are included, namely multiblock partial least squares regress...
84 CitationsSource
#1David Couvin (Université Paris-Saclay)H-Index: 18
#2Aude Bernheim (Pasteur Institute)H-Index: 12
Last. Christine Pourcel (Université Paris-Saclay)H-Index: 45
view all 10 authors...
: CRISPR (clustered regularly interspaced short palindromic repeats) arrays and their associated (Cas) proteins confer bacteria and archaea adaptive immunity against exogenous mobile genetic elements, such as phages or plasmids. CRISPRCasFinder allows the identification of both CRISPR arrays and Cas proteins. The program includes: (i) an improved CRISPR array detection tool facilitating expert validation based on a rating system, (ii) prediction of CRISPR orientation and (iii) a Cas protein dete...
302 CitationsSource
#1Ben BusbyH-Index: 11
#2Surya SahaH-Index: 16
Last. Joan Martí-CarrerasH-Index: 6
view all 3 authors...
#1Benedict Paten (UCSC: University of California, Santa Cruz)H-Index: 51
#2Adam M. Novak (UCSC: University of California, Santa Cruz)H-Index: 18
Last. Erik Garrison (Wellcome Trust Sanger Institute)H-Index: 24
view all 4 authors...
: The human reference genome is part of the foundation of modern human biology and a monumental scientific achievement. However, because it excludes a great deal of common human variation, it introduces a pervasive reference bias into the field of human genomics. To reduce this bias, it makes sense to draw on representative collections of human genomes, brought together into reference cohorts. There are a number of techniques to represent and organize data gleaned from these cohorts, many using ...
171 CitationsSource
#1Fabien ArnaudH-Index: 39
#2Cécile PignolH-Index: 14
Last. Arnaud CailloH-Index: 1
view all 16 authors...
Managing paleoscience data is highly challenging to the multiplicity of actors in play, types of sampling, analysis, post-analysis treatments, statistics etc. However, a well-structured curating of data would permit innovative developments based on data and/or sample re-use, such as meta-analysis or the development of new proxies on previously studied cores. In this paper, we will present two recent initiatives that allowed us tackling this objective at a French national level: the “National Cyb...
#1Daniel A. Russell (University of Pittsburgh)H-Index: 23
#2Graham F. Hatfull (University of Pittsburgh)H-Index: 85
The Actinobacteriophage Database (PhagesDB) is a comprehensive, interactive, database-backed website that collects and shares information related to the discovery, characterization and genomics of viruses that infect Actinobacterial hosts. To date, more than 8000 bacteriophages—including over 1600 with sequenced genomes—have been entered into the database. PhagesDB plays a crucial role in organizing the discoveries of phage biologists around the world—including students in the SEA-PHAGES program...
110 CitationsSource
Cited By0