NCBI's Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements.

Published on Sep 16, 2019in Genes3.759
· DOI :10.3390/GENES10090714
Ryan Connor1
Estimated H-index: 1
(NIH: National Institutes of Health),
Rodney Brister1
Estimated H-index: 1
(NIH: National Institutes of Health)
+ 35 AuthorsBen Busby11
Estimated H-index: 11
(NIH: National Institutes of Health)
A wealth of viral data sits untapped in publicly available metagenomic data sets when it might be extracted to create a usable index for the virological research community. We hypothesized that work of this complexity and scale could be done in a hackathon setting. Ten teams comprised of over 40 participants from six countries, assembled to create a crowd-sourced set of analysis and processing pipelines for a complex biological data set in a three-day event on the San Diego State University campus starting 9 January 2019. Prior to the hackathon, 141,676 metagenomic data sets from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) were pre-assembled into contiguous assemblies (contigs) by NCBI staff. During the hackathon, a subset consisting of 2953 SRA data sets (approximately 55 million contigs) was selected, which were further filtered for a minimal length of 1 kb. This resulted in 4.2 million (Mio) contigs, which were aligned using BLAST against all known virus genomes, phylogenetically clustered and assigned metadata. Out of the 4.2 Mio contigs, 360,000 contigs were labeled with domains and an additional subset containing 4400 contigs was screened for virus or virus-like genes. The work yielded valuable insights into both SRA data and the cloud infrastructure required to support such efforts, revealing analysis bottlenecks and possible workarounds thereof. Mainly: (i) Conservative assemblies of SRA data improves initial analysis steps; (ii) existing bioinformatic software with weak multithreading/multicore support can be elevated by wrapper scripts to use all cores within a computing node; (iii) redesigning existing bioinformatic algorithms for a cloud infrastructure to facilitate its use for a wider audience; and (iv) a cloud infrastructure allows a diverse group of researchers to collaborate effectively. The scientific findings will be extended during a follow-up event. Here, we present the applied workflows, initial results, and lessons learned from the hackathon.
📖 Papers frequently viewed together
32 Authors (Ben Busby, ..., Eugene Yaschenko)
3 Citations
4 Authors (Kyle Levi, ..., Robert Edwards)
15 Citations
1 Citations
#1Illyoung Choi (UA: University of Arizona)H-Index: 3
#2Alise J. Ponsero (UA: University of Arizona)H-Index: 8
Last. Bonnie L. Hurwitz (UA: University of Arizona)H-Index: 24
view all 6 authors...
Background: Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content. Results: We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise cl...
21 CitationsSource
#1Alexandre Souvorov (HHS: United States Department of Health and Human Services)H-Index: 1
#2Richa Agarwala (HHS: United States Department of Health and Human Services)H-Index: 44
Last. David J. Lipman (HHS: United States Department of Health and Human Services)H-Index: 1
view all 3 authors...
SKESA is a DeBruijn graph-based de-novo assembler designed for assembling reads of microbial genomes sequenced using Illumina. Comparison with SPAdes and MegaHit shows that SKESA produces assemblies that have high sequence quality and contiguity, handles low-level contamination in reads, is fast, and produces an identical assembly for the same input when assembled multiple times with the same or different compute resources. SKESA has been used for assembling over 272,000 read sets in the Sequenc...
132 CitationsSource
#1Norman Goodacre (CBER: Center for Biologics Evaluation and Research)H-Index: 6
#2Aisha A. AlJanahi (CBER: Center for Biologics Evaluation and Research)H-Index: 7
Last. Arifa S. Khan (CBER: Center for Biologics Evaluation and Research)H-Index: 21
view all 5 authors...
ABSTRACT Detection of distantly related viruses by high-throughput sequencing (HTS) is bioinformatically challenging because of the lack of a public database containing all viral sequences, without abundant nonviral sequences, which can extend runtime and obscure viral hits. Our reference viral database (RVDB) includes all viral, virus-related, and virus-like nucleotide sequences (excluding bacterial viruses), regardless of length, and with overall reduced cellular sequences. Semantic selection ...
87 CitationsSource
#1Enrique González-Tortuero (UCC: University College Cork)H-Index: 6
#2Sean Sutton Td (UCC: University College Cork)H-Index: 1
Last. Colin Hill (UCC: University College Cork)H-Index: 125
view all 8 authors...
Viral (meta)genomics is a rapidly growing field of study that is hampered by an inability to annotate the majority of viral sequences; therefore, the development of new bioinformatic approaches is very important. Here, we present a new automatic de novo genome annotation pipeline, called VIGA, to annotate prokaryotic and eukaryotic viral sequences from (meta)genomic studies. VIGA was benchmarked on a database of known viral genomes and a viral metagenomics case study. VIGA generated the most acc...
13 CitationsSource
#1Dennis CarrollH-Index: 6
#2Peter DaszakH-Index: 91
Last. Jonna A. K. MazetH-Index: 44
view all 9 authors...
Outbreaks of novel and deadly viruses highlight global vulnerability to emerging diseases, with many having massive health and economic impacts. Our adaptive toolkit—based largely on vaccines and therapeutics—is often ineffective because countermeasure development can be outpaced by the speed of novel viral emergence and spread. Following each outbreak, the public health community bemoans a lack of prescience, but after decades of reacting to each event with little focus on mitigation, we remain...
169 CitationsSource
#1Pedro J. TorresH-Index: 11
#2Robert Edwards (SDSU: San Diego State University)H-Index: 133
Last. Katelyn McNairH-Index: 14
view all 3 authors...
Motivation: The Sequence Read Archive (SRA) contains raw data from many different types of sequence projects. As of 2017, the SRA contained approximately ten petabases of DNA sequence (10 16 bp). Annotations of the data are provided by the submitter, and mining the data in the SRA is complicated by both the amount of data and the detail within those annotations. Here, we introduce PARTIE, a partition engine optimized to differentiate sequence read data into metagenomic (random) and amplicon (tar...
17 CitationsSource
#1Fabien ArnaudH-Index: 39
#2Cécile PignolH-Index: 14
Last. Arnaud CailloH-Index: 1
view all 16 authors...
Managing paleoscience data is highly challenging to the multiplicity of actors in play, types of sampling, analysis, post-analysis treatments, statistics etc. However, a well-structured curating of data would permit innovative developments based on data and/or sample re-use, such as meta-analysis or the development of new proxies on previously studied cores. In this paper, we will present two recent initiatives that allowed us tackling this objective at a French national level: the “National Cyb...
2,714 CitationsSource
#1Aron Marchler-Bauer (NIH: National Institutes of Health)H-Index: 34
#2Yu Bo (NIH: National Institutes of Health)H-Index: 1
Last. Stephen H. Bryant (NIH: National Institutes of Health)H-Index: 48
view all 22 authors...
: NCBI's Conserved Domain Database (CDD) aims at annotating biomolecular sequences with the location of evolutionarily conserved protein domain footprints, and functional sites inferred from such footprints. An archive of pre-computed domain annotation is maintained for proteins tracked by NCBI's Entrez database, and live search services are offered as well. CDD curation staff supplements a comprehensive collection of protein domain and protein family models, which have been imported from extern...
1,446 CitationsSource
#1Ana Laura Grazziotin (UI: University of Iowa)H-Index: 5
#1Ana Laura Grazziotin (UI: University of Iowa)H-Index: 8
Last. David M. Kristensen (UI: University of Iowa)H-Index: 24
view all 3 authors...
: Viruses are the most abundant and diverse biological entities on earth, and while most of this diversity remains completely unexplored, advances in genome sequencing have provided unprecedented glimpses into the virosphere. The Prokaryotic Virus Orthologous Groups (pVOGs, formerly called Phage Orthologous Groups, POGs) resource has aided in this task over the past decade by using automated methods to keep pace with the rapid increase in genomic data. The uses of pVOGs include functional annota...
182 CitationsSource
#1Mang Shi (USYD: University of Sydney)H-Index: 39
#2Xian-Dan Lin (CDC: Centers for Disease Control and Prevention)H-Index: 16
Last. Yong-Zhen Zhang (CCDC: Chinese Center for Disease Control and Prevention)H-Index: 5
view all 15 authors...
Current knowledge of RNA virus biodiversity is both biased and fragmentary, reflecting a focus on culturable or disease-causing agents. Here we profile the transcriptomes of over 220 invertebrate species sampled across nine animal phyla and report the discovery of 1,445 RNA viruses, including some that are sufficiently divergent to comprise new families. The identified viruses fill major gaps in the RNA virus phylogeny and reveal an evolutionary history that is characterized by both host switchi...
742 CitationsSource
Cited By5
#1Csbc (Georgia Institute of Technology)
#3Vizcarra (OHSU: Oregon Health & Science University)
Last. Goltsev
view all 8 authors...
Emerging multiplexed imaging platforms provide an unprecedented view of an increasing number of molecular markers at subcellular resolution and the dynamic evolution of tumor cellular composition. As such, they are capable of elucidating cell-to-cell interactions within the tumor microenvironment that impact clinical outcome and therapeutic response. However, the rapid development of these platforms has far outpaced the computational methods for processing and analyzing the data they generate. W...
#1Katarina Braune (Humboldt University of Berlin)
#1Katarina Braune (Humboldt University of Berlin)H-Index: 6
Last. Akira-Sebastian PoncetteH-Index: 3
view all 10 authors...
BACKGROUND The COVID-19 outbreak has affected the lives of millions of people by causing a dramatic impact on several healthcare systems and the global economy. This devastating pandemic has brought communities across the globe to work on this issue in an unprecedented manner. OBJECTIVE This case study describes the steps and methods employed in the conduction of a remote online health hackathon centered on challenges the COVID-19 pandemic poses. It aims to deliver a clear implementation road ma...
#1Shirley Lewis (Manipal University)H-Index: 1
#2Chythra R Rao (Manipal University)H-Index: 10
Last. Sharath K Rao (Manipal University)H-Index: 2
view all 9 authors...
Background Hackathons are a popular trend in the technology domain and is considered a powerful tool to spur creativity and innovation. In a health hackathon, an interdisciplinary team of health, technology and management experts work collaboratively to solve a common problem. Health hackathon can be one of the quickest means to derive technology or process-based solutions to the challenges faced by clinicians. Methods A 2-day hackathon: Hacking Cancer was conducted at a tertiary cancer centre i...
#1Lucas Miguel de Carvalho (State University of Campinas)H-Index: 3
#2N. Coimbra (UFMG: Universidade Federal de Minas Gerais)
Last. S. Nagamatsu (Yale University)
view all 9 authors...
Backgroundthe scientific training to become a bioinformatician includes multidisciplinary abilities, which increase the challenges to professional development. Competition frameworkin order to improve and promote the ongoing training of the Brazilian bioinformatics community, we organize a national competition, with the main goal to develop human resources and abilities in Computational Biology at the national level. The competition framework was designed in three phases: 1) a one-day challenge ...
#2Alejandro R. GenerH-Index: 2
Last. Ben BusbyH-Index: 11
view all 33 authors...
Viruses represent important test cases for data federation due to their genome size and the rapid increase in sequence data in publicly available databases. However, some consequences of previously decentralized (unfederated) data are lack of consensus or comparisons between feature annotations. Unifying or displaying alternative annotations should be a priority both for communities with robust entry representation and for nascent communities with burgeoning data sources. To this end, during thi...