Ready-to-use public infrastructure for global SARS-CoV-2 monitoring

Maier, Wolfgang; Bray, Simon; van den Beek, Marius; Bouvier, Dave; Coraor, Nathan; Miladi, Milad; Singh, Babita; De Argila, Jordi Rambla; Baker, Dannon; Roach, Nathan; Gladman, Simon; Coppens, Frederik; Martin, Darren P.; Lonie, Andrew; Grüning, Björn; Kosakovsky Pond, Sergei L.; Nekrutenko, Anton

doi:10.1038/s41587-021-01069-1

Download PDF

Correspondence
Published: 29 September 2021

Ready-to-use public infrastructure for global SARS-CoV-2 monitoring

Nature Biotechnology volume 39, pages 1178–1179 (2021)Cite this article

2960 Accesses
13 Citations
27 Altmetric
Metrics details

Subjects

To the Editor — The COVID-19 pandemic is the first health crisis characterized by large amounts of genomic data¹. Computational infrastructure can be a bottleneck for data analysis, amplifying global inequalities in ability to track SARS-CoV-2 evolution. This is an issue even in developed countries, as computational infrastructure requires expertise in resource procurement, configuration and maintenance. Commercial computational clouds do not fully address the problem because these resources must still be configured and funded. Furthermore, commercial clouds are predominantly US-based and many countries have policies making payments to foreign providers impractical. In developing countries, research computing infrastructure is rare and researchers often cannot afford commercial cloud-based computation. Here, we present the COVID-19 effort by the Galaxy Project, which pools free worldwide public computational infrastructure, making the analysis of deep sequencing data accessible to anyone while also providing an analytical framework for global pathogen genomic surveillance based on raw sequencing-read data.

Despite the existence of well designed and validated SARS-CoV-2 data analysis approaches^2,3, the ad hoc⁴ nature of their application often complicates the integration and comparison of analysis results. Public computational infrastructure (XSEDE, ELIXIR and Nectar Cloud in the United States, European Union and Australia, respectively) coupled with existing open-source software offers a solution to SARS-CoV-2 analytics challenges. However, glue is required to bind these resources into a unified platform for managing users, allocating storage and pairing analysis tools with appropriate computational resources. Such a platform is best not developed by a single principal investigator, group or institution, but rather supported by an international community of users, developers and educators.

We have developed a two-stage platform (Fig. 1) housed on three public Galaxy instances⁵ in the United States (http://usegalaxy.org), the European Union (http://usegalaxy.eu) and Australia (http://usegalaxy.org.au) and capable of supporting hundreds of thousands of complex analyses per month. Anyone can run effectively unlimited computation with 250 Gb (expandable) of disk space. The COVID-19 Galaxy Project comprises two stages (Fig. 1): the software components of stage 1—mature utilities for quality control, mapping, assembly and allelic variant (AV) calling—run entirely in Galaxy and are distributed via the BioConda project⁶; the software components of stage 2 are snippets of code for data transformation, exploration and visualization running within standard web-browser-based notebook environments. Stage 1 produces variant lists whereas stage 2 uses notebooks to perform descriptive analyses of datasets. In addition, an interactive dashboard is available that tracks temporal AV dynamics. (See https://covid19.galaxyproject.org for data, workflows, notebooks, dashboard and our ongoing automated tracking of large-scale genomic surveillance projects.)

**Fig. 1: Analysis flow for calling SARS-CoV-2 variants using Galaxy.**

Four primary analysis workflows (Supplementary Table 1) support the identification of SARS-CoV-2 AVs from deep-sequencing reads via the production of annotated AVs through a series of steps including quality control, trimming, mapping, deduplication, AV calling and filtering. Their output is processed by the Reporting and Consensus workflows (Supplementary Table 1) to generate standardized data tables describing AVs along with consensus genome sequences. These are further processed to summarize and visualize the data using interactive notebooks.

To illustrate the platform’s utility and scalability, we refer the reader to two large SARS-CoV-2 Illumina datasets (PRJNA622837, 619 samples from early SARS-CoV-2 transmission in the Boston area⁷; and PRJEB37886, ~100,000 samples analyzed as of the time of writing from the COVID-19 Genomics UK (COG-UK) genomic surveillance effort⁸) detailed in Supplementary Tables 1–3 and Supplementary Figs. 1–3. Analysis on COVID-19 Galaxy Project resources provides insights into co-occurrence patterns, presence of mutations defining variants of concern (https://cov-lineages.github.io/lineages-website/global_report.html), and intersection with sites under selection, including non-random associations among common low-frequency AVs that may reflect shared intra-host dynamics (Supplementary Fig. 1 and Supplementary Table 2). It can also highlight the emergence of mutations interfering with binding of polyclonal antibodies⁹ (for example, COG-UK data in Supplementary Fig. 2), suggesting possible intra-host dynamics. These and other interactive notebooks and dashboards on the platform could identify AVs that warrant closer monitoring as the pandemic continues.

Our system is designed to encourage scalable collaborative worldwide genomic surveillance to identify and respond to emerging variants. By relying on raw read data rather than assembled genomes and allowing every result to be traced back to its raw data, it goes a step beyond current surveillance efforts. Specifically, it enables surveillance of intra-patient minor AV frequencies—a distinction that could yield early warnings of epidemiological conditions conducive to the emergence of variants with altered pathogenicity, vaccine sensitivity or drug resistance.

References

Hodcroft, E. B. et al. Nature 591, 30–33 (2021).
Article CAS Google Scholar
Quick, J. et al. Nat. Protoc. 12, 1261–1276 (2017).
Article CAS Google Scholar
Grubaugh, N. D. et al. Genome Biol. 20, 8 (2019).
Article Google Scholar
Baker, D. et al. PLoS Pathog. 16, e1008643 (2020).
Article CAS Google Scholar
Jalili, V. etal. Nucleic Acids Res. 48 W1, W395–W402 (2020).
Grüning, B. et al. Nat. Methods 15, 475–476 (2018).
Article Google Scholar
Lemieux, J. et al. Science https://doi.org/10.1126/science.abe3261 (2021).
du Plessis, L. et al. Science 371, 708–712 (2021).
Article Google Scholar
Greaney, A. J. et al. Cell Host Microbe 29, 463–476.e6 (2021).
Article CAS Google Scholar

Download references

Acknowledgements

The authors are grateful to the broader Galaxy community for their support and software development efforts. This work is funded by NIH grants U41 HG006620 and NSF ABI grant 1661497. Usegalaxy.eu is supported by the German Federal Ministry of Education and Research grants 031L0101C and de.NBI-epi to B.G. Galaxy and HyPhy integration is supported by NIH grant R01 AI134384 to A.N. Usegalaxy.org.au is supported by Bioplatforms Australia and the Australian Research Data Commons through funding from the Australian Government National Collaborative Research Infrastructure Strategy. The hyphy.org development team is supported by NIH grant R01GM093939. Usegalaxy.be is supported by the Research Foundation-Flanders (FWO) grant I002919N and the Flemish Supercomputer Center (VSC). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations

University of Freiburg, Freiburg, Germany
Wolfgang Maier, Simon Bray, Milad Miladi & Björn Grüning
The Pennsylvania State University, University Park, PA, USA
Marius van den Beek, Dave Bouvier, Nathan Coraor & Anton Nekrutenko
GalaxyWorks Inc, Baltimore, MD, USA
Babita Singh & Jordi Rambla De Argila
Centre for Genomic Regulation, Viral Beacon Project, Barcelona, Spain
Dannon Baker
Johns Hopkins University, Baltimore, MD, USA
Nathan Roach
University of Melbourne, Melbourne, Victoria, Australia
Simon Gladman & Andrew Lonie
Ghent University, Ghent, Belgium
Frederik Coppens
VIB Center for Plant Systems Biology, Ghent, Belgium
Frederik Coppens
University of Cape Town, Cape Town, South Africa
Darren P. Martin
Temple University, Philadelphia, PA, USA
Sergei L. Kosakovsky Pond

Authors

Wolfgang Maier
View author publications
You can also search for this author in PubMed Google Scholar
Simon Bray
View author publications
You can also search for this author in PubMed Google Scholar
Marius van den Beek
View author publications
You can also search for this author in PubMed Google Scholar
Dave Bouvier
View author publications
You can also search for this author in PubMed Google Scholar
Nathan Coraor
View author publications
You can also search for this author in PubMed Google Scholar
Milad Miladi
View author publications
You can also search for this author in PubMed Google Scholar
Babita Singh
View author publications
You can also search for this author in PubMed Google Scholar
Jordi Rambla De Argila
View author publications
You can also search for this author in PubMed Google Scholar
Dannon Baker
View author publications
You can also search for this author in PubMed Google Scholar
Nathan Roach
View author publications
You can also search for this author in PubMed Google Scholar
Simon Gladman
View author publications
You can also search for this author in PubMed Google Scholar
Frederik Coppens
View author publications
You can also search for this author in PubMed Google Scholar
Darren P. Martin
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Lonie
View author publications
You can also search for this author in PubMed Google Scholar
Björn Grüning
View author publications
You can also search for this author in PubMed Google Scholar
Sergei L. Kosakovsky Pond
View author publications
You can also search for this author in PubMed Google Scholar
Anton Nekrutenko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Björn Grüning, Sergei L. Kosakovsky Pond or Anton Nekrutenko.

Additional information

Peer review information Nature Biotechnology thanks Jason Sahl for their contribution to the peer review of this work.

Supplementary information

Supplementary Information

Supplementary Methods, Supplementary Tables 1–3 and Supplementary Figs. 1–5

Rights and permissions

Reprints and permissions

About this article

Cite this article

Maier, W., Bray, S., van den Beek, M. et al. Ready-to-use public infrastructure for global SARS-CoV-2 monitoring. Nat Biotechnol 39, 1178–1179 (2021). https://doi.org/10.1038/s41587-021-01069-1

Download citation

Published: 29 September 2021
Issue Date: October 2021
DOI: https://doi.org/10.1038/s41587-021-01069-1

This article is cited by

Detection of SARS-CoV-2 intra-host recombination during superinfection with Alpha and Epsilon variants in New York City
- Joel O. Wertheim
- Jade C. Wang
- Scott Hughes
Nature Communications (2022)

Ready-to-use public infrastructure for global SARS-CoV-2 monitoring

Subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Supplementary information

Supplementary Information

Rights and permissions

About this article

Cite this article

This article is cited by

Detection of SARS-CoV-2 intra-host recombination during superinfection with Alpha and Epsilon variants in New York City

Search

Quick links

Subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Supplementary information

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Detection of SARS-CoV-2 intra-host recombination during superinfection with Alpha and Epsilon variants in New York City

Search

Quick links