The WashU Virus Genome Browser is a web-based portal for efficient visualization of viral ‘omics’ data in the context of a variety of annotation tracks and host infection responses. The browser features both a phylogenetic-tree-based view and a genomic-coordinate, track-based view in which users can analyze the sequence features of viral genomes, sequence diversity among viral strains, genomic sites of diagnostic tests, predicted immunogenic epitopes and a continuously updated repository of publicly available genomic datasets.
Coronavirus disease 2019 (COVID-19) is a rapidly spreading viral disease that has become a global health crisis. The first case of COVID-19 was reported on 12 December 2019, in Wuhan, China; by 2 August 2020, the disease had spread to more than 215 countries, territories and areas, resulting in at least 17,660,523 cases and 680,894 deaths (https://covid19.who.int/). COVID-19 is caused by a virus called severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which is a zoonotic, enveloped virus containing a positive single-stranded RNA genome 29,903 base pairs in size. The virus is one of seven coronaviruses known to infect humans, along with severe acute respiratory syndrome coronavirus (SARS-CoV) and Middle East respiratory syndrome coronavirus (MERS-CoV)1,2. To better understand the evolutionary history and pathogenesis of this new virus, large amounts of genomic data have been generated for both SARS-CoV-2 and human host cells. The genomes of thousands of SARS-CoV-2 strains have now been fully sequenced, the SARS-CoV-2 genome and transcriptome have been functionally annotated3,4, and the genomic activity of host cells in response to viral infection is beginning to be elucidated5,6. This explosion of omics data has created a need for platforms to store, process, analyze and visualize these data, to gain insights into the genomic basis of viral infection. Sequencing databases, such as National Center for Biotechnology Information (NCBI) GenBank7 and Global Initiative on Sharing All Influenza Data (GISAID)8, currently store most of the known genome sequences of individual SARS-CoV-2 strains. The pathogen genomics platform Nextstrain currently analyzes the genomic diversity and epidemiology of a subset of these strains and provides a useful overview of the phylogeography of SARS-CoV-2 transmission9. Although these databases have amassed a rich source of genomic and phylogenetic information for SARS-CoV-2, comprehensive analysis of the SARS-CoV-2 genome will require the use of a high-performance genome browser designed for storing and visualizing various viral and host omic datasets.
To address this need, we created the WashU Virus Genome Browser (https://virusgateway.wustl.edu), a web-based portal adapted from the WashU Epigenome Browser10,11,12,13, which is specifically designed for efficient visualization of viral genomic sequencing data (Supplementary Note). The browser contains the genomes of four different pathogenic virus species: SARS-CoV-2, SARS-CoV, MERS-CoV and Ebola. A reference-genome sequence is provided for each species, as well as a comprehensive collection of genome sequences of individual strains that have been isolated from patients from different geographical regions and time periods. In total, we have collected genomes of >80,000 SARS-CoV-2 strains, 332 SARS-CoV strains, 551 MERS-CoV strains and 1,574 Ebola strains, and we are continuously updating our database (Supplementary Table 1). The genome of each strain is automatically aligned in pairwise fashion to the reference genome, and sequence variants relative to the reference are visualized through single-nucleotide variant (SNV) tracks, which provide a simple and effective way to visualize sequence variation across multiple viral strains (Figs. 1b, 2 and 3b and Supplementary Note). Users can search for viral strains in the browser by using the data table (Supplementary Fig. 1 and Supplementary Note). Here, strains can be filtered by several metadata features, including country, continent, data source, collection date, tree-view availability and clade. Additionally, the data table’s search feature allows users to locate strains of interest by querying individual accession IDs or mutations. Strains of interest can be added to the user’s cart and displayed in the form of SNV tracks in the track-based browser view or highlighted in the tree-based view (Supplementary Note). To accommodate users who wish to visualize sequence variation within their own viral strains, the browser supports user upload of viral sequences in SNV format. Users are provided with easy-to-use scripts to convert their alignment results into SNV format for visualization (Supplementary Note). Additionally, the browser offers various precomputed genome-annotation tracks, such as gene annotations, protein annotations, recombination sites, sequence diversity, mutation frequencies, comparative tracks between species and GC density. Along with the traditional track-based view, our genome browser also features a phylogenetic-tree-based view, which allows users to analyze the evolutionary relationships and history of viral strains (Supplementary Fig. 2 and Supplementary Note). In this view, users can annotate strains on the phylogenetic tree with different metadata categories (for example, country of origin, collection date or Nextstrain clade designations) and can also highlight strains of interest within the tree to determine their relative relationships to other strains or clades (Supplementary Note).
The WashU Virus Genome Browser also hosts a large set of publicly available genomic datasets for both SARS-CoV-2 and human host cells. These datasets are organized as data hubs consisting of one or more tracks of data related to a specific aspect of viral biology, host biology, disease diagnosis or disease therapeutics. The browser currently hosts data hubs for viral transcription, viral recombination sites, viral RNA modifications, host transcriptional responses to infection, predicted antigenic epitopes, binding sites of diagnostic primers and genomic targets of CRISPR-based diagnostic tests (Supplementary Note and Supplementary Table 1). These data hubs can be easily loaded into the genome browser, providing users with an efficient way to analyze multiple data tracks of interest in tandem. As new omics data become publicly available, these datasets are promptly uploaded to the browser in data hubs. Emerging evidence now links host genomic variation with disease severity14, and further research efforts are focused on understanding the host response to viral infection, thus necessitating a platform for visualizing cross-species genomic data. The compatibility between the WashU Virus Genome Browser and the WashU Epigenome Browser provides a platform for visualizing both viral and host genomic data. Our browser provides the unique capability of seamlessly visualizing viral genomics data and the corresponding host genomic response data, also in a modular data-hub format (Supplementary Fig. 3).
As the SARS-CoV-2 virus continues to evolve, one major task is studying how mutations accumulate within diagnostic PCR-primer-binding sites, which could potentially decrease test efficacy. To efficiently track mutation hotspots within a viral species, we developed two annotation tracks: a mutation-alert track, which displays the number of strains with a mutation at each genomic position, and a sequence-diversity (Shannon entropy) track, which displays the Shannon entropy (variation) at each genomic position across strains (Fig. 1a). We also provide a data hub of the genomic binding sites of diagnostic PCR primers from the Centers for Disease Control and Prevention (CDC) and World Health Organization, which users can load into the genome-browser view to determine whether primers overlap with mutation hotspots or mutations in certain strains (Supplementary Note). As shown in Fig. 1a, after loading the track for China-CDC detection primers, users can readily see a mutation hotspot overlapping primer locations within the N gene. Zooming into the region of interest and adding SNV tracks for individual strains offers a color-coded display of individual mutations and accession IDs of strains with a mutation at the given position (Fig. 1b). In addition to the preloaded primer-location tracks, the browser supports user upload of novel primer locations in standard bed format15, thereby allowing users to determine whether their primers overlap with mutations.
The WashU Virus Genome Browser can also be used to identify antigenic epitopes that are conserved across viral species. We demonstrate this utility by performing a similar analysis to that in Extended Data Fig. 5 in ref. 3, showing a genomic alignment of two SARS-CoV strains and five SARS-CoV-2 strains to the reference SARS-CoV-2 N gene, along with a track of putative immune epitopes identified in SARS-CoV (Fig. 2). Peptides in SARS-CoV-2 that are homologous to putative antigenic epitopes in SARS-CoV serve as useful targets for SARS-CoV-2 vaccine development. However, the presence of sequence variants within these peptides may compromise vaccine effectiveness (Fig. 2b,c). Motivated by the overall high sequence similarity between the SARS-CoV and SARS-CoV-2 genomes, we analyzed the full list of experimentally identified linear epitopes from the Immune Epitope Database and Analysis Resource (IEDB)16, and identified a list of 320 high-confidence linear epitopes whose amino acid sequences are identical to those of the predicted translated products from the SARS-CoV-2 reference strain (Supplementary Table 2). This list provides a catalog of epitopes for researchers testing immunological targets that can potentially elicit T-cell and B-cell responses to SARS-CoV-2.
The browser supports multiple file formats, thus allowing users to visualize sequencing data in a variety of ways to better understand viral sequence evolution. Recent work suggests that the amino acid change from aspartate to glycine at position 614 within the spike (S) protein of SARS-CoV-2 has become dominant in non-Asian countries17. Although this mutation is not located within the core receptor-binding domain of the protein, it is nonetheless thought to contribute to the transmissibility of the virus17. In the browser track view, the prevalence of the p.Asp614Gly alteration among SARS-CoV-2 strains is immediately evident when viewing the preloaded sequence-diversity (Shannon entropy) track and the mutation-alert track, which highlight the entropy across strains and the number of strains with mutations, respectively (Fig. 3a). A finer-scale visualization of the p.Asp614 codon within the browser shows that some accessions of non-Asian origin contain the p.Asp614Gly alteration, and a dynamic track of the mutation’s weekly prevalence shows that more than 90% of strains isolated during the week of 27 April 2020 contained the p.Asp614Gly alteration (Fig. 3b). Further characterization of the prevalence confirmed the higher frequency of the p.Asp614Gly alteration in non-Asian accessions (Fig. 3c) and the increase in prevalence of the p.Asp614Gly alteration over time (Fig. 3d).
Using a similar approach, we were also able to confirm the recent observation of a heightened mutation rate at the amino terminus of the S protein among betacoronaviruses3. The genomes of pangolin and bat coronaviruses were aligned to the SARS-CoV-2 reference genome (Supplementary Note), and sequence variation was displayed in the browser in the form of SNV tracks and genome-comparison tracks (Fig. 4 and Supplementary Note). This view showed that the 5′ end of the S gene is highly divergent between the pangolin coronavirus and SARS-CoV-2 (Fig. 4). This finding was further supported by displaying raw next-generation sequence reads in bam format in the browser. Coronavirus reads from a pangolin viral metagenomic dataset18 were extracted and aligned to the SARS-CoV-2 reference genome (Supplementary Note). Most pangolin reads (1,468 of 2,288) aligned to the SARS-CoV-2 genome, and the average mismatch rate between pangolin-CoV and the SARS-CoV-2 sequences was only ~7% (Supplementary Note), thus indicating that the pangolin-CoV dataset is closely related to the SARS-CoV-2 dataset. Nevertheless, we found that no reads from the pangolin library aligned to the 5′ end of the S gene of SARS-CoV-2 (Fig. 4), in agreement with the observation that the N-terminal end of the S protein is one of the most divergent regions among betacoronaviruses3.
All analyzed viral sequences are available from the NCBI GenBank7 (https://www.ncbi.nlm.nih.gov/nuccore), GISAID8 (https://www.gisaid.org/) and Nextstrain9 (https://nextstrain.org/sars-cov-2) public repositories, with the exception of the pangolin viral metagenomic dataset18, which is available at NCBI BioProject PRJNA573298). Additional datasets used to create public data hubs hosted in the browser are listed in Supplementary Table 1.
Notably, the UCSC SARS-CoV-2 Genome Browser has recently been developed in parallel to the work described here19, highlighting the need for comprehensive omic visualization resources as well as community interest and contribution. We hope that the WashU Virus Genome Browser will enable rapid sharing of processed data, facilitate collaboration and accelerate research on existing and novel pathogenic viruses. Moreover, the portable nature of the underlying technology enables us to swiftly spin up viral browser instances in response to other emerging zoonotic viruses. Our browser portal can be accessed at https://virusgateway.wustl.edu; documentation is available at https://virusgateway.readthedocs.io/; and general feedback, suggestions and bug reports may be sent to https://github.com/twlab/virusbrowser/issues.
de Wit, E., van Doremalen, N., Falzarano, D. & Munster, V. J. Nat. Rev. Microbiol. 14, 523–534 (2016).
Cui, J., Li, F. & Shi, Z. L. Nat. Rev. Microbiol. 17, 181–192 (2019).
Zhou, P. et al. Nature 579, 270–273 (2020).
Kim, D. et al. Cell 181, 914–921.e10 (2020).
Blanco-Melo, D. et al. Cell 181, 1036–1045.e9 (2020).
Bojkova, D. et al. Nature 583, 469–472 (2020).
NCBI Resource Coordinators. Nucleic Acids Res. 46, D8–D13 (2018).
Shu, Y. & McCauley, J. Eur. Surveill. 22, 30494 (2017).
Hadfield, J. et al. Bioinformatics 34, 4121–4123 (2018).
Li, D., Hsu, S., Purushotham, D., Sears, R. L. & Wang, T. Nucleic Acids Res. 47, W158–W165 (2019).
Zhou, X. et al. Nat. Biotechnol. 33, 345–346 (2015).
Zhou, X. et al. Nat. Methods 10, 375–376 (2013).
Zhou, X. et al. Nat. Methods 8, 989–990 (2011).
van der Made, C. I. et al. JAMA 324, 1–11 (2020).
Kent, W. J. et al. Genome Res. 12, 996–1006 (2002).
Vita, R. et al. Nucleic Acids Res. 47, D339–D343 (2019).
Korber, B. et al. Cell 182, 812–827 e819 (2020).
Liu, P., Chen, W. & Chen, J. P. Viruses 11, 979 (2019).
Fernandes, J.D. et al. Preprint at bioRxiv https://doi.org/10.1101/2020.05.04.075945 (2020).
J.A.F. is supported in part by the Siteman Cancer Center Precision Medicine Pathway (T32CA113275). X.Z. is supported in part by NIH 5R25DA027995. T.W. is supported by NIH grants R01HG007175, U24ES026699, U01CA200060, U01HG009391 and U41HG010972, and by American Cancer Society Research Scholar grant RSG-14-049-01-DMC.
The authors declare no competing interests.
Supplementary Figures 1–3, Notes 1–4 and Table 1
Linear antibody-binding epitopes curated by the Immune Epitope Database and Analysis Resource (IEDB) identified in SARS-CoV with amino acid sequences identical to predicted translated products from the SARS-CoV-2 reference.
About this article
Cite this article
Flynn, J.A., Purushotham, D., Choudhary, M.N.K. et al. Exploring the coronavirus pandemic with the WashU Virus Genome Browser. Nat Genet (2020). https://doi.org/10.1038/s41588-020-0697-z