In December 2019, a series of cases of pneumonia of unknown origin appeared in Wuhan, China and on 7 January 2020, the virus responsible for the diseases was identified as a novel coronavirus, SARS-CoV-2 (ref. 1). The first SARS-CoV-2 genome was made publicly available on 10 January 2020 (refs. 2,3). Since then, the global scientific community, through an unprecedented effort, has sequenced and shared over 11 million genomes through GISAID (, as of May 2022 (ref. 4). To keep track of the evolving genetic diversity of SARS-CoV-2, Rambaut et al. developed a dynamic phylogeny-informed nomenclature (Pango) to classify SARS-CoV-2 lineages5. As of May 2022, over 2,000 lineages have been designated, which has enabled public health agencies such as Public Health England, the Centers for Disease Control and Prevention (CDC) and the World Health Organization (WHO) to identify variants of concern (VOCs), variants of interest/variants under investigation (VOIs/VUIs) and variants under monitoring/variants being monitored (VUMs/VBMs) based on the phenotypical characterization of these lineages6. Currently, the only designated VOC is the B.1.1.529* (Omicron) lineage, which exhibited very rapid growth and the ability to substantially avoid antibody neutralization7,8. There have been four VOCs previously designated over the course of the pandemic: the B.1.1.7* (Alpha; * denotes the lineage and any of its sublineages) lineage, resulting in increased transmissibility9; the B.1.351* (Beta) lineage, exhibiting immune evasion10; the P.1* (Gamma) lineage, exhibiting immune evasion11; and the B.1.617.2* (Delta) lineage, exhibiting increased transmissibility due to the P681R mutation in the spike gene12.

The emergence of VOCs with fitness advantages has led to global ‘sweeps’, with newly emerged VOCs displacing previously circulating variants. More notably, the growth of each VOC has led to a renewed surge in infections worldwide. This has prompted the need for near real-time genomic surveillance to inform early public health interventions to control the rise of infections. In response to this need, thousands of academic, non-academic and public health laboratories have been depositing sequences predominantly on the sharing platform of the GISAID Initiative4,13. The extraordinarily high sampling rate of infecting viruses provides a unique opportunity to track the evolution of the virus in near real-time. For example, in December 2021 alone, over a million new genomes were submitted to GISAID14. Traditionally, phylodynamic approaches have been employed to retrospectively characterize lineage dynamics during outbreaks of viruses such as Zika15,16,17, West Nile18 and Ebola19,20. Existing tools such as NextStrain21 and frameworks such as Microreact22 primarily rely on a phylogeny to elucidate transmission chains and monitor the evolution of the virus. These tools, however, were not designed to track thousands of new genomes per day and given that building phylogenies for large sets of genomes is computationally intensive and time consuming, obtaining timely insights from the data is often problematic23; however, the high sampling rate of the virus has opened up the possibility of tracking the pandemic using available near real-time genomic data without the need for computationally intensive modeling.

Here, we present, a platform that currently tracks over 40 million combinations of Pango lineages and individual mutations, across over 7,000 locations, to provide insights for researchers, public health officials and the general public. In the following sections, we describe the data pipelines that enable the scalable ingestion and standardization of heterogeneous data on SARS-CoV-2 variants, the server infrastructure that enables the dissemination of the processed data and the client-side applications that provide intuitive visualizations of the underlying data.


The growth rate of a given viral lineage is a function of epidemiology and its intrinsic biological properties (Fig. 1a). For example, the B.1.177 lineage, characterized by an A222V amino acid substitution in the spike gene, increased in prevalence in Europe during the summer of 2020 (ref. 24). While initially thought to be more transmissible, it was eventually shown that the increase in prevalence was due to a resurgence in travel and not due to increased transmissibility. In contrast, a few months later, the B.1.1.7 lineage was shown to be 40–60% more transmissible than previously circulating lineages and this intrinsic biological property led to the rapid growth in its prevalence worldwide25,26. Epidemiological factors such as mobility27,28, mask usage29 and public health interventions30 vary over time and across geographies worldwide, whereas biological properties are a function of the mutations found in a given lineage (Fig. 1a). Hence, to maximize the utility of genomic data for surveillance, we built to enable the exploration of genomic data across three dimensions: geography, time and lineages/mutations. We use the Pango nomenclature to estimate the prevalence of SARS-CoV-2 lineages over time and at varying geographic scales. Using a phylogenetically-informed nomenclature allows us to determine genetic features such as the ‘characteristic mutations’ of a lineage without directly building a global phylogeny. By avoiding a global phylogeny, we can update our databases daily using the continuously growing number of SARS-CoV-2 genomes. In addition, we closely track reports from health agencies such as Public Health England, CDC and WHO that designate VOCs/VOIs/VUMs based on epidemiological analyses. In addition to genomic data, the server also ingests two other types of data: (1) epidemiological data curated by Johns Hopkins University31 and (2) public literature, clinical trial, protocol and dataset metadata from sources such as bioRxiv, medRxiv and LitCovid32. Here, we describe how each of these data sources is used collectively to assist in genomic surveillance.

Fig. 1: enables the exploration of genomic data across three dimensions.
figure 1

a, Growth rate of a lineage is a function of epidemiology and intrinsic biological properties of a lineage. Further, epidemiology varies over time and by geography, whereas intrinsic biological properties are determined by the mutations present in a given lineage. b, Genomic data are ingested from GISAID, processed using the custom-built data pipeline (Bjorn) and stored on a server that can be accessed via an API. The API is consumed by two clients: a JavaScript-based web client and an R package that provides programmatic access by authenticating against GISAID credentials. c, The web interface contains three tools that allow exploration of genomic data across three different dimensions: lineage/mutation, time and geography.

The overall workflow of genomic data is shown in Fig. 1b. Genomic data are ingested from GISAID, processed via a custom-built data pipeline, Bjorn, and stored on a server that can be accessed via an application programming interface (API). We built two client-side applications, a web interface and an R package that consume this API (Fig. 1b). The web interface consists of three main tools focusing on different facets of the underlying genomic data: (1) Lineage and/or Mutation Tracker, (2) Location Tracker and (3) Lineage Comparison Tool. We designed an interface for each tool that focuses on one primary dimension of the genomic data with additional customizability of one or more secondary dimensions (Fig. 1c). The Lineage and/or Mutation Tracker focuses on a specific lineage, mutation or a combination of these. The Location Tracker focuses on a given location and provides a snapshot of currently circulating lineages. Finally, the Lineage Comparison Tool can be used to explore the prevalence of mutations across different lineages. In addition to the web interface, we built an R package that authenticates against GISAID credentials and allows programmatic access to the processed data for downstream analyses.

Lineage and/or Mutation Tracker

The ongoing SARS-CoV-2 pandemic has been punctuated by the emergence of VOCs with fitness advantages over previously circulating variants, resulting in ‘waves’ of infections. Figure 2a shows the changing prevalence of the three most-dominant VOCs in the United Kingdom, but this phenomenon is observed globally with heterogeneity across geography. A fundamental part of genomic surveillance is to identify the emergence of such variants by closely tracking the growth of circulating lineages. Given the geographic variation in epidemiological, social and economic factors, it is important to estimate variant prevalence at varying geographic scales. The Lineage/Mutation Tracker can be used to dynamically query the temporal and geographic variation in the prevalence of a (1) VOC/VOI and its sublineages (for example, Delta and its sublineages); (2) a lineage (for example, B.1.1.7); (3) a lineage and one or more mutations (for example, B.1.1.7 with S:E484K); (4) a mutation (for example, S:E484K); or (5) a group of mutations (for example, S:E484K and S:N501Y) (Fig. 2b). When a mutation is part of a query, we also generate and display a link to Aquaria33 so that users can visualize the amino acid site in the context of the three-dimensional structure of the relevant protein. In addition to customizing the lineages and mutations, users can specify various location scales, such as a country, state or county (or their local equivalents), to estimate the prevalence of a given lineage and/or mutations. To provide meaningful insights from these prevalence estimates, we designed an interface to address a specific set of questions listed in Table 1.

Fig. 2: Lineage and/or Mutation Tracker.
figure 2

a, Prevalence of VOCs in the United Kingdom from Sep 2020 to May 2022. The error bands show the 95% binomial proportion confidence interval calculated using Jeffrey’s interval. b, Search and filter options for Lineage/Variant of Concern tracker. c, Prevalence of S:Y145H+ S:A222V mutations across different lineages globally. d, Prevalence of BA.2 in the United Kingdom. The error bands show the 95% binomial proportion confidence interval calculated using Jeffrey’s interval. e, Mutation map showing the characteristic mutations of AY.4. f, Summary statistics of BA.2 lineage. g, Geographic distribution of the cumulative prevalence of BA.2 lineage over the last 60 d globally. h, Cumulative prevalence of BA.2 in each country over the last 60 d globally. i, Research articles and datasets related to BA.2.

Fig. 3: Location report.
figure 3

a, Relative prevalence of all lineages over time in South Africa. Total number of sequenced samples collected per day are shown in the bar chart below. b, Relative cumulative prevalence of all lineages over the last 60 d in South Africa. c, Mutation prevalence across the most prevalent lineages in South Africa over the last 60 d. d, Comparison of the prevalence of VOCs grouped by WHO classification: Alpha, Beta, Delta and Omicron over time in South Africa. The error bands show the 95% binomial proportion confidence interval calculated using Jeffrey’s interval. e, Daily reported cases in South Africa are shown in the line chart.

Table 1 Questions addressed by the Lineage and/or Mutation Tracker

Location Tracker

Some variants of the virus have only been dominant in certain regions of the world. For example, the VOCs Beta and Gamma were dominant in South Africa10 and Brazil34, respectively. Similarly, B.1.621 (ref. 35) was only dominant in Columbia, A.2.5 was only dominant in Panama and B.1.177 exhibited a high growth rate only in European countries due to a resurgence of travel in the summer of 2020 (refs. 24,36). Factors such as the attack rate, population immunity due to previous infection or vaccination and social mobility vary by geographic region and have an impact on the growth rates exhibited by a given lineage. To account for such localized factors, it is important to have the ability to track the growth of lineages at different geographic scales. We built the Location Tracker on to facilitate the surveillance of SARS-CoV-2 lineages at a country, state/province or county/city level. The Location Tracker provides a snapshot of circulating lineages with a focus on the last 60 d and allows users to compare the prevalence of a customizable set of lineages/mutations over time in that location. Furthermore, the tracker also integrates reported cases over time to provide insights on the impact of growth of various lineages on caseloads in the region. As with the Lineage/Mutation Tracker, we designed the user interface to answer a set of specific questions as shown in Table 2.

Table 2 Questions addressed by the Location Tracker

Case study: as a hypothesis-generation tool

As the pandemic has continued to progress, we have seen the emergence of VOCs with fitness advantages that were able to outcompete previously circulating lineages. As of May 2022, there has been one currently designated VOC (Omicron, B.1.1.529+ sublineages, indicated by *) and four previously designated VOCs: Alpha (B.1.1.7*), Beta (B.1.351*), Gamma (P.1*) and Delta (B.1.617.2*). Of these, Alpha, Beta and Gamma were estimated to have emerged between September and December 2020 (refs. 10,11,37) and were subsequently outcompeted globally by the Delta variant that was first detected in December 2020 (ref. 38). The Omicron lineage, first detected in November 2021 (ref. 7), was able to outcompete Delta and grew much more rapidly relative to previous VOCs during their emergence (Fig. 4a). Where Delta and Omicron variants exhibited high growth rates with little variation globally, Alpha continued to circulate in low prevalence in Brazil and South Africa, where Gamma and Beta variants were dominant, respectively (Fig. 4b,c). Additionally, the prevalence of sublineages within Delta and Omicron variants varies geographically. The Location Tracker on can be used to track the growth of VOCs within a given location, thus facilitating the comparison of lineage growth rates across locations. The Location Tracker can also be used to track the relative prevalence of sublineages within these VOCs, shedding light on any geographic variation in these dynamics. Here, we examine trends in the prevalence of the five VOCs globally and highlight the geographic variation in growth rates of Alpha, Beta, Gamma, Delta and Omicron variants.

Fig. 4: Prevalence of VOCs Alpha, Beta, Gamma, Delta and Omicron lineages over time.
figure 4

ad, Prevalence worldwide (a), in South Africa (b), in Brazil (c) and in the United States (d). Error bands in ad show 95% binomial proportion confidence intervals calculated using Jeffrey’s interval. eh, Lineages with a prevalence over 3% over the last 60 d in Denmark (e), United Kingdom (f), United States (g) and South Africa (h).

The earliest samples of the Alpha variant were sequenced in southern England in late September 2020 (ref. 37). There were multiple introductions of the lineage into the United States as early as late November26. The Alpha variant showed a transmission advantage of 40–50% in the United States26, in line with observations in the United Kingdom and the Netherlands. In the United States, Alpha was able to outcompete previously circulating lineages and continued to increase in prevalence until the introduction of the Delta variant around April 2021 (Fig. 4d). In contrast to the United States, the Alpha variant circulated at very low prevalence in Brazil, whereas the Gamma variant remained dominant in the country11 until the introduction of the Delta variant around April 2021 (Fig. 4b). Similarly, in South Africa, the Beta variant continued to spread until the emergence of the Delta variant and the Alpha variant never became dominant (Fig. 4c). While the Beta and Gamma variants were able to outcompete Alpha in South Africa and Brazil, respectively, Gamma only reached a maximum prevalence of 8% in the United States in May 2020 and Beta circulated at a prevalence of <1% (Fig. 4d). The growth of a lineage is determined by epidemiological factors such as number of introductions, travel between locations and by intrinsic biological properties such as transmission advantage or immune evasion. Both Beta and Gamma variants show varying degrees of immune evasion39. Regions of Brazil had attack rates as high as 75% in October 2020 (ref. 40), indicating that immune evasion was the primary reason for the rapid growth of the P.1 lineage in Brazil. In contrast, states in the United States had an estimated attack rate between 0.1% and 16% in June 2020 (ref. 41). Given this difference in attack rates, we can hypothesize that the intrinsic transmission advantage of the Alpha variant was able to outcompete the advantage conferred by immune evasion of Gamma in the United States, but the opposite was true in Brazil and South Africa. In all three countries, the introduction of the Delta lineage displaced previously circulating Alpha, Beta and/or Gamma lineages in the summer of 2021.

The Delta variant of SARS-CoV-2 was first detected in Maharashtra, India in December 2020 (ref. 38), has been shown to be 40%–60% more transmissible than Alpha42,43 and causes a reduction in vaccine efficacy relative to previously circulating lineages44. Vaccination campaigns against COVID-19 started in December 2020 and despite the progress of these campaigns45, the Delta variant continued to cause a renewed surge in infections globally. The Delta variant report, which can be accessed directly on the landing page of the Lineage and/or Mutation Tracker, can be used to understand the dynamics of its sublineages. Figure 4a shows the global prevalence of the Delta variant over time. This growth reflects the transmission advantage that Delta has over previously circulating lineages including VOCs Alpha, Beta and Gamma. As the Delta variant continued to spread, its genetic diversity increased and as of May 2022, over 200 sublineages of Delta have been designated46.

The Omicron variant was first detected in November 2021 by genomic surveillance teams in South Africa and Botswana. This variant was associated with a rapid resurgence of infections in Gauteng Province, South Africa and was designated a VOC by the WHO within 3 d of uploading the first genome7. The variant grew in prevalence very rapidly: within 3 weeks, the variant was detected in 87 countries and as of May 2022, Omicron has a prevalence of over 95% globally (Fig. 4a). While increased transmissibility confers a bigger fitness advantage compared to immune evasion when population immunity is low, the opposite is true as population immunity increases either due to vaccination or previous infection47. The Omicron variant was found to have a fivefold higher chance of reinfection compared to Delta48 and Omicron infections presented with a higher viral load than wild type but still lower than Delta49. As viral load is one of the determinants of transmissibility, this indicates that Omicron is intrinsically not as transmissible as Delta, but it exhibits better immune evasion. This combination gave Omicron a large fitness advantage over Delta as evidenced by its rapid growth rate worldwide (Fig. 4a). The continued spread of the variant has resulted in the emergence of many sublineages and as of May 2022, over 100 sublineages of Omicron have been designated. Notably, there is considerable geographic variation in the relative prevalence of newly designated sublineages such as BA.2.12.1, BA.4 and BA.5. While BA.2 continues to be the dominant sublineage within Omicron in countries such as Denmark and the United Kingdom (Fig. 4e,f), we see the BA.2.12.1 sublineage slowly displacing BA.2 in the United States (Fig. 4g). In South Africa, sublineages BA.4 and BA.5 have completely displaced the previously dominant BA.2 (Fig. 4h) and have led to another surge in reported cases (Fig. 3e). The three variants, BA.2.12.1, BA.4 and BA.5 have been shown to evade antibodies elicited by previous BA.1 infection in in vitro neutralization studies50,51. This observed escape was higher than what was observed for BA.2 (ref. 52), highlighting the possibility that these variants led to a renewed surge in infections as these variants continue to spread globally. While the growth of Alpha and Delta variants globally was driven primarily by higher intrinsic transmissibility, the growth of the new variants within Omicron seems to be driven primarily by enhanced immune evasion. The increasing prevalence of immunity due to vaccination or previous infection worldwide, further supports this hypothesis.

This case study illustrates how can be used to not only track and compare the prevalence of lineages across locations, but also to derive and support hypotheses regarding the complex interplay between epidemiology and the intrinsic phenotypic characteristics of emerging SARS-CoV-2 lineages as the virus continues to spread.


The Omicron variant, first detected in late November 2021, has outcompeted Delta and as of May 2022, it is the dominant lineage globally; however, it is important to note that regardless of how prevalent previously circulating VOCs were, all five VOCs emerged independent of each other. While the current hypothesis for the emergence of VOCs is prolonged virus evolution in a chronically infected individual53, we still lack a thorough understanding of this process. Given the underlying stochasticity of this process, predicting the emergence of a new VOC is not currently feasible. As a result, continued surveillance of all currently circulating lineages is of utmost importance to public health globally—particularly as SARS-CoV-2 continues to spread and evolve worldwide.

The global community has generated over 11 million genomes of SARS-CoV-2 as of May 2022, shared on platforms such as GISAID14. The availability of such a large volume of genomic data has led to the development of several applications that combine genomic data with associated datasets to track the evolution of variants and their properties such as immune evasion and transmission advantage. Databases such as the Stanford Coronavirus Antiviral & Resistance Database54 rely on manual curation of publications to identify and extract relevant datasets. Applications such as track the evolution of the virus by focusing on a specific set of VOCs/VOIs and key mutations. COG-UK Mutation Explorer55 on the other hand tracks the virus within a specific location. Most of these applications rely on varying degrees of manual and automated workflows.

In terms of the degree of automation and the generalized approach to tracking SARS-CoV-2 variants globally, the applications that are most comparable to include COVID-19 CG56, CoV-Spectrum57 and BV-BRC SARS-CoV-2 Early Warning System58. While each of these applications has their own strengths which can complement each other, we developed to serve as a template for tracking the spread of any pathogen over varying geographic and temporal scales at scale, across the world, in near real-time. Our unique paradigm centralizes the computation of key statistics based on the analysis of disparate data streams.

We designed the server infrastructure of keeping two goals in mind: scalability of the API as existing data sources increase in size and new data sources are incorporated and reusability of the computed data by providing programmatic access through an R package (Extended Data Fig. 2). We separated our data ingestion pipelines from the server-side application so that new data sources could be easily incorporated by instantiating a new instance of the same server-side database and API. This approach differs from other existing applications and enables us to quickly incorporate and link metadata across existing data and new modes of surveillance such as the CDC’s National Wastewater Surveillance System59. Furthermore, the easy dissemination of any computed data on via the R package enables registered GISAID users to not only recreate visualizations available on the web interface but to further interrogate and utilize the processed data for more sophisticated downstream analyses. Extensive documentation of the R package is available at To the best of our knowledge, we are the only application that has incorporated an authentication workflow to enable users to directly access and analyze the processed data. To maximize accessibility of these data, the web interface of was designed to offer a high degree of customizability, allowing users to answer specific biological questions and use the platform as a hypothesis-generation tool.

The guiding principles for the web interface were interactivity via responsive user interface elements powered by a high-performance API and interpretability via intuitive visualization of data based on discussions with researchers, epidemiologists and public health officials. Besides informal discussions we also conducted two usability tests to evaluate user experience and redesigned parts of the user interface based on the feedback we received (see Usability studies in Methods). This rigorous approach to designing the interface and the overarching guiding principles have enabled us to create an interface that can be effectively used by the general public in addition to public health officials, academic and non-academic researchers. has been enabled by unprecedented global genomic sequencing efforts and we developed every element of the application to fully leverage this capacity; however, genomic sampling varies globally with the vast majority of sequences coming from high income countries; even within well-sampled regions, there is geographic and temporal variation13. To communicate the increased uncertainty due to low sampling, we calculate confidence intervals of estimates wherever applicable, provide histograms of sampling density and mask data when there are very few data points available. Sampling strategies for sequencing could also bias the prevalence of a given variant. For example, samples that are prescreened for variants such as B.1.1.7 (Alpha) or BA.1 (Omicron) that show S gene dropout on a widely used qPCR assay could artificially inflate the prevalence of these variants. Efforts to investigate specific outbreaks within a location or preferentially sampling individuals based on their travel history could also bias prevalence estimates. The impact of such factors on prevalence estimates could be substantial especially when the variant has very few sequences and the prevalence of the variant is low; however, this bias decreases as more sequencing data are made available. We communicate these limitations in interpreting our reports through a dedicated ‘caveats’ page on the website and have warnings linking to this page in relevant sections throughout the interface (Extended Data Fig. 3a,b). Limited sequencing data can also impact the ‘characteristic mutations’ of a variant, identified using the 75% threshold. Oftentimes, when a new lineage is designated there are very few genomes assigned to it. In such cases, the 75% threshold might not be adequate to identify all the characteristic mutations due to many positions being masked using Ns or otherwise incomplete genomes. We highlight this limitation in the ‘methods’ page of our website and have a link to this page above the ‘mutation map’ in the interface (Extended Data Fig. 3c). Further, we also throw a warning regarding this limitation if a lineage has fewer than 1,000 sequences assigned to it (Extended Data Fig. 3d).

Since its launch, ( has been widely used not only by the scientific community but also as a source of information by a broader audience. On average, the website received over 270,000 page views per month, of which 75% were unique visitors. The average amount of time spent by users on a page was around 1 min. The API received an average of over 3 million unique hits per month, which included any requests from the website itself. The Lineage and/or Mutation Tracker, Location Tracker and the Lineage Comparison Tool accounted for 45%, 14% and 15% of overall traffic, respectively. continues to provide a mechanism for researchers, epidemiologists and public health officials to easily track the growth of variants, across any number of locations. The platform, backed by robust infrastructure, allows users to quickly access key statistics for known VOCs, emerging variants and any combination of mutations without having to run any time-consuming analyses. While these statistics are useful to monitor the evolution of the virus, there exist several avenues to extend and improve as a platform to comprehensively monitor the evolution of the virus. First among these is to use existing statistics to calculate new quantities such as genetic divergence and the growth rate of emerging variants to aid in the early detection of VOCs and VOIs. The sparse availability of genomes especially for newly detected variants requires methods that can provide reliable confidence intervals in addition to point estimates of such quantities. Given that VOCs and VOIs of SARS-CoV-2 were first detected in many different locations globally including the United Kingdom, Brazil, South Africa and India, it is important to develop or incorporate new methods that can scale across thousands of locations and millions of combinations of lineages and mutations. The second avenue to improve is to leverage our ability to incorporate new types of data that could be analyzed in conjunction with sequencing data sampled in clinical settings. Such data sources include wastewater surveillance, which has emerged as an accurate and cost-efficient method for surveillance and human mobility data that could be used to track the geographic spread of the virus. Finally, we aim to improve our API by adding the ability to construct complex queries. For instance, a query such as ‘CALCULATE prevalence FROM lineage:BA.1 NOT mutation:S:K417N AND location:USA’ would calculate prevalence of BA.1 genomes without the S:K417N mutation sampled within the United States. This would allow researchers to build queries in an easy and intuitive manner so they can focus on data exploration, hypothesis generation and complex downstream analyses. Beyond the SARS-CoV-2 pandemic, serves as a model for providing scalable and reusable metrics to track the spread of any pathogen during an outbreak via interactive and interpretable visualizations.


Ingestion of genomic data

We built a data pipeline, Bjorn, to preprocess metadata and count mutations from a given set of genomes in a scalable manner on a daily basis (Extended Data Fig. 4). The pipeline consists of the following steps: (1) download SARS-CoV-2 genomes from the GISAID provision; (2) filter out records based on erroneous or incomplete genomes and associated metadata; (3) compare with cache to only process new and updated records; (4) divide sequences into chunks of 10,000 records (each chunk is subsequently processed in parallel using GNU parallel)62; (5) align these sequences using minimap2 v.2.24 (ref. 63); (6) convert the alignment into a FASTA file using gofasta v.1.1.0 (ref. 64); (7) count mutations and deletions from this alignment; (8) standardize and filter the metadata: country, division, location, Pango lineage, date of collection and date of submission; and (9) combine results from all chunks and convert to a JSON Lines object. We standardized geographic identifiers using shapefiles from GADM ( The final JSON Lines object was loaded into an Elasticsearch v.1.7 index within the BioThings framework v.0.11.1 (ref. 65). The source code for Bjorn including a Docker container for easy deployment is available at

To exclude records with incomplete or erroneous genomes and associated metadata in step (2), we use the following conditions:

  • Length of the genome is less than 20,000 nucleotides or greater than 40,000 nucleotides.

  • Greater than one-eighth of nucleotides in the genome are masked using Ns.

  • The genome differs from the reference genome at more than 50% of nucleotides.

  • The genome has more than 500 insertions or deletions relative to the reference.

  • The collection date is after the submission date.

  • The submission date is in the future (beyond the current date).

  • The collection date is before November 2019.

  • The host is not human.

  • The location string cannot be matched to any country in the GADM shapefiles.

Ingestion of epidemiological data

We built the EpiData pipeline to ingest reported global cases and deaths from Johns Hopkins University31. We used shapefiles from Natural Earth ( to standardize geographic identifiers and obtain populations for countries and states outside the United States. For the United States, we used the county-level shapefiles and population estimates from the 2019 population estimates by the Census Bureau to standardize geographic identifiers and get population estimates. We standardized reported date formats and geographic identifiers across the two resources. The code for the EpiData pipeline is available at

Calculation of confidence intervals on prevalence

Most estimates of prevalence on are binomial proportions. We calculated 95% confidence intervals for these estimates using Jeffrey’s interval, the 2.5 and 97.5 quantiles of the β distribution \(\beta \left( {x + 0.5,n - x + 0.5} \right)\) where x is the number of successes and n is the number of trials.

Creation of API

To scale with the increasing size of existing data sources and the heterogeneity of newly emerging data sources, we used the BioThings framework65. The JSON outputs of our data pipelines are ingested by the BioThings framework and the processed data are stored in individual Elasticsearch indices. A Tornado server is used to create API endpoints that leverage the search capabilities of Elasticsearch to perform complex aggregations of the underlying data. These API endpoints allow the client-side applications to query the underlying data within reasonable query times while accounting for the scale of the ingested data. The BioThings Hub maintains historical data by default, allowing us to roll back to previous data backups if issues are discovered with new data after they are deployed. The code for the server-side application is available at web application

The web application was built using Vue.js v.2.7.14 (, a model–view–view model JavaScript framework that enables the two-way binding of user interface elements and the underlying data allowing the user interface to reflect any changes in underlying data and vice versa. The client-side application uses the high-performance API to interactively perform operations on the database. Customized data visualizations on the client were built using D3.js v.5.16.0 (ref. 66), giving us the ability to develop novel and intuitive visual elements as part of the user interface. We designed these visualizations to answer specific questions of interest to epidemiologists, researchers and public health officials. We further added functionality to enable the one-click copy or download of every chart in the interface as a PNG or SVG. The code for the client-side application is available at

R package

We developed an R package for to allow researchers and other individuals to easily access the data via the API for downstream analyses and visualizations. The R package is composed of three parts: functions that allow the user to access genomic data, functions to access the epidemiological data and functions to access the Research Library metadata. They all consist of a base function that contains arguments for all possible parameters that can be used to query the API. While users can utilize this base function directly to access data, several wrapper functions are available that inherit the arguments from the base function in addition to prespecified parameters to simplify the process of querying the API. For example, while getGenomicData() can be used directly to access data regarding the daily global prevalence of a specified lineage, doing so would require a user to be familiar with the name of the end point as specified in the API URL (in this case, global prevalence). Instead, the user can access these data with the more intuitively named getPrevalence(). Therefore, these wrapper functions allow users to easily and quickly obtain the data they need. The R package also contains an authenticateUser() function that allows users to authenticate against their GISAID credentials and access computed statistics from the primary genomic data provided by GISAID.

In addition, as the API queries location by ISO3 code, rather than by location name, two functions have been created that allow users to forgo the step of searching for the ISO3 code themselves: getISO3Code() and getLocationIdGenomic(). The latter function uses the genomics API end point to obtain the ISO3 code for a given location. The ISO3 code can be obtained with either a full or incomplete location name; in the latter case, the user will be provided a list of matching locations and must specify the location they are interested in. This function is embedded in the parent getGenomicData() function and is therefore inherited in all wrapper functions. Therefore, searching for data by location in the R package replicates the experience on the client-side web application. Documentation is available at with vignettes located at The R package can be downloaded and installed using the remotes package function: install_github(‘outbreak-info/R-outbreak-info’).

Usability studies

Two remote moderated usability tests of were conducted between 4 March 2021 and 2 April 2021 over Zoom. These tests explored the ease by which researchers could use’s genomic tools to find information about SARS-CoV-2 lineages and mutations, including prevalence data, mutation characteristics and publications. In the first test, five biomedical researchers were recruited to complete a series of seven tasks using the Lineage and/or Mutation Tracker:

  1. 1.

    Please export a visualization of the average daily prevalence of the B.1.1.7 lineage in the United Kingdom.

  2. 2.

    Please use to find publications about the B.1.1.7 lineage.

  3. 3.

    Please tell me which lineages are VOCs and how to distinguish the difference between VOCs and VOIs using the website.

  4. 4.

    Please use to find which lineages contain the E484K mutation.

  5. 5.

    Please use to find the total number of E484K sequences that have been detected in New York state.

  6. 6.

    Please use to find the global prevalence of the P.2 lineage.

  7. 7.

    Please use to find the global prevalence of the S:D614G mutation.

In the second test, five biomedical researchers were recruited to complete a series of five tasks using the location reports:

  1. 1.

    Please find the prevalence of lineages in California. Subsequently, please find the prevalence of specific mutations (for example L452R) for the most common lineages in California.

  2. 2.

    Please compare the prevalence of these lineages/mutations in California over time: B.1.1.7, B.1.427, B.1.429 and S:L452R.

  3. 3.

    Please find cumulative B.1.429 prevalence by county in California.

  4. 4.

    Please find the global prevalence of the B.1.526 lineage with the S:E484K mutation.

  5. 5.

    Please find all publications related to B.1.1.7.

For both tests, five post-test questions were asked:

  1. 1.

    Overall, please rate how easy or difficult it is to use this app on a scale of 1–5, where 1 is very difficult and 5 is very easy.

  2. 2.

    How likely are you to continue using the Mutation Situation Reports to regularly access data or find information about the variants, on a scale of 1–5 where 1 is very unlikely and 5 is very likely?

  3. 3.

    What do you like most about the Mutation Situation Reports?

  4. 4.

    What would you improve about the app? Or what would you add to the app?

  5. 5.

    How would you compare this site to other sites you’ve used to find information about the variants?

The following metrics were recorded:

  1. 1.

    Ease and satisfaction about each task (five-point Likert scale).

  2. 2.

    Time on each task.

  3. 3.

    Number of successful task completions and errors.

  4. 4.

    Overall ease and satisfaction (five-point Likert scale).

  5. 5.

    Likelihood to use (five-point Likert scale).

  6. 6.

    Suggestions for improvement (likes, dislikes and recommendations).

  7. 7.

    Error-free rate.

  8. 8.

    Noted observations about the users’ process.

The successful completion rate for both tests was 100% but users made non-critical errors and the error-free rate for the first and second tests was 10% and 20%, respectively. The average amount of time spent on tasks was a little over 1 min. Users rated the overall ease of using the interface with an average score of 80% and all users responded that they were very likely to continue using the site. Users specifically liked the site’s interactivity, the visualizations and the wide range of features that could be found in one place.

Based on these usability tests, the following changes to the interface were made:

  1. 1.

    New descriptions of report categories were added, as well as the option to filter VOCs and VOIs.

  2. 2.

    The user interface to select mutations and lineage + mutation combinations was streamlined.

  3. 3.

    A second search bar for finding location reports was added to the home page.

  4. 4.

    New options were added to the navigation bar at the top of the interface.

  5. 5.

    An FAQs page was created.

  6. 6.

    A new site description and introductory video were added.

  7. 7.

    Several other design changes to the home page to highlight major features.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.