Data-driven analysis of amino acid change dynamics timely reveals SARS-CoV-2 variant emergence

Since its emergence in late 2019, the diffusion of SARS-CoV-2 is associated with the evolution of its viral genome. The co-occurrence of specific amino acid changes, collectively named ‘virus variant’, requires scrutiny (as variants may hugely impact the agent’s transmission, pathogenesis, or antigenicity); variant evolution is studied using phylogenetics. Yet, never has this problem been tackled by digging into data with ad hoc analysis techniques. Here we show that the emergence of variants can in fact be traced through data-driven methods, further capitalizing on the value of large collections of SARS-CoV-2 sequences. For all countries with sufficient data, we compute weekly counts of amino acid changes, unveil time-varying clusters of changes with similar—rapidly growing—dynamics, and then follow their evolution. Our method succeeds in timely associating clusters to variants of interest/concern, provided their change composition is well characterized. This allows us to detect variants’ emergence, rise, peak, and eventual decline under competitive pressure of another variant. Our early warning system, exclusively relying on deposited sequences, shows the power of big data in this context, and concurs to calling for the wide spreading of public SARS-CoV-2 genome sequencing for improved surveillance and control of the COVID-19 pandemic.


Scientific Reports
| (2021) 11:21068 | https://doi.org/10.1038/s41598-021-00496-z www.nature.com/scientificreports/ that will eventually characterize a variant typically show an exponential-like temporal growth 29 ; one meaningful example is shown in Supplementary Fig. S1. We harnessed the peculiarity of these temporal trends to scout emerging variants. To that end, we applied standard time-series clustering techniques to group together changes showing similar prevalence behavior over a one-month-long period. We regarded as warnings of possible variant emergence those clusters characterized by (1) a positive trend in the prevalence time-series of the constituting amino acid changes, and (2) being sufficiently different from clusters that caused previous warnings (to avoid extracting twice a set describing the same candidate variant). To assess whether these clusters successfully match known variants (named 'hit'), we tested their similarity against the characterizing changes of the most widespread SARS-CoV-2 lineages, as defined by Pangolin, collected in the 'lineage dictionary' (see Supplementary Tables S1 and S2). The dictionary contains all Pangolin lineages that, at the calculation date, have appeared in at least 5000 sequences in the full dataset, with the exception of a few variants of interest highlighted by the news, for which we allowed a lower threshold of 1000 sequences. We also used community analysis to assess whether and how the emergence of new variants can lead to a reorganization in the dynamics of amino acid changes, as observed through their co-occurrence within clusters.
Our approach must not be considered an alternative to phylogenetic analysis, which takes into account the entire evolutionary history of the viral genome. Indeed, the phylogenetic approach works by accumulating individual sequences along the tree of the species, which is built incrementally. When a branch is particularly rich of sequences, it becomes a candidate variant to be investigated (see Supplementary Fig. S2). Our approach, instead, is purely big data-driven; single sequences are not considered (and are not even known). We study aggregate counts of sequences and perform a posteriori analysis (as hinted in Supplementary Fig. S3). Other methods have previously complemented phylogenesis, namely by describing typical SARS-CoV-2 mutational profiles across different countries and regions 30,31 , proposing statistical indicators for location-based mutation evolution 32 , and observing changes that become recurrently prevalent in different locations, thus suggesting selective advantages 2 . Time-based analyses have been considered for phenetic clustering of prevalent SARS-CoV-2 mutations over time 33,34 , trend detection in SARS-CoV-2 short nucleotide sequences 35 , and single amino acid changes 36 .

Searching variants by mining time-series
To show our methods in action, Fig. 1 displays the case of Japan, featuring a matrix of 594 amino acid changes observed over 63 weeks. We identify 132 clusters, ten of which are included in the early warning system, tracking the emergence of possible variants. Four of these clusters revealed to be early hits, i.e. showing strong clusterlineage similarity at the time of clustering. These hits occur on Jun-wk4-20, Aug-wk2-20, and March-wk3-21 (two instances), respectively associated with variants JP2, JP1, and JP3/Alpha. We use JP1, JP2, and JP3 for lineages B.1.1.214, B.1.1.284, and R.1 since, according to Pango lineages reports 37 , they originated in Japan. Each hit is qualified by the growth of the change prevalences in the identified cluster (left panels a-c) and by the similarity between cluster composition and the lineage dictionaries (central panels a-c). Partial overlaps of identified clusters with more than one dictionary are due to overlaps between dictionaries. We find that the emergence of a new variant is typically associated with a drastic reorganization of the within-cluster co-occurrence patterns of amino acid changes (right panels a-c), which may suggest profound effects of variant emergence. Panels (d-g) of Fig. 1 show the temporal dynamics of the changes associated with each identified variant. Strong clusterdictionary matches, illustrated by circle size and color, are found for all variants, spanning several months after the moments when warnings are issued. By contrasting variant dynamics with recorded cases (shaded bars), we observe that JP2, JP1, and Alpha prevalences peaked ahead of the first, second, and third COVID-19 waves in Japan. By contrast, JP3 shows a prevalence peak when reported cases were at a minimum. Taken all together, Fig. 1 constitutes a syntactic fingerprint of how variants have evolved in Japan. In addition to these four hits, our method identified other six candidates that could be rapidly dismissed, because clusters identified in subsequent weeks showed low similarity with the original candidate. Four of them were discarded two weeks after the warning, the other two within five weeks. Supplementary Fig. S4 shows all ten warnings. ; values J cd < 0.1 are not shown. Right panels: Community detection applied to the within-cluster change co-occurrence matrix evaluated over the period of time between the emergence of two subsequent variants. The resulting interaction network is plotted using a forcedirected layout 65 , with edge lengths inversely proportional to link weights. Colors indicate nodes belonging to communities that strongly match known lineages, other communities are in gray-scale shades. (d-g): Temporal dynamics of the four identified variants. In each scatter plot, the left y-axis value represents the average prevalence of the changes in the cluster-dictionary intersection, circle size represents the cluster-dictionary similarity ( J cd < 0.1 not shown), while circle color is proportional to the log-ratio between the number of changes in the cluster and in the dictionary. Warnings are marked with a labeled arrow. In the background, the number of reported COVID-19 infections (thousand cases, right axis) in Japan 66 . Plots were created using MATLAB R2021a (http:// www. mathw orks. com). All graphics were further processed using Adobe Illustrator 2021 (http:// www. adobe. com). www.nature.com/scientificreports/ matched the temporal evolution of the second COVID-19 wave in the country, and Gamma shortly anticipated the third large wave of infections and became locally dominant. Similarly, variants Kappa, Delta, and Beta, grew quite significantly before the second wave of cases in their respective countries of origin. Note that some weaker matches are found also before the relevant warnings, essentially because some amino acid changes included in the lineage dictionaries did indeed emerge individually well before the diffusion of variants. Also, note that the WHO-named variants Eta, Theta, and Lambda do not appear in this figure as their countries of origin did not meet our criteria for minimum data availability and were not included in the analysis. Figure 3 shows the temporal patterns of notable variants as identified by our method in Europe and in the US. For all European countries with at least 16k deposited sequences, panel (a) displays the growth curves of changes associated with the Alpha variant at the time when the relevant warnings were issued by our method. The spread of Alpha outside the UK was delayed by five to seven weeks; warnings concentrate on three consecutive weeks in late January (country shading). Interestingly, the growth curves of the change prevalences associated with the Alpha variant outside the UK are more noisy than inside the UK, likely due to availability of fewer sequences and spatial heterogeneities in sequence deposition.
Panel (b) of Fig. 3 shows instead the dynamics of the most widespread US-born SARS-CoV-2 variants in all US states where at least 5k sequences were deposited and their interplay with WHO-named variants. Our clustering and warning methods partitioned these states into three distinct, geographically quite compact, groups. A central/eastern group of states (blue) displays a baseline temporal pattern where common US variants first rose until Jan-wk1-2021, then started declining under pressure of variant Alpha. In the Western group of states (orange), the Californian-born variant Epsilon appeared in-between the US2 variant and Alpha, whereas in the north-eastern group (green) the New-York-born Iota appeared after the rise of Alpha, none of them reaching the same prevalence as Alpha. Although quite general, the patterns above described in groups may show countrydependent peculiarities. In Louisiana, for example, the hit for Epsilon followed the one for Alpha, and we did not issue warnings for Alpha in New York, nor for US1/US2 in New Jersey and Connecticut.

Discussion
We have presented a novel methodology to scout SARS-CoV-2 variants entirely based on the analysis of data from the GISAID sequence repository; specifically, we have shown that emerging lineages can be detected via cluster analyses of the prevalence time-series of amino acid changes. Our scouting method is not in competition with phylogenetic analysis, which is fundamental for properly defining variants as lineages through monitoring of the evolution of viral sequences. Nevertheless, the proposed big-data-driven approach proved to be quite effective.  www.nature.com/scientificreports/ First, the major WHO-named variants of SARS-CoV-2 were well and timely captured in their country of origin, an indication that the early dynamics of change prevalence is a distinguishing property of emerging variants. All hits were fast (requiring no more than five weeks, the length of the time window used for cluster analysis), except the hits of Beta (five plus two weeks) and Iota (five plus one week)-see Table 1. Second, our plots reveal that the method can effectively trace and visualize the variants' natural evolution. Variants emerge, reach their maximum prevalence, and eventually start a declining phase, typically when a competing variant emerges. Our method also issues warnings that are not found to match major lineages down the road. We do not see this outcome as an intrinsic limitation of our approach, though. In fact, even when not selected as hits, warnings are informative of the dynamics underlying variant emergence and inter-variant competition; in addition, similarity analysis can clarify, typically in a matter of a few weeks at most, whether an emerging variant is gaining traction. We performed our study retrospectively, without the aim of using our method for predicting the variants. Relying exclusively on big data for variant scouting is hampered by three problems: (1) consistency of sampling, (2) delay of deposition, and (3) biases in sampling. As for (1), the ratio between the number of sequences and reported COVID-19 cases widely varies among countries and drops from 77% in Iceland (clearly facilitated by small number of cases) to below 0.1% for many countries, also including some large ones like India and Brazil. Among the countries with lots of cases, the UK stands with an exceptionally high ratio exceeding 9%. Indeed, US and UK have contributed the largest number of sequences worldwide (517 k and 425 k, respectively). Supplementary Table S3 shows these statistics for all countries contributing to GISAID with more than 1000 sequences, whereas Supplementary Table S4 shows statistics for the 50 US states. Regarding (2), as an example, in the UK the average delay between collection and deposition amounts to 24 days. This delay tended to reduce as the pandemic unfolded, from 38 days in 2020 to just 16 days in 2021. Iceland is again striking the best performance, with 11 days of average delay in 2021. Concerning (3), heterogeneities in surveillance may introduce Being aware of these obstacles, the potential of developing and using big-data-driven approaches to keep track of variant emergence may be proved by comparing the timings of warnings issued by our method for the eight major WHO-named variants (as of June 3rd, 2021) with the dates of their initial communication in the media or in relevant scientific outlets (see again Table 1 and "Methods" section). In five cases (Alpha, Beta, Epsilon, Zeta, Kappa), our data-driven warnings anticipated the press notifications, in some cases (Beta, Epsilon, Kappa) with a lead time of several weeks. In other two cases (Delta, Iota), the warning would have been issued around the same time as the first communications. Only in the case of the Gamma variant our warning was late (by a couple of weeks) compared to its first communication. It must be remarked that Brazil, where both Gamma and Zeta variants originated, was sequencing a low number of samples at that time.
In conclusion, our warnings were issued at times which favourably compare with most press/research announcements, in some cases even anticipating them. The above figures illustrating variant hits and dynamics collectively provide an effective fingerprint of the location-specific temporal evolution of variants, and well support comparative displays in time and space. Our exercise thus strongly corroborates the urgent call, raising from the entire scientific community, for faster and wider public publishing of viral sequences 39 .

Methods
Data collection from the GISAID data source. Since January 5, 2020, when the complete genome sequence of SARS-CoV-2 was first released on GenBank (Access number: NC_045512.2), there has been a rapid accumulation of viral sequences. Many tools are available for quickly extracting the assignment of sequences to lineages or the prevalence of each amino acid change in specific geographical areas, including ViruSurf 40 , outbreak.info 41 , CoVariants 42 , and cov-lineages.org lineage report 37 . Here, we considered the proteinlevel mutations (hereon named 'amino acid changes') of 1,819,996 complete SARS-CoV-2 genome sequences from infected individuals from all over the world that were downloaded on June 3rd, 2021, using EpiCov TM data from the GISAID database (https:// www. gisaid. org/ 1 ) on the basis of a specific Data Connectivity Agreement. The original data file contains: sequence accession ID, collection date, submission date, Pangolin lineage, collection location, and the list of amino acid changes. All records belonging to non-human samples or not reporting the day (but only year, or year and month) in the collection date field were discarded, resulting in a dataset of 1,763,923 records (about 97% of the initial size). To denote amino acid changes, we adopt the common notation used by GISAID, representing them as the concatenation of the following elements: the protein acronym, the reference amino acid residue, the position (using the coordinate system of the single protein), and the alternative amino acid residue exhibited by the specific mutated sequence (in case of substitutions) or a dash (in case of deletions). Proteins include, for example, Spike, Envelope (E), Membrane (M), Nucleocapside (N), as well as non-structural proteins forming the ORF1ab polyprotein (NSP1, NSP2, ..., NSP16). Changes can occur at any point within a protein; they are evaluated with respect to the reference sequence WIV04 (used by GISAID 43 ). Lists of amino acid changes per sequence are extracted directly from GISAID, calculated by their internal pipelines based on the full consensus sequences received from submitters. GISAID has a rigorous submission process and performs checks both directly with submitters and internally, carefully processing their quality-related flags.

Temporal aggregation of sequence mutations. Original collection dates were temporally binned into
four approximately weekly periods per month, namely days 1-7 (shortcut as wk1), 8-15 (wk2), 16-23 (wk3), and 24-end of month (wk4). We verified that the small differences in bin sizes among the four periods do not have any impact on the results obtained using our data mining methods, as they rely on change prevalence (relative count of sequences) rather than change abundance (absolute count). We transformed the GISAID data into Table 1. Capture of the emergence of the major WHO-named SARS-CoV-2 variants. WHO variant name; its country of origin; hit date, corresponding to the warning date for early hits, extended by a short delay of two weeks for Beta and one week for Iota (late hits, see "Methods" section); first communication date retrieved from institutional or research outlets (see "Methods" section); temporal comparison between hit date and 1st communication (in weeks). www.nature.com/scientificreports/ a table containing tuples of the following kind: amino acid change, country of collection, week of collection, absolute count of sequences holding that change, and total collected sequences in the same country-week. As locations of interest, we selected all European countries and US states for which at least 16 k and 5 k sequences were available, respectively, and a small number of other countries where variants of concern/interest have originated according to the World Health Organization 16 . For each of those countries, we prepared a matrix where: each row r represents an amino acid change that has been observed in at least five sequences and for at least three weeks, each column c represents an observation week, each cell r, c contains the number of sequences exhibiting the amino acid change r observed in week c. The full account of matrix sizes for the selected countries is shown in Supplementary Table S5. We also extracted from the GISAID dataset the total number of sequences collected at given locations for each of the considered weekly periods. Data extraction and aggregation was performed using PostgreSQL (Version 12.5) and the Pandas library (Version 1.2.1) of Python (Version 3.8.5). The steps of data extraction and aggregation are summarized in Supplementary Fig. S3.
Construction of lineage dictionary. We extracted all lineages that are expressed in at least 5k sequences worldwide and a small number of lineages that deserved attention by the news and/or other online sources, provided they appeared in at least 1k sequences worldwide. The assignment of amino acid changes to lineages is not uniquely/uniformly defined by all sources; following the choice made by Mullen et al. 41 , we computed the set of characterizing changes, pragmatically defined as those that appear in at least 75% of the lineage sequences in GISAID, for all retained lineages. This rule produces partially overlapping lists of amino acid changes, with lengths ranging from five to 24, which we regard as our 'lineage dictionary' (Supplementary Table S2). To improve readability, the naming convention starts with the Pangolin lineage 13 , followed by the WHO name 16 (using Greek alphabet) when available. In selected cases, we add a geographical characterization, based on the country of origin (https:// cov-linea ges. org/ 37 ), resulting into US1/US2 and JP1/JP2/JP3 aliases. Only exceptionally we merged multiple lineages into one label, namely when lineages share the same set of characterizing amino acid changes. Supplementary Table S1 contains a comprehensive list of lineages considered in our study.
Cleaning of too frequent, confounding changes. We found that few changes (globally six, four worldwide-S_D614G, NSP12_P323L, N_R203K, and N_G204R-and two specific to North-America-NS3_Q57H and NSP2_T85I, out of a total of 78,650 recorded changes) are disproportionately frequent in the dataset of each continent, being present in more than 60% of GISAID sequences and populating the dictionary of more than 30% of the most frequent lineages (i.e. those represented by at least 80 sequences). To avoid confounding effects in data analysis, we removed these changes from both the aggregated data matrices and the dictionary entries.

Cluster analysis.
To assess whether the temporal patterns of changes' prevalence occur in a coherent manner in the dataset, as expected in the case of changes belonging to closely related variants, we apply a relatively simple time-series clustering algorithm. All the following analyses have been performed using MATLAB R2021a. At each time t, we retain the n(t) time-series of change prevalence (current ratio between change counts and total counts of changes) that are observed continuously over a time interval of four weeks prior to t ( [t − w + 1 · · · t] , with w = 5 ) and partition them via k-medoids clustering 44 (PAM algorithm, kmedoids function in MATLAB), with pairwise distances between time-series being evaluated via dynamic time warping 45 (dtw in MATLAB). The optimal value of k is exhaustively searched over the range [1 . . . min(20, n(t)/4)] and is selected as the one that maximizes the average silhouette score 46 evaluated over the entire set of n(t) change prevalence time-series.
Our early-warning system for variant emergence. To identify clusters of changes characterized by an increasing trend in their prevalence time-series, as expected in the case of emerging variants, we use the non-parametric Kendall's τ B statistic 47 , as implemented in the ktaub MATLAB package 48 , namely to evaluate the ordinal association between change prevalence and sampling time. More precisely, a cluster of changes is considered to deserve further attention as a candidate for variant emergence if its average prevalence time-series shows a positive trend ( τ B > 0 at significance level α = 0.05 ). Among these candidate clusters, we are particularly interested in those that are sufficiently different from previously observed trending clusters, because they could signal the emergence of a new virus variant. Indeed, the partitioning of change time-series is performed independently at each time step, thus the clusters identified at a given time are in principle not related to those identified at any previous step. However, because of the temporal autocorrelation of change prevalence dynamics, similarities are expected (and found) to exist in the composition of clusters identified at subsequent time steps. To quantify similarity for each pair of clusters, we use the classical Jaccard index 49 , defined as the ratio J cc between the cardinality of the intersection and the cardinality of the union of the sets of changes constituting the two clusters. Among the candidate clusters with a positive trend in their prevalence time-series, only those that are sufficiently different from any previous candidates ( J cc < 0.5 ) are selected to form an early warning system for monitoring the possible emergence of new variants.
Cluster-dictionary comparison. We use the similarity between the composition of early warning clusters and the lineage dictionaries to assign a posteriori the observed change time-series to known lineages, thereby building a ground truth for assessing the performance of data analysis. Specifically, we use again Jaccard similarity index, this time applied to cluster vs. dictionary change composition ( J cd ). A threshold J cd > 0.5 is used to identify clusters for which there is a close compositional match to a known lineage. If a match exists, we conclude that the method has identified the lineage (thus, the underlying variant) at the same time as when clustering occurred (early hit). If not, we keep monitoring the cluster-dictionary similarity J cd in the following time steps, each time using the cluster with the highest compositional similarity J cc with that of the previous step, pro- www.nature.com/scientificreports/ vided that the inter-step similarity remains sufficiently strong ( J cc > 0.5 ), until the threshold J cd = 0.5 is possibly exceeded (late hit). If the inter-step similarity test fails without producing a late hit, the candidate cluster causing the early warning is discarded. Rapid convergence towards a decision between late hit or candidate discarding constitutes a desirable property of our method. The study of the changes associated with the successfully identified variants can be deepened by analyzing their temporal dynamics along with the country-specific epidemiological patterns of the pandemics. Specifically, we match the dictionary composition of each identified variant to cluster composition over time and evaluate: (1) the average prevalence at the time of clustering of the changes in the intersection between cluster composition and lineage dictionary; (2) the cluster-dictionary similarity J cd ; and (3) the log-ratio between cluster and dictionary cardinalities. This methodology is applied to the whole dataset, i.e. also prior to the emergence of variants or after their possible disappearance; in fact, some changes belonging to a lineage dictionary may be individually present in the population before/after the diffusion of the relevant variant.
Assessment of the early warning system. We collected information about the first Institutional Communications (ICs), Research Communications (RCs) captured at their first version, and Published Papers (PPs) about the most important WHO-named variants to assess whether the points in time when warnings are issued by our method compare well with the points in time when variants actually became known. The following references provide the dates indicated in Table 1: • Alpha: IC on 18-Dec-20 22 18 . • Gamma: IC on 06-Jan-21 53 (National Institute of Infectious Diseases of Japan); RCs on 11-Jan-21 54 and 12-Jan-21 55 (two independent posts on Virological.org forum); PP by Naveca et al. 11 . • Delta: IC on 21-Apr-21 22 (Technical Briefing 10 of PHE); RC on 28-Jun-21 19 .
• Iota: IC on 10-Mar-21 22 (Technical Briefing 7 of PHE); RCs on 15-Feb-21 24 by Caltech group and 25-Feb-21 59 by Columbia University group. Community analysis. The points in time when lineage dictionaries are for the first time successfully ( J cd > 0.5 ) matched to cluster composition are used to partition temporally the dataset of change prevalence. For each resulting time window, we explore the within-cluster co-occurrence dynamics of the different changes by building a weighted, undirected graph where the nodes are the different changes and the edges represent pairwise interactions between changes; edge labels are evaluated as the number of clusters in which two changes occur together 62 . To assess whether changes can be grouped into densely connected sets, thereby demonstrating robust co-occurrence patterns within the clusters identified over the time window of interest, we apply the Louvain method of community detection 63 using the MATLAB function GCModulMax1 (Community Detection Toolbox 64 ). The change compositions of the resulting communities are compared with the lineage dictionaries by using again Jaccard index.

Data availability
All employed data matrices (one for each country or US state), the lineage dictionary, and custom scripts to reproduce the results presented in this study, have been deposited in Zenodo at https:// doi. org/ 10. 5281/ zenodo. 50901 69. Original sequence metadata and amino acid changes are publicly accessible through the GISAID platform.