Analysis and forecasting of global real time RT-PCR primers and probes for SARS-CoV-2

Rapid tests for active SARS-CoV-2 infections rely on reverse transcription polymerase chain reaction (RT-PCR). RT-PCR uses reverse transcription of RNA into complementary DNA (cDNA) and amplification of specific DNA (primer and probe) targets using polymerase chain reaction (PCR). The technology makes rapid and specific identification of the virus possible based on sequence homology of nucleic acid sequence and is much faster than tissue culture or animal cell models. However the technique can lose sensitivity over time as the virus evolves and the target sequences diverge from the selective primer sequences. Different primer sequences have been adopted in different geographic regions. As we rely on these existing RT-PCR primers to track and manage the spread of the Coronavirus, it is imperative to understand how SARS-CoV-2 mutations, over time and geographically, diverge from existing primers used today. In this study, we analyze the performance of the SARS-CoV-2 primers in use today by measuring the number of mismatches between primer sequence and genome targets over time and spatially. We find that there is a growing number of mismatches, an increase by 2% per month, as well as a high specificity of virus based on geographic location.

Deltacoronavirus. This test resulted in 0 matches, providing evidence for the specificity of the test sets to the Betacoronavirus genus, as no test matched the other coronaviruses. These three other genera of coronavirus are, by sequence, the closest in relation to betacoronavirus, and thus to the SARS-CoV-2 genome. Therefore, these results suggest a low probability in false positives for all the global primer-probe sets. Table 2 shows the percent of genomes hit by each PCR test, labelled by the country and target gene region. The America RP is an additional housekeeping primer/probe set to detect the human RNase P gene to control for non-viral genes in the sample, and therefore, as expected, 0% of the SARS-CoV-2 genomes match with this set. Each primer-probe set will produce a positive assay for at least 95% of the genomes within the test set, suggesting that all the global primer-probe sets function similarly. However, when we study instead the number of mismatches for each primer-probe set and the hit genomes, we can see a disparity in the test sets. The Japan|NIID 2019-nCOV N primer-probe set has 1 mismatch with 99% of the genomes and the Germany|RdRP-SARSr set has 3 mismatches with 99% of genomes. Both these tests have at least 1 mismatch with every genome. A correction for the Japanese set was published, but this consistent mismatch persists with the corrected set as well. Several of the tests contain degenerate nucleotides in order to allow for a larger set of possible positive assays, but this could have the oppostite effect. For example, the Germany|RdRP-SARSr reverse sequence contains an S, which denotes either an C or G, but every genome in the corpus contains a T at this locus, which causes the mismatch with every genome. Figure 1 shows the number of mismatches for all genomes created by each PCR, where we can see the range varying from 1796, created by the American N1 primer, to 42 mismatches, created by the French IP2 primer. Thus we observe that the measure of mismatches can be used as a proxy to identify the variation found within the specific gene fragments targeted the tests.

Time analysis
Following the methods described in "Time analysis methods" section, all genomes that fall within the 207 day range are segmented by date of collection and analyzed for mismatches to the various primer-probe sets. Using the metadata provided by GISAID, we were able to recover the date of collection for 50,687 genomes from the corpus. Figure 2 shows the average number of mismatches seen for all primers each day within this range, normalized by the number of genomes sampled in each day. From this analysis, we can see an average of 1.1 mismatches, with a 14% increase in mismatches over the 207 day time range. This corresponds to a ∼ 2% increase per month. To estimate the mutation rate, from Fig. 2, we calculate the best-fit line using least squares, which results in an R 2 value of 0.6. This mutation rate is consistent with the expected rate of mutation of the SARS-CoV-2 virus 1-4 . Thus on average, we can conclude that the regions targetted by the primer-probe sets undergo a similar mutation probability as found across the entire viral genome. Figure 3 shows the distribution of total mismatches and time averaged mismatches for each primer set over the defined time interval. Each violin shows the distribution of the mismatches, total and normalized, for each day and is colored by the gene targetted by that primer-probe set. These results suggest that the tests that target the nucleoprotein exhibit a larger distribution of total and normalized mismatches. However, it is important that it is the China and American tests only that are contributing to this larger distribution, while the Hong Kong and Thailand tests, which also target nucleoprotein, are consist with the other gene targets.
It is important to note that the total number of mismatches occurring is increasing and that many of these mismatches are being sustained in the evolving population. In order to identify a trend, genomes that occur close in time should have smaller change in mismatches than genomes that occur further apart in time. Figure 4 shows this comparison between delta time and delta mismatches for every pair of genomes for the France PCR targeting the RdRP gene (IP4). The graphs for the other PCRs may be found in the supplemental files. Each Table 2. Percent of genomes that are hit by the described PCR test, identified by the country and target gene. *Indicates that the primer is designed to separate the any errant samples within the assay. www.nature.com/scientificreports/ point represents a pairwise comparison of the difference in mismatch plotted over the difference in time. We observe that the delta mismatches grows in variance as the genomes occur further apart in time. Furthermore, the Pearson coefficient is 0.99 between mismatches and the number of genomes sampled in a day for each PCR. This positive linear relationship between the number of genomes and the number of mismatches per day shows that the mismatches occur uniformly across the genomes sampled within a day (rather than a few genomes creating noise in the signal). The data indicates that the virus demonstrated sequence variability in the targeted gene regions and that this variability causes sequence mismatches to increase over time.

Geographical analysis
Geographical stratification is occurring as the SARS-CoV-2 mutates within each geographic location. Following the methods described in "Geographical analysis methods" section, geospatial analysis is conducted to identify patterns in mismatches found in genomes sequenced within versus outside the country of primer origin. Figure 5 shows that number of mismatches segmented by genomes that occur within the test's country of origin and outside the test's country of origin. We normalize the number of mismatches by the number of genomes seen within each category, in order to account for the bias in amount of sequencing performed various regions. For the majority of countries, the number of mismatches in the country is lower than the number of mismatches that occur with genomes sampled outside of the country. This shows that the virus displays localized tendencies within the targeted gene regions, in addition to the spike glycoprotein region that defines the common clade analysis. The two outliers, the Hong Kong and France primers, show a higher percent of mismatches within the country rather than from different countries. This however may be due to the bias in data, as we have the least number of sequences from within Hong Kong, resulting in a higher proportion of in-country mismatches. For the French primer, they perform among the best, with less than 50 mismatches in total, so we can conclude that these mismatches are negligible. We can also analyze how the country's primer-probe set change in performance over time, as they experience different variants emerging at different time points within the pandemic. Figure 6 shows the average number www.nature.com/scientificreports/ of mismatches over time, grouped by the genomes sampled within and outside the country, for one American primer-probe. While the in-country average number of mismatches shows low variability, the out-country average number of mismatches show an increasing diversity in these targeted regions. This elucidates the increasing stratification of the virus depending on the location in which it is replicating. We can see the number of mismatches within the country growing linearly, while the number of mismatches outside the countries have much larger spikes. This is evidence that the high geospatial localization of this virus is reflected in the gene regions targeted by the primer-probe sets, and must be considered when administering the tests globally. The full set of graphs for each primer-probe set tested are available in the supplement.

Clade analysis
Geospatial analysis leads to and overlaps with clade definitions, as the clades result in variants emerging in various localities. We can thus examine the number of mismatches segmented by the clades, as defined by Nextstrain, in order to determine if primer-probe sets are skewed in performance to a particular clade. 7125 genomes were placed in Clade 19A, 3706 genomes were placed in Clade 19B, 13437 genomes were placed in Clade 20A, 14,190 genomes were placed in Clade 20B, and 8852 genomes were placed in Clade 20C. Figure 7 shows the number of mismatches for each test per clade, normalized by the number of genomes in clade. This shows definite trends which confirm the geographic specificity of the virus; for example, the American nucleoprotein primers have the highest number of mismatches for clade 19A, which Nextstrain defines as originating from predominantly Asian genomes, while the Chinese primer has the lowest number of mismatches for this clade.
It is important to consider what mutations and gene regions define the clades, as this will impact the significance of a disproportionate number of mismatches. The clades are defined by specific mutations at nucleotide locations, detailed by Nextstrain documentation, which only overlaps with the primer binding regions for 3.37% of the genomes. 58% of the genomes in Clade 20B have a mismatch in the region defining Clade 20B and 1% of the genomes in Clade 19B have a mismatch in the region defining Calde 19B. The genomes in the other clades had no mismatches in the corresponding genomic regions. Therefore, the relationship between the primer mismatches and the genome clades are correlational rather than causational. Thus, the mismatches created by a primer-probe set do not define the clade, but rather are a result of mutations at different locations on the genome that is also being persisted. From this, we provide further evidence that the entire genome is undergoing mutation, and that the targeted gene regions are not resistant to variation.

Discussion
By taking a global perspective on both the SARS-CoV-2 genomes and the common RT-PCR protocols, we are able to highlight important trends within the data. We observe an increasing number of mismatches between the primer and target genome sequence as time progresses. We can also see that the number of mismatches is higher when we compare genomes sampled outside of the country that designed the test compared to within the country. While these metrics do not quantify the performance of the test, they demonstrate a growing divergence between the targeted gene sequences and the test primers. As shown by D. Bru et al. 12 , a single mutation can result in an underestimation of the gene copy number by up to 1000-fold. Our results reveal, today, an average of 1.1 mismatches between the primer and target sequences, with a growth of 2% each month. Understanding copy number is critical to correct interpretation of a PCR assay.  www.nature.com/scientificreports/ If the genome being tested has sufficient mismatches this can lead to an erroneous copy number and, therefore, a misinterpretation of the assay result. In the case of SARS-CoV-2, for each targeted gene sequence, there are at least 10 different sequence variants and with this sequence diversity of the targeted genes, the mismatches in PCR primers may not be amplifying each example at the same rate, leading to false negatives. The given primers average a base length of 20 primers, and it has been demonstrated for primers with such base pair length that 2 to 3 mismatches reduces the yield by approximately 22 percent 13 . Our data indicates that this level of mismatches will be reached within 19 months or fewer if the rate of infection, and thus mutation, increases significantly. The results of this study also demonstrate that each primer target develops a different number of mismatches over time (see: Fig. 3). From the total number of mismatches created by primer target, we can see that the nucleoprotein targets from America, China, Hong Kong, and Thailand develop the greatest number of mismatches. Furthermore, when looking at the distribution of average number of mismatches over time, the primers targeting nucleoprotein have the largest distribution. As shown by Carter et al., there is a benefit to targetting the nucleoprotein, as it is one of the larger proteins within the genome 14 . However, this work aims to highlight the potential need for eventual redesign of primer-probe sets as nucleoprotein increases in mismatches with the current tests. The results indicate that primers targetting the envelope small membrane protein and the RNAdependent RNA polymerase are the most resistant to mismatches. This may suggest more stable targets for future primer test designs.
The mutations that lead to mismatches between gene PCR primers and their targets reflect the sequence evolution of the virus. Comparing the difference in time of collection of two genomes with the number of mismatches by which they differ shows evidence for this evolution (Fig. 4). Genomes that occur on the same day (delta time = 0) have approximately zero difference, while genomes that occur at delta time = 100 [days] have an average of 0.01 mismatches per nucleotide. This is consistent with the observed increasing number of mismatches over time, and shows that evolution of SARS-CoV-2 genomes is being sustained.
The continual branching of the genetic tree due to mutation is further supported by the analysis of the number of mutations within and outside the country that designed the particular primer. Figure 5 shows that most countries primers perform better when tested against genomes sequenced within the country rather than globally sequences genomes. In two cases, Hong Kong and France, the primers have a smaller percent of mismatches with genomes outside the country. For France, the IP2, a region of the RdRp gene, primer target creates a disproportionate number of mismatches when compared to genomes sequenced within France. This suggests that this region of the genome has deviated more from the original reference used to generate the primer set. For Hong Kong, they have the least number of genomes sequenced within the country in this dataset, so it is possible that the larger percent of mismatches for genomes within versus outside the country is an artifact of bias in data.
Nextstrain categorizes the various genetic phylogenies by clade, which is designed to denote long-term genetic changes based on mutation. Each clade defined requires significant geographical and frequency. This study shows that less than 3.5% of the regions on the genome that define the clades overlap with the region that the primers With the emergence of specific mutations that are spreading at faster rates, this analysis becomes more important in evaluating the possible need for primer re-design. As shown with emerging tools, such as PrimerScan and CoV-GLUE, the community is supporting the findings that as the virus mutates, the primer-probe sets must be kept up-to-date. This work further elucidates this as we perform statistical analysis on this data, showing the increasing rate of mismatches across time and location and that these mismatches are being sustained as the virus spreads. The emergence of the B.1.1.7 strain contains mutation in the regions encoding for the envelope small membrane protein and the nucleoprotein, both targeted by the current primers. With the number of cases of SARS-CoV-2 globally, it is highly probable that the genome will mutate in the primer target regions.

Methods
Data description. GISAID has emerged as a leading source of SARS-CoV-2 genomes, containing the largest number of genomes sequences around the world with metadata about the location and time of collection 15 . SARS-CoV-2 genomes from the GISAID repository were curated, collecting high quality genomes within the date range Aug 24, 2017-July 31, 2020 16 . While this date range precedes the start of the current outbreak, the genome sequences from the earlier points and time serve as a control for comparison. We define high quality genomes as those with less than 1% N within the sequence and less 0.05% unique non-synonymous mutation. By taking these measures, we reduce the noise generated from random mutations or sequencing errors found within the genome. This resulted in a set of 61,996 SARS-CoV-2 genomes, for which we evaluated primer homology.
The WHO has published primers from six countries -China, France, USA, Japan, Germany, Hong Kong, and Thailand 5 . Each protocol published is a RT-PCR assay method, and for each primer set, a forward, reverse and probe sequence is provided 5 . For this study, we use the sequences as provided with no modifications made. PCR primer comparison. Using the primer sequences and SARS-CoV-2 genomes described above, we perform a sequence comparison. Specifically, we used BLASTN with parameters similar to Primer-BLAST 17 . This procedure was verified to account for full alignments of the forward, reverse, and probe sequences of primers 18 . The BLAST results are then parsed, ensuring that the forward, reverse, and probe sequences match www.nature.com/scientificreports/ a given genome and that the probe sequence is matched spatially in the forward and reverse directions on the genome, and the number of mismatches is aggregated for each PCR sequence and genome. This metric does not necessarily predict whether the PCR test would generate a positive or negative outcome for the particular genome, but rather measures variability within the targeted gene region. Since all genomes included in this corpus are associated with SARS-CoV-2, its can be assumed that they were collected by a positive assay. Mutations in the targeted gene region, over time, can affect the sensitivity of the primers.
Time analysis methods. For each regional test, the primers each target a particular section of the genome derived from various reference genomes. However, as replication and mutation of the virus occurs, these targeted regions of circulating virus genomes accumulate sequence differences from the reference. Thus, the efficacy of the primer may decrease over time. As more mutations accumulate, it is important to measure the rate of mismatch growth between primer sequence and targeted section as a function of time. From this rate it is possible to anticipate when target sequences used in a regional test should be updated. To estimate the mutation rate of the targeted genes over time, we group the genomes by their date of sampling and aggregate the number of mismatches for each day. In order to reduce noise from days with few genomes collected, for any time-based analysis, we consider only those days that have over 100 unique genomes sequenced. With this restriction data is available for a time range between Jan 1, 2020-July 25, 2020, for a total of 207 days. This process removes outlier data that was sequenced prior to the start of the pandemic, including sequences that were collected from non-human hosts.

Geographical analysis methods.
As the virus has spread throughout the world, we see particular mutations that are specific to outbreaks by geospatial location. As studies using Bayesian coalescent analysis have shown, high evolutionary rates and fast population growth of the SARS-CoV-2 virus results in increasing diversification of the virus by geographic location 19 . To understand how the PCR tests respond differently for genomes collected by country, we first extract the country of sampling for each genome from the fasta header provided by GISAID and then group the number of mismatches found in the genome by in country versus out of country.
Clade analysis methods. SARS-CoV-2 genomes have been categorized into clades to define groups of mutations. For this analysis, we use the clades as indicated by NextStrain, which are defined by frequency and geographic spread. Their script to categorize genomes within the specific clade definitions was used to classify each genome within the dataset 20 . Furthermore, NextStrain publishes the genome locus that defines each clade, and these loci were compared to the genome location the primer targets bind to. By grouping the number of mismatches for each PCR by the genomes' clade we see how different genetic variations affect the PCR test performance.