Introduction

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was identified in December 2019 in Wuhan, China1, as a new strain of coronavirus that causes COVID-19 disease2. Since then, COVID-19 has evolved into a pandemic with nearly 540 million confirmed cases and over 6.3 million deaths worldwide3, as of June 2022. Although now vaccines are available for SARS-CoV-2, the outbreak still represents a massive challenge because of the characteristics of the COVID-19 disease, e.g. long incubation period, wide range of symptoms, high infection rate, high false negative rate (FNR) in detection tests and high mutation rate4.

An early estimate of SARS-CoV-2 mutation rate is \(1.12\times 10^{-3}\) mutations per site-year5. The high mutation rate of SARS-CoV-2, resulted in different variants, where some of them showed a higher rate of transmissibility, virulence, clinical presentation, mortality and/or vaccine/therapeutics resistance6. Given their acquired mutations, and aforementioned characteristics some variants were deemed variants of concern (VOCs) by the WHO. The previously circulating VOCs were B.1.1.7 (Alpha), B.1.351 (Beta), P.1 (Gamma) and B.1.617.2 (Delta) in Pango lineage7. These were designated as previous VOCs on 09-Mar-2022 (B.1.1.7, B.1.351 and P.1) and 7-Jun-2022 (B.1.617.2). At present the dominant VOC circulating in the world stems from B.1.1.529 (Omicron) designated on 26-Nov-2021, including the sublineages BA.1, BA.2, BA.3, BA.4 and BA.58. The initial Omicron outbreak was caused by BA.1. More than 60 non-synonymous mutations were discovered in the BA.1 and BA.2 variants, according to a whole-genome sequencing analysis, including base substitutions, insertions, and deletions.

During a pandemic, having access to a precise diagnosis tool can help decision-makers take appropriate measures, e.g. isolation, monitoring and quarantine of patients to reduce infections9. Nevertheless, given the magnitude of the outbreak, health systems and health workers found themselves overwhelmed by the number of patients needed to be tested. As a possible solution, several research lines proposed the use of artificial intelligence (AI) to automate the detection of infections10. Despite the great potential of AI, a recent meta-analysis of these methodologies strikingly found that classical methods, such as Quantitative reverse transcription PCR (RT-qPCR), have been more accurate during the COVID-19 pandemic, than AI-based detection techniques, e.g. diagnosis based on imaging11.

RT-qPCR is both an effective tool to identify SARS-CoV-2 infections, and the most widespread approach to diagnose COVID-19 in symptomatic and asymptomatic patients12. In summary, the methodology requires a sample from the patient, and then a specific sequence of the viral genome is targeted. If the sample contains the expected sequence, then this will start the polymerase chain reaction to amplify the virus-specific sequence and thereby the virus will be detected. Nevertheless, the high mutation rate of SARS-CoV-2, made it difficult to design tests for each of the VOCs. For example, VOC B.1.1.7 was identified through an increase in the S-gene target failure in a three-target gene assay (N+, ORF1ab+, S−), coupled with sequencing of the virus and RT-qPCR amplicons products13. The S-gene target failure occurs when one of the RT-qPCR probes fails to bind, as a result of the \(\bigtriangleup \)69−70 deletion in the SARS-CoV-2 spike protein, present in B.1.1.713. This \(\bigtriangleup \)69−70 deletion, which affects its N-terminal domain, has been occurring in several different SARS-CoV-2 variants around the world14,15 and has been associated with other spike protein receptor binding domain changes16. This is consistent with other existing primer designs like CoV2R-3 in the S-gene17, that will also yield negative results for the B.1.1.7 variant, as the reverse primer sequence is in the region of mutation P681H. A more in-depth analysis of S-dropout positive results can be found in Kidd et al.18. Due to the likeliness of mutations in the S-gene, assays relying solely on its detection are no longer recommended, and a multiplex approach is required to detect SARS-CoV-2 and the VOCs19,20,21,22,23,24.

As the pandemic evolved and new VOCs appeared, it became clear that we need more specific and efficient primer design to detect the virus and to compensate for the high mutation rate were necessary. We propose an approach based on evolutionary algorithms (EAs)25, an AI technique, and available sequences from the Global Initiative on Sharing Avian Influenza Data (GISAID)26, to find specific sub-sequences that could be used as primers for RT-qPCR to detect SARS-CoV-2 and VOCs in human samples. The basic idea of the approach is to uncover 21-bps sequences specific to a given virus strain (in this case SARS-CoV-2 or a variant), rank them by their suitability as primers, and measure their capacity of discriminating samples belonging to the target virus from other strains. In a previous work, we generated a set of primers to detect SARS-CoV-2 with Convolutional Neural Networks (CNNs)25 using 52,645 SARS-CoV-2 sequences from the GISAID database and 20,572 sequences of other taxa from the National Center for Biotechnology Information (NCBI) database27. Then, we tested the resulting sequences in silico using the software Primer3Plus28 to verify their suitability as primer sets using the SARS-CoV-2 reference genome NC045512 from NCBI29. The process generated a primer set in the ORF3a gene of the SARS-CoV-2 genome, that was named UtrechtU-ORF3a, and was successfully tested in clinical settings with 10 patients’ samples25. Then, we improved the AI method to be faster, and at the same time specific enough to generate primer sets for the VOCs30.

In the context of justified skepticism of AI-based techniques, we then decided to perform a more robust validation of the RTq-PCR diagnostic tool we developed using AI. First, we clinically validated our main lineage primer UtrechtU-ORF3a by comparing them against 2 commercial kits for 20 patients’ samples, 15 positive and 5 negative. Next, we designed specific primer sets for VOCs B.1.1.7 (Alpha) and B.1.1.529 (Omicron) and clinically validated them against government approved tests. The proof-of-concept of our AI-Based Primer Design Tool can aid in the creation of accurate detection RT-qPCR tools for this and future pandemics.

Results

Main lineage comparison to commercial kits

Nasopharingeal and oropharingeal samples were evaluated with three different PCR tests: the UtrechtU-ORF3a, DeCoV19 Kit Triplex and GeneFinder COVID-19 PLUS RealAmp Kit primers. The UtrechtU-ORF3a primer set amplified correctly 15 positive samples with cycle threshold (CT) values similar to DeCoV19 Kit Triplex and GeneFinder COVID-19 PLUS RealAmp Kit primers (Table 1). Conversely, the 5 negative samples were correctly undetected by all three primer kits. We found that our primer set had a comparable performance and demonstrated its specificity and accuracy in detecting SARS-CoV-2 infections.

Table 1 CT (cylce threshold) values obtained from AI primers (Utrecht primers) and commercial kits (Genefinder, Decov-19), where positive SARS-CoV-2 samples are marked as \((+)\), and negative samples are marked as \((-)\). E, RdRP, N, IC are the targets of Genefinder. N1, N2, N3 are the targets of Decov-19. 1, 2 are the replicates of the UtrechtU-ORF3a primers.
Figure 1
figure 1

GeneFinder COVID-19 Plus RealAmp standard viral load curve (Left). UtrechtU-ORF3a standard viral load curve (Right).

Next, we compared viral loads from the 15 positive samples using GeneFinder COVID-19 PLUS RealAmp Kit and UtrechtU-ORF3a primer set (Fig. 1). For GeneFinder COVID-19 PLUS RealAmp Kit, we used E gene CT values for comparison. We observed that there are variations in log and copies/µl between methodologies. For the total of positive samples, the viral load measurement was higher using the UtrechtU-ORF3a primers. On the other hand, when interpolating the E gene CT values in the standard curve, it was observed that the result was less than 1 copy/µl in samples 3, 11, and 18 using the GeneFinder COVID-19 PLUS RealAmp Kit. This shows that UtrechtU-ORF3a primer set had a comparable or superior sensitivity compared to Genefinder, where lower viral loads are the limiting factor for accurate detection of SARS-CoV-2 infection, as seen in patient sample 11.

Experimental evaluation of B.1.1.7 primer specificity

The RT-qPCR amplification curves obtained with B.1.1.7 specific primer set B.1.1.7-1 generated using AI and generic SARS-CoV-2 primers IP2 (RdRp gene/nCoV_IP2) and IP4 (RdRp gene / nCoV_IP4), following the Pasteur Institute Protocol31 on two SARS-CoV-2 strains are shown in Fig. 2. As expected, only the B.1.1.7 strain is amplified by the B.1.1.7-1 primers, while both the Wuhan reference strain (NC_045512.2) and the B.1.1.7 strain are detected by the generic primers (Table 2).

Table 2 Cycle threshold (CT) values obtained from the specific B.1.1.7 (Alpha) primer set B.1.1.7-1, where positive B.1.1.7 samples are marked as \((+)\), and negative control and other lineage samples are marked as \((-)\).
Figure 2
figure 2

Comparison between non-specific primer set IP2 and IP4 and our designed primer set B.1.1.7-1 for B.1.1.7 variant and others.

Experimental evaluation of B.1.1.529 primer specificity

Laboratory testing of the generated specific primer set B.1.1.529-1, as part of the UniCoV study32, where we used raw saliva samples, proved to be successful (see Table 3) in identifying the Omicron variant. Thus, these results further demonstrated that the use of AI to generate specific diagnostic tests can be used as a fast measure to tackle the on-going SARS-CoV-2 pandemic and the appearance of new VOCs.

Table 3 Cycle threshold (CT) values obtained from the generated B.1.1.529 (Omicron) specific primers B.1.1.529-1, where positive SARS-CoV-2 samples are marked as \((+)\), and negative samples are marked as \((-)\). N1 is the 2019-nCoV-N1 primer set. The sample names remain the same as in the Unicov study32.

Discussion

Since the COVID-19 pandemic was declared in March 202033, new VOCs, such as B.1.1.7(Alpha), B.1.351 (Beta)34, P.1 (Gamma)35, B.1.617.2 (Delta)36 and B.1.1.529 (Omicron)37, have been emerging. As the SARS-CoV-2 mutated, some of the first primer sets designs failed to detect the mutated virus and certain characteristics made it necessary to separate the different lineages, this can be seen in supplementary figure 1. Thus, it was necessary to develop new VOC-specific primer sets as fast as possible and specific to certain variants. As a solution, we propose to use AI, and specifically Evolutionary Algorithms (EAs), to develop a tool to find candidate VOC-specific primers.

Our experiments show that the specificity and sensitivity of our AI-based primers can be a viable way of detecting SARS-CoV-2 but also specific VOCs, proving the usefulness of AI techniques in clinical settings. Our methodology is not only applicable to the design of primer sets to detect SARS-CoV-2 and its variants, but can also be used for other upcoming viruses, as long as a minimum of 10 viral sequences are available38. Critiques related to the use of AI in medicine often focus on the poor replicability of the results, on the lack of proper follow-ups with laboratory validation, or on the absence of comparison against more traditional techniques11,39,40. In order to improve the state of the art in the AI field, we performed a thorough validation of our AI-based technique, in a clinical setting involving 20 patients and a comparison against commercial diagnostic kits (for primers developed for the main SARS-CoV-2 lineage); for variant primers, we tested their specificity in laboratory settings for B.1.1.7 (Alpha) and B.1.1.529 (Omicron). Although, a limitation of our study is that we need to make further analyses with more samples from the VOCs.

We believe that all the proposed primer sets can be employed in a multiplexed approach in samples for the initial diagnosis of COVID-19 patients, or used as a second step of diagnosis in cases already verified positive to SARS-CoV-2, to identify specific VOCs. In addition, the methodology can be in principle applied to other detection techniques, such as loop-mediated isothermal amplification (LAMP), but this will require further testing.

In this way, health authorities can better evaluate the medical outcomes of patients, and adapt or inform new policies that can help curve the rise of variants of interest, and new potential viruses. For example, in May 2022, several cases of a human Monkeypox virus were identified outside endemic countries41, and has since been spreading across the globe prompting the WHO to declare it a global emergency last 23 of July 202242. Thus, using our AI-based methodology and 191 Monkeypox sequences available in the GISAID repository, along with 20,603 sequences from other viruses from the NCBI dataset, we obtained 4 different primer sets with 100% in-silico specificity and 100% sensitivity using sequence EPI_ISL_13053218 as reference. The results are reported in supplementary table 5. Although future primer sets delivered by our automated methodology will still require laboratory testing to be validated, our methodology can enable the timely, rapid, and low-cost operations needed for the design of new primer sets to accurately diagnose new emerging SARS-CoV-2 variants and emerging viral infectious diseases.

Methods

The methodology to create a primer set is summarized in Fig. 3. First, we generate a dataset containing the target viral sequence labeled as “1” and other viral sequences labeled as “0”. In the case of the main lineage we will have a dataset with SARS-CoV-2 sequences labeled as “1” and other viruses labeled as “0”. For the VOCs we create the dataset with the target variant as “1” and a set of different lineages of SARS-CoV-2 lineages as “0”. Then, using evolutionary algorithms we find a candidate forward primer, by finding a suitable 21-bps sequence that can distinguish the target variant/virus from the rest. Next, assisted by Primer3Plus28, we verify that the forward primer is viable and create the reverse primer. In the case of SARS-CoV-2, we validate the generated primers in silico in 2,107,300 SARS-CoV-2 sequences from the GISAID repository. Finally, we test the primer set in laboratory settings.

Figure 3
figure 3

Summary of the procedure to create, validate and test a primer set designed using evolutionary algorithms.

Evolutionary algorithm

We will present a summary of the use of artificial intelligence to generate the different primer sets to detect the main lineage of SARS-CoV2, and the VOCs B.1.1.7 and B.1.1.529. A more detailed description is available in25. Evolutionary Algorithms (EAs) are a stochastic optimization technique, able to find candidate solutions that maximize or minimize a given cost function for a specific problem. EAs are suited for searching among a vast number of different alternatives, that would be impossible to process exhaustively. In our case, candidate solutions are 21-bps sub-sequences found inside samples or sequences of the target virus of lineage, e.g. sub-sequences of B.1.1.7, that separate it from the other lineages. In our particular case, the cost function will be a measure of how probable is that a found 21-bps sequence in a position p in a sample k could be used as a forward primer. This given measure considers the GC content, no missing values (N) in the sequence, the specificity and the temperature of the found forward primer.

The cost function in our study evaluates the suitability of a candidate sub-sequence as primer, is to be maximized, and is given by the following:

$$\begin{aligned} F(I)=w_p \cdot \textit{Specificity}(I) + w_c \cdot \textit{GC}(I) + w_n \cdot \textit{N}_{val}(I) + w_t \cdot \textit{Temp}(I) \end{aligned}$$
(1)

with \(w_p, w_c, w_n, w_t\) representing the weights associated to each term.

$$\begin{aligned} \textit{Specificity}(I)=\sum _{i=0}^{T}{P(I, s_i)}, \end{aligned}$$
(2)

\(\textit{Specificity}(I)\) is evaluating the presence of the sequence selected as candidate primer I inside training samples labeled with the variant of interest, and its absence from samples of other variants, T is the number of samples in the training set, \(s_i\) is the i-th sample in the training set. Function P is defined as:

$$\begin{aligned} P(I, s_i) = {\left\{ \begin{array}{ll} 0,&{} \text {if }I\text { is found inside }s_i\text { and }L(s_i) == L(s_k) \\ 1,&{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(3)

where L(s) returns the class label of sample s. In other words, \(P(I,s_i)\) equals 1 if sequence I is found inside a sample with the same class label as sample \(s_k\), the origin of sequence I. So, if the 21-bps sequence I is found inside a sample that does not belong to the variant of interest, or is not found in a sample that belongs to the variant of interest, the solution is penalized.

The second term of the weighted sum takes into account the GC content of the candidate primer:

$$\begin{aligned} \textit{GC}(I) = 0.5 - \sum _{i=0}^{21}{\dfrac{C(I(i))}{21}}&\text {where }C(b)= {\left\{ \begin{array}{ll} 1,&{} \text {if base }b\text { is C or G} \\ 0,&{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(4)

where I(i) represents the base in position i inside sequence I. \(\textit{N}_{val}\) is defined as the following equation, that takes into account the presence of N symbols in the sequence, indicating an error in the read. The ideal primer candidate should only contain A, C, G, or T values.

$$\begin{aligned} \textit{N}_{val}(I) = \sum _{i=0}^{21}{N(I(i))}&\text {where}&N(b) = {\left\{ \begin{array}{ll} 1,&{} \text {if base }b\text { is N} \\ 0,&{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(5)

The final term tackles the requirement of having a melting temperature \(T_m\) centered around \(60^{\circ }\). Specialized literature28 provides an equation to compute \(T_m\) for a sequence I:

$$\begin{aligned} T_m(I) = 81.5 + 16.6 * log_{10}([Na+]) + 41 * \textit{GC}(I) - 600 / l(I) \end{aligned}$$
(6)

where \(\textit{GC}(I)\) is the content of C and G bases in sequence I, as described in Equation 4, \([Na+]\) is the molar sodium concentration, and l(I) is the length of sequence I, in bps. We used the value of \([Na+]=0.2\) as described in28, while \(l(I)=21\) by design. The term taking into account \(T_m\) will then be:

$$\begin{aligned} \textit{Temp}(I)=|60-T_m(I)| \end{aligned}$$
(7)
Figure 4
figure 4

Summary of the EA where we expect to find a target 21-bps sequence in position p and in sample k to be used as a forward primer.

The EA is set with a population of size \(\mu =200\), generating offspring of size \(\lambda =200\). The entire population is replaced by its offspring at each generation, using a (\(\lambda ,\mu \)) replacement strategy, with a tournament selection of size \(\tau =2\), a mutation acting on integer values, a one-point crossover, and a stop condition set on 100 generations, the value of 100 was selected from prelimonary testing and comparison to the results in30. A summary of the used EA is in Fig. 4. In contrast to previous methods for finding primers, that used Convolutional Neural Networks (CNN)25, this approach has the advantage of reducing the time required to obtain candidates. In comparison to the 16 hours required by the CNN approach, each single run of the EA lasts around 62 minutes with 5 threads on a 64-bit Windows 10 laptop with an Intel Xeon E-2186M microprocessor. Furthermore, CNN-based approaches require a considerable amount of post-processing to filter out sub-sequences without the desired requirements as primers, while the cost function already selects for most of the necessary requirements.

In-silico validation of the designed primers

After we have created the forward primer using the EAs, we will verify it using and generate the reverse primer with Primer3Pus28 in the reference sequences for each virus or variant, this is done automatically by our tool. Then, we test for specificity in silico. This test involves verifying if the sequence (or sample) contains the targeted sequences from the primer set. In comparison to other tools like the Primer-Blast tool43, we design the primer set using several sequences at the same time, instead of only one as a reference. For the validation, we will expect that the main lineage primer set UtrechtU-Orf3a appears in all of the lineages. In contrast, we expect for the B.1.1.7-1 and B.1.1.529-1 primer sets to be specific to the target lineages and not the rest.

For the in silico validation, we downloaded 2,107,300 sequences of SARS-CoV-2 from 27 different lineages and sub-lineages from the GISAID repository, in June \(11^{th}\), 2022. The number of sequences by variant of SARS-CoV-2 is available in supplementary table 4.

Main lineage comparison to commercial kits

Create primer set

We downloaded 583 sequences (*.fasta files) from the National Genomics Data Center (NGDC) repository44 on March 15th, 2020, with labels: MERS-CoV (236), HCoV-EMC (4), HCoV-OC43 (138), HCoV-229E (22), HCoV-4408 (2), HCoV-NL63 (58), HCoV-HKU1 (17), SARS-CoV (10) and SARS-CoV-2 (96). We assign SARS-CoV-2 as class 1, and the rest as class 0. Then, we run the EA algorithm 20 times, limiting its search to the sequence around gene ORF3a from SARS-CoV-2, which yields as candidate forward primers 14 non-repeated sequences. The results of the 20 runs of the EA by generation is in Fig. 5 (left).

Figure 5
figure 5

(Left) Cost function in 20 runs of the EA for 100 generations to find a forward primer in the main lineage. (Right) Percentage of appearance of the primer set in the 2,107,300 SARS-CoV-2 sequences.

The resulting sequences and the Primer3Plus28 analysis are in supplementary table 1. From the results, the sequence 5′-TAG CAC TCT CCA AGG GTG TTC-3′ appears 4 times, and can be used as a forward primer and 3′-GCA AAG CCA AAG CCT CAT TA-5′ as reverse primer simulated using Primer3Plus28, this is the same result as using a CNN25.

In silico validation

Although the B.1.1.529 (Omicron) VOC did not exist when we created our primers for the main SARS-CoV-2 lineage (UtrechtU-ORF3a), in an initial in-silico analysis of 10,907 Omicron samples from GISAID26, the primers return a detection rate above 95% over the 5 Omicron sublineages BA.1-BA.5. In contrast, the E-Sarbeco45 (Charite-E) and CHINA-CDC-N primer sets do not detect these Omicron variants, see supplementary figure 1. Over a total of 2,107,300 sequences of different SARS-CoV-2 variants from GISAID, the UtrechtU-ORF3a forward primer has an accuracy of 98.15%, and the reverse primer of 97.99%. In Fig. 5 ( right) we can see the expected result that the created primer does appear in almost all sequences from B.1.1.7, B.1.1.529 and other lineages.

Laboratory testing

Next, in the Hospital Civil de Guadalajara, México we select 20 samples from patients, to perform the evaluation of the AI-designed RT-qPCR tests. The molecular diagnosis in this hospital was carried out with the DeCoV19 Kit Triplex (Genes2Life) endorsed by the Institute of Epidemiological Diagnosis and Reference (InDRE) in Mexico. We choose 15 positive and 5 negative nasopharyngeal and oropharyngeal swabs from patients with acute respiratory tract infection. The samples are tested with DeCoV19 Kit Triplex, GeneFinder COVID-19 PLUS RealAmp Kit, and the AI-designed primers UtrechtU-ORF3a.

Then, viral RNA is extracted from nasopharyngeal and oropharyngeal swabs using Qiamp Viral RNA mini kit. Approximately 75 µL of viral RNA is recovered and tested immediately. The DeCoV19 Kit Triplex detects 3 regions of the N gene and the human RNAse P gene. The GeneFinder COVID-19 PLUS RealAmp Kit detects regions of the N, E, and RdRp genes and an internal control. The AI designed primers bind to the ORF3a gene obtaining a 179 bp amplicon, this primers set are identified as UtrechtU-ORF3a primers. For commercial Kits, RT-qPCR is performed following the manufacturer’s recommendations, while for UtrechtU-ORF3a primers SuperScript III Platinum One-Step qRT-PCR Kit, forward (5′-TAG CAC TCT CCA AGG GTG TTC-3′), reverse (3′-GCA AAG CCA AAG CCT CAT TA-5′) and TaqMan probe (5′-FAM CCT TGA AGC CCC TTT TCT CT NFQ-3′) are used. All the RT-qPCR tests are developed in QuantStudio 5 Real-Time PCR System (Applied Biosystems).

Finally, to evaluate PCR efficiency we run a duplicate 10-fold dilution series of viral RNA for each assay using Twist Synthetic SARS-CoV-2 RNA Control 1. We calculate the slope by linear regression and defined the required levels for efficiency \(>80\%\) and \(R^2 \ge 0.98\), respectively. Thereafter, for the UtrechtU-ORF3a primers all the samples were evaluated with a replicate to obtain an indication of clinical sensitivity (n = 2). Furthermore, we calculated the viral load of samples according to standard curve of Twist Synthetic SARS-CoV-2 RNA Control 1.

Experimental evaluation of B.1.1.7 Primer specificity

Create primer Set

We downloaded 10,712 SARS-CoV-2 sequences from the GISAID repository on December 23, 2020. After removing repeated sequences, we obtained a total of 2104 sequences labeled as B.1.1.7 and 6819 sequences from other variants, for a total of 8923 samples. Next, we select 1000 sequences for training the algorithm, where 605 are B.1.1.7 and 395 are other SARS-CoV-2 variants. Then, we assigned label 1 to B.1.1.7 variant samples, and the rest are assigned label 0, and we ran the EA algorithm 10 times, Fig. 6 (left). The results are reported in supplementary table 2.

Figure 6
figure 6

(Left) Cost function in 10 runs of the EA for 100 generations to find a forward primer in the the variant B.1.1.7. (Right) Percentage of appearance of the primer set in the 2,107,300 SARS-CoV-2 sequences.

Again, we simulated the candidate primers using Primer3Plus and EPI_ISL_601443 as the reference sequence for B.1.1.7. From the results, we selected the forward primer 5′-CAT GCT ATC TCT GGG ACC AAT-3′, because the sequence has more than 1 single nucleotide mutation, and therefore has a higher probability of being a successful specific primer. In addition, given the position of the proposed reverse primer, we could use the Y144 deletion as reverse primer 3′-TGT TGT TTT TGT GGT AAA CAC C-5′ by displacing the result 10 bp. This results in an amplicon product size of 244 bp. This primer set is identified as B.1.1.7-1.

In silico validation

From an in-silico analysis, the B.1.1.7-1 forward primer, tested on 1,051,740 B.1.1.7 sequences, has a sensitivity of 95.25% and specificity of 98.67%. The forward primer appears in 45% of Omicron BA.3, 61% of Omicron BA.5, and 69% of B.1.258 variant samples, that all did not exist when the primer was created. The reverse primer in the set does not appear in these lineages, although it appears in 22% of B.1.526 variant samples. Nevertheless, the forward and reverse primers appear at the same time only in the B.1.1.7 lineage, meaning that the primer set can be still effectively used to detect B.1.1.7, at least at the time of writing (June 2022). From Fig. 6 (right), we can see that forward and reverse sequences appear mostly in B.1.1.7 sequences, as expected, in comparison to samples from B.1.1.529 and other lineages.

Laboratory testing

Amplification efficiency of the designed primer sets was evaluated using viral RNA extracts from two sequenced SARS-CoV-2 strains: the original Wuhan strain 210207 (GISAID \(\text {N}^{\circ }\) EPI_ISL_437689) and VOC B.1.1.7 strain (GISAID \(\text {N}^{\circ }\) EPI_ISL_683466) in the Pasteur Institute. Viral RNA was extracted from infected cell culture supernatants using the NucleoSpin Dx Virus kit (Macherey-Nagel), following the manufacturers’ protocol. Viral RNA extracts (5 µL) were analyzed either using the IP2/IP4 dualplex real-time reverse-transcriptase (RT)-PCR assay, developed by following the Pasteur Institute and targeting conserved regions of the SARS-CoV-2 RdRP gene31, or primer set B.1.1.7-1, using the LightCycler EvoScript RNA SYBR Green I Master kit (Roche). Both RT-PCR assays were conducted on a LightCycler® 480 System (Roche), using the thermal cycling program described in the Pasteur Institute protocol31.

Experimental evaluation of B.1.1.529 primer specificity

Create primers

From the GISAID repository26, we downloaded a total of 2100 sequences, in *.fasta format, with 123 sequences identified as B.1.1.529 (Omicron) and 100 sequences of each of the following variants, labeled following the Pango lineage46: AY.3 (Delta Sublineage), B.1, B.1.1.7 (Alpha), B.1.1.214, B.1.1.519, B.1.2, B.1.160, B.1.177, B.1.177.21, B.1.221, B.1.243, B.1.258, B.1.351 (Beta), B.1.427, B.1.429, B.1.526, B.1.596, B.1.617.2 (Delta), D.2, P.1 (Gamma), and R.1. Using EAs we found candidate forward primers from 10 runs of the algoritm, Fig. 7 (left). The results are reported in supplementary table 3.

Figure 7
figure 7

(Left) Cost function in 10 runs of the EA for 100 generations to find a forward primer in the the variant B.1.1.529. (Right) Percentage of appearance of the primer set in the 2,107,300 SARS-CoV-2 sequences.

In silico validation

Then, we simulated the candidate forward primers using Primer3Plus28 using the accession \(EPI\_ISL\_6590782\) as reference for B.1.1.519 (Omicron) variant. Although sequence GACCCACTTATGGTGTTGGTC resulted in a warning for High end self complementarity, it presented 3 characteristics mutations of the B.1.1.529 (Omicron) variant47: Q498R (A23055G), N501Y (A23063T) and Y505H (T23075C), position 23,054 to 23,075 in the reference accession NC_045512.248. A single-nucleotide mutation may not be enough to work as a specific primer for detecting SARS-CoV-2 variants. Thus, to solve the High end self complementarity, we increased the size of the primer by adding a base pair at the end (GACCCACTTATGGTGTTGGTCA), which resulted in an acceptable primer candidate with a \(T_m\) of 62.0 \(^{\circ }\text {C}\). We then generated the internal probe CACCAGCAACTGTTTGTGGA and reverse primer CTGCCAAATTGTTGGAAAGG with a \(T_m\) of 60.8 \(^{\circ }\text {C}\) and 60.5 \(^{\circ }\text {C}\) respectively, with a product size of 208 bp. This primer set is identified as B.1.1.529-1.

The B.1.1.529-1 forward primer, tested in-silico on 10,907 B.1.1.529 (Omicron) sequences, including the 5 sub-lineages, shows a sensitivity of 82.17%, and a specificity of 100% when tested on 2,096,393 sequences of other SARS-CoV-2 variants. The lower sensitivity is due to fact that several of the BA.1 samples in the repository contain sequencing errors (N) in the part of the genome that should match the forward primer. As seen on Fig. 7 (right) the reverse primer do appear in almost all SARS-CoV-2 sequences. Nevertheless, it is necessary to have both primers in order to give a positive.

Laboratory testing

To test the specificity of the primer set, raw saliva samples, approximately 1-2 ml, were collected in 5 ml screwcap containers from volunteers in the UniCoV study32 from the University College Cork. Following collections samples were heat inactivated at 95 °C for 5 min, cooled, vortexed and then 20 ml of saliva was added to 20 ml of Saliva Ready\(^{\textrm{TM}}\) Solution in a 0.2 ml 96 well plate. The plate was vortexed and centrifuged and then heated at 62 °C for 5 min, 92 °C for 5 min and then cooled at \(4^{\circ }\text {C}\). Saliva was then screened for SARS-CoV-2 (ORF1a, ORF1b and N gene) and Human RNase P using the TaqMan\(^{\textrm{TM}}\) 1-Step Mutiplex SARS-CoV-2 Fast PCR Kit 2.0 on the Applied Biosystems\(^{\textrm{TM}}\) QuantStudio\(^{\textrm{TM}}\) 5 Real Time PCR Instrument, 96 well, 0.2-mL block (Thermo Fisher Scientific) according to the manufacturer’s instructions including a Positive Control: reverse transcription 53 °C for 5 min, 1 preincubation 85 °C for 5 min, activation at 95 °C 2 min with 40 cycles of denaturation 95 °C for 1 second and anneal/extension 62 °C for 30 seconds.

Positive SARS-CoV-2 samples were subsequently screened for the presence of the Omicron variant using our specific Omicron primer set (B.1.1.529-1). Briefly cDNA was synthesised using saliva from the Saliva Ready\(^{\textrm{TM}}\) step above with LunaScript RT-Supermix (NEB), briefly for primer annealing for 2 min at 25 °C , cDNA synthesis for 10 minutes \(55\,^{\circ }\text {C}\) 2 minutes and denaturation 95 °C for 1 minute. qPCR using followed by second and anneal/extension 62 °C for 30 seconds.

Finally, qPCR was performed using cDNA from above with the Luna Universal Probe qPCR mix (NEB), with Omicron specific primers and N1 (2019-nCoV RUO) primers/probes (Integrated DNA Technologies (IDT)) at a concentration 500 nM with probes at 250 nM(FAM-labelled). PCR conditions are denaturation at 95 °C fro 1 min and 40 cycles of denaturation 95 °C for 15 second and anneal/extension 60 °C for 30 seconds on the Applied Biosystems\(^{\textrm{TM}}\) QuantStudio\(^{\textrm{TM}}\) 5 Real Time PCR Instrument, 96 well, 0.2-mL block. All PCR reactions were performed in duplicate with a technical replicate performed following initial analysis.