An Innovative AI-based primer design tool for precise and accurate detection of SARS-CoV-2 variants of concern

As the COVID-19 pandemic winds down, it leaves behind the serious concern that future, even more disruptive pandemics may eventually surface. One of the crucial steps in handling the SARS-CoV-2 pandemic was being able to detect the presence of the virus in an accurate and timely manner, to then develop policies counteracting the spread. Nevertheless, as the pandemic evolved, new variants with potentially dangerous mutations appeared. Faced by these developments, it becomes clear that there is a need for fast and reliable techniques to create highly specific molecular tests, able to uniquely identify VOCs. Using an automated pipeline built around evolutionary algorithms, we designed primer sets for SARS-CoV-2 (main lineage) and for VOC, B.1.1.7 (Alpha) and B.1.1.529 (Omicron). Starting from sequences openly available in the GISAID repository, our pipeline was able to deliver the primer sets for the main lineage and each variant in a matter of hours. Preliminary in-silico validation showed that the sequences in the primer sets featured high accuracy. A pilot test in a laboratory setting confirmed the results: the developed primers were favorably compared against existing commercial versions for the main lineage, and the specific versions for the VOCs B.1.1.7 and B.1.1.529 were clinically tested successfully.

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was identified in December 2019 in Wuhan, China 1 , as a new strain of coronavirus that causes COVID-19 disease 2 .Since then, COVID-19 has evolved into a pandemic with nearly 540 million confirmed cases and over 6.3 million deaths worldwide 3 , as of June 2022.Although now vaccines are available for SARS-CoV-2, the outbreak still represents a massive challenge because of the characteristics of the COVID-19 disease, e.g.long incubation period, wide range of symptoms, high infection rate, high false negative rate (FNR) in detection tests and high mutation rate 4 .
An early estimate of SARS-CoV-2 mutation rate is 1.12 × 10 −3 mutations per site-year 5 .The high mutation rate of SARS-CoV-2, resulted in different variants, where some of them showed a higher rate of transmissibility, virulence, clinical presentation, mortality and/or vaccine/therapeutics resistance 6 .Given their acquired mutations, and aforementioned characteristics some variants were deemed variants of concern (VOCs) by the WHO.

Discussion
Since the COVID-19 pandemic was declared in March 2020 33 , new VOCs, such as B.1.1.7(Alpha),B.1.351(Beta) 34 , P.1 (Gamma) 35 , B.1.617.2 (Delta) 36 and B.1.1.529(Omicron) 37 , have been emerging.As the SARS-CoV-2 mutated, some of the first primer sets designs failed to detect the mutated virus and certain characteristics made it necessary to separate the different lineages, this can be seen in supplementary figure 1.Thus, it was necessary to develop new VOC-specific primer sets as fast as possible and specific to certain variants.As a solution, we propose to use AI, and specifically Evolutionary Algorithms (EAs), to develop a tool to find candidate VOCspecific primers.
Our experiments show that the specificity and sensitivity of our AI-based primers can be a viable way of detecting SARS-CoV-2 but also specific VOCs, proving the usefulness of AI techniques in clinical settings.Our methodology is not only applicable to the design of primer sets to detect SARS-CoV-2 and its variants, but can also be used for other upcoming viruses, as long as a minimum of 10 viral sequences are available 38 .Critiques related to the use of AI in medicine often focus on the poor replicability of the results, on the lack of proper follow-ups with laboratory validation, or on the absence of comparison against more traditional techniques 11,39,40 .In order to improve the state of the art in the AI field, we performed a thorough validation of our AI-based technique, in a clinical setting involving 20 patients and a comparison against commercial diagnostic kits (for primers developed for the main SARS-CoV-2 lineage); for variant primers, we tested their specificity in laboratory settings for B.1.1.7 (Alpha) and B.1.1.529(Omicron).Although, a limitation of our study is that we need to make further analyses with more samples from the VOCs.
We believe that all the proposed primer sets can be employed in a multiplexed approach in samples for the initial diagnosis of COVID-19 patients, or used as a second step of diagnosis in cases already verified positive to SARS-CoV-2, to identify specific VOCs.In addition, the methodology can be in principle applied to other detection techniques, such as loop-mediated isothermal amplification (LAMP), but this will require further testing.
In this way, health authorities can better evaluate the medical outcomes of patients, and adapt or inform new policies that can help curve the rise of variants of interest, and new potential viruses.For example, in May 2022, several cases of a human Monkeypox virus were identified outside endemic countries 41 , and has since been spreading across the globe prompting the WHO to declare it a global emergency last 23 of July 2022 42 .Thus, using our AI-based methodology and 191 Monkeypox sequences available in the GISAID repository, along with 20,603 sequences from other viruses from the NCBI dataset, we obtained 4 different primer sets with 100% in-silico specificity and 100% sensitivity using sequence EPI_ISL_13053218 as reference.The results are reported in supplementary table 5.Although future primer sets delivered by our automated methodology will still require laboratory testing to be validated, our methodology can enable the timely, rapid, and low-cost operations needed for the design of new primer sets to accurately diagnose new emerging SARS-CoV-2 variants and emerging viral infectious diseases.

Methods
The methodology to create a primer set is summarized in Fig. 3. First, we generate a dataset containing the target viral sequence labeled as "1" and other viral sequences labeled as "0".In the case of the main lineage we will have a dataset with SARS-CoV-2 sequences labeled as "1" and other viruses labeled as "0".For the VOCs we create the dataset with the target variant as "1" and a set of different lineages of SARS-CoV-2 lineages as "0".Then, using evolutionary algorithms we find a candidate forward primer, by finding a suitable 21-bps sequence that can distinguish the target variant/virus from the rest.Next, assisted by Primer3Plus 28 , we verify that the forward primer is viable and create the reverse primer.In the case of SARS-CoV-2, we validate the generated primers in silico in 2,107,300 SARS-CoV-2 sequences from the GISAID repository.Finally, we test the primer set in laboratory settings.

Evolutionary algorithm
We will present a summary of the use of artificial intelligence to generate the different primer sets to detect the main lineage of SARS-CoV2, and the VOCs B.1.1.7 and B.1.1.529.A more detailed description is available in 25 .Evolutionary Algorithms (EAs) are a stochastic optimization technique, able to find candidate solutions that maximize or minimize a given cost function for a specific problem.EAs are suited for searching among a vast number of different alternatives, that would be impossible to process exhaustively.In our case, candidate solutions are 21-bps sub-sequences found inside samples or sequences of the target virus of lineage, e.g.sub-sequences of B.1.1.7,that separate it from the other lineages.In our particular case, the cost function will be a measure of how probable is that a found 21-bps sequence in a position p in a sample k could be used as a forward primer.This given measure considers the GC content, no missing values (N) in the sequence, the specificity and the temperature of the found forward primer.
The cost function in our study evaluates the suitability of a candidate sub-sequence as primer, is to be maximized, and is given by the following: with w p , w c , w n , w t representing the weights associated to each term.where L(s) returns the class label of sample s.In other words, P(I, s i ) equals 1 if sequence I is found inside a sam- ple with the same class label as sample s k , the origin of sequence I. So, if the 21-bps sequence I is found inside a sample that does not belong to the variant of interest, or is not found in a sample that belongs to the variant of interest, the solution is penalized.
The second term of the weighted sum takes into account the GC content of the candidate primer: where I(i) represents the base in position i inside sequence I. N val is defined as the following equation, that takes into account the presence of N symbols in the sequence, indicating an error in the read.The ideal primer candidate should only contain A, C, G, or T values.
The final term tackles the requirement of having a melting temperature T m centered around 60 • .Specialized literature 28 provides an equation to compute T m for a sequence I: where GC(I) is the content of C and G bases in sequence I, as described in Equation 4, [Na+] is the molar sodium concentration, and l(I) is the length of sequence I, in bps.We used the value of [Na+] = 0.2 as described in 28 , while l(I) = 21 by design.The term taking into account T m will then be:   The EA is set with a population of size µ = 200 , generating offspring of size = 200 .The entire population is replaced by its offspring at each generation, using a ( , µ ) replacement strategy, with a tournament selection of size τ = 2 , a mutation acting on integer values, a one-point crossover, and a stop condition set on 100 generations, the value of 100 was selected from prelimonary testing and comparison to the results in 30 .A summary of the used EA is in Fig. 4. In contrast to previous methods for finding primers, that used Convolutional Neural Networks (CNN) 25 , this approach has the advantage of reducing the time required to obtain candidates.In comparison to the 16 hours required by the CNN approach, each single run of the EA lasts around 62 minutes with 5 threads on a 64-bit Windows 10 laptop with an Intel Xeon E-2186M microprocessor.Furthermore, CNN-based approaches require a considerable amount of post-processing to filter out sub-sequences without the desired requirements as primers, while the cost function already selects for most of the necessary requirements.

In-silico validation of the designed primers
After we have created the forward primer using the EAs, we will verify it using and generate the reverse primer with Primer3Pus 28 in the reference sequences for each virus or variant, this is done automatically by our tool.Then, we test for specificity in silico.This test involves verifying if the sequence (or sample) contains the targeted sequences from the primer set.In comparison to other tools like the Primer-Blast tool 43 , we design the primer set using several sequences at the same time, instead of only one as a reference.For the validation, we will expect that the main lineage primer set UtrechtU-Orf3a appears in all of the lineages.In contrast, we expect for the B.1.1.7-1and B.1.1.529-1primer sets to be specific to the target lineages and not the rest.
For the in silico validation, we downloaded 2,107,300 sequences of SARS-CoV-2 from 27 different lineages and sub-lineages from the GISAID repository, in June 11 th , 2022.The number of sequences by variant of SARS- CoV-2 is available in supplementary table 4.
The resulting sequences and the Primer3Plus 28 analysis are in supplementary table 1.From the results, the sequence 5′-TAG CAC TCT CCA AGG GTG TTC-3′ appears 4 times, and can be used as a forward primer and 3′-GCA AAG CCA AAG CCT CAT TA-5′ as reverse primer simulated using Primer3Plus 28 , this is the same result as using a CNN 25 .

In silico validation
Although the B.1.1.529(Omicron) VOC did not exist when we created our primers for the main SARS-CoV-2 lineage (UtrechtU-ORF3a), in an initial in-silico analysis of 10,907 Omicron samples from GISAID 26 , the primers return a detection rate above 95% over the 5 Omicron sublineages BA.1-BA.5.In contrast, the E-Sarbeco 45 (Charite-E) and CHINA-CDC-N primer sets do not detect these Omicron variants, see supplementary figure 1.Over a total of 2,107,300 sequences of different SARS-CoV-2 variants from GISAID, the UtrechtU-ORF3a forward primer has an accuracy of 98.15%, and the reverse primer of 97.99%.In Fig. 5  Then, viral RNA is extracted from nasopharyngeal and oropharyngeal swabs using Qiamp Viral RNA mini kit.Approximately 75 µL of viral RNA is recovered and tested immediately.The DeCoV19 Kit Triplex detects 3 regions of the N gene and the human RNAse P gene.The GeneFinder COVID-19 PLUS RealAmp Kit detects regions of the N, E, and RdRp genes and an internal control.The AI designed primers bind to the ORF3a gene obtaining a 179 bp amplicon, this primers set are identified as UtrechtU-ORF3a primers.For commercial Kits, RT-qPCR is performed following the manufacturer's recommendations, while for UtrechtU-ORF3a primers SuperScript III Platinum One-Step qRT-PCR Kit, forward (5′-TAG CAC TCT CCA AGG GTG TTC-3′), reverse (3′-GCA AAG CCA AAG CCT CAT TA-5′) and TaqMan probe (5′-FAM CCT TGA AGC CCC TTT TCT CT NFQ-3′) are used.All the RT-qPCR tests are developed in QuantStudio 5 Real-Time PCR System (Applied Biosystems).
Finally, to evaluate PCR efficiency we run a duplicate 10-fold dilution series of viral RNA for each assay using Twist Synthetic SARS-CoV-2 RNA Control 1.We calculate the slope by linear regression and defined the required levels for efficiency > 80% and R 2 ≥ 0.98 , respectively.Thereafter, for the UtrechtU-ORF3a primers all the samples were evaluated with a replicate to obtain an indication of clinical sensitivity (n = 2).Furthermore, we calculated the viral load of samples according to standard curve of Twist Synthetic SARS-CoV-2 RNA Control 1.

Create primer Set
We downloaded 10,712 SARS-CoV-2 sequences from the GISAID repository on December 23, 2020.After removing repeated sequences, we obtained a total of 2104 sequences labeled as B.1.1.7 and 6819 sequences from other variants, for a total of 8923 samples.Next, we select 1000 sequences for training the algorithm, where 605 are B.1.1.7 and 395 are other SARS-CoV-2 variants.Then, we assigned label 1 to B.1.1.7 variant samples, and the rest are assigned label 0, and we ran the EA algorithm 10 times, Fig. 6 (left).The results are reported in supplementary table 2.
Again, we simulated the candidate primers using Primer3Plus and EPI_ISL_601443 as the reference sequence for B.1.1.7.From the results, we selected the forward primer 5′-CAT GCT ATC TCT GGG ACC AAT-3′, because the sequence has more than 1 single nucleotide mutation, and therefore has a higher probability of being a successful specific primer.In addition, given the position of the proposed reverse primer, we could use the Y144 deletion as reverse primer 3′-TGT TGT TTT TGT GGT AAA CAC C-5′ by displacing the result 10 bp.This results in an amplicon product size of 244 bp.This primer set is identified as B.1.1.7-1.

In silico validation
Then, we simulated the candidate forward primers using Primer3Plus 28 using the accession EPI_ISL_6590782 as reference for B.1.1.519(Omicron) variant.Although sequence GAC CCA CTT ATG GTG TTG GTC resulted in a warning for High end self complementarity, it presented 3 characteristics mutations of the B.1.1.529(Omicron) variant 47 : Q498R (A23055G), N501Y (A23063T) and Y505H (T23075C), position 23,054 to 23,075 in the reference accession NC_045512.2 48.A single-nucleotide mutation may not be enough to work as a specific primer for detecting SARS-CoV-2 variants.Thus, to solve the High end self complementarity, we increased the size of the primer by adding a base pair at the end (GAC CCA CTT ATG GTG TTG GTCA ), which resulted in an acceptable primer candidate with a T m of 62.0 • C .We then generated the internal probe CAC CAG CAA CTG TTT GTG GA and reverse primer CTG CCA AAT TGT TGG AAA GG with a T m of 60.8 • C and 60.5 • C respectively, with a product size of 208 bp.This primer set is identified as B.1.1.529-1.
The B.1.1.529-1forward primer, tested in-silico on 10,907 B.1.1.529(Omicron) sequences, including the 5 sub-lineages, shows a sensitivity of 82.17%, and a specificity of 100% when tested on 2,096,393 sequences of other SARS-CoV-2 variants.The lower sensitivity is due to fact that several of the BA.1 samples in the repository contain sequencing errors (N) in the part of the genome that should match the forward primer.As seen on Fig. 7 (right) the reverse primer do appear in almost all SARS-CoV-2 sequences.Nevertheless, it is necessary to have both primers in order to give a positive.

Laboratory testing
To test the specificity of the primer set, raw saliva samples, approximately 1-2 ml, were collected in 5 ml screwcap containers from volunteers in the UniCoV study 32 from the University College Cork.Following collections samples were heat inactivated at 95 °C for 5 min, cooled, vortexed and then 20 ml of saliva was added to 20 ml of Saliva Ready TM Solution in a 0.2 ml 96 well plate.The plate was vortexed and centrifuged and then heated at 62 °C for 5 min, 92 °C for 5 min and then cooled at 4 • C .Saliva was then screened for SARS-CoV-2 (ORF1a, ORF1b and N gene) and Human RNase P using the TaqMan Positive SARS-CoV-2 samples were subsequently screened for the presence of the Omicron variant using our specific Omicron primer set (B.1.1.529-1).Briefly cDNA was synthesised using saliva from the Saliva Ready TM step above with LunaScript RT-Supermix (NEB), briefly for primer annealing for 2 min at 25 °C , cDNA synthesis for 10 minutes 55 • C 2 minutes and denaturation 95 °C for 1 minute.qPCR using followed by second and anneal/ extension 62 °C for 30 seconds.
Finally, qPCR was performed using cDNA from above with the Luna Universal Probe qPCR mix (NEB), with Omicron specific primers and N1 (2019-nCoV RUO) primers/probes (Integrated DNA Technologies (IDT)) at a concentration 500 nM with probes at 250 nM(FAM-labelled).PCR conditions are denaturation at 95 °C fro 1 min and 40 cycles of denaturation 95 °C for 15 second and anneal/extension 60 °C for 30 seconds on the Applied Biosystems TM QuantStudio TM 5 Real Time PCR Instrument, 96 well, 0.2-mL block.All PCR reactions were performed in duplicate with a technical replicate performed following initial analysis.
(1) F(I) = w p • Specificity(I) + w c • GC(I) + w n • N val (I) + w t • Temp(I) Specificity(I) is evaluating the presence of the sequence selected as candidate primer I inside training samples labeled with the variant of interest, and its absence from samples of other variants, T is the number of samples in the training set, s i is the i-th sample in the training set.Function P is defined as: , s i ) = 0, if I is found inside s i and L(s i ) == L(s k ) 1, otherwise.

1 ,
if base b is C or G 0, otherwise.

Figure 3 .
Figure 3. Summary of the procedure to create, validate and test a primer set designed using evolutionary algorithms.

Figure 4 .
Figure 4. Summary of the EA where we expect to find a target 21-bps sequence in position p and in sample k to be used as a forward primer.
( right) we can see the expected result that the created primer does appear in almost all sequences from B.1.1.7,B.1.1.529and other lineages.Laboratory testing Next, in the Hospital Civil de Guadalajara, México we select 20 samples from patients, to perform the evaluation of the AI-designed RT-qPCR tests.The molecular diagnosis in this hospital was carried out with the DeCoV19 Kit Triplex (Genes2Life) endorsed by the Institute of Epidemiological Diagnosis and Reference (InDRE) in Mexico.We choose 15 positive and 5 negative nasopharyngeal and oropharyngeal swabs from patients with acute respiratory tract infection.The samples are tested with DeCoV19 Kit Triplex, GeneFinder COVID-19 PLUS RealAmp Kit, and the AI-designed primers UtrechtU-ORF3a.

Figure 5 .Figure 6 .
Figure 5. (Left) Cost function in 20 runs of the EA for 100 generations to find a forward primer in the main lineage.(Right) Percentage of appearance of the primer set in the 2,107,300 SARS-CoV-2 sequences.

Figure 7 .
Figure 7. (Left) Cost function in 10 runs of the EA for 100 generations to find a forward primer in the the variant B.1.1.529.(Right) Percentage of appearance of the primer set in the 2,107,300 SARS-CoV-2 sequences.
TM 1-Step Mutiplex SARS-CoV-2 Fast PCR Kit 2.0 on the Applied Biosystems TM QuantStudio TM 5 Real Time PCR Instrument, 96 well, 0.2-mL block (Thermo Fisher Scientific) according to the manufacturer's instructions including a Positive Control: reverse transcription 53 °C for 5 min, 1 preincubation 85 °C for 5 min, activation at 95 °C 2 min with 40 cycles of denaturation 95 °C for 1 second and anneal/extension 62 °C for 30 seconds.