Design and in silico validation of polymerase chain reaction primers to detect severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)

Accurate designing of polymerase chain reaction (PCR) primers targeting conserved segments in viral genomes is desirable for preventing false-negative results and decreasing the need for standardization across different PCR protocols. In this work, we designed and described a set of primers and probes targeting conserved regions identified from a multiple sequence alignment of 2341 Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) genomes from the Global Initiative on Sharing All Influenza Data (GISAID). We subsequently validated those primers and probes in 211,833 SARS-CoV-2 whole-genome sequences. We obtained nine systems (forward primer + reverse primer + probe) that potentially anneal to highly conserved regions of the virus genome from these analyses. In silico predictions also demonstrated that those primers do not bind to nonspecific targets for human, bacterial, fungal, apicomplexan, and other Betacoronaviruses and less pathogenic sub-strains of coronavirus. The availability of these primer and probe sequences will make it possible to validate more efficient protocols for identifying SARS-CoV-2.

. Primers designed in this study. The percentage of the total number of sequences that annel without mismatches or allowing 10% mismatches are shown in parentheses. F forward primer; R reverse primer; P probe; Tm melting temperature; GC% = G + C percentage; SC self complementarity; SC 3' self 3'-complementarity; No mis number of sequences that anneal to the primer without mismatches; 10% mis number of sequences that anneal to the primer allowing 10% mismatches. www.nature.com/scientificreports/ the B.1.1.7 variant, the primers UFRN_3, UFRN_5, and UFRN_8 annealed to the vast majority of its sequences (Table 3). Still, two primers (2019-nCoV_N2 and nCoV_IP2-12669Fw) from the PD_primer set had the same performance as the three UFRN_primers mentioned above (Table 4). Concerning the specificity, both primers set performed well. Tests allowing 20% mismatch against Apicomplexa targets revealed that the 2019-nCoV_N2-F / 2019-nCoV_N2-R and UFRN_8_F / UFRN_8_R primer pairs could generate 746 bp and 755 bp amplicons with Toxoplasma gondii sequences from accession codes XTG08368.2 and XM_002364674.2, respectively. The other pairs of primers did not present nonspecific amplicons allowing values between 0 and 20% of mismatches.

Discussion
Early detection of pathogens is crucial to disease prevention 5 and containment, especially during epidemic outbreaks 6 . PCR is a reliable and relatively accessible molecular method that directly recognizes pathogenderived material from patients samples 7 . However, PCR protocols' optimization is strongly dependent on primers' Table 2. Primers released by WHO to detect SARS-CoV-2 using polymerase chain reaction. The percentage of the total number of sequences that annel without mismatches or allowing 10% mismatches are shown in parentheses. F forward primer; R reverse primer; P probe; Tm melting temperature; GC% G + C percentage; SC self complementarity; SC 3' self 3'-complementarity; No mis number of sequences that anneal to the primer without mismatches; 10% mis number of sequences that anneal to the primer allowing 10% mismatches. www.nature.com/scientificreports/ specificity and efficiency 8 . This reason, combined with the increasing number of SARS-CoV-2 sequences available and its crescent polymorphism, led us to design a set of new primers that can address very conserved regions of the virus genomes. Therefore, to aid PCR optimization, the UFRN_primers were designed to present Tm values that were as close as possible. These settings will probably enable the use of at least two systems using the same thermal cycling parameters. In this way, it would be possible to perform the PCR test identifying different viral genome regions simultaneously, according to the protocols already described for the PD_primers. In this context, possibly the systems UFRN_3 and UFRN_4 will have different thermal cycling parameters compared to the other systems since, in this case, the probe Tm is similar to the primers (Table 1). Probably these systems will depend on more annealing time to ensure that the probe has interacted in the DNA template before the amplification starts.
The higher specificity of UFRN_primers confirmed by in silico analysis is mainly due to the availability of 2.341 genome sequences, which made it possible to identify the conserved regions with greater accuracy from the alignment. The UFRN_6 and UFRN_7 primers differ only by one base and have overlapping probes. However, these discrete differences were sufficient to alter the sequences in which these primers interact (Table 1). Only 12 sequences did not anneal with the designed primers. Among them, seven were isolated from pangolins and Table 3. Analysis of potential annealing (In silico PCR) of UFRN primers (UFRN_primers) to the genomes of the main SARS-CoV-2 variants. The percentage of the total number of sequences that annel without mismatches or allowing 10% mismatches are shown in parentheses. No mis number of sequences that anneal to the primer without mismatches; 10% mis number of sequences that anneal to the primer allowing 10% mismatches. Another striking result is that UFRN_primers presented a higher potential to identify the main SARS-CoV-2 recent variants of concern than the PD_primers, significantly the B.1.351, B.1.427, B.1.429, B.1.525, and P.1. In silico predictions indicate that the UFRN_primers are potentially less prone to generate false-negative results. Its application could represent a significant difference to Covid-19 diagnostic and epidemiology since the Food and Drugs Administration (FDA) has recently warned of the negative impact of SARS-CoV-2 genetic variants on molecular detection tests available 9 .
The use of universal primers makes it possible to identify several virus variants using the same PCR protocol. UFRN_primers are strong candidates to simplify the procedures and supply chain for detecting SARS-CoV-2, allowing, for example, the mass production of primers and kits that could be applied in different parts of the world with equivalent efficiency. However, the primers presented here still depend on in vitro validation. The availability of these sequences at this time will be crucial so that these new protocols can be validated promptly to assist in the control of the SARS-CoV-2 pandemic.  www.nature.com/scientificreports/ Another critical point is that primers presented here were tested against the updated RNA sequences databases from bacteria, fungi, and protozoa and did not generate nonspecific amplicons in any case. Although executed through in silico analyses, this lack of prediction increases the potential for applying these primers to different samples such as blood, feces, or even environmental samples. Currently, the most suitable sample for detecting SARS-CoV-2 is the human nasal swab; however, there are already studies that have shown digestive symptoms (e.g. diarrhea and vomiting) 10,11 and other less frequent symptoms (e.g. conjunctivitis) in patients who tested positive for SARS-CoV-2 [12][13][14] . This diversity of symptoms makes clinical diagnosis difficult, and testing new types of samples may be needed quickly. The application of UFRN_primers to detect SARS-CoV-2 in blood or fecal samples is likely efficient since these primers should not interact non-specifically with RNAs of the main protozoa and bacteria that cause health problems in humans.
Quite possibly, at the time of publication of this work, a considerably larger number of additional sequences will be available, which may reveal new polymorphic sites in the target regions of UFRN_primers and PD_primers. In this way, our research group will continue this bioinformatics work, and whenever relevant, we will report new updates on the primer sequences or new primers.

Methods
Whole-genome sequences of SARS-CoV-2 from human isolates were retrieved from the Global Initiative on Sharing All Influenza Data (GISAID-gisaid.org) 15 and Virus Variation from the National Center for Biotechnology Information (NCBI-https:// www. ncbi. nlm. nih. gov/ genome/ virus es/ varia tion/) 16 databases, between Mar 30 and Nov 24, 2020. To minimize sequencing errors and artifacts, we activated the filters "complete (> 29.000 bp)", "high coverage only" and "low coverage excl" at sequence retrieval in GISAID database and the filter "Complete" under the option "Nucleotide completeness" from the Virus Variation database. The full list of authors and laboratories of GISAID submissions and the Virus Variation sequences accessions are available in Supplementary Table 3.
Complete fasta sequences were then aligned using Clustal-Omega, version 1.2.4 17 , with standard parameters, using a supercomputer. To avoid excessive misaligned gaps and to better identify conserved polymorphic sites, we trimmed the multiple sequence alignments (MSAs) using the trimAL tool, version 1.2 18 , with the "-auto-mated1" option. We used the sequence from a Wuhan seafood market pneumonia virus (GenBank Accession code MN908947) 19 as a reference for all alignments to identify site and region positions.
The CSs were submitted to online Primer-BLAST 20 to design primer pairs adopting the following criteria: PCR product size = 90-150 nt; primer melting temperatures (°C) minimum = 55, optimum = 58, maximum = 63 and maximum melting temperature (Tm) difference = 2 °C. The specificity check was performed using the complete Refseq RNA databases for Homo sapiens (taxid: 9606), Bacteria (taxid: 2), Fungi (taxid:4751), Apicomplexa (taxid:5794). We set the primer specificity stringency so that the primer must have at least 3 total mismatches to unintended targets, including at least 2 mismatches within the last 5 bps at the 3' ignoring targets with 5 or more mismatches to the primer. The other Primer-BLAST parameters have been kept in the default configuration to confirm the newly-designed primers pairs features.
From all the primers generated by the Primer-BLAST, we selected 124 primer pairs that presented low selfcomplementarity for total annealing (max 5 nt) and also for annealing in the 3' region (max 3 nt). After individual evaluation using the Geneious suite (version 9.1.8, 2017), we elected 9 primer pairs that target regions with 100% identity among all 2143 initial genomes. These primers comprise ORF1a, ORF1b, and S regions of the SARS-CoV-2 genome. TaqMan probes for each primer pair were also designed considering the same alignment and prioritizing conserved regions inside each of the predicted amplicons.
To compare and assess the already used and newly-designed primers and probes' annealing specificity, we used three different tools: PrimerSearch version 6.6.0 from the Emboss package 21 , the stand-alone BLAST + 22 , and the on-line Primer-BLAST. For the first two tools, we used five different custom databases: (1) SARS-CoV-2 sequences from GISAID (211,833 genome sequences retrieved on Nov 24, 2020), with the filters as mentioned earlier activated; (2) SARS-CoV-2 sequences from Virus Variation; (3) RefSeq RNAs from Apicomplexa taxon, retrieved from GenBank on Mar 30, 2020; (4) RefSeq RNAs from Toxoplasma taxon, also from GenBank (Mar 30, 2020) and (5)  The specificity test's first step was to search all 5' and 3' primers pairs sequences to verify amplicon possibilities using PrimerSearch, against each of the databases mentioned above. We used three different mismatch allowance percentages (0, 10, and 20%). We also evaluated the number of hits subject sequences from standalone BLAST +, the aligned start and end regions, and the number of mismatches for each alignment for probes similarity searches.
The genome sequences of B. 1.1.7, B.1.351, B.1.427, B.1.429, B.1.525, and P.1 variants were retrieved from the GISAID database with the following filters activated: "complete sequence", "excl low coverage", "high coverage", and "w/ pacient status". The total number of sequences for each variant was: 1931 for B. 1.1.7, 495 for B.1.351, 94  for B.1.427, B.1.429 e B.1.525, and 177 for P.1. The primer pairs were aligned with each set of sequences using PrimerSearch, with the parameters of 0% mismatches and 10% mismatches allowed. The results were processed and recorded for each primer pair and variant using a custom shell script.

Data availability
The sequences utilized during the current study are publicly available in GISAID (https:// www. gisaid. org/) and Virus Variation (https:// www. ncbi. nlm. nih. gov/ genome/ virus es/ varia tion/) databases. Sequence codes are available in Supplementary Material 3. Any other data/protocol is open upon request to the corresponding author.