A small interfering RNA (siRNA) database for SARS-CoV-2

Coronavirus disease 2019 (COVID-19) rapidly transformed into a global pandemic, for which a demand for developing antivirals capable of targeting the SARS-CoV-2 RNA genome and blocking the activity of its genes has emerged. In this work, we presented a database of SARS-CoV-2 targets for small interference RNA (siRNA) based approaches, aiming to speed the design process by providing a broad set of possible targets and siRNA sequences. The siRNAs sequences are characterized and evaluated by more than 170 features, including thermodynamic information, base context, target genes and alignment information of sequences against the human genome, and diverse SARS-CoV-2 strains, to assess possible bindings to off-target sequences. This dataset is available as a set of four tables, available in a spreadsheet and CSV (Comma-Separated Values) formats, each one corresponding to sequences of 18, 19, 20, and 21 nucleotides length, aiming to meet the diversity of technology and expertise among laboratories around the world. A metadata table (Supplementary Table S1), which describes each feature, is also provided in the aforementioned formats. We hope that this database helps to speed up the development of new target antivirals for SARS-CoV-2, contributing to a possible strategy for a faster and effective response to the COVID-19 pandemic.

www.nature.com/scientificreports/ based on siRNAs with a focus on SARS-Cov-2. One of them (Gu et al 27 ) performed in vitro and in vivo experiments (Syrian hamster and Rhesus macaque) with siRNA that targets RNA-dependent RNA polymerase (RdRp) gene. Two other works that developed siRNAs to target ORF-1 28 and RdRp 29 genes also performed, respectively, in vitro and in vivo experiments. All studies reported effective gene suppression activity of SARS-CoV-2 suggesting a promising approach for treating COVID-19. Furthermore, as occurred during the SARS-CoV epidemic in 2003 30,31 , many patent applications for SARS-CoV-2 32 Table S3). A critical step in the development of RNAi-based therapies is the design of siRNAs. To find potential regions in diverse coronaviruses with matches to SARS-CoV-2, identifying many of them in SARS-CoV, the closest homolog, researchers 34 used Immune Epitope Database and Analysis Resource (IEDB) 34 . Chen et al 35 applied a window of 3000 nucleotides with a step of 1500 over the reference SARS-COV-2 genome seeking 1-25nt regions called 'free segments' . Besides, siRNAs databases targeting a broad range of viruses [36][37][38] have been developed. Recently, researchers developed a SARS-CoV-2 oligonucleotide sequence database, to improve the SARS-CoV-2 detection and treatment methods, providing sequences with the lowest and highest conservation levels 39 .
In this work, we presented a SARS-CoV-2 targets database to support the development of siRNA approaches and speed up RNAi design, by providing a set of possible targets and siRNA sequences with the required information for choosing the most appropriate targets for new siRNAs. Unlikely cited databases, which are manually curated and provide only a small set of siRNAs chosen for specific targets (see siRNA computational identification and design papers at Supplementary Table S3), we apply a sliding-window approach for covering whole SARS-CoV-2 genomic space, extracting every possible siRNA sequence of 18, 19, 20, and 21 nucleotides. This methodology generated a comprehensive database that enables researchers to assess solutions capable of targeting any region of the virus but also to select homologous regions between the circulating variants. It also enabled 100% of matches with siRNAs published by similar works (see column % of siRNAs present in the proposed database at Table S3). The database presents more than 170 features, including thermodynamic information, base context, target genes, and alignment information against diverse SARS-CoV-2 strains, together with scores and predictions collected from three siRNA efficiency prediction tools. It is worth mentioning that the various laboratories around the world have distinct expertise and goals for siRNAs development, therefore, all this coordinated information will enable users to select, with higher confidence, targets that best match a broad set of conditions for designing even more efficient siRNAs.

Results
Database analysis and statistics. The proposed database displays a total of 119,526 siRNAs divided into four different sizes, ranging from 18 to 21 nucleotides (see Table 1 for the number of siRNAs of each length), and covering more than 170 features (Supplementary Table S1 describes each feature). The column Annot, for example, indicates which gene (or genes) a siRNA can target, and should be consulted if the user wants to design siRNAs focused on inhibiting the activity of a single gene or a group of overlapping genes. Figure 1 provides the distribution of 21nt siRNAs across the twenty-most siRNA-abundant SARS-CoV-2 genes. It can be noticed that gene overlapping pp1ab,pp1a,nsp3 comprises about 20% of all siRNAs (5811), more than the double of the next most siRNA-abundant gene, the gene overlapping pp1ab, Pol, which can be targeted by 2732 siRNAs (about 9% of all 21nt ones). Otherwise, gene overlapping S_glycoprotein, Spike_protein_S2 and nsp8 holds the lowest number of siRNAs: 366 and 335, respectively (about 1% each).
We also aligned all siRNAs to the Human genome, Human coding and non-coding transcriptomes, SARS, MERS, and H1N1 genomes, with Bowtie version 1.1.0, to identify if siRNAs could off-target regions in those organisms, thus presenting cross-reactivity with them. According to Yamada et al. 40 , a minimum of three mismatches against the human genome is necessary to guarantee that the siRNA will not anneal off-target regions, hence increasing its effectiveness. Figure 2a-d illustrate the growth of siRNAs quantities as the minimum number of necessary mismatches to have alignment increases. The results show that virtually all 18nt siRNAs can match the human genome and transcriptomes (coding and non-coding) with three mismatches. This number, however, increases to four mismatches considering 19nt and 20nt (Fig. 2b, c), and to five considering the 21nt length (Fig. 2d). It can be observed that in each length, about 2500 siRNAs match perfectly with some region of SARS. Regarding MERS and H1N1, about 2500 18nt siRNAs can match regions of those viruses, when the minimum number of allowed mismatches is two (Fig. 2a). This number is overcome by 19-21nt siRNAs only when the number of mismatches is increased to three (Figs. 2b-d). Finally, it is also important to note that while all 18nt www.nature.com/scientificreports/ and 20nt siRNAs match some regions from MERS, SARS, and H1N1 using at least six mismatches, the number of mismatches increases to seven for 19nt and 21nt siRNAs.
To analyze the siRNAs' effectivity to address strains sets from different populations, other alignments with Bowtie version 1.1.0 were performed, this time with siRNAs against SARS-CoV-2 strains available at the Global Initiative of Sharing All Influenza Data (GISAID) database (https:// www. gisaid. org/), coming from nine countries (see Methods). This analysis indicate, for example, which siRNAs are more suitable for a specific country, given its matches with the strains. The columns BV to CD, CN to CV, and DF to DN from database spreadsheet files (see Supplementary Table S1) provide the number of genomes from each strains country set that natural sense, synthetic sense, and antisense respectively have a perfect match with (the total number of strains genomes is available at columns' headers).  The majority of siRNAs encloses more than 95% of the strains from all countries. In spite of that, a lesser but significant portion of siRNAs encloses between 50 and 95% of the strains from all countries, with China and Brazil presenting the highest numbers (6796 and 5881 siRNAs, respectively). Russia has the smallest portion of such siRNAs (524 siRNAs). Figure 3b displays an intersection matrix from siRNAs that enclose more than 95% of the strains from each country, providing the number of siRNAs that have a match simultaneously with each country pair. Brazil and China produce similar profiles: while the former shares between 18.000 and 24.000 siRNAs with more than 95% of country coverage with other countries, China shares between 18.000 and 23.000 siRNAs. On the other hand, England's pairs vary between 20.000 and 26.500 siRNAs, and it is possible to see a "cluster" formed between Germany, Italy, Russia, Spain, and USA, wherein each pair shares more than 27.000 siRNAs. These results suggest that many of the 21nt siRNAs in our proposed database have potential to be used worldwide, given such sharing power.
Supplementary Table S2 shows the number of 21nt siRNAs with more than 95% of coverage against the variants presented in each country. It can be noticed that same number of siRNAs (570) against the overlapping genes pp1ab, pp1a, nsp8 is observed across all countries but China (where it has 410), and the overlapping genes pp1ab, pp1a, 3CL-PRO, across all countries (with 894 siRNAs in each one) except China (828 siRNAs) and Russia (873). Otherwise, the overlapping genes pp1ab, pp1a, nsp10 present the same number of siRNAs (393) in all countries but China (358) and Spain (334). Although Russia and China do not share the same number of siRNAs in any gene, Russia does with Wuhan in the overlappings pp1ab, ExoN (1557), pp1ab, nsp16 (894), and pp1ab, pp1a, nsp10 (393 siRNAs). Besides, the number of siRNAs with more than 95% of coverage in Russia that target S_glycoprotein is the same number of at least one other country. These results indicate that it is possible, with the proposed database, to screen for siRNAs with effectivity directed to a specific gene (or group of overlapping genes) with either a potential global application or to a specific set of populations of interest.
The possible toxicity of a siRNA in humans is an important aspect that must be taken into account during design processes, and can be handled at diverse levels, from the molecular one to the issue of having off-targeting capabilities (as previously mentioned and discussed). Regarding sequence level, where the proposed database is located, a set of proposals can be found in the literature 40,41 related to how to assess the toxicity of a siRNA. Seeking to evaluate the toxicity level of 21nt siRNAs from the proposed database, to check its ability to handle these issues, we applied a filtering set of four masks based on previous works 40,41 (see Table 2, section Toxicity). A total of 26,629 siRNAs (about 89% of the 21nt database) were considered toxic, and 3251 atoxic (about 11%). Country coverage of atoxic siRNAs (Supplementary Figure S1a,b) follows the same visual pattern from Fig. 3, this time with pairs involving Brazil or China each one varying from 1,800 to 2,500 siRNAs; England, from 2,000 to 2,900, and the "cluster" formed between Germany, Italy, Russia, Spain, and USA, with more than 2,800 siRNAs. Their distribution across Top 20 siRNA-abundant SARS-CoV-2 genes (Supplementary Figure S2) repeats the pattern of overlappings pp1ab, pp1a, nsp3 and pp1ab, Pol with the highest quantities (436 and 330, respectively), and now nsp8 gene is the third least, covering 64 siRNAs (overlappings pp1ab, nsp16, and pp1ab, pp1a, nsp9 fill the list with 62 and 59 siRNAs, respectively).
As stated in Methods, we applied over all the siRNAs three efficiency prediction tools to assess their inhibition power. Figure 4 illustrates the number of 21nt antisense siRNA sequences predicted as effective by every single predictor, and the quantities predicted by more than one. It can be seen that no siRNA was unanimously considered effective, while approximately 53% of them (15,821 siRNAs) were considered as such just by SSD. Besides, a single siRNA was predicted as effective by both ThermoComposition21 and si_shRNA_selector.
The literature [40][41][42][43][44][45][46] regarding siRNA effectiveness indicates that three main characteristics should be considered during the design processes: Toxicity, Stability, and Effectivity. We examined whether these criteria could be assessed using the variables of our proposed database. Table 2 addresses the mapping between the features of the database and literature filtering criteria. It is possible to see that there is at least one available feature for    www.nature.com/scientificreports/ Database use and access. The proposed database is distributed as a set of five files, available in spreadsheet and CSV formats. One of them is a metadata table, containing the description of each column (Supplementary Table S1 replicates such table), and the remaining ones correspond to target region sequences of a specific length. Here we will present how a researcher can use this database with an illustrative example. Suppose a user wants to select siRNAs with 21 nucleotides length. In this case, the user will access either the file 21bases.xlsx or the file 21bases.csv (for purposes of this example, it will be considered that the user has accessed 21bases.xlsx).
After opening it in a spreadsheets editor, the next step is selecting siRNAs whose properties match the user requirements. Assume that the user wants a siRNA that has little or no homology with the human genome, can act over as much as possible British SARS-CoV-2 strains, and its first dinucleotide is AA. This last requirement is achievable by applying a filter over column sAA (see Supplementary Text S1 and Supplementary  Table S1) can be filtered to display only the three highest values, for example, which reduces candidates from 999 to 10 candidates. Such a reduction not only saves wet-lab test costs but also ensures that selected siRNAs meet the main user requirements.

Discussion
Small interfering RNAs (siRNAs) are double-stranded non-coding RNA molecules of 18-25 base pairs long, which regulate the expression of genes by a phenomenon known as RNA interference (RNAi). Although their therapeutic use had imposed many challenges to overcome limited effectiveness and potential toxicity effects in the early applications 12,47 , several siRNAs have been developed as potential therapies against viral infections with limited treatment options and accessible target cells, like hepatitis B virus 48 , Ebola 49 , and respiratory syncytial virus 50 . For SARS-CoV, the effect of prophylactic and therapeutic activities of siRNAs in Rhesus monkeys (Macaca mulatta) was evaluated in Li et al 22 . The researchers used two duplex siRNAs, targeting the SARS-CoV genome in the spike and NSP12 protein-coding regions. They found that siRNAs provide relief of fever caused by SARS-CoV infection, reducing the viral load and decreasing acute diffuse alveolar damage. In addition to proving the effectiveness of these siRNAs in prophylactic and therapeutic activity, the experiments did not show any signs of toxicity related to the use of siRNAs as therapy.
Regarding SARS-CoV-2, three different in vitro and in vivo studies [27][28][29] successfully reported siRNAs application as a potential treatment for COVID-19 (see Supplementary Table S3, Pre-clinical or human-clinical test studies line). Furthermore, researchers have applied for a large number of patents related to vaccines or drug development reported siRNAs targeting M, N, and E protein genes, PI4KB, RdRp, ORF3a gene, among others (reviewed in [30][31][32] ). As expected, many of these and new studies have been reversed in patent applications applied to COVID-19. Two patents reported a preparation method of a CoViD-19 antisense RNA multivalent vaccine (CN111330003A) and a dsRNA vaccine (CN111321142A), targeting ORF1ab, 3′ UTR, and S, E, M, or N genome region. For the development of biological medicines aiming to prevent and/or treat Covid-19, siRNAs were developed for conserved regions of the SARS-CoV-2 (CN111139241A and CN111139242A). According to the authors, siRNA modified by the invention has an obvious inhibition effect on the gene, with a great clinical significance for treating COVID-19 pneumonia. Also applying siRNA molecules, other patents report the suppression of SARS-CoV-2 replication, by targeting the ORF1 (ORF1a, ORF1b) and N genes (RU2733361C1) or described an effective inhibition of the expression of virus key protein by targeting the gene sequence of RDRP enzyme or S protein (CN111518809A). Meanwhile, biotechnology companies have invested heavily to consolidate RNAi therapeutics for COVID-19. A collaborative team of Alnylam Pharmaceuticals and Vir Biotechnology developed an aerosolized delivery of siRNAs optimized for lung uptake, and are conducting in vitro and in vivo tests, whereas Sirnaomics perform preclinical studies with a respiratory-specific siRNA formulation that is delivered by a customized handheld nebulizer device 51 . Based on these studies and evidence, we believe that siRNA-based therapies are a promising tool for fighting epidemics. Furthermore, the siRNAs experimentally validated in above papers and also presented in our bank indicate that the proposed database can effectively help to achieve this goal.
Despite the exciting results obtained by this technique, researchers still face many challenges, and one of the most critical is the avoidance of nonspecific toxicity in therapeutic applications. According to Setten et al. 12 , the main sources of toxicity that have considerably affected clinical RNAi drug development are related to (1) Immunogenic reactions to dsRNA, (2) Toxicity of excipients, (3) Unintended RNAi activity, and (4) On-target RNAi activity in non-target tissues. Some of these problems are largely mitigated by the development of excipients limited to a small number of chemical components that are individually verified for low toxicity, or by choosing specific delivery routes, like lipidic nanoparticles complexes and other non-viral vectors 52,53 . Although the sequence features from siRNAs are insufficient to evaluate the assertion of delivery at the intended target area, it is an essential information to evaluate efficiency and possible toxic effects 43,44 . In this paper we worked on specific parameters based on siRNAs sequence features, that evaluate molecule stability and its potential to interact with off-target regions and pathways from human coding and non-coding transcriptome. Careful evaluation of these parameters will help to optimize the design and effective development of SiRNAs for each given objective.
The design of siRNAs is a challenging procedure because sometimes minor changes in its nucleotide sequence can alter its functionality 42 . As reported by Alagia et al 54 , specificity, potency, and efficacy of siRNA-mediated gene silencing can be determined by analyzing the siRNA nucleotide sequence, hence its inability to bind to unintended regions (off-targets) is an important factor that must be strongly taken into consideration. Therefore, we proposed a SARS-CoV-2 targeted siRNAs database with sequence and thermodynamic stability information, www.nature.com/scientificreports/ to help the evaluation of important factors related to their efficacy and optimize the decision process towards choosing the best ones as target antiviral solutions. Considering that each laboratory has its own technology context and expertise in designing siRNAs of specific lengths, we provide a list of siRNAs varying from 18 to 21 nucleotides-length, aiming to meet the range of possible lengths used in the design process. The analysis of 21nt siRNAs showed the overlapping genes with most siRNAs (5811, 20% of the total number) involve pp1ab. Once this gene covers about 70% of the virus genome (21287nt) 3 , it is natural that most of the siRNAs fall in it. Thus, the database gives the option to screen either for siRNAs with higher or lesser genespecificity, in which the higher the number of overlapping genes that a siRNA can target, the higher the chances are to it be more effective because a larger set of viral functions will be compromised. On the other hand, gene nsp8 covers only about 2% of the genome (594nt) 3 , which may explain the reduced number of siRNAs that target it. The gene nsp3, which is present in the most target pp1ab, participates in the process of viral transcription and replication 55,56 . Since gene distribution analysis of siRNAs considered atoxic (see Results and Figure S2) revealed the same distribution pattern from the whole dataset, it can be suggested that nsp3 is a good target candidate for siRNAs design and development, given its abundance and function.
An early 2020 variation analysis study 57 over SARS-CoV-2 strains from diverse countries reported homology levels between 99 and 100% for all strains. These countries presented the highest numbers of siRNA sharing pairs (Fig. 3), thus supporting the idea of high conservation areas in the SARS-CoV-2 genome. This can also resemble in Supplementary Table S2, where some genes have the same number of siRNAs that are capable to target at least 95% of strains from diverse countries. These results indicate that, although siRNAs from the proposed database can not target mutation sites from new SARS-CoV-2 strains, the fields ranging from BV to CD spreadsheet columns (see Supplementary Table S1) help to identify homology regions common to all strains from a specific country.
Numerous works have been proposing methods and guidelines for choosing the best siRNAs by analyzing their sequence characteristics 43,44,58,59 , for which two broad reviews are available 42,60 (some of them are briefly discussed at Supplementary Text S1). Given the importance of such guidelines and also the characteristics involved in their formulation, we decided to insert all this information into the database, so that users can select their best siRNAs from instructions already published, or by drafting their own rules from their expertise and specific objective. In this way, all the information contained in the database can be used in a customized and cost-effective manner. For example, our proposed database provides information regarding the bases, GC, and AU context, so as the quantities of each RNA nitrogenated base in sequences, besides information about the presence of UUUU and GCCA, considered toxic motifs 41 , so any user with a proper efficacy evaluation method (or anyone provided by literature) can easily evaluate siRNAs with this database at disposal. It also provides thermodynamic information collected from the application of three predictors 45,61,62 , thus enabling users to have a deeper look at siRNAs' properties, and choose the best ones according to their specificities. As it can be seen in Fig. 4, they have high divergence when setting a siRNA as efficient or not, which suggests that they must be used in a complementary way. Due to the genetic diversity and variability of SARS-CoV-2 63 , a siRNA that is highly efficient over one strain may not be when applied to another. Hence, we also provide information about similarity within strains from diverse countries, such that users will benefit from the opportunity of input geographical specificity and even more customization to their decision process.
Ensuring that siRNAs are not capable of targeting human sequences (off-targets) is also another important requirement, for which a minimum of three mismatches is necessary to meet it 40 . Thus, similarity information with the human genome, coding, and non-coding transcriptome, is also available in our database. As it was shown in the Database Analysis & Statistics session, virtually all 18nt-long siRNAs matched with such genome and transcriptomes with at least three mismatches, corroborating with the literature statement 40 . To the best of our knowledge, this is the first database to figure SARS-CoV-2 siRNAs similarity information against human coding and non-coding transcriptomes, giving to users even more confidence power over siRNAs specificity. To investigate how it is possible to use the database for customized efficient siRNAs selection, we have elicited from literature filtering criteria regarding siRNA activity, such as Toxicity, Stability, and Efficiency. Table 2 showed that all the listed criteria can be handled using database features, which allows users to delimit thresholds and look at features that maximize desired skills that a siRNA must have to fulfill their requirements, to specialize the search space for their needs. It is important to note, however, that this database has not its use limited to biomedical applications: users can exploit their features to other biotechnological applications that have distinct requirements and explore database information in different ways. This is a clear advantage of our database over others already developed-the fact the whole possible siRNAs set are available opens the potential for groups with diverse specialties to work with them for different applications beyond healthcare ones.
In this work, we presented a database to support the development of new target antivirals for SARS-CoV-2 using RNAi technology. We hope that the development of new antiviral products can not only be facilitated and accelerated but that the presented database helps to generate even more efficient solutions to silence the virus, contributing to the control of the pandemic. Given the urgency to provide this information for the scientific community, we made available the database as a set of tables files in spreadsheet and CSV formats, however, a webpage for more user-friendly and interactive access to the data will be released soon. Finally, it is important to stress that the approach presented here can be successfully applied for exploring the genomic information of other viruses, including the ones that may represent a threat to new pandemic events.

Methods
Although siRNAs length can vary from 18 to 25 nucleotides 64 , synthetic ones should range from 19 to 21nt 65  www.nature.com/scientificreports/ tools employed for assessing siRNAs efficiency 45,61,62 operate over sequences lying in that range, which reinforces our choice. Since they present the same columns, we explain here the development process only for the 21-length table. SARS-CoV-2 reference genome was collected from NCBI (code NC_045512) and a sliding window of 21ntlong and step 1 (this parameter is used in all tables, independently of length) were used to traverse the genome. Table 1 indicates the total number of sequences obtained for each length. Seven new sequences sets were then generated from the obtained sequences set (called target region), following the aforementioned ThermoFisher guidelines, and suggestions from our collaborators: (1) natural sense, by removing the first 5′-end dinucleotide from target region sequences; (2) oligo natural sense, replacing thymine with uracil over natural sense set; (3) synthetic sense, by replacing first 3′-end dinucleotide from natural sense sequences with TT; (4) oligo synthetic sense, by replacing of thymine with uracil over synthetic sense sequences; (5) antisense, from the reverse complement of target region; (6) oligo antisense, by replacing thymine with uracil over antisense sequences; and (7) oligo antisense rev set, by reversing antisense sequences.
Natural sense sequences were then aligned against (a) SARS-CoV-2 reference genome, to verify which genes they align with; (b) the human genome (NCBI accession code GRCh37) and coding and non-coding transcriptome, to verify potential cross-reaction with off-target transcripts; (c) SARS-CoV-2 strains available at Global Initiative of Sharing All Influenza Data (GISAID) database (https:// www. gisaid. org/) coming from Brazil, China (Wuhan region only and whole country less Wuhan), England, Germany, Italy, Russia, Spain and USA; and (d) reference genomes of MERS virus (NCBI MG987420), SARS-CoV (NCBI NC_004718) and Influenza virus genome (NCBI NC_026438), aiming to assess whether siRNAs are capable to target regions from those viruses and strains. Bowtie 67 version 1.1.0 was used as the aligner, using the following flags: -a, -S, --pairtries equals to 4, -p equals to 40, -n equals to 3, -l equals to 7 and -f. Flag -e was used, being equals to 150 when aligning against the human genome, 10 for GISAID strains, and 220 for remaining genomes. Sequence properties regarding base context and alignment information were calculated from the above sequences sets and performed alignments (see Supplementary Text S1). Thermodynamic information and expected efficiency of candidates siRNA designed for targeting those regions was calculated with OligoCalc 68 and three predictors, namely ThermoComposition21 61 , SSD 62 , and si-shRNA Selector 45 . Finally, we elicited criteria filtering sets from literature regarding (a) Toxicity, (b) Efficiency, (c) Populations and genomes coverages, (d) Stability, and (e) Effectiveness prediction, which are summarized at Table 2, and then analyzed whether they could be assessed with the features of proposed database.