Analysis of the full genome sequences of SARS-CoV-2 isolates to determine antigenic proteins and epitopes to be used for the development of a vaccine or a diagnostic approach for COVID-19

In genome of SARS-CoV-2, the 5’-terminus encodes a polyprotein (pp1ab), which is further cleaved into 15 non-structural proteins (nsp-1 to nsp-10 and nsp-12 to nsp-16) whereas the 3’ terminus encodes four structural proteins (spike, envelope, membrane, and nucleocapsid) and eight accessory proteins (3a, 3b, p6, 7a, 7b, 8b, 9b, and orf14). Among these 27 proteins, the present study aimed to discover likely antigenic proteins and epitopes to be used for during the development of a vaccine or serodiagnostic assay using a reverse vaccinology in silico approach. For this purpose, after the full genome analyses of SARS-CoV-2 isolates, viral surface proteins including spike, envelope and membrane proteins as well as proteins with predicted signal peptide were determined as probable vaccine candidates whereas the remaining were considered as possible antigens to be used during development of serodiagnostic assays. According to results, the phylogenetic analysis of SARS-CoV-2 isolates from 31 different countries showed two significant clusters in which one was clustered with China-Wuhan and the other one with China-Yunnan isolates. During the analyses, 105 SNPs were identified that resulted in change in 70 amino acid positions. Among the 27 proteins, 26 of them were predicted as probable antigen, except nsp-16. In 26 proteins, spike protein was selected as the best vaccine candidate because of having a signal peptide, negative grand average of hydropathicity value, one transmembrane helix, moderate aliphatic index, a big molecular weight, a long-estimated half-life, beta wrap motifs as well as having a stable, soluble and non-allergic features. In addition, orf7a, orf8 and nsp-10 proteins were considered as potential vaccine candidates because of having signal peptides. Nucleocapsid protein and a highly antigenic GGDGKMKD epitope of nucleocapsid protein were identified as ideal antigens to be used in development of serodiagnostic assays. Moreover, considering MHC-I alleles, highly antigenic KLNDLCFTNV


Introduction
Coronaviruses belonging to the family Coronaviridae and the order Nidovirales are a large family of enveloped positive-strand RNA viruses. Coronaviruses are zoonotic pathogens that infect both 3 animals and humans, and may cause diseases in intestinal, liver, respiratory, and nervous systems.
In genome of SARS-CoV-2, the 5'-terminus encodes a polyprotein (pp1ab), which is further cleaved into 15 non-structural proteins (nsp-1 to nsp-10 and nsp-12 to nsp-16) whereas the 3' terminus encodes four structural proteins including spike (S), envelope (E), membrane (M), and nucleocapsid (N) proteins and eight accessory proteins (3a, 3b, p6, 7a, 7b, 8b, 9b, and orf14) 1 . Comparative genomic analyses have revealed that SARS-CoV-2 shared more nucleotide homology with SARS-CoV than MERS-CoV 2,3 . SARS-CoV-2 has been also suggested as more adaptive to humans with a higher mutation rate than SARS-CoV 4 . The novel coronavirus, SARS-CoV-2 causing COVID-19 has been first reported in Wuhan City of Hubei Province in China on December, 2019 and then, COVID-19 has spread from China to 211 different countries with more than 2.8 million cases including more than 195 thousand deaths in record time 5 .
Since it has been known that no specific therapeutic agents that target SARS-CoV-2 are currently available, the development of an urgent vaccine against SARS-CoV-2 is inevitable. In vaccine development, traditional or recombinant vaccine methods are being used. Traditional approaches, based on inactivated or live attenuated viruses, can be applied for vaccine development but it has been reported that these approaches have some limitations such as being time-consuming, having problems in the production of non-abundant proteins and pathogens 6 . This condition prevents the development of new vaccines against pathogens causing the outbreaks leading to pandemic. On the other hand, to overcome these problems, new recombinant vaccine development strategies allowing several genes obtained from different pathogenic agents to be cloned, expressed and purified to be used as vaccine candidates are applied 7 . During new recombinant vaccine design, reverse vaccinology (RV) in silico approach provides detailed preliminary prediction about vaccine 4 candidates by using genome sequences that can be ultimately translated into proteins. Utilisation of RV in silico approach is rather crucial because of offering a prediction for the antigenicity, the epitope regions of B and T cells as well as other parameters such as signal peptide, subcellular localisation, and solubility about targeted proteins 6,8 . Currently, docking analysis demonstrating the binding among predicted epitopes and selected alleles of MHC-I and MHC-II became an important part of RV in silico approach 6 . Result obtained from in silico prediction has utmost importance for preventing the failures that can be encountered at the end of wet-lab studies or even late stages of clinical trials.
In this context, the present study aimed to analyse the full genome of SARS-CoV-2 isolates from different countries compared with a seed genome (reference isolate Wuhan-Hu-1; Accession number: NC_045512.2) in order to determine the nucleotide variation(s). Secondly, we aimed to discover likely antigenic proteins and epitopes to be used for the development of a vaccine candidate or serodiagnostic assay using a RV in silico approach as previously described 6,[9][10][11] . For this purpose, surface proteins including S, E and M proteins as well as proteins that were predicted to have a signal peptide were identified as probable vaccine candidates whereas the remaining were considered as probable antigens to be used in development of a serodiagnostic assay. During in silico analyses, physico-chemical parameters, secondary structure, subcellular localisation, transmembrane helices, antigenicity and signal peptide were predicted for 27 proteins of the seed genome. Also, signal peptide and antigenicity predictions were conducted for variant proteins whose reference proteins are structural protein and/or have a signal peptide. Variant proteins were determined according to amino acid alterations detected at least in two countries. Thereafter, allergenicity, BetaWrap motifs, similarity with host proteins, post translational modifications and B/T cell epitopes were predicted for proteins which are structural, variants of structural proteins and/or have a signal peptide. Finally, selected epitopes were docked with receptors of MHC-I/II alleles.

Results
Variations. The full genomes of 41 isolates from six continents were compared with a reference seed genome and a lot of codon alterations were detected in all open reading frames (orfs), except orf10. Orf1ab that contain 15 different coding regions was the largest fragment in genome and codon alterations were detected in 66 positions, resulting in 39 amino acid changes. For orf3a, codon alterations were detected in eight positions and seven of them were found to cause the amino acid changes. Fragments of orf6, orf7a and orf8 contained codon alterations in one, two and two positions, respectively. All codon alterations were detected to cause amino acid changes, except one in orf6.
The codon alterations were also detected in gene fragments encoding structural proteins including S, E, M and N proteins. These codon alterations detected in only structural proteins were depicted comprehensively in Table 1, 2 and 3. In the gene fragment of S, codon alterations were detected in eleven positions and nine of them were detected to cause amino acid changes. Especially codon alteration caused by the substitution of adenine for guanine at position 1841 was prevalent and detected in 17 different isolates from 14 countries ( Table 1). The altered codon caused a change from aspartic acid to glycine. In the gene fragment of N, codon alterations were detected in 12 different positions. Among these codon alterations, three were prevalent and two of them were detected in the same six isolates from five countries (Georgia, Greece, Argentina, Colombia, Nigeria). One of them was caused by consecutive guanine to adenine transition at two positions 608 and 609, and resulted in arginine to lysine substitution whereas the other one was caused by a changing from guanine to cytosine at position 610 and resulted in glycine to arginine substitution. The third prevalent variation caused by substitution of cytosine to thymine was detected in 36 isolates from 29 countries and did not result in amino acid alteration ( Table 2).
The fragments belonging to E and M were the smallest when compared to S and N proteins. In the gene fragment of E, two variations caused change in amino acid sequence and one of them was detected in two isolates whereas a codon alteration detected in the gene fragment of M protein did not result in amino acid change (Table 3).
Physico-chemical parameters. The number of amino acids varied from 75 to 1273 among structural proteins. The largest one was S protein with ~142 kDa whereas E protein with ~8.4 kDa was the smallest one (Table 4). Among non-structural proteins, except orf1ab, the amount of amino acids varied from 43 to 275. Orf3a with ~32 kDa was the largest protein whereas orf7b with 5.2 kDa was the smallest protein (Table 4). Each non-structural protein that is encoded by orf1ab was also analysed, and the amount of amino acids was detected to vary from 83 to 1945. Nsp-3 with ~218 kDa molecular weight was one of the largest proteins whereas the smallest one was nsp-7 with ~9.3 kDa size (Table 4). When all proteins encoded by full genome were analysed, theoretical PI value was between 4.6 and 10.07. Among structural proteins, only S protein was negatively charged whereas E, M and N protein were positively charged. In addition, orf7a, orf10, nsp-6, nsp-9, nsp-13, nsp-14 and nsp-16 proteins were positively charged whereas the remaining proteins were negatively charged except nsp-4 and nsp-8 that were neutral. The estimated half-life was 30 h for all proteins, except proteins that were encoded by orf1ab. Only nsp-1 in orf1ab had 30 h estimated half-life. According to instability index, N protein was found as instable while S, E, M structural proteins and most of the non-structural proteins were found as stable. Aliphatic index showed a significant variation ranging between 52.53 to 144 among all proteins. Grand average of hydropathicity value was found negative in S and N proteins as well as in most of the non-structural proteins that were encoded by orf1ab (Table 4).
Secondary structure. According to results obtained from structural proteins, alpha helix was between ~22 and 47%, that of extended strand was between ~10 and 22%, and that of random coil was between ~40 and 60%. For non-structural proteins, alpha helix varied between 0 and 69%, that of extended strand varied between ~3 and 47%, and that of random coil varied between ~28 and 58% (Table 5).
Antigenicity. All structural proteins were predicted as probable antigen. Antigenicity value varied from 0.4638 to 0.6298. E protein and its variant L37H had the highest antigenicity value whereas S protein had the lowest antigenicity value. Interestingly, all non-structural proteins were also predicted as probable antigen, except nsp-16 encoded by orf1ab. In addition, orf7b had the highest antigenicity value with 0.8462 among all proteins ( Table 6).
Solubility. According to solubility prediction, S, E and N proteins were soluble. Among non-structural proteins, orf3a as well as nsp-2, 4, 7, 10, 12, 13, 14, 15 and 16 proteins encoded by orf1ab were predicted as insoluble. The solubility prediction of another protein, nsp-3 encoded by orf1ab, could not be retrieved due to large fragment size (Table 6).  (Table 6). When subcellular localisation predictions were examined, S, M and N proteins were predicted to be in host endoplasmic reticulum. E as well as M proteins were also predicted to be in host cell membrane. Non-structural proteins were predicted to locate in cell membrane, endoplasmic reticulum, cytoplasm, nucleus (Table 6).
Signal peptide. According to prediction of signal peptide based on four different parameters, only S protein and its variant were predicted to have a signal peptide during the analyses of structural proteins. Among non-structural proteins, orf7a, orf8, variant of orf8 and nsp-10 were predicted to have a signal peptide (Table 7).
Allergenicity. None of the proteins analysed showed allergenic properties for MEME/MAST motif and IgE epitopes (Table 8).
BetaWrap motifs. Among all proteins analysed, only S protein and its variant (D614G) were predicted to contain BetaWrap motifs (Table 8).
Similarity with Host Proteome. No significant similarity was predicted between analysed viral proteins and host proteins (Table 8).
B cell epitopes. A lot of linear B cell epitopes were predicted for S, variant S (D614G), N and variants of N (S197L and R203K/G204R), E and variant E (L37H), orf8 and nsp-10 proteins using Bcepred and IEDB. Epitopes that were predicted in both Bcepred and IEDB, and detected as probable antigen were presented in Table 9. Obtained predictions showed that nearly all epitopes had more antigenicity value than those of their own proteins. Among these analysed proteins, the highest antigenicity value (1.4530) was predicted for an epitope (GGDGKMKD) belonging to N protein and its variants. Another epitope (THTGTGQ) that had a high antigenicity value of 1.0789 was predicted in Nsp-10 encoded by orf1ab. Also, any antigenic epitope was not predicted for M and orf7a proteins. All predicted probable antigenic epitopes were depicted in Table 9.

MHC-I and MHC-II epitopes.
A lot of MHC-I epitopes were predicted as probable antigen (Table   10). Antigenicity values belonging to epitopes were generally predicted higher than those of their own proteins. Among structural proteins, an epitope (KLNDLCFTNV) that had the highest antigenicity value (2.6927) was predicted in S protein and its variant (D614G). For non-structural proteins, an epitope (ITLCFTLKRK) in orf7a had the highest antigenicity value (2.5150). Any antigenic epitope was not predicted for nsp-10. On the other hand, KWPWYIWLGF, FLAFVVFLLV, FARTRSMWSF and RNRFLYIIKL, AQFAPSASAF and LGIITTVAAF epitopes belonging to S (including variant D614G), E (including variant L37H), M, N (including S197L and R203K/G204R) and orf8 (including L84S), respectively, had an IC50 value lower than 10 and a percentile rank varying from 0.02 from 0.1, indicating a strong binding among the epitope and MHC-I alleles.
Similarly, a lot of MHC-II epitopes were predicted as probable antigen (Table 11). Also, nearly all epitopes had higher antigenicity values than those of their own proteins. Among structural proteins, PTNFTISVTTEILPV and VTLAILTAHRLCAYC epitopes predicted in S protein (including variant D614G) and variant L37H had the highest antigenicity value. For non-structural proteins, orf7a had an epitope (IVFITLCFTLKRKTE) that was predicted as a probable antigen with a high antigenicity value (1.8597). Any antigenic epitope was not predicted for nsp-10. Among MHC-II epitopes, although there were a lot of epitopes with low percentile rank, only one epitope that had an IC50 value lower than 10, indicating a strong binding among epitope and MHC-II alleles, was detected in orf8.
Post-translational modifications. S protein and its variant (D614G) were predicted to have highly N-glycosylated and phosphorylated sites as well as a few O-glycosylated and acetylated sites. M, E (including L37H), orf7a, and nsp10 proteins were predicted to have N-glycosylated and phosphorylated sites while orf7a was predicted to have an acetylation site. Orf8 and its variant (L84S) were predicted to have N-glycosylated and phosphorylated site whereas two additional phosphorylation sites, one of which locate in exposed surface and the other one is buried, were predicted in only variant L84S. In addition, N protein and its two variants (S197L and R203K/G204R)  For this purpose, firstly full genome sequences belonging to 41 isolates from 31 different countries were compared in order to find the variations causing codon alterations and/or amino acid substitutions, and the association with each other. Viral distribution caused by human mobility or migration was predicted according to phylogenetic tree drawn. Accordingly, isolates from Turkey (EPI_ISL_424366), Canada-Ontario (EPI_ISL_418384), and Australia-Western Australia (EPI_ISL_420539) shared a common ancestor with isolates from Wuhan SARS-CoV-2 (NC_045512), and Kuwait (EPI_ ISL_416458, EPI_ISL_416541). Also, it was thought that the isolate from Chile (EPI_ISL_415661) might be sourced from Japan (LC_528233) because of their closest clustering.
Variants were detected in S, N, E and orf8 proteins. Accordingly, in S proteins, D614G variation was detected to be prevalent ( Table 1). The comparison of reference S protein and its variant (D614G) showed no difference in antigenicity values, epitope regions and antigenicity values of epitopes (Table 9, 10 and 11). In addition, three variations (S197L, R203K/G204R) causing amino acid alterations were detected in N protein ( Table 2). S197L variation was predicted to increase antigenicity value of N protein (Table 9) and thus, utilise of S197L variant was thought to be a better antigen for serodiagnostic studies conducted in countries harboring SARS-CoV-2 isolates with S197L variant. Similar result was also detected in E protein and higher antigenicity value was predicted in variant L37H (Table 3 and 9). Contrary to these results, for variant orf8 (L84S) detected in Japan, China (Yunnan), India, Chile, Australia-Queensland and New Zealand, a lower antigenicity value was predicted (Table 9). These results show the importance of assessment of sequence data obtained from regional isolates before initiated vaccine studies.
Although, there was no major difference between predicted antigenicity values for probable vaccine candidate proteins, S protein was selected as a better vaccine candidate protein compared to others depending on in silico analyses results. Physico-chemical analysis showed that S protein had a negative GRAVY value indicating that S protein is hydrophilic and has a better interaction with surrounding water molecules 36 . Also, it had stable and soluble characteristics which are important parameters for biophysical studies on epitope-based vaccine design. Moreover, S protein had a moderate aliphatic index which indicates stability in a wide spectrum of temperature 37 , fewer than two transmembrane helices facilitating cloning, expression and purification 9 and a big molecular weight and long estimated half-life (more than 10 h). These properties show that S protein can be used as a vaccine candidate antigen. In addition to these physico-chemical properties, other predicted parameters such as the presence of a signal peptide that increase the immune response and the presence of betawrap motifs that are a virulence factor, as well as a non-allergic property also showed that S protein was a better vaccine candidate. In addition, orf7a, 8 and nsp-10 proteins were predicted to have a signal peptide and it is known that the presence of a signal peptide on any protein is an important parameter which indicates that the protein can be destined towards the secretory pathway 38,39 . Accordingly, these probable secreted three proteins (orf7a, 8 and nsp-10) were also considered as potential vaccine candidate proteins. As S, orf7a, orf8 and nsp-10 proteins were examined with regard to secondary structure, random coil were detected higher than 49%. The presence of this highly predicted random coil shows that these proteins can be preferably recognised by an antibody 40 . Another critical point for these proteins was the prediction of post-translational modifications. The presence of these modifications indicates that if these proteins are produced by recombinant technology, eukaryotic expression systems such as yeast, insect or mammalian should be preferred instead of bacterial systems 41 .
In previous vaccine studies, S and M proteins have been used for the development of DNA or recombinant protein vaccines against SARS-CoV that affected 30 countries in five continents 42,43 .
Also, S protein has been used to develop a vaccine against MERS CoV which is another zoonotic pathogen that has infected approximately 2500 people in over 25 countries 5,44 . According to the results obtained from these studies, S and M proteins were reported to induce a strong immune response. Consequently, these findings of recent studies and our in silico study support that only S proteins can be strong vaccine candidate protein in development of recombinant vaccine against SARS-CoV-2 causing COVID-19.
Since N protein does not locate at the surface of SARS-CoV-2, it was thought that N protein may not be a proper vaccine candidate but could be a good antigen for serodiagnosis of COVID-19 because of having a negative GRAVY value and soluble characteristics and not transmembrane helices. There were several previous coronavirus (SARS-CoV) related studies supporting our predictions. For example, a previous study reported strong antibody response against recombinant N protein in 10 of 12 SARS patients 45 . In a different study, a B cell epitope region between 156 and 175 positions of N protein reacted strongly with sera from SARS patients 46 .
In the second part of our study, epitope regions specific to B and T cells were predicted in all structural proteins, variants of structural proteins and non-structural proteins that have a signal peptide and, antigenicity control was performed for all predicted epitopes. Results associated with B cell epitopes showed that there were a lot of highly antigenic epitopes. Antigenicity value was very high for GGDGKMKD, THTGTGQ, and NLDSKV epitopes corresponding to N, nsp-10 encoded by orf1ab and S proteins. Similarly, epitopes that have high antigenicity values were also predicted for MHC-I and II alleles. Among these predicted epitopes, for MHC-I alleles, KLNDLCFTNV (Fig. 2) and ITLCFTLKRK epitopes belonging to S and orf7a proteins had very high antigenicity values whereas for MHC-II alleles, PTNFTISVTTEILPV and IVFITLCFTLKRKTE epitopes belonging to S and orf7a 13 proteins also had significant antigenicity values.
These findings indicate that a cocktail/mixture composed of these epitopes may induce neutralizing antibody response or can be used in development of an epitope-based peptide vaccine because of their association with both B and T cells. Also, it was thought that they can be used as antigens that capture IgM and IgG antibodies against SARS-CoV-2 during viral infection in ELISA or Western blotting tests. In previous wet lab studies, the presence of neutralizing epitopes has been reported to bind with S protein of SARS-CoV [47][48][49] . For example, in a study conducted in mice, major neutralisation determinant was reported in receptor binding domain (RBD) of S protein in SARS-CoV 47 . Another study reported that NYNWKR epitope in S protein had a neutralizing effect against SARS-CoV 48 . There are also some new studies using wet lab techniques and in silico approaches associated with SARS-CoV-2. In a study, splenocytes were stimulated with plenty of T cell epitopes belonging to S protein and nine of them were reported to induce cellular immune response. Among these epitopes, only one of them (VGGNYNYLYRLFRKS; between 445 and 459 positions) was inside RBD, five of them (YNYKLPDDFTGCVIA; DDFTGCVIAWNSNNL;mVVLSFELLHAPATVC; LLHAPATVCGPKKST; KNKCVNFNFNGLTGT) were located nearby RBD whereas the remaining three (SFPQSAPHGVVFLHV; PHGVVFLHVTYVPAQ; FTTAPAICHDGKAHF) were inside S2 segment of S protein 50 . Interestingly, in our study, an epitope (CYFPLQSYGF; between 488 and 497 positions) with a relatively lower antigenicity value were predicted in RBD of S protein and four epitopes (NLDSKV, KLNDLCFTNV, RQIAPGQTGK, GDEVRQ) were also predicted in a very close region. Docking results supported that KLNDLCFTNV epitope was targeted by HLA-A*02:01 allele (Fig. 2). These findings indicate that the abovementioned epitopes may have promising neutralizing effect against SARS-CoV-2. In a previous in silico study, five different epitopes (SYGFQPTNGVGYQPY; SQSIIAYTMSLGAEN; IPTNFTISVTTEILP; AAAYYVGYLQPRTFL; APHGVVFLHVTYVPA) related to both MHC-I and II were predicted in S protein 51 and only one of them overlapped with a highly antigenic epitope (PTNFTISVTTEILPV) predicted in our study. In another in silico study, 14 epitopes were predicted in S protein for T cells 52 and six of them were detected to overlap with epitopes predicted in our study. However, none of these 14 overlapped epitopes were among the significant epitopes identified in this study.

Conclusion
In conclusion, our study shows that vaccine candidate proteins can be selected from among a lot of proteins of SARS-CoV-2 using a reverse vaccinology in silico analyses approach. Depending on our in silico results, S protein was the best vaccine candidate protein. In addition, orf7a, orf8 and nsp-10 proteins were promising vaccine candidate proteins because of having a signal peptide. Moreover, epitopes predicted in S protein and other proteins having a signal peptide may have a potential neutralizing effect and can be used to develop an epitope-based peptide vaccine or a serodiagnostic assay. Prediction of Signal Peptide. Seed genome proteins and variant proteins whose reference protein is a structural protein and/or has a signal peptide were analysed by Signal-BLAST (http://sigpep.services.came.sbg.ac.at/signalblast.html) 21 .

Strains of SARS-
Prediction of Allergenicity. The allergenicity of proteins which are structural, variants of structural and/or have a signal peptide was predicted by Algpred online server (http://crdd.osdd.net/raghava/algpred/) using a prediction approach of MEME/MAST motif and IgE receptor alleles that were specific to each epitope were retrieved from The Protein Data Bank (PDB; http://www.rcsb.org/pdb/). In selection of MHC-I receptor models, the presence of free (undocked) 3D protein structures were considered. Models of epitopes that were selected based on low IC50 value and being probable antigen were predicted by I-TASSER Server (http://zhanglab.ccmb.med.umich.edu/I-TASSER) 31 . In addition, epitopes that have the highest antigenicity value also selected for docking. Each modelled epitope ligand was docked to its specific MHC-I allele receptor by ClusPro Server (https://cluspro.bu.edu/home.php) 32 and visualised on UCSF Chimera 1.14 tool 33 . For docking analyses conducted with MHC-II alleles, epitope models that were selected based on low IC50 values and being probable antigen were predicted by I-TASSER Server (http://zhanglab.ccmb.med.umich.edu/I-TASSER). Each modelled epitope ligand was docked to its specific MHC-II allele by selecting specific alleles from the EpiDock Server (http://www.ddgpharmfac.net/epidock/EpiDockPage.html) 34   Phylogenetic tree results belonging to 42 SARS-CoV-2 isolates from six continents.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.