In silico discovery of antigenic proteins and epitopes of SARS-CoV-2 for the development of a vaccine or a diagnostic approach for COVID-19

In the genome of SARS-CoV-2, the 5′-terminus encodes a polyprotein, which is further cleaved into 15 non-structural proteins whereas the 3′ terminus encodes four structural proteins and eight accessory proteins. Among these 27 proteins, the present study aimed to discover likely antigenic proteins and epitopes to be used for the development of a vaccine or serodiagnostic assay using an in silico approach. For this purpose, after the full genome analysis of SARS-CoV-2 Wuhan isolate and variant proteins that are detected frequently, surface proteins including spike, envelope, and membrane proteins as well as proteins with signal peptide were determined as probable vaccine candidates whereas the remaining were considered as possible antigens to be used during the development of serodiagnostic assays. According to results obtained, among 27 proteins, 26 of them were predicted as probable antigen. In 26 proteins, spike protein was selected as the best vaccine candidate because of having a signal peptide, negative GRAVY value, one transmembrane helix, moderate aliphatic index, a big molecular weight, a long-estimated half-life, beta wrap motifs as well as having stable, soluble and non-allergic features. In addition, orf7a, orf8, and nsp-10 proteins with signal peptide were considered as potential vaccine candidates. Nucleocapsid protein and a highly antigenic GGDGKMKD epitope were identified as ideal antigens to be used in the development of serodiagnostic assays. Moreover, considering MHC-I alleles, highly antigenic KLNDLCFTNV and ITLCFTLKRK epitopes can be used to develop an epitope-based peptide vaccine.


Results
Physico-chemical parameters. The number of amino acids varied from 75 to 1273 among structural proteins. The largest one was S protein with ~ 142 kDa whereas E protein with ~ 8.4 kDa was the smallest one (Table 1). Among non-structural proteins, except orf1ab, the number of amino acids varied from 43 to 275. Orf3a with ~ 32 kDa was the largest protein whereas orf7b with 5.2 kDa was the smallest protein (Table 1). Each non-structural protein that is encoded by orf1ab was also analysed, and the number of amino acids was detected to vary from 83 to 1945. Nsp-3 with ~ 218 kDa molecular weight was one of the largest proteins whereas the smallest one was nsp-7 with ~ 9.3 kDa size (Table 1). When all proteins encoded by the full genome were analysed, the theoretical PI value was between 4.6 and 10.07. Among structural proteins, only S protein was negatively charged whereas E, M, and N protein were positively charged. In addition, orf7a, orf10, nsp-6, nsp-9, nsp-13, nsp-14, and nsp-16 proteins were positively charged whereas the remaining proteins were negatively charged except nsp-4 and nsp-8 that were neutral. The estimated half-life was 30 h for all proteins, except proteins that were encoded by orf1ab. Only nsp-1 in orf1ab had 30 h estimated half-life. According to the instability index, N protein was found as instable while S, E, M structural proteins and most of the non-structural proteins were found as stable. The aliphatic index showed a significant variation ranging between 52.53 to 144 among all proteins. The grand average of hydropathicity value was found negative in S and N proteins as well as in most of the non-structural proteins that were encoded by orf1ab (Table 1). Secondary structure. According to results obtained from structural proteins, the alpha helix was between ~ 22 and 47%, that of the extended strand was between ~ 10 and 22%, and that of the random coil was between ~ 40 and 60%. For non-structural proteins, the alpha helix varied between 0 and 69%, that of the extended strand varied between ~ 3 and 47%, and that of the random coil varied between ~ 28 and 58% (Table 2).
Antigenicity. All structural proteins were predicted as probable antigen. Antigenicity value varied from 0.4661 to 0.6025. E protein had the highest antigenicity value whereas S protein had the lowest antigenicity value. Antigenicity values did not dramatically change among the original Wuhan and variant proteins. Interestingly, all non-structural proteins were also predicted as probable antigen, except nsp-16 encoded by orf1ab. In addition, orf7b had the highest antigenicity value with 0.8462 among all proteins. According to Wuhan orf8 antigenicity value, variant V62L had a higher antigenicity value whereas variant L84S had a lower antigenicity value (Table 3).
(P13L, S194L, S197L, and R203K/G204R), E, and variant E (L37H), orf8 and variants of orf8 (S24L, V62L and L84S) and nsp-10 proteins using Bcepred and IEDB. Epitopes that were predicted in both Bcepred and IEDB, and detected as probable antigen were presented in Table 6. Obtained predictions showed that nearly all epitopes had more antigenicity value than those of their own proteins. Interestingly, an epitope (VDEAGSKS) corresponding to variant orf8 (S24L) was predicted non-antigen because of adding valine amino acid to lead of epitope as differ- www.nature.com/scientificreports/ ent from the original Wuhan sequence. Variant N (P13L) had a specific epitope (AEGSRGGSQASSRSSSRSRNS) with a high antigenicity value that was not predicted for N or other variant N proteins. Among these analysed proteins, the highest antigenicity value (1.4530) was predicted for an epitope (GGDGKMKD) belonging to N protein and its variants. Another epitope (THTGTGQ) that had a high antigenicity value of 1.0789 was predicted in Nsp-10 encoded by orf1ab. Also, any antigenic epitope was not predicted for M and orf7a proteins. All predicted probable antigenic epitopes were depicted in Table 6.

MHC-I and MHC-II epitopes.
A lot of MHC-I epitopes were predicted as probable antigen (Table 7).
Antigenicity values belonging to epitopes were generally predicted higher than those of their own proteins. Among structural proteins, an epitope (KLNDLCFTNV) that had the highest antigenicity value (2.6927) was predicted in S protein and its variant (D614G). For non-structural proteins, an epitope (ITLCFTLKRK) in orf7a had the highest antigenicity value (2.5150). Any antigenic epitope was not predicted for nsp-10. On the other hand, KWPWYIWLGF, FLAFVVFLLV, FARTRSMWSF, and RNRFLYIIKL, AQFAPSASAF and LGIITTVAAF epitopes belonging to S (including variant D614G), E (including variant L37H), M (including variant T175M), N (including variants P13L, S194L, S197L, and R203K/G204R) and orf8 (including S24L, V62L, and L84S), respectively, had an IC50 value lower than 10 and a percentile rank varying from 0.02 from 0.1, indicating a strong binding among the epitope and MHC-I alleles. Also, T175M and S194L variations in M and N proteins caused the prediction of additional epitopes that are specific to own themselves. Similarly, a lot of MHC-II epitopes were predicted as probable antigen (Table 8). Also, nearly all epitopes had higher antigenicity values than those of their own proteins. Among structural proteins, PTNFTISVTTEILPV and VTLAILTAHRLCAYC epitopes predicted in S protein (including variant D614G) and variant L37H had the highest antigenicity value. For non-structural proteins, orf7a had an epitope (IVFITLCFTLKRKTE) that was predicted as a probable antigen with a high antigenicity value (1.8597). Any antigenic epitope was not predicted for nsp-10. Among MHC-II epitopes, although there were a lot of epitopes with low percentile rank, only two epitopes (SKWYIRVGARKSAPL and KWYIRVGARKSAPLI) that had an IC50 value lower than 10, indicating a strong binding among epitope and MHC-II alleles, was detected in orf8 and its variants. In addition, variant M protein (T175M) and orf8 variants (S24L and V62L) had specific epitopes with high antigenicity values as different from original Wuhan M and orf8 proteins.
Post-translational modifications. S protein and its variant (D614G) were predicted to have highly N-glycosylated and phosphorylated sites as well as a few O-glycosylated and acetylated sites. M (including T175M), E (including L37H), orf7a, and nsp10 proteins were predicted to have N-glycosylated and phosphorylated sites www.nature.com/scientificreports/ while orf7a was predicted to have an acetylation site. Orf8 and its variants were predicted to have N-glycosylated and phosphorylated sites whereas two additional phosphorylation sites, one of which locate in the exposed surface and the other one is buried, were predicted in only variant L84S. In addition, N protein and its four variants were predicted to have N-/O-glycosylated, phosphorylated and acetylated sites. When N protein and its variant were compared, the number of O-glycosylation, acetylation, or phosphorylation sites showed minor alterations. In addition, post-translational modifications within significant epitopes were shown in Table 9.
Docking analysis. All probable antigenic epitopes that have a low IC50 value and percentile rank could not be docked with their MHC-I or MHC-II alleles because of limitations associated with available MHC-I and MHC-II alleles variations in data bank or server. Accordingly, KWPWYIWLGF, KLNDLCFTNV, FLAFV-VFLLV, LIFLWLLWPV, MEVTPSGTWL, FLIVAAIVFI, and LEYHDVRVVL epitopes belonging to S (including variant D614G), E (including variant L37H), M (including variant T175M, N (including variants P13L, S194L, S197L, and R203K/G204R), orf7a and orf8 (including variants S24L, V62L, and L84S), respectively, were docked with receptors of selected MHC-I alleles (Figs. 1, 2, and 3). During docking analysis conducted by MHC-II alleles, in S protein, core regions of PTNFTISVTTEILPV, SIIAYTMSLGAENSV, and GYFKIYSKHTPINLV epitopes were docked with the receptor of HLA-DRB1*07:01. Also, the core region of another epitope (QDLFLPFFSNVTWFH) in S protein was docked with the receptor of HLA-DRB1*15:01. In M protein, core regions of ASFRLFARTRSMWSF, RTLSYYKLGASQRVA and PKEIT-VATSRTLSYY epitopes were docked with the receptor of HLA-DRB1*07:01. Also, the core region of an epitope (QIAQFAPSASAFFGM) in N protein was docked with the receptor of HLA-DRB1*07:01. Similarly, the core region of an epitope (VTLAILTAHRLCAYC) in variant L37H was docking to the receptor of HLA-DRB1*1501. These epitopes that were docked with own MHC-II alleles were also available in their variant proteins. Solvent-exposure positions in epitopes. The amino acids exposed to the solvent were detected in all significant epitopes. Among these epitopes, the whole of only two epitopes (FLAFVVFLLV and KWPWYI-WLGF) were in the solvent accessible region on protein structure (Table 9). Table 3. Solubility, transmembrane helices, localisation and antigenicity results predicted by SolPro, TMHMM, Virus-mPLoc and Vaxijen, respectively. a Could not be retrieved. www.nature.com/scientificreports/

Discussion
Reverse vaccinology plays an important role in the development of recombinant vaccines by allowing in silico analyses of the genome of pathogens. In silico analyses enables identifying the highly antigenic and secreted proteins which are crucial in vaccine development before the beginning of the wet lab studies 8,14 . Using this approach, the present study aimed to discover likely antigenic proteins as well as epitope regions that are targeted by both B and T cell arms of the adaptive immune response for the development of a vaccine or serodiagnostic assay as described by Dangi et al. 8 and Goodswen et al. 14 .
All proteins of SARS-CoV-2, except nsp-16 encoded by orf1ab, were predicted as probable antigen. Although, there was no major difference between predicted antigenicity values for probable vaccine candidate proteins, S protein was selected as a better vaccine candidate protein compared to others depending on in silico analyses results. The physico-chemical analysis showed that S protein had a negative GRAVY value indicating that S protein is hydrophilic and has a better interaction with surrounding water molecules 15 . Also, it had stable and soluble characteristics which are important parameters for biophysical studies on epitope-based vaccine design. Moreover, S protein had a moderate aliphatic index which indicates stability in a wide spectrum of temperature 16 , fewer than two transmembrane helices facilitating cloning, expression, and purification 11 , and a big molecular weight and long estimated half-life (more than 10 h). These properties show that S protein can be used as a vaccine candidate antigen. In addition to these physico-chemical properties, other predicted parameters such as the presence of a signal peptide that increase the immune response and the presence of betawrap motifs that are a virulence factor, as well as a non-allergic property also showed that S protein was a better vaccine candidate. In addition, orf7a, orf8 and nsp-10 proteins were predicted to have a signal peptide. This feature is an important www.nature.com/scientificreports/ parameter which indicates that the protein can be destined towards the secretory pathway 17,18 . Moreover, the signal peptide promotes protein secretion, and thus, the signal peptide is used to improve the protein secretion level in recombinant techniques 19 . For example, a study showed that vaccination with an unconventionally secreted viral nonstructural protein (NS1) protected mouse from murine norovirus 20 . Also, it has been reported that proteins with the signal peptide should be taken into consideration as vaccine candidates both they have been targeted to the secretory pathway and have high antigenicity and specificity 21 . Accordingly, these probable secreted and antigenic three proteins (orf7a, orf8, and nsp-10) can also be considered as potential vaccine candidate proteins. As S, orf7a, orf8, and nsp-10 proteins examined with regard to secondary structure, random coils were detected higher than 49%. The presence of this highly predicted random coil shows that these proteins can be preferably recognised by an antibody 22 . Another critical point for these proteins was the prediction of posttranslational modifications. The presence of these modifications indicates that if these proteins are produced by  www.nature.com/scientificreports/ recombinant technology, eukaryotic expression systems such as yeast, insect or mammalian should be preferred instead of bacterial systems 23 .
In previous vaccine studies, S and M proteins have been used for the development of DNA or recombinant protein vaccines against SARS-CoV that affected 30 countries in five continents 24,25 . Also, S protein has been used to develop a vaccine against MERS CoV which is another zoonotic pathogen that has infected approximately 2500 people in over 25 countries 7,26 . According to the results obtained from these studies, S and M proteins were reported to induce a strong immune response. For the vaccine development against SARS-CoV-2, it has been stated that S protein is a promising candidate because it plays role in viral attachment, fusion, and entry [27][28][29] . In addition, a report showing that antibodies against S protein of SARS-CoV inhibit the SARS-CoV-2  www.nature.com/scientificreports/ S protein-mediated entry into cells encourages the use of this molecular target for vaccination 28,29 . Currently, a lot of companies or research groups target the S protein to develop a vaccine against SARS-CoV-2 using various recombinant vaccine technologies. For example, Inovio using S protein with a DNA vaccine technology is in Phase I. Another company, Moderna, is in Phase I/II with an RNA based vaccine targeting S protein 30 . Consequently, these findings of recent studies and our in silico study support that only S proteins can be a strong vaccine candidate protein in the development of a recombinant vaccine against SARS-CoV-2 causing COVID-19. Since N protein does not locate at the surface of SARS-CoV-2, it was thought that N protein may not be a proper vaccine candidate but could be a good antigen for serodiagnosis of COVID-19 because of having a negative GRAVY value and soluble characteristics and not transmembrane helices. There were several studies for the previous coronavirus (SARS-CoV) supporting our predictions. For example, a previous study reported a strong antibody response against recombinant N protein in 10 of 12 SARS patients 31 . In a different study, a B cell epitope region between 156 and 175 positions of N protein reacted strongly with sera from SARS patients 32 .
Among structural proteins, subcellular localisations of S and E proteins were predicted as endoplasmic reticulum using in silico methods in the current study, and this result was found to be compatible with the results of SARS-CoV performed with an in vitro immunofluorescent analysis showing the localisation of S protein in several compartments of host secretory pathway from the endoplasmic reticulum to cell membrane as well as E protein in endoplasmic reticulum 33 . However, subcellular localisation of M protein was predicted as the host cell membrane and endoplasmic reticulum in the current in silico analysis while it was shown in the Golgi apparatus in the same in vitro analysis 33 . In fact it was also thought to be compatible with in silico results because endoplasmic reticulum, Golgi, and cell membrane are parts of the same host secretory pathway and all surface proteins may be detected in each part of the pathway. www.nature.com/scientificreports/  www.nature.com/scientificreports/ In this study, the immunological effects of prevalent variant proteins belonging to E, M, N, S, and orf8 proteins were also analysed. Accordingly, the comparison of reference S protein and its variant (D614G) showed no difference in antigenicity values, epitope regions, and antigenicity values of epitopes. However, detecting D614G variation as prevalent has been associated with selection advantage and random founder effect 34,35 . In addition to these, a study reported that the D614G variant was more stable and enhanced its infectious nature 36 whereas another study reported that there was not enough evidence to express that the variant is more infectious 37 . For N protein, among five variations (P13L, S194L, S197L, R203K/G204R), P13L and S197L variations were predicted to increase the antigenicity value of N protein and thus, utilise of P13L and S197L variants was thought to be a better antigen for studies conducted in countries harboring SARS-CoV-2 isolates with P13L or S197L variant. A similar result was also detected in E protein and a higher antigenicity value was predicted in variant L37H. Variant orf8 (L84S) had a lower antigenicity value whereas a higher antigenicity value was predicted in variant orf8 (V62L) compared to orf8 of Wuhan isolate. Also, variant M protein (T175M) had a lower antigenicity value. As depending on these results, since a higher antigenicity value is associated with a stronger immune response in the host, selection of the proteins with high antigenicity values in vaccinological or serodiagnostic studies would be useful.
In the second part of our study, epitope regions specific to B and T cells were predicted in all structural proteins, variants of structural proteins, and non-structural proteins that have a signal peptide, and antigenicity control was performed for all predicted epitopes. Results associated with B cell epitopes showed that there were a lot of highly antigenic epitopes. Antigenicity value was very high for GGDGKMKD, THTGTGQ, and NLDSKV epitopes corresponding to N, nsp-10 encoded by orf1ab and S proteins. Similarly, epitopes that have high antigenicity values were also predicted for MHC-I and II alleles. Among these predicted epitopes, for MHC-I alleles, KLNDLCFTNV (Fig. 1) and ITLCFTLKRK epitopes belonging to S and orf7a proteins had very high antigenicity values whereas for MHC-II alleles, PTNFTISVTTEILPV and IVFITLCFTLKRKTE epitopes belonging to S and orf7a proteins also had significant antigenicity values. www.nature.com/scientificreports/ These findings indicate that a cocktail/mixture composed of these epitopes may induce a neutralising antibody response or can be used in the development of an epitope-based peptide vaccine because of their association with both B and T cells. Also, it was thought that they can be used as antigens that capture IgM and IgG antibodies against SARS-CoV-2 during viral infection in ELISA or Western blotting tests. In previous wet lab studies, the presence of neutralising epitopes has been reported to bind with S protein of SARS-CoV [38][39][40] . For example, in a study conducted in mice, a major neutralisation determinant was reported in receptor-binding domain (RBD) of S protein in SARS-CoV 38 . Another study reported that the epitope NYNWKR in S protein had a neutralising effect against SARS-CoV 39 . There are also some new studies using wet lab techniques and in silico approaches associated with SARS-CoV-2. In a study, splenocytes were stimulated with plenty of T cell epitopes belonging to S protein, and nine of them were reported to induce a cellular immune response. Among these epitopes, only one of them (VGGNYNYLYRLFRKS; between 445 and 459 positions) was inside RBD, five of them (YNYKLPDD-FTGCVIA; DDFTGCVIAWNSNNL; VVLSFELLHAPATVC; LLHAPATVCGPKKST; KNKCVNFNFNGLTGT) were located nearby RBD whereas the remaining three (SFPQSAPHGVVFLHV; PHGVVFLHVTYVPAQ; FTTA-PAICHDGKAHF) were inside S2 segment of S protein 41 . Interestingly, in our study, an epitope (CYFPLQSYGF; between 488 and 497 positions) with a relatively lower antigenicity value was predicted in RBD of S protein and four epitopes (NLDSKV, KLNDLCFTNV, RQIAPGQTGK, GDEVRQ) were also predicted in a very close region. Docking results supported that the epitope KLNDLCFTNV was targeted by HLA-A*02:01 allele (Fig. 1). These findings indicate that the above-mentioned epitopes may have a promising neutralising effect against SARS-CoV-2. In a previous in silico study, five different epitopes (SYGFQPTNGVGYQPY; SQSIIAYTMSLGAEN; IPTNFTISVTTEILP; AAAYYVGYLQPRTFL; APHGVVFLHVTYVPA) related to both MHC-I and II were predicted in S protein 42 and only one of them overlapped with a highly antigenic epitope (PTNFTISVTTEILPV) predicted in our study. In another in silico study, 14 epitopes were predicted in S protein for T cells 43 and six of them were detected to overlap with epitopes predicted in our study. However, none of these overlapped epitopes were among the significant epitopes identified in this study.
Docking analysis confirming the interaction among predicted epitopes and MHC-I/II alleles allows the prediction of more reliable epitopes that can be used for wet lab studies. In this study, docking analysis could not be performed for each of the all MHC-I or MHC-II epitopes due to the lack of 3D structures of some MHC-I/ II alleles in protein database or server. This situation limits the docking analysis part of this study as preventing the analysis of all epitopes. Therefore, it was thought that increasing the number of 3D models of MHC-I or II alleles in PDB or servers would be useful for the analysis of a more robust epitope.
Development of computer-based methodologies enhances the credibility of in silico approaches in biological studies. Based on that, the methods of predicting vaccine candidate proteins are always favored even though they are not expressed in vitro. In silico methods also have the advantage of being able to make a fast and cost-efficient analysis. The other advantage of these methods is that they make predictions depending on the structure of vaccine candidate proteins and constitute a major way for vaccine design. Thus, in silico methods can be used at a www.nature.com/scientificreports/ very early stage in the vaccine development process and this makes in silico methods essential as a pre-analysis approach before starting wet lab studies. Although the multiple numbers of proteins can be analysed by in silico methods for vaccine design studies, there are some limitations that should be taken into consideration. The lack of information in databases, inaccuracies of software algorithms, and usage of inappropriate tools for data are known limitations in terms of in silico studies. Therefore, it is important to select the right tools for analysis and utilize different parameters to find the correct results for in silico-based studies.

Conclusion
As spike protein has important roles such as viral attachment, fusion and, entry, it is a very significant strategic target for vaccine studies and a lot of companies and research groups use the protein for vaccine development.
Our reverse vaccinology in silico approach also supports that S protein is the best vaccine candidate protein.
In addition, probable secreted orf7a, orf8, and nsp-10 proteins with signal peptide can be promising vaccine candidates. Epitopes predicted in S protein and other proteins having a signal peptide may have a potential neutralising effect and can be used to develop an epitope-based peptide vaccine or a serodiagnostic assay. In the future, in addition to the currently studied S protein, antigenicity of orf7a, orf8, and nsp-10 proteins as well as significant epitopes selected in this study should be checked by wet lab studies and antigenic proteins/epitopes should be studied as vaccine or serodiagnostic candidates. Prediction of physico-chemical parameters and secondary structures. The reference genome proteins were investigated using Expasy ProtParam online server (https ://web.expas y.org/protp aram/) for the prediction of physico-chemical properties 46 . The prediction of solubility was performed by SolPro (http://scrat ch.prote omics .ics.uci.edu/) 47 . Also, prediction of secondary structures was performed by GOR IV online server (https ://npsa-prabi .ibcp.fr/cgi-bin/npsa_autom at.pl?page=/NPSA/npsa_gor4.html) 48 . Prediction of signal peptide. The reference genome proteins and variant proteins were analysed by Signal-BLAST (http://sigpe p.servi ces.came.sbg.ac.at/signa lblas t.html) 52 .

Prediction
Prediction of allergenicity. The allergenicity of the reference genome structural proteins, variant proteins, and the proteins that have a signal peptide was predicted by Algpred online server (http://crdd.osdd.net/ragha va/algpr ed/) using a prediction approach of MEME/MAST motif and IgE epitopes 53 .
Prediction of BetaWrap motifs. The prediction of BetaWrap motifs of the reference genome structural proteins, variant proteins, and the proteins that have a signal peptide was carried out by BetaWrap online server (http://cb.csail .mit.edu/cb/betaw rap/betaw rap.html) 54 .
Prediction of similarity with host proteome. The reference genome structural proteins, variant proteins, and the proteins that have a signal peptide were examined by BlastP (https ://blast .ncbi.nlm.nih.gov/Blast .cgi?PAGE=Prote ins) to predict the similarity with the host proteome. In analysis, Homo sapiens was selected as a host organism.
Prediction of post-translational modifications. The prediction of post-translational modifications of the reference genome structural proteins, variant proteins, and the proteins that have a signal peptide were carried out using NetNGlyc 1.0 server (http://www.cbs.dtu.dk/servi ces/NetNG lyc/) 55  www.nature.com/scientificreports/ e.php) 58 . In addition, NetSurfP 2.0 (http://www.cbs.dtu.dk/servi ces/NetSu rfP/) was used for the prediction of surface accessibility of post-translational modification sites in proteins 59 .
Prediction of B cell epitopes. Linear B cell epitopes of the reference genome structural proteins, variant proteins, and the proteins that have a signal peptide were predicted by Bcepred (http://crdd.osdd.net/ragha va/ bcepr ed/) 60  Docking analysis with MHC-I and II alleles. For docking analyses conducted with MHC-I alleles, receptor alleles that were specific to each epitope were retrieved from Protein Data Bank (PDB; http://www.rcsb. org/pdb/). In selection of MHC-I receptor models, the presence of free (undocked) 3D protein structures were considered. Models of epitopes that were selected based on low IC50 value and being probable antigen were predicted by I-TASSER Server (http://zhang lab.ccmb.med.umich .edu/I-TASSE R) 62 . In addition, epitopes that have the highest antigenicity value were also selected for docking. Each modelled epitope ligand was docked to its specific MHC-I allele receptor by ClusPro Server (https ://clusp ro.bu.edu/home.php) 63 and visualised on UCSF Chimera 1.14 tool 64 . For docking analyses conducted with MHC-II alleles, each epitope that was selected based on low IC50 value and being probable antigen was docked to its specific MHC-II allele by selecting specific alleles from the EpiDock Server (http://www.ddg-pharm fac.net/epido ck/EpiDo ckPag e.html) 65 .

Prediction of solvent-exposure positions in epitopes. Each protein was aligned with its variants by
Clustal Omega (https ://www.ebi.ac.uk/Tools /msa/clust alo/) 66 . Then, a 3D structure pdb file for each protein was downloaded from the Protein Data Bank (PDB; http://www.rcsb.org/pdb/) or constructed by modelling using Swiss-Model (https ://swiss model .expas y.org). As an input for each protein, an alignment file and a 3D protein model file were uploaded and run on ESPript 3.0 (http://espri pt.ibcp.fr/ESPri pt/ESPri pt/) 67 to predict amino acid solvent-exposure (accessibility) properties for epitopes.