Introduction

Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)1,2. First detected in December 2019 in Wuhan, the virus has spread globally, with basic reproduction number (R0) reaching 5.73, millions of deaths, and unprecedented financial, social and political impacts all over the world4. Efficacious vaccines are therefore desperately needed5. The main clinical features of the COVID-19 are fever, cough and myalgia or fatigue6; the virus has caused clusters of severe respiratory illness similar to severe acute respiratory syndrome coronavirus and is associated with ICU (Intensive Care Unit) admission and high mortality rates7.

Currently, without a single specific antiviral therapy for SARS-CoV-2, the control methods of the COVID-19 are early diagnosis, reporting, isolation, supportive treatments, and timely publishing epidemic information with only limited impact on the coronavirus8,9. Researchers have proposed several approaches to develop vaccines for the SARS-CoV-210. Traditional process of vaccine design is based on growing pathogens, which represents a very time-consuming process of isolating, inactivating and injecting the virus that causes the disease11,12. Such process usually takes more than a year to result in efficacious vaccines and hence contributes very little to avoid the current spread of the disease13,14. Recently, researchers have worked on constructing multi-epitope vaccines by in silico methods based on immunoinformatics without the need to grow pathogens to accelerate the vaccine design process15,16,17. Multi-epitope vaccines are constructed by multiple virus protein fragments rich in overlapping epitopes. They contain the vital part of the virus to elicit either a cellular or a humoral immune response and they reduce unwanted components that can trigger adverse effects18. Multi-epitope vaccines can be powerful for fighting viral infections, providing excellent vaccine candidates for clinical trials. The genome sequencing of the SARS-CoV-2 is completed8 and researchers have studied the details in the SARS-CoV-2 proteins19. Coronavirus is studded on its exterior with spike proteins, which are key components to infect and attack human cells20. The spike protein of the SARS-CoV-2 can latch onto cells and force the virus through the cell membrane, which enables the virus entry. Previous studies reveal that the spike protein of the SARS-CoV-2 plays a decisive role during the infection. Proteolytic activation of spike protein by host cell proteases is also a critical determinant21. It is promising to combat the COVID-19 by inducing the B-cells and T-cells that can perform immune responses against the SARS-CoV-2 spike protein. Hence, in this study, we choose the spike protein sequence of the SARS-CoV-2 as the main subject to design our multi-epitope vaccine.

Although the in silico vaccine design approaches are looked at as fairly efficient, they may not be sufficiently fast to keep pace with the emergence of various pandemics. Figure 1A shows the schematic diagram of a traditional in silico vaccine design process. Researchers usually use numerous in silico tools to predict the B-cell, CTL and HTL epitopes on the whole virus proteins22,23. The antigenicity and other physicochemical properties of the overlapping fragments are also necessary to be evaluated24. To select the best virus protein regions for constructing an efficacious vaccine, we need to carefully and comprehensively evaluate all the predicted results, which creates a large overhead and can be very time consuming. Currently, each in silico vaccine design tool can only achieve one single prediction goal. For example, BepiPred25 is a very popular B-cell epitope prediction tool and many researchers use this tool to predict the B-cell epitopes. However, BepiPred can only be used to address the one step of B-cell epitope prediction, and when it comes to T-cell epitope prediction, a different tool such as NetMHCpan26 is needed. No current tool is able to conduct multiple predictions and comprehensively analyze the results for us at once to directly identify the best vaccine subunits for further construction and evaluation.

Figure 1
figure 1

Schematic Diagram of In Silico Vaccine Design Process. (A) Traditional in silico vaccine design process. We have to use numerous vaccine design tools. The evaluation and subunits selection is very time consuming. No current tool is able to include all the predictions to comprehensively analyze and select out the best vaccine subunits directly. (B) In silico vaccine design by DeepVacPred framework. By replacing the many predictions, evaluations and selections with a DNN architecture inside the DeepVacPred framework, we are able to directly predict a very small number of potential vaccine subunits within a second and start the following evaluation and vaccine construction on a much smaller amount of data.

To overcome the above challenges of the in silico vaccine design, we propose DeepVacPred, a novel AI-based in silico multi-epitope vaccine design framework. We successfully replace the multiple necessary predictions and the comprehensive evaluations with a deep neural network (DNN) architecture. When the DNN takes one peptide sequence as input, it can then judge whether this input sequence can be a potential vaccine subunit. In the DeepVacPred framework, the number of potential vaccine subunits can be firstly reduced to around 30, then further evaluation and vaccine construction is done on the predicted subunits by reliable and popular in silico methods to construct the final vaccine. Our novel approach aims to achieve a much better efficiency of the in silico vaccine design.

With DeepVacPred, this study designs a multi-epitope vaccine in a novel in silico fashion. We first use the DNN architecture to lock down 26 fragments in the SARS-CoV-2 spike protein as vaccine subunit candidates. Next, we predict the linear B-cell epitopes, CTL epitopes and HTL epitopes to select and construct our final vaccine. We further analyze the human population coverage, antigenicity, allergenicity, toxicity and other physicochemical properties to validate the quality. We also predict the secondary structure and 3D structure model. This model is eventually refined and validated. Finally, the codon optimization and in silico cloning are performed to check the vaccine genome and protein constructions and ensure its effective expression. In addition, DeepVacPred allows us to quickly check for newly emerging threats caused by the RNA mutations of the SARS-CoV-2. We prove that our vaccine can tackle the virus RNA mutations.

DeepVacPred

Background

An in silico vaccine design process can be seen as selecting good fragments of the virus proteins, then constructing them together into a final vaccine24. A fragment with multiple merits can be selected as a subunit of the final vaccine. For example, an ideal subunit should contain multiple B-cell epitopes and T-cell epitopes and it should have high antigenicity to trigger human protective reactions22,23. These merits can be predicted by in silico approaches and currently there are numerous in silico vaccine design tools. However, these tools are designed to address only one of the several predictions at a time. Consequently, researchers have to overcome the time-consuming tasks of analyzing each individual prediction result from different tools while adopting a comprehensive view of the vaccine design. No current tool can take all the necessary merits into consideration and directly predict the vaccine subunit candidates from the virus proteins.

There are two drawbacks to the current situation: (i) We usually need only the best 10–20 subunits to construct the final vaccine while each prediction tool may provide us with hundreds or even thousands of potential locations to choose, which creates a large overhead to comprehensively select out the subunits we need and no current tool can achieve both the prediction and the selection for us. (ii) Nearly 90% prediction results are eventually discarded because they have only part of the merits, resulting in too much of unnecessary analysis and wasting many computing resources. Consequently, traditional approaches may produce vaccines that are too late or ineffective for pandemics.

In order to improve the efficiency and reliability of the vaccine design process, we improve over state-of-the-art tools by providing a DNN approach, DeepVacPred, an efficient in silico vaccine design process to address the afore-mentioned concerns. DeepVacPred directly predicts the best vaccine subunit candidates (the number is within 30) from the virus protein sequences within a second by replacing the prediction and selection with deep neural network architecture, hence promising much higher efficiencies for the vaccine design and test process.

Data collection and dataset design

Reliable data is essential for the performance of supervised learning27, thus, it plays a crucial role in the outcome of the vaccine design process. We collected 5000 latest known B-cell epitopes (B) and 2000 known T-cell epitopes containing both MHC (major histocompatibility complex)-1 and MHC-2 binders28 (T) from the IEDB database, combining with the same number of proteins which are not T-cell or B-cell epitopes, forming a dataset of epitopes and non-epitopes. 100 known latest viral protective antigens are selected from the IEDB database, and the same number of proteins without protective functions are randomly selected, combining with the 400 antigens from previous work29, forming a dataset with 600 antigens.

DeepVacPred is built based on supervised learning on a subtly designed dataset. To directly predict the vaccine subunit candidates, the protein sequences in the positive dataset must contain at least one T-cell epitope and one B-cell epitope and must be protective antigens. Cartesian Product30 is the set that contains all ordered pairs from two sets. Thus, the two Cartesian Products, T × B and B × T, which are formed between the collected B-cell epitopes dataset and the T-cell epitopes dataset can cover all the possible combinations of the known B-cell and T-cell epitopes. We use the 600 antigens to train a neural network that can identify protective antigens. We use this neural network on the Cartesian Product to sieve out 706,970 peptides sequences that are predicted to be protective antigens. Those 706,970 peptides contain both B-cell epitopes and T-cell epitopes and are protective antigens, referred in this paper as the positive vaccine dataset. The same number of peptides randomly bridged by negative T-cell and B-cell epitopes form our negative vaccine dataset. The dataset we design addresses the three most important predictions, the B-cell epitopes, T-cell epitopes and antigenicity in the vaccine design process.

All the datasets we collected, designed and created for the DNNs training can be found in the Data Availability section. The descriptions of each dataset are shown in Table 1.

Table 1 Description of the datasets used for analysis and DNN training.

Network training

A multi-layer convolutional neural network (CNN) and a four-layer linear neural network connect together, forming a deep neural network (DNN) with a two-class output. The positive and negative datasets are annotated by Z-descriptors31, then converted to the same length of 45 vectors with auto cross covariance (ACC) transformation32. Trained by the transformed dataset above, the DNN achieves the classification function to predict whether the input is a protective antigen containing both the B-cell and T-cell epitopes, realizing the ability to directly judge whether a sequence can be a potential vaccine subunit. This DNN is the core part of the rapid vaccine design process of our DeepVacPred framework and we name it as DNN-V. In addition, we train another DNN with the same structure on the T-cell epitope dataset which can judge whether an input sequence can be a T-cell epitope and we name it as DNN-T. The detailed neural network structures, training process and hyper-parameters can be found in “DNN Design and Training in DeepVacPred Framework” in the Methods section.

Validation

ROC curves

Receiver operating characteristic (ROC) curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied33. DNN-V is a novel approach that needs to be validated. We use the ROC curves to evaluate the DNN-V in DeepVacPred. We test the trained DNN-V with two datasets, namely the train set and the test set, each of which containing 200 protein sequences. The training set contains 200 proteins randomly selected from the dataset; we use to train the DNN-V, with 100 positive and 100 negative protein sequences. We also selected known B-cell epitopes and T-cell epitopes that are not in our collected data and use the above steps to form the testing set, also with 100 positive and 100 negative protein sequences. The ROC curves are shown in Fig. 2. The validation data appears in Table 2. The thresholds are ranged from 0 to 1. The accuracy reported in Table 2 is the greatest value among all thresholds. The sensitivity and specificity values in Table 2 are reported for the case with the highest accuracy. The AUC (Area Under the ROC Curve) value of 0.9703 for the test set which indicates the high accuracy of the classification of DNN-V to identify potential vaccine subunits.

Figure 2
figure 2

ROC Curves for the DNN-V in DeepVacPred. The area under the ROC curves represent the ability of the DNN-V to classify potential vaccine subunits and non-potential vaccine subunits. The high area under the ROC curves suggests that the DNN-V has strong classification ability and high accuracy at most threshold values.

Table 2 DeepVacPred Validation. For the Training Set, we reach the highest accuracy of 0.995 if the threshold value is set at 0.32. At this threshold value, the sensitivity and specificity are 0.99 each. For the Testing Set, we reach the highest accuracy of 0.95 if the threshold value is set at 0.5. At this threshold value, the sensitivity and specificity are 0.95 each. The experimental data shows high accuracy and strong classification ability of the proposed DeepVacPred framework.

Vaccine design test

The false positive rate (FPR) will fall down to 0 if we set the threshold to a very low value, e.g., 0.0003, since we only care about discarding all the non-candidates. We use the DNN-V in our DeepVacPred framework on the 1273aa spike protein sequence of the SARS-CoV-2. 132 vaccine candidates are predicted. We use BepiPred25, NetMHCpan26 and Vaxijen34 to examine each candidate. All of the candidates contain both T-cell and B-cell epitopes and only 14 of them are predicted by Vaxijen to be non-protective antigens.

DeepVacPred framework

Figure 1B provides the schematic diagram of the vaccine design process using DeepVacPred framework. DeepVacPred first uses DNN-V to predict a very small number of potential vaccine subunits directly from the virus protein sequences. DeepVacPred further uses DNN-T to examine all the overlapping sequences in these subunits and select the subunit candidates which have multiple T-cell epitopes. These two prediction rounds take less than 1 s and reduce the number of potential vaccine subunits to around 30. Compared to traditional approaches, the most time-consuming subunits selection part can be easily done by DeepVacPred within less than a second, saving a large amount of time and computational resources.

The following steps in the DeepVacPred framework are as follows: (i) selecting the best subunits from only about 30 candidates and (ii) constructing the final vaccine based on the evaluations by various reliable in silico tools, including Linear B-cell epitopes prediction, CTL and HTL epitopes prediction, population coverage analysis, vaccine construction, evaluation of antigenicity, allergenicity, solubility, immunogenicity, toxicity and other physicochemical properties, structure prediction, 3D modeling, in silico cloning, molecular docking and molecular dynamics simulation. Compared to the popular computational process, those evaluations are done on a much smaller amount of data, hence improving the efficiency.

Results

Data retrieval

The genome sequence of SARS-CoV-2 isolate Wuhan-Hu-1 is retrieved from the NCBI database with accession number MN90894735. The protein sequences are retrieved according to their translation. Especially, the spike protein (protein ID: QHD43416.1) has a length of 1273 amino acids (aa), and the receptor binding domain (RBD) is from 347 to 520aa20. The following experiments are mainly focused on the spike protein region.

DeepVacPred vaccine subunits prediction

All the overlapping protein fragments with a length of 30aa are generated out of the 1273aa SARs-CoV-2 spike protein sequence. DeepVacPred first tests these 1244 30aa protein sequences and predicts 132 potential vaccine subunits (see Table 3). The DeepVacPred framework further predicts the T-cell epitopes at these locations and discards the subunits which have less than 8 T-cell epitopes36. After this prediction, our DeepVacPred provides us with 26 potential vaccine subunits for further evaluation and construction (see Table 4). These subunits are very likely to contain B-cell epitopes and multiple T-cell epitopes. They are also very likely to have high antigenicity and low allergenicity. We start the following in silico vaccine design process directly from the predicted 26 vaccine subunits, which is very efficient.

Table 3 DeepVacPred first round prediction results. Here we show the number of predicted vaccine subunits for each location.
Table 4 DeepVacPred second round prediction results. Here we get 26 vaccine subunits for further evaluation and construction. Those 26 vaccine subunits are very likely to have high antigenicity and contain multiple B-cell and T-cell epitopes. With DeepVacPred, those 26 vaccine subunits are reached within less than a second, while it can take days to select those subunits from the virus protein if we use traditional methods. Next, DeepVacPred simply checks the epitopes and other merits on those 26 subunits and constructs the multi-epitope vaccine directly from those 26 candidates, which is much more efficient than traditional approaches.

Linear B-cell epitopes prediction

B-cell epitopes are portions of antigens binding to immunoglobulin or antibody to trigger the B-cells to provide immune response37. Linear B-cell epitopes are predicted on the 26 vaccine subunits. Linear B-cell epitopes are predicted by four online servers including BepiPred25, SVMtrip38, ABCPred39 and BCPreds40. We first use BepiPred for the main prediction and we use the other three servers to check the prediction results by BepiPred. A B-cell epitope predicted by the BepiPred will be discarded if it is not predicted by any of the other three servers. B-cell epitopes must be located in the solvent-exposed region of the antigens to be possible to combine with the B-cell37, thus it is essential to predict the surface availability of the structural protein sequence. The surface availability is predicted by Emini tool41,42 on the whole SARS-CoV-2 spike protein sequence, and we discarded the epitopes that are not exposed on the surface. After the predictions, we select out 14 vaccine subunits (see Table 5). We further use the RaptorX Property server to evaluate the surface accessibility of the SARS-CoV-2 to validate that the B-cell epitopes in those subunits are well-exposed (see Fig. 3).

Table 5 Linear B-cell Epitopes Prediction Results. Here, we show the selected 14 vaccine subunits, the contained B-cell epitopes and their Emini scores.
Figure 3
figure 3

Surface accessibility of the SARS-CoV-2. The red color represents the exposed residues, the yellow color represents the medium exposed residues and the blue color represents the buried residues. In the SARS-CoV-2 spike protein, the B-cell epitopes in the 14 vaccine subunits are well-exposed according to the surface accessibility prediction, showing good potential that the B-cell receptor is able to interact with the virus to trigger the immune response.

Cytotoxic T lymphocytes (CTL) epitopes prediction

Cytotoxic T Lymphocytes (CTL) recognize the infected cells by using the MHC class I molecules to bind with certain CTL epitopes26. We use NetMHCpan 4.1 server43 to predict potential CTL epitopes. All the overlapping 9aa peptide sequences in the 14 vaccine subunits are tested with the most common 12 human-leukocyte-antigen (HLA) Class I alleles including HLA-A1, HLA-A2, HLA-A3, HLA-A24, HLA-A26, HLA-B7, HLA-B8, HLA-B27, HLA-B39, HLA-B44, HLA-B58 and HLA-B62 to evaluate their binding affinities and predict potential CTL epitopes26,44. The total HLA score is calculated for each vaccine subunits. The results are shown in Table 6.

Table 6 CTL epitopes prediction results.

Helper T lymphocytes (HTL) epitopes prediction

Helper T Lymphocytes (HTL) help the activity of other immune cells and they recognize the infection by using MHC class II molecules to bind with certain HTL epitopes45. We use NetMHCIIpan 4.0 server46 to predict potential HTL epitopes. All the overlapping 15aa peptide sequences in the 14 vaccine subunits are tested with the most common 13 HLA Class II alleles including HLA-DRB1-0101, HLA-DRB1-0301, HLA-DRB1-0401, HLA-DRB1-0701, HLA-DRB1-0801, HLA-DRB1-0901, HLA-DRB1-1001, HLA-DRB1-1101, HLA-DRB1-1201, HLA-DRB1-1301, HLA-DRB1-1401, HLA-DRB1-1501, HLA-DRB1-1601 to evaluate their binding affinities and predict the potential HTL epitopes45,47. The total HLA score is calculated for each vaccine subunits. The results appears in Table 7.

Table 7 HTL epitopes prediction results.

Worldwide human population coverage analysis

The vaccine we design should have wide human population coverage. We use the IEDB population coverage analysis tool48 to evaluate the worldwide human population coverage of the 14 vaccine subunits. The 25 HLA alleles we used to predict the T-cell epitopes can cover 98.39% of the human population. The human population coverage of each vaccine subunit is shown in Table 8. The results suggest that our 14 vaccine subunits can cover a very wide range of human population.

Table 8 Worldwide human population coverage analysis results.

Multi-epitope vaccine construction

We discard Subunits 9, 15 and 26 for their poor performance in the CTL and HTL epitope predictions. We use the remaining 11 vaccine subunits to construct a final multi-epitope vaccine (see Fig. 4). To avoid potential autoimmunity, we perform a BLASTp screening against the Uniprot database on those 11 vaccine subunits. A subunit with a higher-than-35% identity will be considered as homologous protein with human proteome. Among the 11 vaccine subunits we choose for the final vaccine construction, none of them show high degree of homology with the human proteome. The final vaccine contains an adjuvant, 50S ribosomal protein L249,50 (accession no. AXI95322.1), to improve the immune response51, linked with the amino (N) terminum of the multi-subunit sequence through an EAAAK linker52. The multi-subunit sequence has a CTL multi-epitope peptides region followed by an HTL multi-epitope peptides region. The CTL region is constructed by 6 subunits which have better performance in the CTL epitopes prediction. AAY linkers52 are used in this region to fuse the subunits. The HTL region is constructed by 6 subunits which have better performance in the HTL epitopes prediction. GPGPG linkers52 are used in this region to fuse the subunits. The two regions are linked through a GPGPG linker. In addition, Subunit 5 is used twice in both CTL and HTL region for its good performance in both CTL and HTL epitope predictions. In the end, a 6xHis tag is added at the C-terminal to help purify and identify the protein53. The final vaccine consists of 694 amino acid residues. It contains 16 B-cell epitopes, 82 CTL epitopes and 89 HTL epitopes.

Figure 4
figure 4

Schematic Presentation of the final Multi-epitope Vaccine. The vaccine is constructed by 11 subunits (Subunit 5 is used twice in both CTL and HTL region for its good performance), an adjuvant and a 6xHis tag, linked by EAAAK, AAY and GPGPG linkers. The final vaccine consists of 694 amino acid residues. It contains 16 B-cell epitopes, 82 CTL epitopes and 89 HTL epitopes.

Antigenicity, allergenicity and solubility evaluation

The antigenicity of the final multi-epitope vaccine sequence is evaluated by the Vaxijen 2.0 online server34,54 and the AntigenPro server55. We also evaluate the antigenicity of each vaccine subunit, including the adjuvant (see Table 9). The Vaxijen score for the whole final vaccine is 0.5705 with a virus model at a threshold of 0.4, suggesting a high antigenicity of our final vaccine. The AllergenFP 1.0 server and AllerTOP 2.0 server56 predict the final vaccine and its every subunit to be non-allergenic (see Table 9). The solubility of the final vaccine and its every subunit is evaluated by SolPro57 and Protein-sol server58. The predicted values suggest that our final vaccine and its every subunit have good solubility (see Table 9).

Table 9 Antigenicity, allergenicity and solubility Evaluation Results. NA: non-allergen. Higher Vaxijen and Antigen Pro scores suggest higher antigenicity. Higher SolPro and Protein-sol scores suggest higher solubility.

Toxicity and physicochemical properties analysis

The vaccine must not have toxicity potential and the physicochemical properties are also important to evaluate how the vaccine interacts with the environments59. We use the ToxinPred server60 to predict the toxicity. Other physicochemical properties, including hydropathicity, charge, half-life, instability index, pI (theoretical isoelectric point value) and molecule wheight, are predicted by ExPASy ProtParam Tool61. For the whole final vaccine sequence and the adjuvant sequence, we use the protein screening mode in the ToxinPred server to check all its overlapping peptides with length no more than 50 aa. The whole vaccine and the adjuvant do not contain any toxic part peptide. Other subunits and the 6xHis tag are checked by the SVM prediction mode in the ToxinPred server and all the subunits and the 6xHis tag are non-toxicity. The hydropathicity value of the final vaccine is predicted to be − 0.521. This negative value suggests that our final vaccine is hydrophilic in nature and can interact with water molecules easily62. The charge is 37.00; this value will decrease in alkaline environment so usually it is better if the charge values are positive. The half-life of the final vaccine is predicted to be 30 h in vitro and > 20 h in vivo. An Instability Index of 34.01 is predicted; this being less than 40 threshold value suggests that our final vaccine is stable. The pI of the final vaccine is calculated to be 9.75, which is an alkaline value, indicating its highly basic existence in nature. The molecular weight of the final vaccine is calculated to be 76 kDa. We also check the toxocity and physicochemical properties of every subunit and the results are shown in Table 10.

Table 10 Toxicity and physicochemical properties prediction results. NT: none-toxicity. We use the protein screening mode in the ToxinPred server to check the overlapping peptides in the final vaccine and adjuvant sequence and they do not contain any toxic peptide. For the rest subunits, we directly use the SVM based prediction to predict their toxicity.

Secondary structure prediction

We use PSIPRED63 to generate the secondary structure of our final vaccine. Graphical representation of the secondary structure features are shown in Fig. 5. The predicted secondary stucture indicates that the final vaccine constitutes 10.8% alpha helix, 24.6% beta strand, and 64.6% coil. The solvent accessibility (ACC), and disorder regions (DISO) are predicted by RaptorX Property server64,65 (see Fig. 6). Among the 694 amino acid residues in our final vaccine, 44% are predicted to be exposed, 27% medium exposed, and 27% are predicted to be buried. The peptides marked in red boxes in Fig. 6 are the B-cell epitopes, showing good surface accessibility and they are not close to each other. A total of 60 residues (8%) are predicted to be located in disordered regions.

Figure 5
figure 5

Graphical Representation of the Secondary Structure Features. The alpha helix residues are in pink, the beta strand residues are in yellow and the coil residues are in grey. The predicted secondary stucture indicates that the final vaccine constitutes 10.8% alpha helix, 24.6% beta strand, and 64.6% coil, respectively.

Figure 6
figure 6

Solvent Accessibility and Disorder Regions Prediction Results. In the solvent accessibility prediction results, the red color represents the exposed residues, the yellow color represents the medium exposed residues and the blue color represents the buried residues. The peptides marked in red boxes are B-cell epitopes. The prediction results show that the B-cell epitopes in the final vaccine have good surface accessibility and also they are not close to each other. In the disorder regions prediction results, the ordered regions are in blue while the disordered regions are in red. A total of 60 residues (8%) are in disordered regions, showing good order in structure.

Vaccine 3D structure modeling

We use the RaptorX server66 to build the 3D structure models of our final vaccine. The protein structure with PDB ID 3j3vC is predicted by RaptorX to be the best template, based on which this server constructs the 3D structure model of our final vaccine (see Fig. 7). In this model, 100% (694) amino acids in the final vaccine are modeled in four domains. The P-value quantifies the likelihood of the predicted model being worse than other models generated randomly. The P-value for this model is calculated to be 4.13 × 10−14, which is a very low value, suggesting high quality of this 3D model. The unnormalized Global Distance Test (uGDT) score measures the absolute model quality. The overall uGDT score is predicted to be 506 and being greater than the 50 threshold value for a protein with more than 100 amino acid residues indicates that the 3D model of our final vaccine is good for further refinement.

Figure 7
figure 7

Vaccine 3D Structure Modeling by RaptorX based on the template with PDB ID 3j3vC. All the 694 amino acids in the final vaccine are modeled. The P-value of this model is 4.13 × 10 − 14 and this very low value indicates high quality of this 3D model. The unnormalized Global Distance Test (uGDT) score of this model is 506 (> 50), indicating good absolute model quality.

Vaccine 3D structure refinement

We use GalaxyRefine server67 to refine the 3D structure model of our final vaccine. Among the 5 refined models predicted by GalaxyRefine, we choose the Model 2 shown in Fig. 8 as our final vaccine model based on its model quality scores (see Table 11). The predicted B-cell epitopes are highlighted in yellow, showing good surface accessibility. Global Distance Test—High Accuracy (GDT-HA) score measures the similarity between two protein structures. The GDT-HA score between this refined model and the initial model reaches a high value of 0.900, indicating that they have high similarity. The distance between atoms is measured by the Root Mean Square Deviation (RMSD) score. Lower RMSD value suggests better stability and usually an RMSD score ranges between 0 and 1.2 is acceptable. This model has an RMSD score of 0.580. Such RMSD score indicates stable protein structure. Molprobity score reflects the crystallographic resolution of the model. The MolProbity score of our identified vaccine model is 2.618, which is much lower than the initial model, showing that the refinement has lowered the critical errors of the 3D model. The Clash Score reflects the number of unfavorable all-atom steric overlaps and the refinement reduced the clash score of the model from 137.8 to 33.5, improving the model stability to a high level. The Ramachandran plot score represents the size of energetically favoured regions and usually a value greater than 85% is acceptable. The Ramachandran plot score has been improved from 78.3 to 87.5% by the refinement. The quality scores of the refined model shows good overall quality.

Figure 8
figure 8

Refined Vaccine 3D Structure Model by GalaxyRefine. This model has a Global Distance Test—High Accuracy (GDT-HA) score of 0.900, a Root Mean Square Deviation (RMSD) score of 0.580, a MolProbity score of 2.618, a clash score of 33.5 and a Ramachandran plot score of 87.5%, showing great overall model quality. The B-cell epitopes in this final vaccine 3D model are highlighted in yellow.

Table 11 Quality scores of the models predicted by GalaxyRefine.

Vaccine 3D structure validation

We use ProSA-web68 to validate the overall model quality of the refined final vaccine model. ProSA predicts a Z-score of -6.51 (see Fig. 9) for the refined model, which is lying inside the score range of the comparable sized native proteins, indicating good overall model quality. ProSA also checks the local model quality and the residue scores are plotted in Fig. 9. Negative values suggest no erroneous parts of the model structure. We also use RAMPAGE server to do the Ramachandran plot analysis and it reveals a Ramachandran plot score of 87.5%, which is consistent with the results of GalaxyRefine.

Figure 9
figure 9

Vaccine 3D Structure Validation by ProSA-web. The Z-score of the refined model is -6.51 which is lying inside the score range. ProSA-web also plots the residues scores to check the local model quality and the negative values suggest no erroneous parts of the model structure.

Conformational B-cell epitope prediction

The structure and folding of the new protein can result in new conformational B-cell epitopes which requires additional predictions. We use ElliPro server69 to predict the conformational B-cell epitopes in the refined 3D model. The ElliPro server predicts 6 new conformational B-cell epitopes which involved 387 residues with scores ranging from 0.531 to 0.963. The detailed 3D model and information of those 6 epitopes are shown in Fig. 10.

Figure 10
figure 10

The 3D model of the 6 predicted conformational B-cell epitopes in the refined final vaccine structure. The yellow parts are the conformational B-cell epitopes and the grey parts are the rest of the residues. (a) 3 residues with a score of 0.963. (b) 30 residues with a score of 0.757. (c) 167 residues with a score of 0.711. (d) 161 residues with a score of 0.688. (e) 23 residues with a score of 0.59. (f) 3 residues with a score of 0.531.

Codon optimization and in silico cloning

We analyze the cloning and expression efficiency and optimize the codon usage of vaccine construct in E. coli (Escherichia coli) strain K12) by Java Codon Adaptation Tool70. The length of the optimized codon sequence is 2082 nucleotides. Its Codon Adaptation Index (CAI) is 0.997, and the average GC content is 50.73%, indicating a great potential of good expression of the final vaccine in the E. coli host. After the optimization, we use the SnapGene tool to insert the codon sequences into pET28a( +) vector for cloning71 (see Fig. 11). The codon sequence of the final vaccine is presented in red, which is the 2082 bp gene sequence generated by the JCat server. The pET28a( +) expression vector is in black. The codon sequence is inserted between Eco53KI (188) and EcoRV (1573), forming a clone with a total length of 6066 bp.

Figure 11
figure 11

available at https://www.snapgene.com).

Final Vaccine in silico cloning into the pET28a( +) vector. The codon sequence of the final vaccine is in red, which is a 2082 bp gene sequence generated by the JCat server. The pET28a( +) expression vector is in black. The codon sequence is inserted between Eco53KI (188) and EcoRV (1573), forming a clone with a toal length of 6066 bp. This image was created by SnapGene 5.1.5 software (from Insightful Science;

Molecular docking

Molecular docking can evaluate the interactions between a ligand molecule and the receptor molecule to check the stability and binding affinity of their docked complex. Toll-like receptor 4 is an important human protein for pathogen recognition and immune response. Consequently, we choose TLR4 as the immune receptor to perform the molecular docking. We use the ClusPro 2.0 server72 to perform the molecular docking between the refined 3D model of our final vaccine and the TLR4 (PDB ID: 4G8A) immune receptor. Among all the generated docking model, we select the one with the lowest energy score of -1311.5 as the best docked complex, suggesting that the vaccine model occupies the receptor properly and indicating good binding affinity (see Fig. 12).

Figure 12
figure 12

The docked complex of the vaccine model and the TLR4 immune receptor. The vaccine protein is in yellow and the rest of the residues is the TLR4 receptor. The lowest energy score of this complex model is -1311.5, indiating good binding affinity.

Molecular dynamics simulation of the vaccine-receptor complex

To evaluate the stability and physical movements of the vaccine-TLR4 docked complex17,73, we perform molecular dynamics simulation by the iMOD server74. The main-chain deformability is shown in Fig. 13a. The locations with hinges are regions with high deformability. The B-factor values calculated by normal mode analysis are proportional to root mean square (see Fig. 13b). B-factor values quantify the uncertainty of each atom. Figure 13c presents the eigenvalues which are closely related to the energy required to deform the structure and the eigenvalue of the complex is 5.426 × 10−6. The covariance matrix between the pairs of residues is shown in Fig. 13d, indicating their correlations (red: correlated, white: uncorrelated, blue: anti-correlated). The elastic network model is shown in Fig. 13e, suggesting the connection between atoms and springs. The molecular dynamic simulation results suggest that our vaccine model is stable.

Figure 13
figure 13

The molecular dynamics simulation of the vaccine-TLR4 docked complex. (a) Main-chain deformability simulation, the hinges are regions with high deformability. (b) B-factor values calculated by normal mode analysis, quantifying the uncertainty of each atom. (c) The eigenvalue of the docked complex, showing the energy required to deform the structure. (d) The covariance matrix between pairs of residues (red: correlated, white: uncorrelated, blue: anti-correlated). (e) The elastic network model, suggesting the connection between atoms and springs. The springs are more rigid if their greys are darker.

RNA mutations

As the SARS-CoV-2 spreads all over the world, its RNA sequence is going through mutations, translating out different virus proteins. Such mutations can have influences on the epitope based vaccines, since a single amino acid difference can change the epitope prediction results. Therefore it is important to prove that the proposed final multi-epitope vaccine can tackle the mutations. With our DeepVacPred, we are also able to quickly examine the mutated protein sequences to search for new potential vaccine subunits.

The RNA sequence we use to translate the spike protein and design the vaccines is from Wuhan, which is the place of the original virus35. The RNA mutations lead to three most frequent changes in the spike protein area of the SARS-CoV-2 and each of the changes contains one amino acid change75. Table 12 shows the mutation details.

Table 12 Spike protein mutations. Occurrence is the number of isolates that showed the mutation. Region is the origin of the isolates.

The mutation at the 614aa in spike protein from D to G is the most frequent mutation with 116 known isolates75. This mutation is very common in many cities in North America. In Europe and South America the D614G mutation occurs in less than 10 isolates. This change has no influence on the final multi-epitope vaccine since it does not contain the 614aa of the spike protein. With DeepVacPred, we are also able to quickly check and identify whether the mutation can create new potential vaccine subunits. We input the mutated protein sequence into DeepVacPred and the predicted subunits are the same as the original virus.

At 476aa in spike protein there is a frequent mutation from G to S, which occurs in 3 isolates from Washington DC75. This mutation has no influence on the final multi-epitope vaccine since it does not contain the 476aa of the spike protein. We input the mutated protein sequence into DeepVacPred and the predicted subunits are the same as the original virus.

At 483aa in spike protein there is a frequent mutation from V to A, which occurs in 6 isolates from Washington DC75. This mutation has no influence on the final multi-epitope vaccine since it does not contain the 483aa of the spike protein. We input the mutated protein sequence into DeepVacPred and the predicted subunits are the same as the original virus.

In conclusion, our designed multi-epitope vaccine can tackle the current RNA mutations of the coronavirus. The current RNA mutations of the coronavirus create no new potential vaccine subunits.

Discussion

In silico vaccine design has high value of efficacy and it strongly emphasizes the multi-epitope in the vaccine peptides. In this study, we develop DeepVacPred, an efficient vaccine subunit sieving framework, that exploits an AI-based approach to rapidly select 26 potential vaccine subunit candidates, introducing a new way for achieving a much higher speed and efficiency in in silico vaccine design. The goal is to directly predict the potential vaccine subunit sequence without the need to do a large number of different predictions, as well as to evaluate and select the predicted results manually. With this AI-based framework, we are able to skip at least 95% of unnecessary predictions and let the computer analyze and select the best vaccine subunits for us. DeepVacPred predicts the 26 vaccine subunits within less than a second, which enables us to skip the most time consuming part of the in silico vaccine design. With DeepVacPred, a researcher can construct a multi-epitope vaccine for a new virus and validate its quality within an hour.

This approach can be further developed by enhancing the complexity and coverage of the dataset. In this study, we selected a part of known epitopes and protective antigens to form the dataset and use it for training the DNN architecture. We use the simple bridging of one B-cell epitopes and one T-cell epitopes. With a more comprehensive dataset and more possibilities of epitope combinations, we will be able to develop a better, more comprehensive and quicker vaccine design tool. In spite of limited available datasets, the current framework can still deal with most of the situations now and provide an efficacious vaccine design.

The application of AI, and DNN methodology in particular, to protein sequences classification shows great potential. Most of the online tools rely on the SVM learning approaches. In the highly popular protective antigens prediction tool Vaxijen34, the AUC of the ROC curve can only reach 0.743, which cannot perform very accurate predictions. The dataset to train Vaxijen only contains 200 proteins, so it becomes more time consuming and challenging to rely on the SVM model with increasing number of discovered protective antigens. Consequently, the proposed DeepVacPred proves that DNN can perform a very accurate prediction with over 700,000 different proteins in the dataset.

This study eventually results in a novel multi-epitope vaccine with a length of 649aa against the SARS-CoV-2. It contains an adjuvant, 11 subunits with 16 B-cell epitopes, 82 CTL epitopes and 89 HTL epitopes. It shows good antigenicity, population coverage and good physichochemical properties and structures, providing great potential for the next step COVID-19 vaccine design with actual experiments and clinical studies.

Furthermore, we trace the RNA mutations of the SARS-CoV-2 virus. Basically, the RNA mutations can result in one amino acid change in the spike protein or other related proteins. The proposed vaccine design framework can also tackle the three most frequently observed mutations as well as it can be extended to deal with other potentially unknown mutations. The investigation on the RNA mutations also proves the high efficiency of our DeepVacPred. As future work, we will investigate novel AI algorithms and architectures capable of constructing multi-epitope vaccine designs that can overcome the unknown unknowns of viruses evolution.

Methods

DNN design and training in DeepVacPred framework

Each data input to the DNN architecture is a sequence with a length of 45 vectors which is converted from its protein sequence by Z-descriptors31 and ACC transformation32. Convolutional Neural Network (CNN) exhibits good performance to identify and process such vectors while multi-layer linear neural network is broadly connected to the ouput layer of the CNN, forming a complex DNN to enhance the classification ability. Hence, our DNN is constructed by the following layers and the parameters of each layer is decided using a random search to obtain high accuracy while maintaining good computing speed:

i. CNN, in channels = 1, out channels = 16, kernel size = 3, stride = 2, padding = 1, Tanh function;

ii. CNN, in channels = 16, out channels = 16, kernel size = 3, stride = 2, padding = 1, Tanh function;

iii. CNN, in channels = 16, out channels = 1, kernel size = 3, stride = 2, padding = 1, Tanh function, average pooling;

iv. Linear, in features = 32, out features = 64 , Tanh function;

v. Linear, in features = 64, out features = 32, Tanh function;

vi. Linear, in features = 32, out features = 16, Tanh function;

vii. Linear, in features = 16, out features = 2, Sigmoid function.

The hyper-parameters of the DNN training are listed below. The selected hyper-parameter values are marked in bold. We choose the hyper-parameters with good accuracy while maintaining good computing speed by using a random search.

i. Learning rate: [0.0001, 0.0005, 0.001, 0.0015, 0.002];

ii. Optimizer: [SGD, RMSProp, Adam];

iii. Epochs: [2000, 4000, 6000, 8000, 10000];

iv. Batch size: [1024, 2048, 4096, 8192].

Linear B-cell epitopes prediction

We use four popular server to predict the linear B-cell epitopes on each vaccine subunit candidates. (1) BepiPred-2.0 web server (http://www.cbs.dtu.dk/services/BepiPred/). BepiPred is a reliable machine learning based tool trained by random forest algorithm and its training dataset covers a large number of known linear B-cell epitopes from the IEDB database25. (2) ABCpred (http://www.imtech.res.in/raghava/abcpred/). ABCPred applies recurrent neural network to the classification of epitopes and non-epitopes to improve the accuracy39. (3) SVMTrip (http://sysbio.unl.edu/SVMTriP/). SVMTrip uses support vector machine to predict antigenic epitopes and its AUC reaches a value of 0.70238. (4) BCPreds (http://ailab.ist.psu.edu/bcpred/). BCPreds is also based on SVM model with an AUC value of 0.758 and its prediction relies on kernel methods40. The B-cell surface accessibility is checked by IEDB Emini tool42.

Cytotoxic T lymphocytes (CTL) epitopes prediction

We use NetMHCpan 4.1 server (http://www.cbs.dtu.dk/services/NetMHCpan/) to predict the CTL epitopes on each vaccine subunit candidates. We predict the CTL epitopes with a length of 9aa. All the parameters are set at default. NetMHCpan predicts peptide binding to any MHC Class I molecule of known sequence using artificial neural networks (ANNs) which is trained on a combination of more than 850,000 quantitative Binding Affinity (BA) and Mass-Spectrometry Eluted Ligands (EL) peptides, providing reliable prediction results43.

Helper T lymphocytes (HTL) epitopes prediction

We use NetMHCIIpan 4.0 server (http://www.cbs.dtu.dk/services/NetMHCIIpan/) to predict the HTL epitopes on each vaccine subunit candidates. We predict the HTL epitopes with a length of 15aa. All the parameters are set at default. NetMHCIIpan predicts peptide binding to any MHC II molecule of known sequence using artificial neural networks (ANNs) which is trained on an extensive dataset of over 500,000 measurements of Binding Affinity (BA) and Eluted Ligand mass spectrometry (EL), covering the three human HLA-DR, HLA-DQ and HLA-DP alleles, providing reliable prediction results46.

Multi-epitope vaccine construction

In this section, the BLASTp screening is done by the Uniprot server (https://www.uniprot.org/blast). BLASTp can identify similar regions between two sequences.

Worldwide human population coverage analysis

The worldwide human population coverage of each subunit is evaluated by IEDB population coverage analysis tool (http://tools.iedb.org/population/). The evluation is done on the worldwide human population.

Antigenicity, allergenicity and solubility evaluation

The antigenicity of the final vaccine and its every subunit is predicted by VaxiJen 2.0 server (http://www.ddg-pharmfac.net/ vaxijen/VaxiJen/VaxiJen.html) and AntigenPro server (http://scratch.proteomics.ics.uci.edu). Vaxijen is based on auto cross covariance (ACC) transformation of protein sequences into uniform vectors of principal amino acid properties34. Antigenpro is a sequence-based, alignment-free and pathogen-independant predictor of protein antigenicity55. The allergenicity of the final vaccine and its every subunit is checked by AllergenFP 1.0 server (http://ddg-pharmfac.net/AllergenFP/) and AllerTOP 2.0 server (https://www.ddg-pharmfac.net/AllerTOP/). AllergenFP and is a binary classfier between allergens and non-allergens. The dataset is described by five E-descriptors and the strings are transformed into uniform vectors by auto-cross covariance (ACC) transformation76. AllerTop is also based on ACC transformation and E-descriptors56. The solubility is evaluated by SolPro server (http://scratch.proteomics.ics.uci.edu) and Protein-sol server (https://protein-sol.manchester.ac.uk). SolPro is an SVM based tool to predict the solubility of a protein sequence with an overall accuracy of over 74% estimated by tenfold cross-validation57. Protein-sol is based on available data for Escherichia coli protein solubility in a cell-free expression system58.

Toxicity and physicochemical properties analysis

The toxicity of the final vaccine and its every subunit is predicted by ToxinPred server (http://crdd.osdd.net/raghava/toxinpred/). TonxinPred is based on SVM model to classify toxicity and non-toxicity. The dataset used in its method consists of 1805 toxic peptides (≤ 35 residues)60. The physicochemical properties of the final vaccine and its every subunit is predicted by ExPASy ProtParam server (https://web.expasy.org/protparam/). The physicochemical properties include hydropathicity, charge, half-life, instability index, pI (Theoretical isoelectric point value) and molecule wheight61.

Secondary structure prediction

PSIPRED is used for the secondary structure prediction of our final vaccine (http://bioinf.cs.ucl.ac.uk/psipred/). PSIPRED incorporates two feed-forward neural networks which perform an analysis on output obtained from PSI-BLAST (Position Specific Iterated—BLAST). It achieves an average Q3 score of 81.6%, which can achieve accurate secondary structure prediction63. We also use RaptorX Property web server (http://raptorx.uchicago.edu/StructurePropertyPred/predict/) to predict the solvent accessibility (ACC) and disorder regions (DISO). RaptorX employs an emerging machine learning model called DeepCNF (Deep Convolutional Neural Fields) to predict secondary structure (SS), solvent accessibility (ACC), and disorder regions (DISO) simultaneously65.

Vaccine 3D structure modeling

The 3D model of the final vaccine is constructed by RaptorX server (http://raptorx.uchicago.edu/ContactMap). RaptorX provides distance-based protein folding powered by deep learning. This server was officially ranked 1st in contact prediction in both CASP12 and CASP13 and initiated the revolution of protein structure prediction by deep learning66.

Vaccine 3D structure refinement

The 3D model built by RaptorX server is refined by GalaxyRefine (http://galaxy.seoklab.org/cgi-bin/submit.cgi?type=REFINE). GalaxyRefine first rebuilds side chains and performs side-chain repacking and subsequent overall structure relaxation by molecular dynamics simulation. According to the CASP10 assessment, the GalaxyRefine server method performed the best in improving local structure quality67 The quality of the refined model is evaluated in terms of its GDT-HA socre, RMSD score, Molprobity score, clash score and Ramachandran plot score.

Vaccine 3D structure validation

The final refined 3D model of our final vaccine is validated by ProSA-web server(https://prosa.services.came.sbg.ac.at/prosa.php). ProSA calculates an overall quality score for a specific input structure. If this score is outside a range characteristic for native proteins the structure probably contains errors. A plot of local quality scores points to problematic parts of the model which are also highlighted in a 3D molecule viewer to facilitate their detection68.

Conformational B-cell epitope prediction

The conformational B-cell epitopes in the refined final vaccine 3D structure model are predicted by the ElliPro Server (http: //tools.iedb.org/ellipro). ElliPro is based on the geometrical properties of protein structure. Among the current conformational B-cell epitope prediction tools, ElliPro has the best AUC score of 0.732, which is a very reliable tool for identifying antibody epitopes in protein antigens69.

Codon optimization and in silico cloning

Java Codon Adaptation Tool (JCat) server is used for codon optimization (https://urldefense.com/v3/__http://www.jcat.de/LIr3w8kk_Xxm!7wRJ08pRiYapODc_l0a3Lu91JwL-k63K5zWwthwiCfq_ctg6SmoWSkB2JxUzyRA). JCat adapts the codon usage to most sequenced prokaryotic organisms and selected eukaryotic organisms70. The optimized codon sequence is insert into pET28a( +) vector with SnapGene 5.1.5 software (from Insightful Science; available at https://www.snapgene.com).

Molecular docking

The molecular docking is done by ClusPro 2.0 server (https://cluspro.bu.edu). ClusPro is a widely used tool for protein–protein docking. Docking with each energy parameter set results in ten models defined by centers of highly populated clusters of low-energy docked structures72. We choose TLR4 (PDB ID: 4G8A) as the immune receptor. We select the docked complex with the lowest energy score.

Molecular dynamics simulation of the vaccine-receptor complex

The molecular dynamics simulation is done by iMOD server (iMODS) (http://imods.chaconlab.org). iMODS facilitates the exploration of such modes and generates feasible transition pathways between two homologous structures74. The iMOD server evaluates the protein stability by computing its internal coordinates through normal mode analysis (NMA). The stability of the protein is represented in terms of its main-chain deformability plot, B-factor values, eigenvalue, covariance matrix and elastic network model.