Bioinformatic prediction of potential T cell epitopes for SARS-Cov-2

To control and prevent the current COVID-19 pandemic, the development of novel vaccines is an emergent issue. In addition, we need to develop tools that can measure/monitor T-cell and B-cell responses to know how our immune system is responding to this deleterious virus. However, little information is currently available about the immune target epitopes of novel coronavirus (SARS-CoV-2) to induce host immune responses. Through a comprehensive bioinformatic screening of potential epitopes derived from the SARS-CoV-2 sequences for HLAs commonly present in the Japanese population, we identified 2013 and 1399 possible peptide epitopes that are likely to have the high affinity (<0.5%- and 2%-rank, respectively) to HLA class I and II molecules, respectively, that may induce CD8+ and CD4+ T-cell responses. These epitopes distributed across the structural (spike, envelope, membrane, and nucleocapsid proteins) and the nonstructural proteins (proteins corresponding to six open reading frames); however, we found several regions where high-affinity epitopes were significantly enriched. By comparing the sequences of these predicted T cell epitopes to the other coronaviruses, we identified 781 HLA-class I and 418 HLA-class II epitopes that have high homologies to SARS-CoV. To further select commonly-available epitopes that would be applicable to larger populations, we calculated population coverages based on the allele frequencies of HLA molecules, and found 2 HLA-class I epitopes covering 83.8% of the Japanese population. The findings in the current study provide us valuable information to design widely-available vaccine epitopes against SARS-CoV-2 and also provide the useful information for monitoring T-cell responses.


Introduction
In December 2019, a cluster of several severe pneumonia cases of unknown etiology was found in the city of Wuhan in Hubei province of China. Shortly thereafter, a novel Betacoronavirus, SARS-CoV-2, was identified as a causative microbial agent to cause severe acute respiratory disease. The World Health Organization (WHO) declared the outbreak of a coronavirus disease of 2019 (COVID- 19) as public health emergency of international concern and put in place a series of temporary recommendations on January 30. The current outbreak of COVID-19 has nearly 3 million confirmed cases worldwide with more than 200,000 deaths, as of April 27, 2020, according to the WHO. The genome sequences of the SARS-CoV-2 were reported to consist of~30,000 nucleotides with high sequence similarities to Betacoronavirus, including severe acute respiratory syndrome coronavirus (SARS-CoV; 79%) and Middle East respiratory syndrome coronavirus (MERS-CoV; 50%) [1][2][3]. The SARS-CoV-2 genome, like other coronaviruses, encodes for multiple structural and nonstructural proteins. The structural proteins include spike protein (S), envelope protein (E), membrane glycoprotein (M), nucleocapsid phosphoprotein (N), and the nonstructural proteins include open reading frame 1ab (ORF1ab), ORF3a, ORF6, ORF7a, ORF8, and ORF10. The previous studies have suggested that SARS-CoV-2 has putatively a similar cell entry mechanism and human cell receptor usage [4,5].
Many researches are now underway to develop effective interventions for controlling and preventing the COVID-19 pandemic, including therapeutic drugs such as inhibitors of the RNA-dependent RNA polymerase or the viral protease, and blockers of virus-cell membrane fusion as well as vaccines, and large scale clinical trials have just begun [6,7]. For the vaccine design against SARS-CoV-2 and the evaluation of immunogenicity of candidate vaccines, it is important to predict epitopes of SARS-CoV-2 and detect their immune responses to SARS-CoV-2. However, little information is currently available on which parts of the SARS-CoV-2 sequence are important for our immune responses.
Therefore, in the current study, we comprehensively screened potential T cell epitopes from the SARS-CoV-2 sequence using bioinformatic tools, and also assessed the conservation of these epitopes across different coronavirus species, including SARS-CoV and MERS-CoV.

Comparison of coronavirus sequences
Alignment of downloaded sequences was done with Genetyx software (version 8.0.0). The similarity among the sequences was visualized using SimPlot software (version 3.5.1) [8], with the consensus sequence of SARS-CoV-2 isolated from Wuhan-Hu-1 (MN908947) as the query.
Binding affinity to HLA class I molecules was calculated for all 9-and 10-mer peptides from SARS-CoV-2 proteins using NetMHCv4.0 and NetMHCpanv4.0 software [11,12]. We selected the top 0.5%-ranked epitopes based on the prediction score as strongly binding epitopes. Binding affinity to HLA class II molecules was calculated for all 15-mer peptides from SARS-CoV-2 proteins using NetMHCIIpanv3.1 software [13]. We applied the threshold of top 2%-ranked epitopes based on the prediction score as strong binders.

Mutation analysis
To identify mutations of SARS-CoV-2, we used a total of 6421 SARS-CoV-2 sequences isolated in different areas, including 587 sequences from Asia, 1918 from North America, 3190 from Europe, and 726 from Oceania regions, which were deposited in the Global Initiative on Sharing Avian Influenza Data as of 18 April 2020. We first aligned each of these SARS-CoV-2 sequences to the reference sequence SARS-CoV-2_Wuhan-Hu-1 (accession number MN908947) using BLAT software [14]. After the alignment, we extracted nucleotide sequences corresponding to individual proteins of SARS-CoV-2, translated them to amino acid sequences, and then compared them to reference amino acid sequences of SARS-CoV-2_Wuhan-Hu-1 (accession numbers QHD43415-QHD43423, QHI42199).

Statistical analysis
Fisher's exact test was used to analyze the enrichment of epitopes and differences of mutation rates of SARS-CoV-2 isolated from different areas. Statistical analysis was carried out using the R statistical environment version 3.6.1.

Results
We first screened potential epitopes that are likely to be presented on certain HLA class I molecules, HLA-A, B, and C molecules, which are commonly observed (frequencies of more than 5%) in the Japanese population [9], using netMHC4.0 and netMHCpan4.0 algorithm [11,12]. We selected the top 0.5%-ranked (high affinity) peptides derived from the SARS-CoV-2 protein sequences and obtained a total of 2013 unique predicted epitopes ( Fig. 1, Table 1 and Supplementary Table 2). The predicted epitopes were significantly enriched in the M protein (P = 0.00062, odds ratio = 1.64), whereas less enriched in the N protein (P = 0.0074, odds ratio = 0.69). We then performed a screening of HLA-class II-candidate peptide epitopes that show the high affinity to HLA-DPA1, DPB1, DQA1, DQB1, and DRB1, which are common (frequencies of more than 5%) in the Japanese populations [9,10], using netMHCIIpan3.1 algorithm [13]. We found a total of 1399 possible HLA-class II epitopes after selecting top 2%ranked peptides (Fig. 1, Table 1 and Supplementary  Table 3). The predicted HLA-class II epitopes were enriched in the M protein (P = 0.000051, odds ratio = 1.92), ORF3a (P = 0.0000016, odds ratio = 2.00), and ORF6 (P = 0.000000039, odds ratio = 4.22), whereas less enriched in the N protein (P = 0.00031, odds ratio = 0.53).
Position 30,000 25,000 20,000 15,000 10,000 5,000 0 Since it is reported that SARS-CoV-2 has about 79% and 50% nucleotide sequence homology to SARS-CoV and MERS-CoV, respectively [1][2][3], we compared the homology of predicted SARS-CoV-2 epitope sequences to 3 SARS-CoVs (BJ01, GZ02, and Tor2) and MERS-CoV to evaluate their cross-reactivities (Table 2 and Supplementary  Tables 2 and 3 T cell epitopes, which are likely to be presented commonly on multiple HLA molecules, could cover a larger proportion of individuals/patients. Therefore, we estimated population coverages of SARS-CoV-2-derived, HLA-class I-and II-presented epitopes with high binding affinity on the basis of the allele/haplotype frequencies of HLA (Table 3 and Supplementary Tables 4, 5 and 6). Two epitopes in ORF1ab, ORF1ab2168-2176, and ORF1ab4089-4098, that

Number of common epitopes to SARS-CoV or MERS-CoV
HLA-class I epitopes HLA-class II epitopes  were predicted to have strong affinity to HLA-A*24:02, HLA-A*02:01, and HLA-A*02:06 showed the highest coverage of 83.8% of the Japanese population.

Discussion
To control the current COVID-19 pandemic and prevent the second pandemic in the near future, the development of new drugs and vaccines, and the establishment of tools investigating the immune responses in patients or silently-infected individuals are urgent issues. Especially, effective vaccination or immunotherapy could play a significant role in suppressing the spread of the virus. Since antibodies recognize cell surface proteins, the targets of antibody-based vaccine were limited. Upon viral infection, the viral proteins express in the infected cells and are processed into small peptides by proteasomes. These peptides are then presented by HLA molecules on the surface of the infected cells and recognized by T cells through their T cell receptors. Thus, the potential T cell epitopes can be derived from any of the viral structural and nonstructural   Table 1) proteins. In this study, using the bioinformatics tools, we comprehensively screened potential SARS-CoV-2-derived, HLA-class I-and II-presented epitopes for 43 HLA alleles that are common in the Japanese population, and identified 2013 and 1399 epitopes, respectively. 781 HLA-class I and 418 HLA-class II epitopes are considered to be common between SARS-CoV-2 and SARS-CoV (Table 2). We found that four epitopes, S1060-1068, S1220-1229, N222-230, and N315-324 of SARS-CoV-2, have exactly same sequences reported as immunogenic SARS-CoV-derived epitopes for HLA-A*02:01, that correspond to S1042-1050, S1203-1211, N223-231, and N317-325 of SARS-CoV, respectively [17,18]. Interestingly, SARS-CoV-2-derived S1060-1068 epitope is also predicted as high affinity epitopes for HLA-A*02:06 (belonging to the same HLA-A*02 family), HLA-B*52:01, and HLA-C*12:02. In the Japanese population, HLA-A*24:02 is the most common HLA-A molecule with the allelic frequency of 37.8% (implying that 61.3% of the Japanese have at least one HLA-A*24:02 allele), followed by 12.3% and 9.6% of HLA-A*02:01 and HLA-A*02:06, respectively. Two epitopes in ORF1ab, ORF1ab2168-2176, and ORF1ab4089-4098, latter of which is conserved in SARS-CoV, were predicted to have the strong affinity to HLA-A*24:02 as well as HLA-A*02:01 and HLA-A*02:06. Based on their allele frequency, these epitopes could cover 83.8% of the Japanese individuals. Since no mutation was identified in these epitope sequences we identified, these potential candidate epitopes could lead to the contribution to development of rationally designed epitope-based peptide vaccines against SARS-CoV-2. If these epitopes are immunogenic, we are able to use HLA-oligomer with each of these peptides for monitoring T-cell responses in patients and were compared with the reference protein sequence of SARS-CoV-2_Wuhan-Hu-1 [1]. 156 amino acid mutations, which were observed at more than 0.5% frequencies in at least one region, were plotted silently-infected individuals. In addition, several reports have suggested that a subset of patients with severe COVID-19 might have a cytokine release syndrome [19,20]. These HLA-oligomers might be useful to predict and monitor acute T-cell responses in COVID-19 patients who cause these lifethreatning symptoms.
In conclusion, through bioinformatic screening, we identified a large number of potential T cell epitopes, some of which could cover other coronavirus spices, including SARS-CoV. These peptides can possibly cover a large proportion of the Japanese population. Although further experimental proof to evaluate immunogenicity of the predicted peptides is required, we hope our findings in the current study could contribute to designing vaccines (DNA or RNA vaccine, inactivated viral vaccines, or peptides vaccines) and evaluating these vaccines, and also to immune monitoring of SARS-CoV2-infected patients.