Introduction

Human leukocyte antigens (HLAs) are the major histocompatibility complexes (MHCs) in humans. They are expressed on the surfaces of antigen presenting cells to recognize endogenous or foreign peptides for immunological reactions1,2. The genes that encode HLAs are a gene system located at the short arm of Chromosome 6. They are highly polymorphic across populations3,4,5. There are different classes of HLAs, including Class I, II and III, according to their genetic locations. Different classes of HLAs have divergent structures and functions. Both Class I and Class II HLAs have a long binding groove that can bind peptides and present them onto T-cell receptors6,7,8, while Class III HLAs are a part of the complement system to help with pathogen clearance9. Class I HLAs capture the endogenous peptides degraded from cytosolic proteins and present them to the T-cell receptors on the surface of CD8+ T-cells for cytotoxic responses, while the Class II HLAs present exogenous peptides from extracellular sources to the CD4+ T-cells to trigger acquired responses including antibody synthesis10,11. The binding between Class I/II HLAs and peptides is an important process for immune responses. Studying HLA-peptide binding will help us better understand the immune system and the mechanisms of autoimmune diseases and adverse drug reactions12,13 and will also provide important information needed in the development of vaccines and protein therapeutics14,15.

Since HLA-peptide binding is important for immune-related applications, experimental binding assays were developed to test in vitro binding affinities between HLAs and peptides and the data were collected in databases such as AntiJen16, IEDB17, MHCBN18 and SYFPEITHI19. The IMGT/HLA database recorded more than 13,000 HLA alleles by August 201520. Since it is time-consuming and costly to experimentally test the binding between large numbers of HLAs and peptides, computational methods have been developed to predict HLA-peptide binding21. The current widely used methods are machine learning methods; however, several challenges limit their applicability. First, many machine learning methods can only predict a limited number of HLAs or peptides with a specific length. Second, an HLA-specific model would be unreliable if the training samples were not large enough21. Therefore, we developed the neighbor-edges based and unbiased leverage algorithm (Nebula) based on network analysis to overcome the limitations of machine learning methods22,23. We successfully applied Nebula to predict HLA-peptide binding and found that it delivered a reasonable performance. However, Nebula is not applicable to predict the binding between a peptide and an HLA if no experimental data are available between the peptide and other HLAs or no binding assay has been developed for the HLA. Thus, Nebula is not able to predict binding for unstudied peptides and HLAs, limiting its application. Nebula is an algorithm purely based on the topology of a network; alternatively, the network is treated as a colorless graph where the nodes are not differentiated (colorless). Actually, the nodes (HLAs and peptides) in the bipartite network of HLA-peptide could be differentiated in many ways. Thus, appropriate consideration of node difference in a prediction algorithm is expected to improve its performance. In this study, we developed a new network-based prediction algorithm called similar neighbor-edges based and unbiased leverage algorithm (sNebula) by presenting the bipartite network of HLA-peptide binding data in a color graph. By introducing color to the network as additional information, sNebula can predict binding activity for peptides and HLAs that are not included the training network, overcoming the limitation of Nebula. We used the qualitative binding data between Class I HLAs and peptides as an example. We demonstrated that sNebula is a reliable algorithm for prediction of HLA-peptide binding and can be applied to HLAs or peptides with or without experimental binding data.

Results

Data curation

We curated 43,935 peptides, 135 Class I HLAs and 141,224 qualitative HLA-peptide binding data from the four databases. The binding data are given in Supplementary Table S1. Among the 43,935 distinct peptides, the peptide length varies from 6 to 30. Most of the peptides are 9-mers (65%) and 10-mers (25%), which is consistent with the experimental discovery of Class I HLA-binding peptides24. The distribution of peptide lengths is summarized in Supplementary Table S2. The 135 HLAs include 49 HLA-A alleles, 75 HLA-B alleles, 9 HLA-C alleles and 2 HLA-E alleles. The HLA alleles and their pseudo-sequences are listed in Supplementary Table S3. Among the 141,224 HLA-peptide binding data, 47% are bindings and 53% are non-bindings.

Leave-one-out (LOO) cross-validation

Parameter n indicates the maximum of neighbors from the peptide and HLA that are used for sNebula to make a prediction of the binding between the HLA and the peptide. Fifty leave-one-out (LOO) cross-validations were conducted on the HLA-peptide binding data using parameter n = 1 to 50. The prediction accuracy values yielded from the 50 LOO cross-validations are shown in Fig. 1. When n = 13, the accuracy reached the maximal value 0.841, and the corresponding sensitivity, specificity and area under the receiver operating characteristic curve (AUC) values were 0.818, 0.862 and 0.841, respectively. As n increases after this point, the accuracy of the model gradually dropped. Thus we used n = 13 for LOO cross-validations.

Figure 1: Determination of parameter n using LOO cross-validations for sNebula.
figure 1

The y-axis is the prediction accuracy and the x-axis indicates n.

Five-fold cross-validations

One thousand iterations of five-fold cross-validations were conducted on the HLA-peptide binding data using sNebula. The prediction values in each of the five-fold cross-validations were compared with the experimental values and a set of accuracy, sensitivity and specificity values was calculated. The distributions of 1,000 values of these performance metrics are shown in Fig. 2. The average sensitivity, specificity and accuracy values are 0.816, 0.852 and 0.835, respectively, with the same standard deviation of 0.001.

Figure 2
figure 2

The distributions of sensitivity, specificity and accuracy seen in the 1,000 iterations of five-fold cross-validations.

Confidence analysis

The sNebula predictions are continuous values that not only indicate binding status of binder/non-binder but also represent the prediction confidence levels. The confidence levels of sNebula predictions from the 1,000 iterations of five-fold cross-validations were calculated and used to place the predictions into 10 groups by confidence. The performance of sNebula was assessed for each of the 10 groups of predictions. The performance in terms of accuracy, sensitivity, specificity and AUC at different confidence levels were plotted in Fig. 3. As the confidence increased, the AUC, accuracy, sensitivity and specificity (indicated by the left y-axis) improved, and the predictions in number (indicated by the right y-axis) also increased. The confidence analysis results revealed that the higher the prediction confidence level the better the prediction performance of sNebula. Moreover, most predictions from sNebula were at high confidence.

Figure 3: The relationships between prediction confidence and sNebula performance (AUC, accuracy, sensitivity, specificity) and prediction frequency.
figure 3

The confidence values ranging from 0 to 1 are grouped into 10 bins. The X-axis represents the upper boundary of each confidence bin. The left Y-axis indicates AUC, accuracy, sensitivity and specificity. The right Y-axis gives prediction frequency.

Benchmark

The IEDB website contains the performance comparison of various prediction methods for HLA-peptide binding (http://tools.iedb.org/auto_bench/mhci/weekly/). We used NetMHCpan25,26,27 to compare with sNebula and Nebula. The performance comparison is shown in Table 1. Different methods had different performance depending on the dataset and the HLA. While NetMHCpan performed well on some datasets such as B*07:02, B*15:02 and B*27:04 in terms of AUC values, sNebula had better results on datasets such as C*03:04 and A*02:06. As a comparison, Nebula also had high AUC values on some datasets such as A*68:02 and B*38:01. However, because Nebula could not make predictions on HLAs and peptides that are not included in the training network, the results of Nebula were not complete for some datasets as B*27:04 and B*27:05.

Table 1 Performance comparison of NetMHCpan, sNebula and Nebula on IEDB benchmark datasets.

Discussion

The human HLA loci are in a genomic region that is among the most polymorphic. The HLA loci have retained much variation28,29,30. Thousands of HLA alleles have been discovered, including approximately 9000 alleles of Class I HLAs20,31. The proteins encoded by HLAs are used by the immune system to recognize invaders such as foreign pathogens. However, the proteins themselves are not able to display biological functions. The binding groove of HLA proteins holds a peptide that can exhibit functions of HLAs such as social recognition skills32. It follows that knowledge of HLA-peptide binding plays a key role in understanding related biomedical questions such as autoimmune diseases and HLA-mediated adverse drug reactions33,34. Many in vitro experiments have been designed to assay HLA-peptide binding35. However, due to the huge number of possible binding interactions between thousands of HLAs and millions of peptides, it is difficult, if not impossible, to comprehensively ascertain the binding interactions between HLAs and peptides. Thus, computational methods can play a crucial role in the study of HLA-peptide binding. Though some computational approaches have been proposed for prediction of HLA-peptide binding21,36, the practicability is limited since many methods do not support HLAs with few binding peptides or peptides that are diverse in length. Some recently developed methods such as NetMHCpan25,26,27, NetMHC37 and kernel functions25,38 can predict for peptides with different lengths; however, extra processes39,40 are usually required to identify core binding sequences within the peptides so that they can be converted to a fixed length. Though such extra processes may not necessarily reduce performance of such methods, algorithms that can overcome some restrictions of the current computational methods and handle peptides with different lengths are expected to have wider applications. Using sNebula, one can generate a comprehensive atlas of binding interactions between HLAs and peptides. Based on bipartite network analysis, sNebula has no limitations on the number of HLA molecules used or the size of the peptides utilized for training and, thus, provides a promising solution for the construction of a comprehensive atlas of HLA-peptide binding. However, different from machine learning-based methods such as NetMHCpan25,26,27, sNebula is unable to directly predict HLA-peptide binding if neither the peptide nor the HLA exists in the training network.

The results of this current study suggest that sNebula can accurately predict the binding activity between HLAs and peptides, even though this is a very sparse dataset. The algorithm is useful because it can not only make predictions for untested peptides and HLAs given sequence information, but also can make more accurate predictions with a higher confidence. In addition, it does not set any limitation on the peptide length or the number of HLA alleles. With all these advantages, sNebula can help us study the binding between HLAs and peptides and improve our understanding of the immune system. Like HLA-peptide binding data, a lot of big data are diverse and incomplete41 such as gene expression data42,43, drug-target binding44 and clinical information45. Methods have been developed to impute the missing values for analysis including unsupervised and supervised classifications43,46,47. However, unlike the classification models, sNebula can deal with sparse or incomplete data without requiring the process of missing data imputation. It also accepts the diversity and flexibility of data so an assured length of features is not required. With the arising of big data era and increasing needs of big data analysis, we believe sNebula is one of the possible solutions to deal with large, diverse and incomplete data for predictions and novel discoveries.

Future applications of sNebula remain to be explored. In this study, sequences were utilized to calculate the similarity between nodes. It is possible to use sNebula in the development of similar algorithms for other applications. For example, it is possible to utilize the 2D structural fingerprints of drugs to replace sequences of peptides for similarity calculation and, thus, modify sNebula to predict drug-HLA binding or even drug-target binding that may underlie some observed genetic links to adverse drug events. Network-based inference (NBI) is a powerful network approach that can integrate a variety of data sources for a wide spectrum of applications such as drug-target predictions48,49, drug safety assessment50,51,52, driver mutations prioritization in cancer genomics53, RNA network prediction54 and xenobiotics gene and disease prediction55. Cheng et al. utilized Node Weighted Network-based Inference (NWNBI) to predict drug-target interactions using a node-weighted network and observed a better performance than the unweighted NBI49,56. They calculated the drug similarity by 2D fingerprints and target similarity by sequences. However, their method uses all the neighbors for prediction instead of selecting the top similar ones. It is possible to improve the prediction performance by selecting top similar neighbors using sNebula. Another possible application for sNebula is to predict drug-disease association for drug repurposing. Gottlieb et al. collected a network of drug-disease associations as well as information of drug-drug similarities and disease-disease similarities to predict novel drug-disease associations using logistic regression57. The machine learning method is useful; however, there are some challenges such as problems to deal with a flexible length of features21 or missing data45. Since sNebula is based on similarity and does not require the completeness or an assured length of features, it is possible to extend sNebula to predict drug-disease associations while overcoming those problems.

Another potential application of sNebula is to develop new therapeutics such as tumor immunotherapy. The neoantigens are peptides in the human body that are not encoded by the normal human genome. In tumors, they are generated by the tumor-specific DNA alternations58. When the gene expression data for patients are available, predicting HLA-peptide binding may help to identify or filter patient-specific neoantigens, which are a major factor for clinical immunotherapy development58,59. As more HLA-peptide binding data and patient-specific RNA sequencing data are becoming available, we believe sNebula can potentially help with neoantigen identification and the development of immunotherapies.

In addition to predictions values, sNebula also provides the confidence values. Confidence values are estimations about how likely the result is true; therefore, users can differentiate the results using confidence values and select the most confident predictions for validation. A good method not only makes more predictions in number, but also predicts with higher accuracy at higher confidence. From the confidence analysis result of sNebula, we saw sNebula predicted more and performed better with higher confidence. We believe the confidence values are useful information that can potentially help with the selection of prediction results for experimental validation in applications such as HLA-peptide binding, drug-target binding and drug-disease associations.

Conclusion

We developed a network-based prediction algorithm, sNebula, to predict the binding potential between HLAs and peptides. We found this algorithm exhibited a good performance in both the LOO cross-validation and five-fold cross-validations using the experimental HLA-peptide binding data curated from major databases. The confidence analysis indicated its ability to make predictions with more accuracy when the confidence level is higher. This algorithm not only overcomes the limitations of the current machine learning methods on the number of HLAs and lengths of peptides, but also makes it possible to predict HLA-peptide binding for new peptides or HLAs. It could be expected that sNebula can help with the construction of a comprehensive atlas of HLA-peptide binding that, in turn, facilitates better understanding of the immune system.

Methods

Study design

The workflow of the study is shown in Fig. 4. Qualitative Class I HLA-peptide binding data were collected and curated from four databases: AntiJen16, IEDB17, MHCBN18 and SYFPEITHI19. A bipartite network of HLA-peptide binding data was then constructed. The binding data network was used to assess the performance of sNebula using leave-one-out (LOO) cross-validation and 1,000 iterations of five-fold cross-validations. The prediction confidence analysis was conducted based on the results of five-fold cross-validations.

Figure 4: Study workflow.
figure 4

The qualitative Class I HLA-peptide binding data were curated from four major databases (AntiJen, IEDB, MHCBN and SYFPEITHI). A HLA-peptide binding data network was constructed based on the curated data. To assess the performance of sNebula, LOO cross-validation and 1,000 iterations of five-fold cross-validations were executed. Based on the results of five-fold cross-validations, confidence analysis was conducted to evaluate the relationship between the confidence levels and the prediction performance of sNebula.

Data curation

The experimental Class I HLA-peptide binding data were collected from four databases (AntiJen16, IEDB17, MHCBN18 and SYFPEITHI19) as described in our previous study22. The databases provide qualitative binding categories (binding versus non-binding) for each HLA-peptide pair. We merged the four databases and recorded only one qualitative datum for each of HLA-peptide pairs using the majority voting strategy22. Since sNebula is based on a color graph of the network in which the nodes (HLAs and peptides) are colored using their amino acid sequences, we downloaded HLA sequences from the IMGT/HLA database20 and removed the HLAs that do not have an affirmative protein sequence such as HLA serotypes and allele groups and their binding peptides. The curated HLA-peptide binding data were used to construct a bipartite binding network, where HLAs and peptides are nodes, and the binding data between them, either binding or non-binding, are edges.

sNebula

To predict the binding between peptide px and HLA hy, sNebula first identifies the peptides that are connected to HLA hy, pi (i = 1, 2, …). The similarity between peptide px and each of the connected peptides of hy, pi, is calculated based on their sequences. To calculate the similarity between peptide px and peptide pi, sNebula first aligns the sequences of the two peptides without opening gaps and then calculates the similarity score spx,i using equations (1–2).

In equation (2), lx and li are the lengths of peptides px and pi, respectively. In equation (1), px,i is the sequence similarity between peptides px and pi and is calculated using BLOSUM50 matrix60, and px,x is the sequence similarity between peptide px and itself. BLOSUM50 matrix was used in this study because it has been used in many other methods for predicting HLA-peptide binding26,39,61,62. For each of possible alignments between peptide px and peptide pi, a sequence similarity value, px,i, is calculated. The alignment with the largest sequence similarity value is then selected and its similarity, px,i, is used for the calculation of equation (1).

In the same way, sNebula recognizes the HLAs that are connected to peptide px, hj (j = 1, 2, …). The similarity between HLA hy and each of the connected HLAs of peptide px, hj, is then calculated. For the HLA similarity calculation, instead of using HLA full sequences, sNebula utilizes the 48 unique residues on HLAs that closely interact with peptides which were identified by Chelvanayagam63 as HLA pseudo-sequences. The pseudo-sequence similarity score shy,j between HLA hy and HLA hj is calculated using equation (3).

In equation (3), hy,j is the pseudo-sequence similarity between HLA hy and HLA hj and is calculated using BLOSUM50 matrix, and hy,y is the pseudo-sequence similarity between HLA hy and itself. It is noted that both spx,i and shy,j are directional; thus, spx,I (shy,j) is not necessarily equal to spj,x (shj,y).

After the similarity scores for the nodes (HLAs and peptides) of the neighbor edges are calculated, sNebula ranks the peptides and HLAs using their similarity scores spx,i and shy,j to select n top ranked peptides and HLAs for px and hy, respectively, for the calculation of a continuous value px,y using equation (4) as the prediction of binding between px and hy. Here n is a parameter to be determined. The n with the best prediction of accuracy of LOO cross-validation based on the network of HLA-peptide binding data is used.

In equation (4), ei,y is the binding edge weight between peptide i and HLA y in the HLA-peptide binding data network given by equation (5).

As an unbiased approach, sNebula considers the contribution from the peptides px and HLAs hy of the neighbor edges equally. When peptide px do not contain neighbor edges, sNebula predicts the binding between peptide px and HLA hy using equation (6).

Therefore, if a peptide has no experiment data against any HLA in the training network, sNebula is still able to predict its binding towards the HLAs based on its sequence and the topological feature of the training network using equation (6). In the same way, if HLA hy do not have neighbor edges, sNebula predicts the binding between peptide px and HLA hy using equation (7).

When the number of available connecting nodes (HLAs or peptides) for peptide px or HLA hy is less than n, all of the nodes are used. When multiple connecting nodes for peptide px or HLA hy have the same similarity score to be selected as the top n similar nodes, the average of their binding edge weights is used in equations (4), (6) and (7).

The binding prediction px,y is a continuous value between −1 and 1 and is converted into a categorical prediction value cx,y using equation (8).

The goodness of the prediction cx,y is assessed using a metric term as prediction confidence confx,y that is defined in equation (9).

LOO cross-validation

In the LOO cross-validation, each of the HLA-peptide binding data was taken out and the remaining HLA-peptide binding data were used to construct the HLA-peptide binding network for prediction of the binding value for the taken-out HLA-peptide pair using sNebula. This process was repeated until every HLA-peptide binding data were used as a test sample. The predicted values were then compared with the actual experimental binding data and sensitivity, specificity, accuracy and area under receiver operating characteristic curve (AUC) were calculated to evaluate the performance of sNebula. To determine the parameter n, we repeated the LOO cross-validation for n = 1, 2, 3 … 50. The n value with the highest prediction accuracy of LOO cross-validation was selected to be the parameter for sNebula.

Five-fold cross-validations

We conducted 1,000 iterations of five-fold cross-validations on the HLA-peptide binding data to obtain a statistically robust estimation of sNebula performance. In a five-fold cross-validation, the HLA-peptide binding data were randomly divided into five parts as equal as possible. One part of HLA-peptide binding data was taken out to be used as test samples and the remaining four parts of HLA-peptide binding data were used as the training samples to construct a bipartite network. LOO cross-validations were conducted on training samples to determine the parameter n in sNebula. The network constructed from the training samples and the determined parameter n was used by sNebula to predict HLA-peptide binding of the test samples. This process was repeated five times so that each of the five parts of the HLA-peptide binding data was used once and only once as test samples. The categorical prediction results from all the five folds of test samples were compared to the actual experimental HLA-peptide binding data to calculate the sensitivity, specificity and accuracy to estimate the performance of sNebula.

Prediction confidence analysis

Prediction confidence has been proposed as one of the metrics to measure performance of predictive models developed in the FDA’s endocrine disruptors knowledge based project64,65,66,67,68,69,70,71 using variety of machine learning methods such as decision tree72, Decision Forest models73,74,75,76,77,78 based on molecular descriptors79 that are calculated using the algorithm developed for the expert systems of structure elucidation80,81,82,83,84,85, support vector machine86,87 and principal component analysis based algorithm88,89. The continuous value output from sNebula for prediction of binding between an HLA and a peptide is the measure of likelihood of the peptide is a binder or non-binder of the HLA and indicates the confidence for the prediction. A good prediction method should not only show an overall high prediction accuracy but also is expected to 1) predict most unknown samples with a high confidence and 2) show a higher accuracy for the predictions with a higher confidence than the predictions with a lower confidence. We examined the relationship between prediction confidence and accuracy using all predictions from the 1,000 iterations of five-fold cross-validations. The prediction confidence was calculated using equation (9) for every prediction. The predictions were then placed into 10 groups with even confidence bins according to their confidence values. For each of the 10 groups of predictions, we calculated the sensitivity, specificity, accuracy and AUC by comparing the predictions with the actual experimental HLA-peptide binding data. At last, the performance of sNebula at difference confidence levels was analyzed.

Benchmark

IEDB has an automatic server benchmark page (http://tools.iedb.org/auto_bench/mhci/weekly/) that evaluates different prediction methods for HLA-peptide binding based on new data submitted in the past three months or last week. We used the latest three datasets of 3-month period (2016-05-03, 2016-02-19 and 2015-08-07) as an example to compare sNebula with existing methods as well as its predecessor, Nebula. The parameter n was set to 13 for sNebula to make predictions. Two performance metrics, Spearman’s Rank Correlation Coefficient (SRCC) and AUC, were calculated between the predicted values and experimental values using R.

Additional Information

How to cite this article: Luo, H. et al. sNebula, a network-based algorithm to predict binding between human leukocyte antigens and peptides. Sci. Rep. 6, 32115; doi: 10.1038/srep32115 (2016).