sNebula, a network-based algorithm to predict binding between human leukocyte antigens and peptides

Luo, Heng; Ye, Hao; Ng, Hui Wen; Sakkiah, Sugunadevi; Mendrick, Donna L.; Hong, Huixiao

doi:10.1038/srep32115

Download PDF

Article
Open access
Published: 25 August 2016

sNebula, a network-based algorithm to predict binding between human leukocyte antigens and peptides

Heng Luo^1,2,
Hao Ye¹,
Hui Wen Ng¹,
Sugunadevi Sakkiah¹,
Donna L. Mendrick¹ &
…
Huixiao Hong¹

Scientific Reports volume 6, Article number: 32115 (2016) Cite this article

2359 Accesses
16 Citations
Metrics details

Subjects

Abstract

Understanding the binding between human leukocyte antigens (HLAs) and peptides is important to understand the functioning of the immune system. Since it is time-consuming and costly to measure the binding between large numbers of HLAs and peptides, computational methods including machine learning models and network approaches have been developed to predict HLA-peptide binding. However, there are several limitations for the existing methods. We developed a network-based algorithm called sNebula to address these limitations. We curated qualitative Class I HLA-peptide binding data and demonstrated the prediction performance of sNebula on this dataset using leave-one-out cross-validation and five-fold cross-validations. This algorithm can predict not only peptides of different lengths and different types of HLAs, but also the peptides or HLAs that have no existing binding data. We believe sNebula is an effective method to predict HLA-peptide binding and thus improve our understanding of the immune system.

Robust prediction of HLA class II epitopes by deep motif deconvolution of immunopeptidomes

Article 14 October 2019

Opening opportunities for Kd determination and screening of MHC peptide complexes

Article Open access 23 May 2022

HLA3DB: comprehensive annotation of peptide/HLA complexes enables blind structure prediction of T cell epitopes

Article Open access 10 October 2023

Introduction

Human leukocyte antigens (HLAs) are the major histocompatibility complexes (MHCs) in humans. They are expressed on the surfaces of antigen presenting cells to recognize endogenous or foreign peptides for immunological reactions^1,2. The genes that encode HLAs are a gene system located at the short arm of Chromosome 6. They are highly polymorphic across populations^3,4,5. There are different classes of HLAs, including Class I, II and III, according to their genetic locations. Different classes of HLAs have divergent structures and functions. Both Class I and Class II HLAs have a long binding groove that can bind peptides and present them onto T-cell receptors^6,7,8, while Class III HLAs are a part of the complement system to help with pathogen clearance⁹. Class I HLAs capture the endogenous peptides degraded from cytosolic proteins and present them to the T-cell receptors on the surface of CD8+ T-cells for cytotoxic responses, while the Class II HLAs present exogenous peptides from extracellular sources to the CD4+ T-cells to trigger acquired responses including antibody synthesis^10,11. The binding between Class I/II HLAs and peptides is an important process for immune responses. Studying HLA-peptide binding will help us better understand the immune system and the mechanisms of autoimmune diseases and adverse drug reactions^12,13 and will also provide important information needed in the development of vaccines and protein therapeutics^14,15.

Since HLA-peptide binding is important for immune-related applications, experimental binding assays were developed to test in vitro binding affinities between HLAs and peptides and the data were collected in databases such as AntiJen¹⁶, IEDB¹⁷, MHCBN¹⁸ and SYFPEITHI¹⁹. The IMGT/HLA database recorded more than 13,000 HLA alleles by August 2015²⁰. Since it is time-consuming and costly to experimentally test the binding between large numbers of HLAs and peptides, computational methods have been developed to predict HLA-peptide binding²¹. The current widely used methods are machine learning methods; however, several challenges limit their applicability. First, many machine learning methods can only predict a limited number of HLAs or peptides with a specific length. Second, an HLA-specific model would be unreliable if the training samples were not large enough²¹. Therefore, we developed the neighbor-edges based and unbiased leverage algorithm (Nebula) based on network analysis to overcome the limitations of machine learning methods^22,23. We successfully applied Nebula to predict HLA-peptide binding and found that it delivered a reasonable performance. However, Nebula is not applicable to predict the binding between a peptide and an HLA if no experimental data are available between the peptide and other HLAs or no binding assay has been developed for the HLA. Thus, Nebula is not able to predict binding for unstudied peptides and HLAs, limiting its application. Nebula is an algorithm purely based on the topology of a network; alternatively, the network is treated as a colorless graph where the nodes are not differentiated (colorless). Actually, the nodes (HLAs and peptides) in the bipartite network of HLA-peptide could be differentiated in many ways. Thus, appropriate consideration of node difference in a prediction algorithm is expected to improve its performance. In this study, we developed a new network-based prediction algorithm called similar neighbor-edges based and unbiased leverage algorithm (sNebula) by presenting the bipartite network of HLA-peptide binding data in a color graph. By introducing color to the network as additional information, sNebula can predict binding activity for peptides and HLAs that are not included the training network, overcoming the limitation of Nebula. We used the qualitative binding data between Class I HLAs and peptides as an example. We demonstrated that sNebula is a reliable algorithm for prediction of HLA-peptide binding and can be applied to HLAs or peptides with or without experimental binding data.

Results

Data curation

We curated 43,935 peptides, 135 Class I HLAs and 141,224 qualitative HLA-peptide binding data from the four databases. The binding data are given in Supplementary Table S1. Among the 43,935 distinct peptides, the peptide length varies from 6 to 30. Most of the peptides are 9-mers (65%) and 10-mers (25%), which is consistent with the experimental discovery of Class I HLA-binding peptides²⁴. The distribution of peptide lengths is summarized in Supplementary Table S2. The 135 HLAs include 49 HLA-A alleles, 75 HLA-B alleles, 9 HLA-C alleles and 2 HLA-E alleles. The HLA alleles and their pseudo-sequences are listed in Supplementary Table S3. Among the 141,224 HLA-peptide binding data, 47% are bindings and 53% are non-bindings.

Leave-one-out (LOO) cross-validation

Parameter n indicates the maximum of neighbors from the peptide and HLA that are used for sNebula to make a prediction of the binding between the HLA and the peptide. Fifty leave-one-out (LOO) cross-validations were conducted on the HLA-peptide binding data using parameter n = 1 to 50. The prediction accuracy values yielded from the 50 LOO cross-validations are shown in Fig. 1. When n = 13, the accuracy reached the maximal value 0.841, and the corresponding sensitivity, specificity and area under the receiver operating characteristic curve (AUC) values were 0.818, 0.862 and 0.841, respectively. As n increases after this point, the accuracy of the model gradually dropped. Thus we used n = 13 for LOO cross-validations.

**Figure 1: Determination of parameter n using LOO cross-validations for sNebula.**

Five-fold cross-validations

One thousand iterations of five-fold cross-validations were conducted on the HLA-peptide binding data using sNebula. The prediction values in each of the five-fold cross-validations were compared with the experimental values and a set of accuracy, sensitivity and specificity values was calculated. The distributions of 1,000 values of these performance metrics are shown in Fig. 2. The average sensitivity, specificity and accuracy values are 0.816, 0.852 and 0.835, respectively, with the same standard deviation of 0.001.

Confidence analysis

The sNebula predictions are continuous values that not only indicate binding status of binder/non-binder but also represent the prediction confidence levels. The confidence levels of sNebula predictions from the 1,000 iterations of five-fold cross-validations were calculated and used to place the predictions into 10 groups by confidence. The performance of sNebula was assessed for each of the 10 groups of predictions. The performance in terms of accuracy, sensitivity, specificity and AUC at different confidence levels were plotted in Fig. 3. As the confidence increased, the AUC, accuracy, sensitivity and specificity (indicated by the left y-axis) improved, and the predictions in number (indicated by the right y-axis) also increased. The confidence analysis results revealed that the higher the prediction confidence level the better the prediction performance of sNebula. Moreover, most predictions from sNebula were at high confidence.

**Figure 3: The relationships between prediction confidence and sNebula performance (AUC, accuracy, sensitivity, specificity) and prediction frequency.**

Benchmark

The IEDB website contains the performance comparison of various prediction methods for HLA-peptide binding (http://tools.iedb.org/auto_bench/mhci/weekly/). We used NetMHCpan^25,26,27 to compare with sNebula and Nebula. The performance comparison is shown in Table 1. Different methods had different performance depending on the dataset and the HLA. While NetMHCpan performed well on some datasets such as B^*07:02, B*15:02 and B*27:04 in terms of AUC values, sNebula had better results on datasets such as C*03:04 and A*02:06. As a comparison, Nebula also had high AUC values on some datasets such as A*68:02 and B*38:01. However, because Nebula could not make predictions on HLAs and peptides that are not included in the training network, the results of Nebula were not complete for some datasets as B*27:04 and B*27:05.

Table 1 Performance comparison of NetMHCpan, sNebula and Nebula on IEDB benchmark datasets.

Full size table

Discussion

The human HLA loci are in a genomic region that is among the most polymorphic. The HLA loci have retained much variation^28,29,30. Thousands of HLA alleles have been discovered, including approximately 9000 alleles of Class I HLAs^20,31. The proteins encoded by HLAs are used by the immune system to recognize invaders such as foreign pathogens. However, the proteins themselves are not able to display biological functions. The binding groove of HLA proteins holds a peptide that can exhibit functions of HLAs such as social recognition skills³². It follows that knowledge of HLA-peptide binding plays a key role in understanding related biomedical questions such as autoimmune diseases and HLA-mediated adverse drug reactions^33,34. Many in vitro experiments have been designed to assay HLA-peptide binding³⁵. However, due to the huge number of possible binding interactions between thousands of HLAs and millions of peptides, it is difficult, if not impossible, to comprehensively ascertain the binding interactions between HLAs and peptides. Thus, computational methods can play a crucial role in the study of HLA-peptide binding. Though some computational approaches have been proposed for prediction of HLA-peptide binding^21,36, the practicability is limited since many methods do not support HLAs with few binding peptides or peptides that are diverse in length. Some recently developed methods such as NetMHCpan^25,26,27, NetMHC³⁷ and kernel functions^25,38 can predict for peptides with different lengths; however, extra processes^39,40 are usually required to identify core binding sequences within the peptides so that they can be converted to a fixed length. Though such extra processes may not necessarily reduce performance of such methods, algorithms that can overcome some restrictions of the current computational methods and handle peptides with different lengths are expected to have wider applications. Using sNebula, one can generate a comprehensive atlas of binding interactions between HLAs and peptides. Based on bipartite network analysis, sNebula has no limitations on the number of HLA molecules used or the size of the peptides utilized for training and, thus, provides a promising solution for the construction of a comprehensive atlas of HLA-peptide binding. However, different from machine learning-based methods such as NetMHCpan^25,26,27, sNebula is unable to directly predict HLA-peptide binding if neither the peptide nor the HLA exists in the training network.

The results of this current study suggest that sNebula can accurately predict the binding activity between HLAs and peptides, even though this is a very sparse dataset. The algorithm is useful because it can not only make predictions for untested peptides and HLAs given sequence information, but also can make more accurate predictions with a higher confidence. In addition, it does not set any limitation on the peptide length or the number of HLA alleles. With all these advantages, sNebula can help us study the binding between HLAs and peptides and improve our understanding of the immune system. Like HLA-peptide binding data, a lot of big data are diverse and incomplete⁴¹ such as gene expression data^42,43, drug-target binding⁴⁴ and clinical information⁴⁵. Methods have been developed to impute the missing values for analysis including unsupervised and supervised classifications^43,46,47. However, unlike the classification models, sNebula can deal with sparse or incomplete data without requiring the process of missing data imputation. It also accepts the diversity and flexibility of data so an assured length of features is not required. With the arising of big data era and increasing needs of big data analysis, we believe sNebula is one of the possible solutions to deal with large, diverse and incomplete data for predictions and novel discoveries.

Future applications of sNebula remain to be explored. In this study, sequences were utilized to calculate the similarity between nodes. It is possible to use sNebula in the development of similar algorithms for other applications. For example, it is possible to utilize the 2D structural fingerprints of drugs to replace sequences of peptides for similarity calculation and, thus, modify sNebula to predict drug-HLA binding or even drug-target binding that may underlie some observed genetic links to adverse drug events. Network-based inference (NBI) is a powerful network approach that can integrate a variety of data sources for a wide spectrum of applications such as drug-target predictions^48,49, drug safety assessment^50,51,52, driver mutations prioritization in cancer genomics⁵³, RNA network prediction⁵⁴ and xenobiotics gene and disease prediction⁵⁵. Cheng et al. utilized Node Weighted Network-based Inference (NWNBI) to predict drug-target interactions using a node-weighted network and observed a better performance than the unweighted NBI^49,56. They calculated the drug similarity by 2D fingerprints and target similarity by sequences. However, their method uses all the neighbors for prediction instead of selecting the top similar ones. It is possible to improve the prediction performance by selecting top similar neighbors using sNebula. Another possible application for sNebula is to predict drug-disease association for drug repurposing. Gottlieb et al. collected a network of drug-disease associations as well as information of drug-drug similarities and disease-disease similarities to predict novel drug-disease associations using logistic regression⁵⁷. The machine learning method is useful; however, there are some challenges such as problems to deal with a flexible length of features²¹ or missing data⁴⁵. Since sNebula is based on similarity and does not require the completeness or an assured length of features, it is possible to extend sNebula to predict drug-disease associations while overcoming those problems.

Another potential application of sNebula is to develop new therapeutics such as tumor immunotherapy. The neoantigens are peptides in the human body that are not encoded by the normal human genome. In tumors, they are generated by the tumor-specific DNA alternations⁵⁸. When the gene expression data for patients are available, predicting HLA-peptide binding may help to identify or filter patient-specific neoantigens, which are a major factor for clinical immunotherapy development^58,59. As more HLA-peptide binding data and patient-specific RNA sequencing data are becoming available, we believe sNebula can potentially help with neoantigen identification and the development of immunotherapies.

In addition to predictions values, sNebula also provides the confidence values. Confidence values are estimations about how likely the result is true; therefore, users can differentiate the results using confidence values and select the most confident predictions for validation. A good method not only makes more predictions in number, but also predicts with higher accuracy at higher confidence. From the confidence analysis result of sNebula, we saw sNebula predicted more and performed better with higher confidence. We believe the confidence values are useful information that can potentially help with the selection of prediction results for experimental validation in applications such as HLA-peptide binding, drug-target binding and drug-disease associations.

Conclusion

We developed a network-based prediction algorithm, sNebula, to predict the binding potential between HLAs and peptides. We found this algorithm exhibited a good performance in both the LOO cross-validation and five-fold cross-validations using the experimental HLA-peptide binding data curated from major databases. The confidence analysis indicated its ability to make predictions with more accuracy when the confidence level is higher. This algorithm not only overcomes the limitations of the current machine learning methods on the number of HLAs and lengths of peptides, but also makes it possible to predict HLA-peptide binding for new peptides or HLAs. It could be expected that sNebula can help with the construction of a comprehensive atlas of HLA-peptide binding that, in turn, facilitates better understanding of the immune system.

Methods

Study design

The workflow of the study is shown in Fig. 4. Qualitative Class I HLA-peptide binding data were collected and curated from four databases: AntiJen¹⁶, IEDB¹⁷, MHCBN¹⁸ and SYFPEITHI¹⁹. A bipartite network of HLA-peptide binding data was then constructed. The binding data network was used to assess the performance of sNebula using leave-one-out (LOO) cross-validation and 1,000 iterations of five-fold cross-validations. The prediction confidence analysis was conducted based on the results of five-fold cross-validations.