sNebula, a network-based algorithm to predict binding between human leukocyte antigens and peptides

Understanding the binding between human leukocyte antigens (HLAs) and peptides is important to understand the functioning of the immune system. Since it is time-consuming and costly to measure the binding between large numbers of HLAs and peptides, computational methods including machine learning models and network approaches have been developed to predict HLA-peptide binding. However, there are several limitations for the existing methods. We developed a network-based algorithm called sNebula to address these limitations. We curated qualitative Class I HLA-peptide binding data and demonstrated the prediction performance of sNebula on this dataset using leave-one-out cross-validation and five-fold cross-validations. This algorithm can predict not only peptides of different lengths and different types of HLAs, but also the peptides or HLAs that have no existing binding data. We believe sNebula is an effective method to predict HLA-peptide binding and thus improve our understanding of the immune system.

network analysis to overcome the limitations of machine learning methods 22,23 . We successfully applied Nebula to predict HLA-peptide binding and found that it delivered a reasonable performance. However, Nebula is not applicable to predict the binding between a peptide and an HLA if no experimental data are available between the peptide and other HLAs or no binding assay has been developed for the HLA. Thus, Nebula is not able to predict binding for unstudied peptides and HLAs, limiting its application. Nebula is an algorithm purely based on the topology of a network; alternatively, the network is treated as a colorless graph where the nodes are not differentiated (colorless). Actually, the nodes (HLAs and peptides) in the bipartite network of HLA-peptide could be differentiated in many ways. Thus, appropriate consideration of node difference in a prediction algorithm is expected to improve its performance. In this study, we developed a new network-based prediction algorithm called similar neighbor-edges based and unbiased leverage algorithm (sNebula) by presenting the bipartite network of HLA-peptide binding data in a color graph. By introducing color to the network as additional information, sNebula can predict binding activity for peptides and HLAs that are not included the training network, overcoming the limitation of Nebula. We used the qualitative binding data between Class I HLAs and peptides as an example. We demonstrated that sNebula is a reliable algorithm for prediction of HLA-peptide binding and can be applied to HLAs or peptides with or without experimental binding data.

Results
Data curation. We curated 43,935 peptides, 135 Class I HLAs and 141,224 qualitative HLA-peptide binding data from the four databases. The binding data are given in Supplementary Table S1. Among the 43,935 distinct peptides, the peptide length varies from 6 to 30. Most of the peptides are 9-mers (65%) and 10-mers (25%), which is consistent with the experimental discovery of Class I HLA-binding peptides 24 . The distribution of peptide lengths is summarized in Supplementary Table S2. The 135 HLAs include 49 HLA-A alleles, 75 HLA-B alleles, 9 HLA-C alleles and 2 HLA-E alleles. The HLA alleles and their pseudo-sequences are listed in Supplementary Table S3. Among the 141,224 HLA-peptide binding data, 47% are bindings and 53% are non-bindings.
Leave-one-out (LOO) cross-validation. Parameter n indicates the maximum of neighbors from the peptide and HLA that are used for sNebula to make a prediction of the binding between the HLA and the peptide. Fifty leave-one-out (LOO) cross-validations were conducted on the HLA-peptide binding data using parameter n = 1 to 50. The prediction accuracy values yielded from the 50 LOO cross-validations are shown in Fig. 1. When n = 13, the accuracy reached the maximal value 0.841, and the corresponding sensitivity, specificity and area under the receiver operating characteristic curve (AUC) values were 0.818, 0.862 and 0.841, respectively. As n increases after this point, the accuracy of the model gradually dropped. Thus we used n = 13 for LOO cross-validations. Five-fold cross-validations. One thousand iterations of five-fold cross-validations were conducted on the HLA-peptide binding data using sNebula. The prediction values in each of the five-fold cross-validations were compared with the experimental values and a set of accuracy, sensitivity and specificity values was calculated. The distributions of 1,000 values of these performance metrics are shown in Fig. 2. The average sensitivity, specificity and accuracy values are 0.816, 0.852 and 0.835, respectively, with the same standard deviation of 0.001.
Confidence analysis. The sNebula predictions are continuous values that not only indicate binding status of binder/non-binder but also represent the prediction confidence levels. The confidence levels of sNebula predictions from the 1,000 iterations of five-fold cross-validations were calculated and used to place the predictions into 10 groups by confidence. The performance of sNebula was assessed for each of the 10 groups of predictions. The performance in terms of accuracy, sensitivity, specificity and AUC at different confidence levels were plotted in Fig. 3. As the confidence increased, the AUC, accuracy, sensitivity and specificity (indicated by the left y-axis) improved, and the predictions in number (indicated by the right y-axis) also increased. The confidence analysis results revealed that the higher the prediction confidence level the better the prediction performance of sNebula. Moreover, most predictions from sNebula were at high confidence.
Benchmark. The IEDB website contains the performance comparison of various prediction methods for HLA-peptide binding (http://tools.iedb.org/auto_bench/mhci/weekly/). We used NetMHCpan [25][26][27] to compare with sNebula and Nebula. The performance comparison is shown in Table 1. Different methods had different performance depending on the dataset and the HLA. While NetMHCpan performed well on some datasets such

Discussion
The human HLA loci are in a genomic region that is among the most polymorphic. The HLA loci have retained much variation [28][29][30] . Thousands of HLA alleles have been discovered, including approximately 9000 alleles of Class I HLAs 20,31 . The proteins encoded by HLAs are used by the immune system to recognize invaders such as foreign pathogens. However, the proteins themselves are not able to display biological functions. The binding groove of HLA proteins holds a peptide that can exhibit functions of HLAs such as social recognition skills 32 . It follows that knowledge of HLA-peptide binding plays a key role in understanding related biomedical questions such as autoimmune diseases and HLA-mediated adverse drug reactions 33,34 . Many in vitro experiments have been designed to assay HLA-peptide binding 35 . However, due to the huge number of possible binding interactions between thousands of HLAs and millions of peptides, it is difficult, if not impossible, to comprehensively ascertain the binding interactions between HLAs and peptides. Thus, computational methods can play a crucial role in the study of HLA-peptide binding. Though some computational approaches have been proposed for prediction of HLA-peptide binding 21,36 , the practicability is limited since many methods do not support HLAs with few binding peptides or peptides that are diverse in length. Some recently developed methods such as NetMHCpan 25-27 , NetMHC 37 and kernel functions 25,38 can predict for peptides with different lengths; however, extra processes 39,40 are usually required to identify core binding sequences within the peptides so that they can be converted to a fixed length. Though such extra processes may not necessarily reduce performance of such methods, algorithms that can overcome some restrictions of the current computational methods and handle peptides with different lengths are expected to have wider applications. Using sNebula, one can generate a comprehensive atlas of binding interactions between HLAs and peptides. Based on bipartite network analysis, sNebula has no limitations on the number of HLA molecules used or the size of the peptides utilized for training and, thus, provides a promising solution for the construction of a comprehensive atlas of HLA-peptide binding. However, different from machine learning-based methods such as NetMHCpan [25][26][27] , sNebula is unable to directly predict HLA-peptide binding if neither the peptide nor the HLA exists in the training network.
The results of this current study suggest that sNebula can accurately predict the binding activity between HLAs and peptides, even though this is a very sparse dataset. The algorithm is useful because it can not only make predictions for untested peptides and HLAs given sequence information, but also can make more accurate predictions with a higher confidence. In addition, it does not set any limitation on the peptide length or the number of HLA alleles. With all these advantages, sNebula can help us study the binding between HLAs and peptides and improve our understanding of the immune system. Like HLA-peptide binding data, a lot of big data are diverse and incomplete 41 such as gene expression data 42,43 , drug-target binding 44 and clinical information 45 . Methods have been developed to impute the missing values for analysis including unsupervised and supervised classifications 43,46,47 . However, unlike the classification models, sNebula can deal with sparse or incomplete data without requiring the process of missing data imputation. It also accepts the diversity and flexibility of data so an assured length of features is not required. With the arising of big data era and increasing needs of big data analysis, we believe sNebula is one of the possible solutions to deal with large, diverse and incomplete data for predictions and novel discoveries. Future applications of sNebula remain to be explored. In this study, sequences were utilized to calculate the similarity between nodes. It is possible to use sNebula in the development of similar algorithms for other applications. For example, it is possible to utilize the 2D structural fingerprints of drugs to replace sequences of peptides for similarity calculation and, thus, modify sNebula to predict drug-HLA binding or even drug-target binding that may underlie some observed genetic links to adverse drug events. Network-based inference (NBI) is a powerful network approach that can integrate a variety of data sources for a wide spectrum of applications such as drug-target predictions 48,49 , drug safety assessment [50][51][52] , driver mutations prioritization in cancer genomics 53 , RNA network prediction 54 and xenobiotics gene and disease prediction 55 . Cheng et al. utilized Node Weighted Network-based Inference (NWNBI) to predict drug-target interactions using a node-weighted network and observed a better performance than the unweighted NBI 49,56 . They calculated the drug similarity by 2D fingerprints and target similarity by sequences. However, their method uses all the neighbors for prediction instead of selecting the top similar ones. It is possible to improve the prediction performance by selecting top similar neighbors using sNebula. Another possible application for sNebula is to predict drug-disease association for drug repurposing. Gottlieb et al. collected a network of drug-disease associations as well as information of drug-drug similarities and disease-disease similarities to predict novel drug-disease associations using logistic regression 57 . The machine learning method is useful; however, there are some challenges such as problems to deal with a flexible length of features 21 or missing data 45 . Since sNebula is based on similarity and does not require the completeness or an assured length of features, it is possible to extend sNebula to predict drug-disease associations while overcoming those problems.
Another potential application of sNebula is to develop new therapeutics such as tumor immunotherapy. The neoantigens are peptides in the human body that are not encoded by the normal human genome. In tumors, they are generated by the tumor-specific DNA alternations 58 . When the gene expression data for patients are available, predicting HLA-peptide binding may help to identify or filter patient-specific neoantigens, which are a major factor for clinical immunotherapy development 58,59 . As more HLA-peptide binding data and patient-specific RNA sequencing data are becoming available, we believe sNebula can potentially help with neoantigen identification and the development of immunotherapies.
In addition to predictions values, sNebula also provides the confidence values. Confidence values are estimations about how likely the result is true; therefore, users can differentiate the results using confidence values and select the most confident predictions for validation. A good method not only makes more predictions in number, but also predicts with higher accuracy at higher confidence. From the confidence analysis result of sNebula, we saw sNebula predicted more and performed better with higher confidence. We believe the confidence values are useful information that can potentially help with the selection of prediction results for experimental validation in applications such as HLA-peptide binding, drug-target binding and drug-disease associations.

Conclusion
We developed a network-based prediction algorithm, sNebula, to predict the binding potential between HLAs and peptides. We found this algorithm exhibited a good performance in both the LOO cross-validation and five-fold cross-validations using the experimental HLA-peptide binding data curated from major databases. The confidence analysis indicated its ability to make predictions with more accuracy when the confidence level is higher. This algorithm not only overcomes the limitations of the current machine learning methods on the number of HLAs and lengths of peptides, but also makes it possible to predict HLA-peptide binding for new peptides or HLAs. It could be expected that sNebula can help with the construction of a comprehensive atlas of HLA-peptide binding that, in turn, facilitates better understanding of the immune system. Fig. 4. Qualitative Class I HLA-peptide binding data were collected and curated from four databases: AntiJen 16 , IEDB 17 , MHCBN 18 and SYFPEITHI 19 . A bipartite network of HLA-peptide binding data was then constructed. The binding data network was used to assess the performance of sNebula using leave-one-out (LOO) cross-validation and 1,000 iterations of five-fold cross-validations. The prediction confidence analysis was conducted based on the results of five-fold cross-validations. Data curation. The experimental Class I HLA-peptide binding data were collected from four databases (AntiJen 16 , IEDB 17 , MHCBN 18 and SYFPEITHI 19 ) as described in our previous study 22 . The databases provide qualitative binding categories (binding versus non-binding) for each HLA-peptide pair. We merged the four databases and recorded only one qualitative datum for each of HLA-peptide pairs using the majority voting strategy 22 . Since sNebula is based on a color graph of the network in which the nodes (HLAs and peptides) are colored using their amino acid sequences, we downloaded HLA sequences from the IMGT/HLA database 20 and removed the HLAs that do not have an affirmative protein sequence such as HLA serotypes and allele groups and their binding peptides. The curated HLA-peptide binding data were used to construct a bipartite binding network, where HLAs and peptides are nodes, and the binding data between them, either binding or non-binding, are edges.

Study design. The workflow of the study is shown in
Scientific RepoRts | 6:32115 | DOI: 10.1038/srep32115 sNebula. To predict the binding between peptide p x and HLA h y , sNebula first identifies the peptides that are connected to HLA h y , p i (i = 1, 2, … ). The similarity between peptide p x and each of the connected peptides of h y , p i , is calculated based on their sequences. To calculate the similarity between peptide p x and peptide p i , sNebula first aligns the sequences of the two peptides without opening gaps and then calculates the similarity score sp x,i using equations (1-2).
In equation (2), l x and l i are the lengths of peptides p x and p i , respectively. In equation (1), p x,i is the sequence similarity between peptides p x and p i and is calculated using BLOSUM50 matrix 60 , and p x,x is the sequence similarity between peptide p x and itself. BLOSUM50 matrix was used in this study because it has been used in many other methods for predicting HLA-peptide binding 26,39,61,62 . For each of possible alignments between peptide p x and peptide p i , a sequence similarity value, p x,i , is calculated. The alignment with the largest sequence similarity value is then selected and its similarity, p x,i , is used for the calculation of equation (1). In the same way, sNebula recognizes the HLAs that are connected to peptide p x , h j (j = 1, 2, … ). The similarity between HLA h y and each of the connected HLAs of peptide p x , h j , is then calculated. For the HLA similarity calculation, instead of using HLA full sequences, sNebula utilizes the 48 unique residues on HLAs that closely interact with peptides which were identified by Chelvanayagam 63 as HLA pseudo-sequences. The pseudo-sequence similarity score sh y,j between HLA h y and HLA h j is calculated using equation (3). In equation (3), h y,j is the pseudo-sequence similarity between HLA h y and HLA h j and is calculated using BLOSUM50 matrix, and h y,y is the pseudo-sequence similarity between HLA h y and itself. It is noted that both sp x,i and sh y,j are directional; thus, sp x,I (sh y,j ) is not necessarily equal to sp j,x (sh j,y ). After the similarity scores for the nodes (HLAs and peptides) of the neighbor edges are calculated, sNebula ranks the peptides and HLAs using their similarity scores sp x,i and sh y,j to select n top ranked peptides and HLAs for p x and h y , respectively, for the calculation of a continuous value p x,y using equation (4) as the prediction of binding between p x and h y . Here n is a parameter to be determined. The n with the best prediction of accuracy of LOO cross-validation based on the network of HLA-peptide binding data is used.
In equation (4), e i,y is the binding edge weight between peptide i and HLA y in the HLA-peptide binding data network given by equation (5).
if binding between peptide i and HLA y if non binding between peptide i and HLA y 1, 1, i y , As an unbiased approach, sNebula considers the contribution from the peptides p x and HLAs h y of the neighbor edges equally. When peptide p x do not contain neighbor edges, sNebula predicts the binding between peptide p x and HLA h y using equation (6).
Therefore, if a peptide has no experiment data against any HLA in the training network, sNebula is still able to predict its binding towards the HLAs based on its sequence and the topological feature of the training network using equation (6). In the same way, if HLA h y do not have neighbor edges, sNebula predicts the binding between peptide p x and HLA h y using equation (7).
x y j n y j x j j n y j , 1 , When the number of available connecting nodes (HLAs or peptides) for peptide p x or HLA h y is less than n, all of the nodes are used. When multiple connecting nodes for peptide p x or HLA h y have the same similarity score to be selected as the top n similar nodes, the average of their binding edge weights is used in equations (4), (6) and (7). The binding prediction p x,y is a continuous value between − 1 and 1 and is converted into a categorical prediction value c x,y using equation (8).
x y x y x y , , , The goodness of the prediction c x,y is assessed using a metric term as prediction confidence conf x,y that is defined in equation (9). = con f p (9) x y x y , , LOO cross-validation. In the LOO cross-validation, each of the HLA-peptide binding data was taken out and the remaining HLA-peptide binding data were used to construct the HLA-peptide binding network for prediction of the binding value for the taken-out HLA-peptide pair using sNebula. This process was repeated until every HLA-peptide binding data were used as a test sample. The predicted values were then compared with the actual experimental binding data and sensitivity, specificity, accuracy and area under receiver operating characteristic curve (AUC) were calculated to evaluate the performance of sNebula. To determine the parameter n, we repeated the LOO cross-validation for n = 1, 2, 3 … 50. The n value with the highest prediction accuracy of LOO cross-validation was selected to be the parameter for sNebula.

Five-fold cross-validations.
We conducted 1,000 iterations of five-fold cross-validations on the HLA-peptide binding data to obtain a statistically robust estimation of sNebula performance. In a five-fold cross-validation, the HLA-peptide binding data were randomly divided into five parts as equal as possible. One part of HLA-peptide binding data was taken out to be used as test samples and the remaining four parts of HLA-peptide binding data were used as the training samples to construct a bipartite network. LOO cross-validations were conducted on training samples to determine the parameter n in sNebula. The network constructed from the training samples and the determined parameter n was used by sNebula to predict HLA-peptide binding of the test samples. This process was repeated five times so that each of the five parts of the HLA-peptide binding data was used once and only once as test samples. The categorical prediction results from all the five folds of test samples were compared to the actual experimental HLA-peptide binding data to calculate the sensitivity, specificity and accuracy to estimate the performance of sNebula.
Prediction confidence analysis. Prediction confidence has been proposed as one of the metrics to measure performance of predictive models developed in the FDA's endocrine disruptors knowledge based project [64][65][66][67][68][69][70][71] using variety of machine learning methods such as decision tree 72 , Decision Forest models 73-78 based on molecular descriptors 79 that are calculated using the algorithm developed for the expert systems of structure elucidation [80][81][82][83][84][85] , support vector machine 86,87 and principal component analysis based algorithm 88,89 . The continuous value output from sNebula for prediction of binding between an HLA and a peptide is the measure of likelihood of the peptide is a binder or non-binder of the HLA and indicates the confidence for the prediction. A good prediction method should not only show an overall high prediction accuracy but also is expected to 1) predict most unknown samples with a high confidence and 2) show a higher accuracy for the predictions with a higher confidence than the predictions with a lower confidence. We examined the relationship between prediction confidence and accuracy using all predictions from the 1,000 iterations of five-fold cross-validations. The prediction confidence was calculated using equation (9) for every prediction. The predictions were then placed into 10 groups with even confidence bins according to their confidence values. For each of the 10 groups of predictions, we calculated the sensitivity, specificity, accuracy and AUC by comparing the predictions with the actual experimental HLA-peptide binding data. At last, the performance of sNebula at difference confidence levels was analyzed.
Benchmark. IEDB has an automatic server benchmark page (http://tools.iedb.org/auto_bench/mhci/ weekly/) that evaluates different prediction methods for HLA-peptide binding based on new data submitted in the past three months or last week. We used the latest three datasets of 3-month period (2016-05-03, 2016-02-19 and 2015-08-07) as an example to compare sNebula with existing methods as well as its predecessor, Nebula. The parameter n was set to 13 for sNebula to make predictions. Two performance metrics, Spearman's Rank Correlation Coefficient (SRCC) and AUC, were calculated between the predicted values and experimental values using R.