Computer-aided designing of immunosuppressive peptides based on IL-10 inducing potential

In the past, numerous methods have been developed to predict MHC class II binders or T-helper epitopes for designing the epitope-based vaccines against pathogens. In contrast, limited attempts have been made to develop methods for predicting T-helper epitopes/peptides that can induce a specific type of cytokine. This paper describes a method, developed for predicting interleukin-10 (IL-10) inducing peptides, a cytokine responsible for suppressing the immune system. All models were trained and tested on experimentally validated 394 IL-10 inducing and 848 non-inducing peptides. It was observed that certain types of residues and motifs are more frequent in IL-10 inducing peptides than in non-inducing peptides. Based on this analysis, we developed composition-based models using various machine-learning techniques. Random Forest-based model achieved the maximum Matthews’s Correlation Coefficient (MCC) value of 0.59 with an accuracy of 81.24% developed using dipeptide composition. In order to facilitate the community, we developed a web server “IL-10pred”, standalone packages and a mobile app for designing IL-10 inducing peptides (http://crdd.osdd.net/raghava/IL-10pred/).

of cytokines. Recently our group has taken the initiative to develop cytokine-specific prediction methods (e.g., IFNepitope 40 , IL4Pred 41 ). To the best of author's knowledge, there is no method for the prediction of IL-10 inducing epitopes. This study is an attempt to develop computational models for predicting peptides that can induce cytokine IL-10 production.

Results
In this study, we used 394 MHC class-II binders, which have the ability to induce cytokine IL-10, as positive instances. On the other hand, we used 848 MHC class-II binders, which do not have the ability to induce cytokine IL-10, as negative examples. Thus, our dataset consisted of 394 IL-10 inducing and 848 non-inducing peptides or epitopes. We performed all the analysis on this dataset to understand the preference of residues and motifs in IL-10 inducing peptides. It was observed that all the peptides contained at least 8 residues. The maximum length  Positional Conservation Analysis. In order to understand the preference of specific residues at certain positions, we generated a two-sample logo (TSL) for the positive and negative peptides (Fig. 3). In a TSL, the height of the amino acid symbol is indicative of its relative abundance. The number of terminal residues was selected on the basis of the minimum peptide length in the dataset and is not associated with any biological function. It has been observed that R is highly preferred at position 2 nd , 4 th , 5 th , 6 th , 7 th , 11 th , 13 th , and 16 th in IL-10 inducing peptides. Similarly, L is more dominant at position 3 rd , 4 th , 5 th , 7 th and 10 th in IL-10 inducing peptides. On the other hand, the residue A was found to be predominant in non-IL-10 inducing peptides at 1 st , 4 th , 5 th , 9 th and 12 th position.
Compositional Analysis. The Amino Acid Composition (AAC) was computed for IL-10 inducing and non-inducing peptides; the average composition is shown in the bar plot (Fig. 4). As shown in Fig. 4, certain residues (like A, G and P) have a higher average composition in non-inducing or negative peptides than in positive peptides. In contrast, the residues L and R are more abundant in IL-10 inducing peptides.

Motif based analysis.
In the present work, we used MERCI program 42 for searching motifs occurring exclusively in IL-10 inducing peptides but not found in non-inducing peptides. Similarly, we searched motifs exclusively found in IL-10 non-inducing peptides. As shown in Table 1, the motifs found in IL-10 inducing peptides are rich in R, K and L while the exclusive motifs found in non-inducing peptides are dominated by residues A, G and P. Notably, the residue V is prevalent in the exclusive motifs of both the negative as well as the positive sets. Support Vector Machine-based models. We developed prediction models using Support Vector Machine (SVM) for discriminating IL-10 inducing and non-inducing sequences. Various sequence-based features of the peptides were used as input for developing SVM-based prediction models. Amongst the amino acid composition (AAC) models, we obtained the highest accuracy of 72.30% with Matthews's correlation coefficient (MCC) value 0.41 ( Table 2). The performance of our prediction model improved significantly using the dipeptide composition (DPC) as input feature instead of the AAC. As shown in Table 2, SVM model achieves maximum accuracy 78.42% with MCC value 0.55 using dipeptide composition. In addition, we developed models using terminal composition of peptides 43 . Since the minimum length of the peptides in our dataset is 8, we extracted 8 residues from N-terminus and developed the model called NT8 using AAC and DPC; these models achieved the maximum accuracy values of 63.45% and 66.75% respectively as shown in Table 2. Further, the models developed using the binary profiles of amino acids in the peptides, attained the accuracy of 67.15% with MCC of 0.31 for NT8. In the case of the CT8 models (involving the input features of terminal 8 residues of the C-terminus of the peptides), the AAC and DPC features obtained the accuracies 63.85% and 65.22% respectively. The binary model for the CT8 showed an accuracy of 62.88%. Additionally, we concatenated 8 residue sequences each at the N and C terminals to develop the NT8CT8 model, where a slight increase in the performance was observed as compared to models developed separately for NT8 or CT8 terminal input features. The maximum MCC value obtained here was 0.54 with DPC input vector.
In this study, various models were also developed using split composition 44 , where the peptide sequence is split into two equal parts. The compositions of the two parts are used as the input features for developing models. These models achieved the accuracy values of 73.67% and 72.71% for split-AAC and split-DPC respectively. In order to reduce the noise in models, we removed less significant or insignificant features. The CfSubSetEval algorithm of WEKA was used for selecting important features from AAC and DPC, with16 and 57 features respectively being selected by the algorithm, as enlisted in Table S1. These selected features were further used for   developing SVM-based models. The models developed on the selected features performed less than the models based on all features taken together (Table S2).

Models using WEKA classifiers.
We have also used the WEKA suite, which is a collection of various machine-learning algorithms. Out of many algorithms available in WEKA, we have employed four classifiers IBK, SMO, J48 and Random Forest. IBK (a K-nearest neighbors classifier) based model using AAC achieved the maximum accuracy 73.51% with MCC 0.44. Sequential minimal optimization (SMO) reached the maximum accuracy of 74.40%, 78.66% and MCC of 0.37, 0.49 using AAC and DPC respectively. J48 is a tree-based machine learning classifier in the WEKA package that attained the accuracy values of 68.28% and 67.15% for AAC and DPC respectively. Notably, the models based on Random Forest achieved the maximum accuracy value of 80.11% with MCC 0.58 for AAC. In the case of DPC, a Random Forest-based model achieved an accuracy of 81.24% with MCC value of 0.59 (Table 3).
We also developed models based on IBK, SMO and J48 classifier using split-AAC and achieved the maximum accuracy of ~71%. In the case of split-DPC, the performances achieved using these classifiers were comparable to split-AAC. Models based on Random Forest performed better than other classifiers and attained the maximum MCC 0.50 using split-AAC (Table S3). We also developed Random Forest-based models using 16 selected features from AAC and achieved the maximum MCC of 0.55 (Table 4). Similarly, we developed models based on WEKA classifiers using selected features from DPC. External Validation. The external validation technique is one of the most rigorous techniques commonly used to evaluate the realistic performance of a model. In this technique, the performance of a model is evaluated on a dataset not used for its training or testing; this dataset is called independent or validation dataset. In order to evaluate the performance of our models we extracted 66 IL-10-inducing peptides, recently added in IEDB. These peptides are not available in our original dataset used for building models. The best SVM model correctly predicted 45 out of 66 peptides newly included by IEDB as IL-10-inducing MHC-II binding peptides. The Random Forest model with the best performance found in our study correctly predicted 55 out of these 66 peptides. This demonstrates that our models are rigorous and their performance is reasonably good on the independent dataset.
Classification of IL-10-inducing and MHC II non-binding peptides. The prediction models described above are suitable to classify IL-10 inducing and non-inducing peptides in MHC II binding peptides. This means the user cannot use these models to predict IL-10 inducing peptides if MHC II binding status of the query peptide is not known, as we have not used MHC II non-binders in our dataset. Thus it is possible that our model may predict a MHC II non-binder as IL-10 inducing peptide. In order to overcome this problem, we also developed   models using the alternate dataset to discriminate IL-10 inducing and MHC II non-binders. We tested two of the machine learning methods -SVM and Random Forest that showed the best results on the dataset of IL-10-inducing MHC II binders and IL-10 non-inducing MHC II binders. We used 80% of the data for training and testing our models using five-fold cross validation technique. The remaining 20% data called independent dataset was used for external validation of our models. Our best SVM model achieved an accuracy of 76.44% with the MCC of 0.54, when evaluated using five-fold cross-validation. We also tested the performance of this model on the independent dataset and achieved an accuracy of 75.93% with MCC of 0.54. The Random Forest-based method showed a similar performance with 76.33% accuracy and 0.53 MCC, when tested using five-fold cross validation. The performance of the above model on the independent dataset was 77.31% accuracy and 0.58 MCC.
Service to the scientific community. One of the major goals of our group is to provide service to the community based on research carried out in our group. Thus, we developed a user-friendly webserver that integrates models developed in this study. The web-interface developed for the users predicts a query peptide to be IL-10 inducer or non-inducer based on the prediction models developed on the dataset containing IL-10-inducing MHC II binding peptides as positives and IL-10 non-inducing MHC II binders as negatives. However, such a model can falsely predict an MHC II non-binder to be an IL-10-inducing peptide. Thus, we developed separate prediction models that distinguish IL-10-inducing MHC II binders from MHC II non-binders. The web-interface designates a query peptide to be IL-10-inducing only if it is predicted to be positive by both of the above-mentioned models. The web interface of the server has three main modules; i) Predict, ii) Design and iii) Protein Scan. The 'Predict' tool allows a user to identify IL-10 inducing peptides in a given library of peptides. The 'Design' module facilitates the user to generate all possible analogs of the query peptide and identify the best analogs for inducing cytokine IL-10. The 'Protein Scan' module was developed for scanning IL-10 inducing regions in a query protein.
Our web server has been designed using a responsive HTML template for adjusting to the browsing device. Thus, our webserver is compatible with a wide range of devices including the desktops, tablets and smartphones.
In addition to the webserver, we also developed a standalone version of IL-10pred using wxPython. Keeping in view the exponential growth of usage of smart phone users in last decade, we also developed an Android-based mobile app using the Kivy package. The workflow of the IL-10 mobile app has been summarized in the Fig. 5. All these applications are accessible at the URL http://crdd.osdd.net/raghava/IL-10pred/.

Discussion
Immunosuppression is a systemic response that may be desired in some cases like asthma therapy and inappropriate in some other conditions like cancer. Peptide-based immunotherapy has been shown to be capable of capitalizing on both of these flip sides by removal or introduction of IL-10 inducing epitopes in the antigen. In an attempt to develop a therapy for asthma treatment, the IL-10 inducing epitopes were shown to suppress the immune response evoked by other epitopes of the same antigen 45 . On the other hand, removal of IL-10 inducing T cell epitopes from the insulin-like growth factor-binding protein 2 (IGFBP2) vaccine conferred potent anti-tumor activity 46 . With an increased understanding of IL-10 inducing epitopes, their inclusion or exclusion becomes an important consideration in a vaccine design. In the present study, we have made a systematic attempt to understand the nature of IL-10 inducing peptides and to develop models for predicting IL-10 inducing peptides. This is the first in silico study on IL-10 peptides though there is limited information available in the literature. In order to perform this type of study, one needs to have a dataset of inducing and non-inducing peptides. Thus, we examined the experimentally validated MHC class-II binders in IEDB database 47 and extracted IL-10 inducing and non-inducing MHC class-II binders. The dataset of experimentally validated IL-10 inducing and non-inducing peptides is the backbone of this study. We analyzed these peptides to understand compositional and positional preferences of residues in IL-10 inducing peptides using Two-Sample Logo and compositional analysis. As shown in the Results section, certain types of residues are more abundant in IL-10 inducing peptides. In addition, positional preferences of certain types of residues were also observed in the IL-10 inducing peptides. This indicates that IL-10 inducing and non-inducing peptides differ in terms of residue composition. Thus composition can be used to discriminate these two types of peptides.
We tried a wide range of classifiers to build models for predicting IL-10 inducing peptides. Further, we also used a wide range of features particularly compositional features for discriminating IL-10 inducing and non-inducing peptides. As anticipated, models based on compositional features particularly based on DPC, classify IL-10 inducing and non-inducing peptides with high performance. Initially, SVM-based models were developed using different sequence features and achieved reasonably good performances. We also tried popular classifiers available in the software package WEKA and achieved moderate performances using different classifiers. Our Random Forest-based model developed using DPC attained the highest performance among all the classifiers used in the present study (Fig. 6).

Conclusion
In a scenario where direct use of IL-10 as a therapeutic model has revealed toxic effects, peptide-based epitopes that induce IL-10 provide a promising alternative. It has been shown in previous studies that blocking the IL-10 receptor using antibodies could enhance the efficiency of subunit vaccines, for example, in the case of mycobacteria 48,49 . Thus, blocking the IL-10 induced immunosuppression could be an important aspect of subunit vaccine design. Although numerous methods are available for in silico prediction of T cell epitopes 33 , computational methods are not available for predicting IL-10 inducing epitopes. The present work is an attempt to provide a platform for addressing this important aspect. In order to facilitate the scientific community in developing better methods for prediction of IL-10 inducing peptides, we have provided our datasets used in the present study.

Methods
Building Dataset. One of the major challenges for this type of work is to create an authentic dataset containing experimentally validated IL-10 inducing and non-inducing peptides. In this study, the dataset is derived from the IEDB database 47 , which is the largest repository of immune epitopes. The MHC class II binders that were reported to trigger IL-10 release were extracted from the IEDB. We extracted experimentally validated MHC class II binders that elicit cytokine IL-10; these peptides were assigned as IL-10 inducing peptides. We also extracted MHC class II binders reported not to trigger IL-10 release from IEDB. We assigned these MHC class II binding peptides as non-inducing peptides. In order to remove redundancy, we removed identical peptides from both, IL-10 inducing and non-inducing peptides. Our final dataset called the main dataset consists of 394 IL-10 inducing and 848 non-inducing peptide sequences enlisted in Table S4, with unique positive and negative sequences.
In addition to the main dataset, we also created another dataset called the alternate dataset (sequences provided in Table S5). This dataset contains different negative instances than in the main dataset. This dataset contains MHC II non-binders as negative instances instead of MHC II binders. The alternate dataset contains 461 IL-10-inducing MHC II binders as positive instances. In order to create a dataset of negative instances, we extracted 621 MHC II non-binders from the MHCBN database 50 . In summary, our alternate dataset consists of 461 IL-10 inducing peptides and 621 MHC II non-binders. We built this dataset to classify IL-10 inducers and MHC II non-binders.

Computation of the Residue Composition.
In the past, compositional features of the peptide sequences have been used successfully for developing methods for predicting the function of peptides 43,51 . Thus in this study also models have been developed using different types of composition that includes amino acid and dipeptide composition. The composition features (AAC and DPC) were calculated using the in-house Perl scripts based on the following equations 1 and 2.
In the above equations, AAC(i) is the percent amino acid or residue composition of the residue type i. R(i) is the number of residues type i and N is the total number of residues in a peptide sequence. DPC(i) is the percent of dipeptide composition for residue type i. D(i) is the number of dipeptides of type i and N is the total number of dipeptides in a peptide sequence.
Binary Profile. It is another important feature for representing peptide sequences. In the case of binary profile, each of the 20 types of natural amino acid is represented as binary vectors of dimension twenty (e.g. Ala by 1,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0; Cys by 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0). The sequence length in the positive and the negative datasets is variable, but the input vector for applying the machine learning techniques should be of fixed length. Since the minimum length of the sequences is 8 for both the positive and the negative sequences, substrings of length 8 were taken from the N-terminus as well as C-terminus of each sequence and concatenated to have derived sequences of fixed length (16) for each of original sequences. Such derived sequences were used to generate the binary profile.
Two-sample logo. The sequences derived for obtaining the binary profile were also used for generating a Two-Sample logo (TSL) 52 using the web tool available at http://www.twosamplelogo.org/cgi-bin/tsl/tsl.cgi, since this tool also requires a fixed length input sequence criterion. Since the minimum length of the peptides in the dataset was 8 amino acids, the TSL consists of 8 residue positions from each of the N and C termini leading to a profile of 16 residue positions.
Machine-learning Techniques. The Support Vector Machine (SVM)-based prediction models were developed using the package SVM light 53 . The radial basis function kernel was mainly used in this study; different parameters were optimized to get the best performance on the training dataset. In addition, some commonly used classifiers were also used for developing prediction models. These classifiers (e.g., Random Forest, IBK, SMO and J48) were implemented using the WEKA package 54 .
Feature Selection. In this study, we also used the WEKA 54 package for selecting important features from different compositional features. We used CFSubSetEval algorithm with default parameters for the selection of significantly relevant features. These selected features were examined to understand nature of IL-10 inducing peptides as well as for developing the prediction models (Table S3).

Cross-validation.
In order to train, test and evaluate our models, we used the five-fold cross validation technique. This is a standard technique, commonly used in this type of studies; details are available in the previous studies 51 . In summary, the whole dataset is divided into five equal parts, with all five sets having an equal number of positive and negative instances. The four sets are used for training, while the remaining set is used for testing. This process is iterated five times so that each set is used for testing.

Evaluation parameters.
Model evaluation is an important step to estimate the efficiency of the model. We have used well-established evaluation parameters that include sensitivity, specificity, accuracy and MCC.