Introduction

Protein phosphorylation is the most widespread and well-studied post-translational modification (PTM) in eukaryotes and plays a crucial role in the regulation of virtually every cellular behaviour, including DNA repair1, regulation of transcription2, apoptosis3, immune response4, metabolism5 and cellular differentiation6. In addition, protein phosphorylation catalysed by kinase plays significant regulatory roles in intracellular signal transduction7. Therefore, annotation of phosphorylation in proteomes is a first-critical step toward decoding protein function and downstream regulatory networks. Historically, primarily though using low-throughout biological techniques, such as the site-directed mutagenesis, 32P-labeling and degenerate peptide library screening, many novel and specific phosphorylation sites have been discovered. In recent years, the high-throughput studies-large scale mass spectrometry-of protein phosphorylation in different organisms have rapidly accelerated the identification of novel protein phosphorylation data8,9,10. For example, Wiśniewski et al.11 identified nearly 12,035 unique phosphorylation sites in 4,579 mouse brain proteins using mass spectrometry. Unfortunately, both low-throughput and high-throughput biological techniques for identifying phosphorylation events are costly, time-consuming and biased toward abundant proteins and proteotypic peptides and cannot provide information regarding the protein kinases that catalyse phosphorylation substrates in detail. Hence, in silico prediction of phosphorylation sites is potentially a useful alternative strategy for whole proteome annotation.

Until now, more than a dozen phosphorylation site prediction tools have been developed, such as NetworKIN12 and recently developed Musite13, GPS 2.114 and PKIS15. Musite combined three different features to extract the sequence information and finally provided general prediction models for six organisms and kinase-specific prediction models for 13 kinases or kinase families. GPS 2.1 reserved the old version of GPS and integrated a novel approach of motif length selection. PKIS incorporated the composition of monomer spectrum (CMS) encoding strategy and support vector machines (SVMs) to identify kinase-specific phosphorylation sites and obtained good prediction results. In addition, to our knowledge, there are five review articles so far that have comprehensively and systematically discussed the methods of computational phosphorylation site prediction and more information about these tools can be obtained from these five articles16,17,18,19,20. Although there are a number of computational methods of phosphorylation prediction and they have made great progress in prediction, some drawbacks of these models in this field still should be noted: (i) Many predictors are based on non-kinase-specific methods and could not give the predicted kinase information for users, such as PPRED21 and PhosPhAt22. (ii) For some kinase-specific tools, the number of experimental phosphorylation substrates for certain kinase is too small so that lacking of statistical significance, for example, GPS 2.114 and KinasePhos 2.023 considered the minimum phosphorylation substrates of only three and only one, respectively. (iii) The predicted coverage of kinase types for some kinase-specific tools is too narrow to provide comprehensive kinase prediction and the kinase hierarchical classification is not considered, for example, Li et al.24 only considered eight kinase families for predictor construction. PKIS15 only involved single kinases prediction but prediction for kinase family and kinase group were not provided. (iv) The provided stringency for some predictors is relatively high as they paid more attention to specificity than to sensitivity, although the prediction accuracy is promising, the true positive rate is low, such as Scansite25 and Musite13. Therefore, new methods must be established and used for more effective and comprehensive identification of kinase-specific phosphorylation site.

In addition, protein phosphorylation regulates many aspects of cell life, whereas abnormal phosphorylation is a cause or consequence of disease26. As we all know, protein kinases are one of the most ubiquitous families of signalling molecules in the human cell, accounting for approximately 2% of the proteins encoded by the human genome27. The family of genes most frequently contributing to different diseases such as neurodegenerative disease and cancer is the protein kinase gene family28 which is implicated in a huge number of tumorigenic functions including immune evasion, proliferation, antiapoptotic activity, metastasis and angiogenesis. It is now well-appreciated that many disease pathways involve abnormal regulation of phosphorylation events. And recently, increasingly more experimental observations have suggested that protein kinase could indirectly or directly influence the abnormal protein phosphorylation and further result in diseases29. For example, Bose et al.30 have proven that Double-stranded RNA dependent kinase (PKR) could induce Glycogen Synthase Kinase Aβ (GSK-3β) activation, tau phosphorylation and apoptosis in human neuroblastoma cells exposed to tunicamycin or Ab peptide 1–42 and activated PKR is increased in brains with Alzheimer's disease (AD). Wong et al.31 have found that Cyclin-dependent kinase 5 (Cdk5)-mediated phosphorylation of Endophilin B1 is essential for autophagy induction and neuronal loss in models of Parkinson's disease (PD). By analysing a unique randomized tamoxifen trial including breast cancer patients receiving no adjuvant treatment, Busch et al.32 showed for the first time that patients low in ERK phosphorylation in cancer associated fibroblasts (CAFs) did not respond to tamoxifen treatment despite having estrogen-receptor alpha. Paccez et al.33 in his recent review described that the receptor tyrosine kinase Axl has been implicated in the malignancy of different types of cancer. Saini et al.34 described that the phosphatidylinositol 3-kinase (PI3K)/AKT/mammalian target of rapamycin (mTOR) and the Raf/mitogen-activated and extracellular signal-regulated kinase (MEK)/extracellular signal-regulated kinase (ERK) signalling pathways are critical for normal human physiology and also commonly dysregulated in several human cancers, including breast cancer (BC). Ariadna et al.35 demonstrated for the first time that the implication of DYRK1A overexpression in a developmental alteration of the central nervous system associated with Down syndrome (DS). In this regard, comprehensive analysis of abnormal phosphorylation associated with different types of kinases will be helpful to promote the understanding of how abnormal actions of the kinase are involved in regulating biological processes and how they affect susceptibility to diseases.

As described above that the kinase-specific methods could give the kinase information for predicted phosphorylation substrates and further help researches to explore the regulatory mechanism between kinase and abnormal phosphorylation substrates. So in this paper, we firstly constructed an efficient kinase-specific phosphorylation predictor and then used this tool to predict and analyse the types of kinases for all disease-related phosphorylation substrates.

Results

The ability of PSEA to recognize true phosphorylation sites

We first checked the ability of PSEA to correctly recognize the phosphorylation sites. To enlarge the prediction coverage of different types of kinase and to gain insights into function and evolution of kinase, we classified all kinases into a hierarchy of single kinase, kinase family and kinase group (for more information please see supporting information for data preparation and Tables S1–S4). A leave-one-out method was used to test the performance for each positive and negative set. For each test, a known peptide was picked and the others were treated as the predefined peptide set. The P-values for positive and negative peptides were then calculated by the PSEA method. The results are represented in Tables S5–S13. From these tables we can see that for all single kinases of phosphoserine, at the high stringency cut-off, there are 13 terms and 18 terms whose sensitivities and specificities are larger than 80%, respectively. When decreasing the stringency, the corresponding specificity reduces, but most of single kinases can still obtain high performance with both sensitivity and specificity larger than 80%. From the receiver operating characteristic (ROC) evaluation, we can find there are 15 kinases with areas under ROC curves (AUCs) larger than 80% and all kinases' AUCs are larger than 70% except the CaMK2-α, as shown in Figure 1. For single kinases of phosphothreonine, most of the predictors also can obtain promising performance except the PKCA (Table S6, Figure S1). The predictive sensitivity of single kinases of phosphotyrosine is a little lower compared with that of phosphothreonine and phosphoserine (Table S7 and Figure S2), mainly because the sequence conservation of phosphotyrosine kinase is lower than that of phosphothreonine and phosphoserine and the threshold setting for phosphotyrosine, which is the same as that of phosphothreonine and phosphoserine, is too stringent. Compared with single kinase method for phosphorylation prediction, the prediction performance of kinase family and kinase group methods is a little depressed, which could express that although some kinases in the same family or group have the similar structure and function, the slight substantial difference is still exist.

Figure 1
figure 1

The ROC curve and the corresponding AUCs for phosphoserine prediction of different single kinases.

Test on independent data and comparison with other existing methods

To evaluate the performance of our method, we made comparisons with other existing predictors. Here we put our independent test set into three recently developed and online available methods: GPS 2.114, Musite13 and NetworKIN12. GPS 2.1 provides three levels of stringency (High, Medium and Low) with different choices of threshold values. Musite and NetworKIN support continuous stringency adjustment to meet different confidence requirements for users. In order to compare our method with Musite and NetworKIN conveniently, we chose three appropriate levels of stringency (also called High, Medium and Low) for Musite and NetworKIN with specificity (or score) as 85%, 90% and 95% respectively through testing. All these stringency in GPS 2.1, Musite and NetworKIN are all relatively high as they all paid more attention to specificity than to sensitivity. The comparisons of predictive performance between our method and other prediction methods are shown in Figure 2. For different levels of stringency, although the Acc and MCC of the CK2 in GPS 2.1, the CDK and MAPK in Musite and the CDK in NetworKIN are slightly higher than those of our method, most of kinase families in our method are superior to other methods. For example, for the CK1 family, when the stringency level of these four methods is High, the MCC of our method reaches to 43.77%, which is about 19.58%, 28.20% and 44.09% higher than that in GPS 2.1, Musite and NetworKIN, respectively. Also for the Src family, the MCC of our method in these three levels of stringency is about 50%, which is about 25% ~ 50% higher than that of three other methods. The results show that the above three tools can achieve high specificity, but sacrificing sensitivity would therefore result in a low MCC. Our method offers good Sp as well as high Sn, which also illustrates that our method is superior to the current methods. Note that, when performing the comparisons, we used a prediction model that was trained from a dataset excluding the protein sequences in the independent test dataset. However, for GPS 2.1, Musite and NetworKIN, some of the test proteins might have been included in their training processes and thus, the prediction performances may be biased favourably toward these tools in the comparisons. This possibility means that the performance improvement of our method over these tools might be underestimated. Compared with these existing methods, it is worth mentioning that the formula of our method is much more concise or at least comparable with previous established programs. More importantly, the reasonably good performance of our method reflects that our method can effectively evaluate the sequence similarity of phosphorylation substrates for different types of kinases.

Figure 2
figure 2

Comparison of our method with other existing methods on independent set for different kinase families.

(A), compared by accuracy (Acc); (B), compared by sensitivity (Sn); (C), compared by specificity (Sp); (D), compared by Matthews correlation coefficient (MCC).

Predicting the types of kinase for disease-related phosphorylation substrates

Protein kinases are a superfamily of proteins involved in crucial cellular processes such as cell cycle regulation and signal transduction. Accordingly, they play an important role in disease biology. To contribute to the study of the relation between kinases and diseases, we performed a prediction analysis by predicting corresponding kinases of all disease-related phosphorylation substrates which could result in different human diseases. To have large prediction coverage of protein kinases, we determined to use the kinase family predictors to predict the kinase families of all disease-related phosphorylation substrates one by one. The results are shown as orange bars in Figure 3 (we didn't consider the IKK because the prediction performance is not good enough). We could find that MAPK family could catalyse 63.02% of disease-related phosphoserine, while PLK family could only catalyse 32.45% of disease-related phosphoserine. We also checked the effects of these disease-related phosphorylation substrates from different databases (as described in the Methods) and found there are 464 substrates that have annotation information of kinases. From the known information, we found 111 (23.92%) disease-related phosphorylation substrates can be catalysed by the MAPK. After processing these substrates by using our predictors with High stringency, we predicted 103 (92.78%) substrates that can be catalysed by MAPK (all predicted and known kinase information, the source and other detailed information of all collected disease-related phosphorylation substrates can be downloaded from our web site).

Figure 3
figure 3

The data statistics of predicted phosphoserine kinase family types for disease-related and normal phosphorylation substrates.

Significant differences (P-value) refer to the Two-sided category. Statistical significance was calculated with a Fisher exact test.

Besides, the protein-protein interaction network in STRING database was used to analyse the relationship between the MAPK3 and PLK1 kinases (considering the limitation of the data of collected kinase-specific phosphorylation, we only analysed the largest quantity of MAPK3 and PLK1 kinases which contained in MAPK and PLK families, respectively) and all disease-related phosphorylation substrates, as shown in Figure 4. We could find that MAPK3 contacts much more disease-related phosphorylation substrates than that of PLK1 (P = 4.66 × 10e−8). It only displayed the direct contacted interactions, considering the important roles in the regulation of phosphorylation, it is reasonable to believe that there must be many other indirect interactions which MAPK3 kinase reacts with disease-related phosphorylation substrates. From the above analysis, not only could we conclude that MAPK kinase family might occupy a relatively large proportion in abnormal phosphorylation and further result in different diseases but our method of kinase-specific prediction of phosphorylation can effectively predict the corresponding kinase type of phosphorylation substrates.

Figure 4
figure 4

The relationship between specified kinases (MAPK3 and PLK1) and disease-related phosphorylation substrates (only displayed directly contacted interactions).

The nodes with green border line represent the specified kinases and the nodes with blue border line represent the diseased phosphorylation substrates. Bar plot represents the degrees of MAPK3 and PLK1.

Significance analysis for the predicted kinase families of disease-related and normal phosphorylation substrates

To compare the difference of kinase families between disease-related and normal phosphorylation substrates, we also predicted the same size of normal phosphorylation substrates randomly selected from all collected phosphorylation sites. For kinase family of phosphoserine, the predicted results for disease-related and normal phosphorylation substrates are shown in Figure 3. By using the Fisher exact test, we could clearly find that MAPK family (P = 2.76 × 10e−2) and GSK family (P = 3.37 × 10e−4) have significant difference in catalysing disease-related and normal phosphorylation substrates, which might infer that there two kinase families have more direct or indirect association compared with other kinase families. Although we cannot conclude that these two kinase families must involve in the pathological processes, it provided a useful resource and guidance for further experimental validation. Moreover, the analysis for phosphothreonine and phosphotyrosine was also carried out in the same way, but we could not find the significant difference (P > 0.05) between the disease-related and normal phosphorylation substrates among different types of phosphothreonine or phosphotyrosine kinase families, the results are shown in Figures S3–S4.

Function analysis of disease-related phosphorylation substrates

From the above results, we found the predicted MAPK and GSK kinase families have significant difference in catalysing disease-related and normal phosphorylation substrates. The question is whether our prediction results are reasonable and reliable enough. So in this part, we carefully analysed the characteristic of all disease-related phosphorylation proteins and found whether there are connection between these proteins and the kinase families of MAPK and GSK.

To better understand the distribution of the disease-related phosphorylation substrates in function protein groups, we analysed our data to see if there is over- or under- representation (compared to the normal phosphorylation substrates) of function elements such as pathways and gene ontology (GO). Firstly, DAVID program36,37 was used to analyse the pathway to further explore functional aspects of disease-related and normal phosphorylation substrates. Here the top 10 significant results (P < 1.00 × 10e−10) were shown in Figure 5. We could find that 36.50% of disease-related phosphorylation substrates are involved in cancer pathways (containing pathways in cancer, prostate cancer, non-small cell lung cancer and pancreatic cancer) and 8.25% are involved in neurotrophin signalling pathway. After carefully analysing the characteristics of MAPK and GSK kinase families, we found that MAPK kinase family can regulate the related actions to induce the human colon, lung and breast cancers32,38,39 and the glycogen synthase kinase 3β (GSK3β), one member of GSK kinase family, can catalyse Tau phosphorylation and plays an important roles in the genesis and maintenance of neurodegenerative changes associated with Parkinson's disease40. In addition, after carefully analysing the effects of all disease-related phosphorylation substrates we collected, we found those disease-related substrates could mainly cause neurodegenerative diseases (Parkinson's disease, Alzheimer's disease and Huntington's disease) and cancers (Carcinoma and Cancer) and the total numbers of those which could result in these two types of disease are 170 (the percentage of the whole disease-related phosphorylation substrates (806) which this disease (170) accounts for is 21.09%) and 424 (52.61%), respectively, which are also consistent with the results of pathway analyses.

Figure 5
figure 5

The data statistics of pathway terms for disease-related and normal phosphorylation substrates.

Significant differences (P-value) refer to the Two-sided category. Statistical significance was calculated with a Fisher exact test.

As a result of the continuing advances made in previous studies, protein phosphorylation was found to target broad substrates in different biological processes. The collection of disease-related and normal phosphorylation substrates from databases provided an opportunity to analyse the functional abundance and diversity of protein phosphorylation. Here, we statistically analysed the enriched biological processes, molecular functions and cellular components with the gene ontology (GO) annotations and compared the differentiated GO terms with Fisher exact test (Two-sided category) for the human disease-related and normal phosphorylation substrates. The 10 most over-represented terms of these three criteria are shown in Table 1. We could find that chemical stimulus (GO:0070887, GO:0042221) and other stimulus (GO:0009719, GO:0009725, GO:0009605) in biologic processes are the most differentiated GO terms and they are all over-represented in diseased phosphorylation substrates. The results are consistent with the roles of MAPK kinase family, as Giuseppe et al.41 have provided the evidence that p38(MAPK) and ERK1/2 dictate cell death/survival response to different pro-oxidant stimuli via p53 and Nrf2 in neuroblastoma cells SH-SY5Y, also, Min et al.42 have discovered that NF kappa B and JNK/MAPK activation mediates the production of major macrophage- or dendritic cell-recruiting chemokine in human first trimester decidual cells in response to proinflammatory stimuli. For molecular functions, the binding functions such as receptor binding (GO:0005102), protein complex binding (GO:0032403) and identical protein binding (GO:0042802) are the most differentiated GO terms, which are also consistent with previous studies of MAPK and GSK families43,44. The diseased phosphorylation substrates are mainly localized in cytoplasm and membrane (such as GO:0005829, GO:0044444 and GO:0005886) according to the cellular component analysis. The above analysis showed that MAPK and GSK families might have significant differences in biological processes, molecular functions and cellular components for diseased and normal phosphorylation substrates, which could be helpful to design the protein kinase inhibitors for abnormal phosphorylation related diseases. Therefore, from the pathway and GO analysis we could further confirm that the MAPK and GSK families might have some really regulatory effects in disease-related phosphorylation.

Table 1 Statistical comparison of the GO terms of the disease-related and normal phosphorylation substrates

Computation programs construction and web server

Our main aim is to develop an open platform for computational analysis of phosphorylation of human proteins. We chose the C# programming language to execute all of calculation for its powerful computing capability and excellent portability. Besides that, we have constructed web service platform by using the Asp.net (C#). This web service of predicting kinase-specific phosphorylation site is freely accessible for academic researchers at: http://bioinfo.ncu.edu.cn/PKPred_Home.aspx. In download page, all of data used in this paper can be downloaded from this web site, such as all phosphorylation data, independent data and disease-related phosphorylation data. Bug fixing and minor changes of phosphorylation prediction model will be done. The improved phosphorylation prediction model will be constructed when the new phosphorylation sites data and kinase data become available.

With the web service of predictor, each query phosphorylation site (S/T/Y) in sequences can get a score. A higher score indicates a higher probability of the phosphorylation site by the selected kinase term. To control the false-positive predictions, we suggest users pay more attention to the predicted phosphorylation sites with P-values lower than the top 10%. In this case, the estimated specificity will be higher than 90%. Specific phosphorylation sites (S/T/Y) passing the suggested cut-off are highlighted by colour in the table of prediction results on the web site. The background set should contain unreported phosphorylation sites; as such, the specificity is very likely underestimated. In our opinion, this cut-off should be loosened once interaction between kinase and query protein occurs. In applications, users can adjust the cut-off values according to the trade-off between discovering more putative kinase-specific phosphorylation sites and making fewer false-positive predictions.

Discussion

Protein phosphorylation regulates most aspects of cell life, whereas abnormal phosphorylation is a cause or consequence of disease. A growing interest in developing orally active protein kinase inhibitors has recently culminated in the approval of the first of these drugs for clinical use. Protein kinases have now become the second most important group of drug targets, after G-protein-coupled receptors26. With increasingly more disease-related phosphorylation substrates were discovered in the clinic or in clinical trials, it is crucial to explore the relationship between the protein kinases and these disease-related phosphorylation substrates, such as whether the specific kinase has the potential relation to specific disease and which type of kinase can specifically result in the abnormal phosphorylation and further cause disease. On the basis of large amount of kinase-specific phosphorylation data and disease-related phosphorylation data, it has become both a possibility and a priority to determine what the functional implication of protein kinases are and how to use the abnormal regulatory information of phosphorylation to develop the corresponding protein kinase inhibitors for related diseases.

In this paper, we analysed the kinases' characteristic of all collected disease-related phosphorylation substrates for the first time on the basis of our kinase-specific prediction method and predicted that MAPK and GSK kinase families are enriched in the environment of disease-related phosphorylation, which could be helpful to design the corresponding protein kinase inhibitors to cure the diseases. Our findings about MAPK and GSK kinase families also have been confirmed by some experimental observations. For example, Noble et al.45 have reported that activation of p38α MAPK can lead to increased activities of proinflammatory cytokines, such as tumor necrosis factor-α and interleukin 1β. This observation suggested that p38 selective inhibition could be a therapeutically useful route to treatment of a number of inflammatory and autoimmune diseases. Meijer et al.43 have reviewed that more than 30 inhibitors of GSK have been identified to treat several diseases, including Alzheimer's disease and other neurodegenerative diseases.

Note that the capacity of disease-related phosphorylation data we collected is not lager enough, we cannot rule out the possibility that these data were gleaned from small-scale studies, which often aimed at studying only the better-known protein kinases and not their less explored relatives, while the data in background set almost come from high-throughput studies. So the statistics analysis between disease-related and normal phosphorylation data might have bias. But to authors' knowledge, there is no efficient method for large-scale analysis of the relationship between kinases and phosphorylation related diseases with computational model. Although we are not completely sure that this statistics analysis is reliable, it provided a useful resource for further experimental medicinal considerations. In addition, with the rapid development of the biotechnology, more and more diseased-related phosphorylation data will be detected and our idea of analysing the disease-related phosphorylation could give another choice for the pathogenesis research.

When personalized medicine is the next frontier of scientists, industry and the general population, it is important to develop computational approaches that can lead to a better understanding of the etiology of a disease. Considering the essential roles of phosphorylation in protein functions, integration of phosphorylation kinases and molecular information is a sensible step in this direction because it provides a structural and functional perspective to both the human protein kinase and abnormal phosphorylation. Our work can not only be used in pathophysiological diagnosis researches of abnormal phosphorylation but also in the selection of protein kinase inhibitors of clinical applications.

Based on the existing data, we constructed the kinase-specific predictors and carried out the systematic analysis of all disease-related phosphorylation and obtained some satisfactory result, but several issues must be solved in future researches. (i) The phosphorylation sites used in the predictors were mostly identified by high-throughput methods, which may have inherent bias in terms of representing the global phosphorylation events and consequently affect the performance of prediction. As techniques like electron transfer dissociation and alternative proteases are helping to resolve technology limitations, more complete phosphorylation data sets will be released. We will update our predictors as the new data become available. (ii) We have only labelled positive data, but we do not have labelled background and negative data (i.e. we still cannot ensure the non-phosphorylation substrates are truly negatives, although we have discarded the inaccessible substrate sites by using the structural filters). (iii) The types of kinase for disease-related phosphorylation substrates are only predicted from the aspect of kinase family due to the data limitation of single kinases (we only considered those kinases whose substrates is larger than 50), then the analysis for these diseased phosphorylation might not be elaborate enough. More kinases will be included in the system with the availability of more kinase-specific phosphorylation data and more detailed analysis for disease-related phosphorylation substrates will be carried out.

Methods

The PSEA method

Gene Set Enrichment Analysis (GSEA) was developed and used on DNA microarray data to detect coordinated expression changes in a group of functionally related genes and then was applied to find the putative functions of the long non-coding RNAs46,47,48. Taking advantage of the idea of GSEA, we proposed a new method called PSEA (Phosphorylation Set Enrichment Analysis) to detect new sites phosphorylated by a specific kinase, kinase family and kinase group. For each term, we focused on finding sites which were similar in sequence with discovered ones. We treated the phosphorylation sites and their surrounding amino acids as phosphorylated peptides. Phosphorylated peptides from the above three levels of kinase hierarchical classification formed kinase specific peptides sets (see Tables S1–S3). To determine whether a query peptide could be phosphorylated or not, we just needed to judge whether this query peptide was similar to the phosphorylated peptides in that set. The PSEA method we developed can efficiently estimate this similarity and the significance of the similarity. The following calculation was based on the conception of the 15 amino acid long peptides. Compared with the original GSEA method several necessary changes have been made49. The details of the PSEA method (shown in Figure 6) are described as follows:

  1. i

    Prepare the input data. Predefined each kinase group, kinase family and single kinase specific phosphorylation site peptide set SP containing N peptides, for example, SP (CDK2/S) containing 318 peptides and SP (CDK2/T) containing 182 peptides, as shown in Table S1. Predefined background peptide set Sb containing 10,000 randomly selected peptides from whole background set.

  2. ii

    Calculate Similarity Score. First, similarity scores between the query peptide (denoted as Pquery) and each peptide in Sp and Sb were calculated according to local sequence similarity13. For example, for two local sequence fragments S1 and S2 (the window size is 2n + 1), define the distance D(S1,S2) between S1 and S2 as:

    where, Sim is derived from the normalized amino acid substitution matrix. a and b are the two amino acids, M is the substitution matrix (BLOSUM62 was used in this paper). Then, all similarity scores were mixed together and ranked from high to low. According to the above steps, we know that if Pquery is similar to the peptides in Sp for a specific kinase level (single kinase, kinase family or kinase group), these peptides in Sp should be enriched at the top of the ranked similarity score list and Pquery is likely a novel phosphorylation site for that kinase level.

  3. iii

    Calculate Enrichment Score (ES). To test the enrichment of the peptides in Sp at the top of the ranked list, a running sum score was calculated by walking down the ranked list. Here, we denote di as the similarity scores between peptide Pquery and peptide Pi (contained in Sp) and D as the sum of |di| for all Pi. While walking down the ranked list, the running sum score increased |dj |/D when encountering a peptide Pj in Sp and decreased 1/10,000 when encountering a peptide in Sb. Finally, ES was defined as the maximum of the running sum score.

  4. iv

    Evaluate Significance of ES. To evaluate the significance of the ES for a given peptide, a total of 999 peptide sets with the same size as Sp were randomly selected from the background peptides and denoted as Sb1 to Sb999. Then the ES for each set was calculated one by one by treating each set as predefined peptide set. Finally, all of these 1000 ES (ES(Sb1), ES(Sb2)…ES(Sb999), plus ES(Sp)) were ranked from high to low. If the rank of ES(Sp) is L, the nominal P-value of the given peptide was calculated as L/1000. The P-value should be ranked from 0.001 to 1, with the minimum interval of 0.001. The smaller the P-value is, the more significant the chance that the given peptides were phosphorylated by the specific kinase. In practice, the number of randomly selected peptide sets can be changed according to different needs.

Figure 6
figure 6

Detailed processes of the PSEA method.

For the PSEA method described above, there are several advantages for classification. Firstly, it does not require the balanced number of positive and negative datasets, as many machine learning methods did. Secondly, it directly calculates the sequence similarity between two single peptides. In most of the cases, peptides in the kinase set can be divided into several subsets which share little variation with each other. The PSEA method is particularly suitable for these cases, because the ES of a given peptide would be significant as soon as it is similar to some but not all peptides in the kinase set.

Predicting the disease-related phosphorylation substrates

The information about disease-related phosphorylation which could cause severe diseases were obtained from the PhosphoSitePlus50, which is an online systems of biology resource providing comprehensive information and tools for the study of post-translational modifications (PTMs) of the protein and providing MS/MS records for sets of modification sites observed in specified diseases, cell lines and tissues. We collected 320 human disease-related phosphorylation proteins, which contain 806 disease-related phosphorylation terms. After collecting all the data, we further consulted the SwissVariant51, UniProtKB/Swiss-Prot and PubMed databases about their effects and the references to the variations, phosphorylation information and related diseases of corresponding proteins. The main alteration of these disease-related phosphorylation data are hyperphosphorylation and hypophosphorylation, which could further result in a series of diseases, such as Alzheimer's disease, breast cancer and so on. As one disease-related phosphorylation site could cause two or more diseases and we only analyse the relationship between kinase and disease-related phosphorylation substrate, then we only count one time for these phosphorylation sites. So after removing the redundant disease-related phosphorylation terms, we lastly collected 265 disease-related phosphoserines, 63 disease-related phosphothreonines and 237 disease-related phosphotyrosines (the data can be obtained from our online web site). With the kinase-specific predictor, PSEA, we predicted the corresponding kinase family for each disease-related phosphorylation site (because of the data limitation, the single kinase predictors only cover a small part of all kinase in PSEA, so we use kinase family predictors to predict the disease-related phosphorylation sites).

Enrichment analysis of normal and disease-related phosphorylation sites

Gene Ontology (GO) is a major bioinformatics initiative. It meets the need for consistent descriptions of gene products in different databases. It has been developed to manage the overwhelming mass of current biological data from a computational perspective and become a standard tool to annotate gene products for various databases52. Enrichment analysis was performed to identify over- or under-represented GO terms in the loss and gain of disease-related phosphorylation, compared to the normal phosphorylation in the entire human phosphorylation proteome. According to the two-sided category of Fisher exact test, the P-value of 0.05 was considered significant and was calculated. Besides, the KEGG pathway was analysed by using online DAVID program36,37, which provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes. We also used the STRING database (version 9.1) of protein-protein interaction network to analyse and explain the relationship between the kinase and the disease-related phosphorylation substrates.

Assessment of the performance

The performance of PSEA was evaluated with the positive and negative test sets, with the accuracy (Acc), specificity (Sp), sensitivity (Sn) and the Matthews correlation coefficient (MCC) measurements defined in the following way:

where, TP is the number of true positive predictions, TN is the number of true negative predictions, FN is the number of false negatives and FP is the number of false positives. The accuracy denotes the percent of correct prediction in both the positive and negative sets. The sensitivity and specificity depend on the threshold used for the prediction. A highly stringent threshold will improve the specificity but reduce the sensitivity, whereas less stringent threshold will increase the sensitivity at the price of lower specificity. The MCC accounts for the true and false positives and negatives and is usually regarded as a balanced measure that can be used even if the classes are of very different sizes. Besides, receiver operating characteristic (ROC) curves were calculated and plotted based on Sp and Sn to evaluate the prediction performance of single kinases and areas under ROC curves (AUCs) were also calculated based on the trapezoidal approximation.

Threshold setting

Threshold setting is also a difficult problem. In general, most of us chose different threshold for every protein kinase. Here we proposed that a uniform rule to choose cut-off values based on calculated P-values. For all single kinases, kinase families and kinase groups, the high, medium and low thresholds were established with P-values smaller than 0.002, 0.005 and 0.015, respectively. The high threshold is recommended to test a large-scale prediction of human phosphorylation sites. The medium threshold often reduces the stringency to be useful in small-scare experiments. Also, the low threshold reduces the specificity to improve sensitivity considerably which is very useful in extensively experimental identification of all potential phosphorylation sites in substrates.