With the recent exponential increase in protein phosphorylation sites identified by mass spectrometry, a unique opportunity has arisen to understand the motifs surrounding such sites. Here we present an algorithm designed to extract motifs from large data sets of naturally occurring phosphorylation sites. The methodology relies on the intrinsic alignment of phospho-residues and the extraction of motifs through iterative comparison to a dynamic statistical background. Results show the identification of dozens of novel and known phosphorylation motifs from recently published serine, threonine and tyrosine phosphorylation studies. When applied to a linguistic data set to test the versatility of the approach, the algorithm successfully extracted hundreds of language motifs. This method, in addition to shedding light on the consensus sequences of identified and as yet unidentified kinases and modular protein domains, may also eventually be used as a tool to determine potential phosphorylation sites in proteins of interest.
As research in molecular biology moves forward it has become increasingly clear that few cellular processes are unaffected by protein phosphorylation. Protein degradation, localization and conformation as well as protein/protein interactions are only some of the functions in which protein phosphorylation has been implicated1,2. Furthermore, protein phosphorylation levels are central to our current understanding of cell division and signal transduction pathways in both normal and diseased cell states3. Yet, relatively little is known about the majority of protein kinases in the human proteome. Only approximately one-tenth of the estimated 500–600 human protein serine, threonine and tyrosine kinases have known consensus sequences for their sites of phosphorylation4. Even when consensus sequences are known, in vivo protein substrates are often lacking.
To date, the task of understanding kinase recognition sequences has progressed mainly by a 'kinase-driven' approach whereby a kinase of interest is incubated with a combinatorial peptide library and ATP. Edman degradation of the phosphorylated peptides, which have been enriched using a ferric column, leads to the creation of a position-weight matrix of the data and hence the consensus sequence5. Though the kinase-driven approach has had much success in identifying optimal kinase consensus sequences and substrates, it has suffered from the fact that optimal in vitro binding is often kinetically unfavorable in the cellular environment, thus leading to motifs that are rarely found in the proteome.
Here we present an attempt to start with known biologically phosphorylated substrates from unknown kinases and discover motifs through a 'substrate-driven' approach. In the past, the low number of localized phosphorylation sites cited in the literature made substrate-driven approaches to determining kinase consensus motifs difficult. However, refinements of several affinity-based strategies such as immunoaffinity6, immobilized metal affinity chromatography (IMAC)7 and strong cation exchange (SCX) chromatography8, coupled with the enabling technology of tandem mass spectrometry have more than doubled the number of phosphorylation sites identified in the past year alone, with several studies reporting from several hundred to several thousand sites6,8,9,10,11,12,13.
Two of these recently published large-scale mass spectrometry studies were chosen as test sets for our motif-building algorithm. The first study used SCX for the enrichment of phosphopeptides from HeLa cell nuclei, resulting in the elucidation of 1,594 unique phosphoserine and 195 unique phosphothreonine sites8. The second study used an antiphosphotyrosine antibody to enrich for phosphorylated tyrosine residues in pervanadate-treated Jurkat cells (151 sites), cells expressing constitutively active NPM-ALK fusion kinase (237 sites) and cells expressing constitutively active Src kinase (185 sites)6.
Overview of the method
A schematic of the motif extraction algorithm is shown in Figure 1. The method commences with the establishment of two parallel sequence data sets: the phosphorylated peptide data set from which motifs will be built, and a peptide data set used for background probability calculations. Next, the two data sets are converted into position-weight matrices of equal dimensions whereby each matrix contains information on the frequency of all residues at the six positions upstream and downstream of the phosphorylation site. Using the information encoded in these two matrices, a third matrix, the binomial probability matrix, is created. Specifically, this matrix contains the probability of observing s or more occurrences of residue x at position j (taken from the phosphorylation matrix), given a background probability P for residue x at position j (taken from the background matrix).
The motif-building step of the algorithm is a greedy recursive search of the sequence space to identify highly correlated residue/position pairs with the lowest P values. Each recursive iteration identifies the most statistically significant residue/position pair meeting a user-defined binomial probability threshold (in this study taken as P < 10−6) and occurrence threshold (which represents the minimal number of sequences in the phosphorylation data set needed to match the residue/position pair). When such a pair is found, the sequence spaces of the phosphorylation and background matrices are reduced by retaining only those sequences containing the selected residue/position pair, and a new binomial probability matrix is calculated (see Fig. 1). This recursive pruning procedure is repeated until no more statistically significant residue/position pairs that meet the occurrence threshold are detected. At this point the motif is identified by the tally of residue/position pairs selected during this step.
The next major step of the algorithm involves set reduction of the phosphorylation and background data sets by removing all of those sequences that match the motif identified in the motif-building step. The purpose of this step is to remove the effects of those peptides with identified motifs from confounding the search for other significant motifs. Thus, performing the sequential loop of motif building followed by set reduction results in a decomposition of the phosphorylation sequence database into a list of significant motifs. The algorithm is complete (that is, the loop exits) when the motif-building step fails to identify any significant residue/position pairs.
To test the effectiveness of our algorithm, we applied it to a linguistic data set to determine its ability to extract English language motifs, that is, English words or common word fragments. Using a framework previously conceptualized by Bussemaker et al.14 to test their algorithm for the detection of regulatory DNA motifs, we ran our motif-building strategy on the first ten chapters of the classic English novel Moby Dick15 with random characters (at frequencies identical to those found in the original text) inserted between words14. Using the criteria P < 10−6 and occurrences ≥ 10, we extracted 384 unique motifs of which 371 mapped back to English words in the original text, indicating a false positive rate of 3.4% (Supplementary Tables 1,2,3,4 online). Additionally, the motifs extracted covered 93.4% of the English words in the original text (3,886 out of 4,160). If we required the motifs to have at least two fixed letter/position pairs (aside from the central letter), a scenario more suitable for a large linguistic data set, then 317 unique motifs were extracted with only one false positive (FP) (FP rate = 0.32%) and a coverage of 67.1%. It is important to note that because approximately half of the data in this analysis were composed of random characters, the false-positive rate was substantially higher than would be expected in the phosphorylation data set analysis where all of the data were centered on true phosphorylation sites. Nevertheless, to remain conservative, we retained these same stringent P value thresholds in our biological analyses.
In order to more closely mimic the biological situation, we further validated our approach using two additional data sets. The first of these consisted of 300 in silico–generated artificial proteins (Supplementary Table 5 online). These synthetic proteins were created using human proteome residue frequencies and varied between 50 and 700 residues in length. The proteins were then studded at random positions with the following five motifs, DxxSQxN, RxSxxL, TVxSxE, RxSxxP, and KSxxxI ('x' residues retained background residue frequencies). To ensure that the artificial data set was sufficiently challenging, we inserted each motif at most once in only ∼50% of the proteins. To deal with the difficulty of an unaligned data set, we created a 'pseudo-alignment' by taking a sliding window of all 13-mers in the data set and dividing this into 20 subsets based on each of the central residues. The motif-extraction algorithm was then run independently on each of these subsets (with the set of all 13-mers as a background data set). Using the same parameters established in the linguistic analysis, and with a run time of under 5 min, the method was able to build and extract all five motifs in various alignments with no false positives. Because each of the motifs contained an 'S,' the S-centered analysis extracted all five motifs at once (Table 1). However, the D-, E-, Q-, N-, R-, L-, T-, P-, K-, V- and I-centered analyses also extracted the correct motifs containing those particular residues (data not shown). For example, the Q-centered analysis extracted only the motif DxxSQxN while the R-centered analysis extracted the motifs RxSxxL and RxSxxP. Thus the algorithm does not depend on a priori knowledge of any specific residues contained in the motif which may allow for the discovery of biologically important residues. As in the linguistic analysis, it should be noted that this was a much more complex data set than would be seen in a phosphorylation study because only ∼1–2% of the 13 mers contained a given motif.
Though the next data set used to validate our algorithm was significantly less complex than the aforementioned ones, it closely resembled the intended application of the algorithm (Supplementary Table 6 online). To generate this data set we used the Phospho.ELM database16 to extract serine-phosphorylated peptides experimentally determined to be substrates of the following four kinases: Ataxia Telangiectasia Mutated (ATM) (43 sites), Casein II (184 sites), Calcium/Calmodulin-dependent protein Kinase II (CaMK II) (41 sites) and Mitogen-Activated Protein Kinase (MAPK) (30 sites). Application of the motif-building algorithm to this combined data set resulted in the extraction of six motifs corresponding to the precise consensus sequences for ATM, Casein II and CaM II kinases (Table 2). In the case of MAP kinase, the small size of the initial MAPK data set yielded an sP motif instead of the canonical PxsP motif.
Comparison to other algorithms
Despite the wealth of motif-discovery algorithms designed to predict transcription factor binding sites, tools for the extraction of protein motifs have not kept pace. Though no algorithms exist with the specific intention of extracting protein phosphorylation motifs, we have chosen four of the most popular protein motif discovery programs against which to benchmark our algorithm. We applied the TEIRESIAS17, Pratt18, Gibbs motif sampler19 and eMOTIF20 algorithms through their online servers to the aforementioned in silico and Phospho.ELM- derived data sets used to test our approach.
The Gibbs motif sampler is an iterative Monte Carlo procedure, which results in a position-weight matrix representation of a motif. When applied to the two test data sets the Gibbs sampler extracted only DxxSQxN from the artificial-protein data set and HS[IS][PY][SPHE] (a false positive) from the Phospho.ELM-data set.
Pratt operates through pruned depth-first search of the sequence space and returns an almost unlimited number of highly generalized patterns, which vary drastically in accordance with the large number of parameters. Of the top 1,000 motifs returned by the Pratt algorithm on our artificial-protein data set, several appeared to be related to our inputted motifs, but none were exact matches and the overwhelming majority did not resemble any of the five inserted motifs. When applied to our Phspho.ELM-derived test set, Pratt returned 1,000 motifs which were almost entirely acidic in nature. The first ATM-like motif appeared as motif number 492 (SQxxxS).
The TEIRESIAS algorithm is based on an exhaustive search of small patterns followed by a convolution phase in which the small patterns are joined to form longer ones. Using our artificial protein data set TEIRESIAS extracted the motifs TVxSxE and DxxSQxN as the 4th and 6th hits, respectively. These were then followed by a long list of motifs containing serine and leucine presumably because of their increased frequency in the human proteome and the lack of background filtering in the algorithm. None of the other three motifs were found in the top 500 patterns returned by TEIRESIAS. When tested against our Phospho.ELM-derived data set, the program returned all four kinase motifs: casein II kinase motif sxxE (hit no. 1), ATM motif sQ (hit no. 17), CaM II kinase motif Rxxs (hit no. 30) and MAP kinase motif sP (hit no. 40). While TEIRESIAS was the only algorithm tested that returned all four kinase motifs, the high number of false-positive motifs between these true positives unfortunately limits the applicability of the algorithm to most biological data sets of this nature.
The final protein motif discovery tool we tested was eMOTIF. By using a prealigned data set this program uses a pruned exhaustive search to find motifs with high specificity and coverage. Since eMOTIF requires a prealigned data set, we used the same set of 13 mers centered on 'S' used for our motif-building analysis as input for the artificial-protein data set. Probably because this data set was not a true alignment, eMOTIF was unable to form any motifs. However, when applied to the Phospho.ELM-derived data set, eMOTIF returned 497 motifs. Inspection of these motifs revealed a majority of highly generalized acidic motifs. The specific motifs for two of the expected kinases, namely ATM and CaMK II, were found in the middle of the list.
Analysis of mass spectrometry phosphorylation data sets
Table 3 shows the results of the motif extraction algorithm applied to the set of threonine phosphorylated peptides from the Beausoleil, Jedrychowski et al.8 data set. The most significant motif identified, tPP (where lowercase letters indicate phosphorylated residues) appeared in 62 unique sequences from the data set, representing ∼32% of the phosphorylation data. The same motif was identified in only 0.7% of the background data set, thereby highlighting the statistical significance of this motif. To further visualize the identified motifs, we used sequences from the subset of the phosphorylation data set containing the motifs to construct sequence logos in which the height of each residue in the logo was proportional to its frequency in the subset21,22. It is evident from the sequence logo for the tPP motif that a preference for basic residues exists at the −3 position of the motif (Fig. 2a).
To address the issue of conservative amino acid substitutions, often found in kinase motifs, we created degenerate phosphorylation and background data sets whereby the 20 amino acids were condensed into an 11-amino-acid code based on residue characteristics (see Methods and shaded portions of Table 3). Not surprisingly, the degenerate analysis yielded a more specific analog of the initial motif, [RK]xxtPP (where 'x' denotes any residue, 't' represents the phosphorylated threonine and [RK] denotes R or K at that position), which was 168-fold overrepresented in the phosphorylation data set as compared to the background. Sequences from the data set that contained this phosphorylated motif indicated a significant number of transcription-related proteins including eIF4γ2, eIF3, HOMEZ, PPRB, TCF20, RUNX1 and DRPLA. Despite their overwhelming statistical significance in this data set, it is interesting that the biological significance of these motifs has yet to be reported in the literature.
The motif analysis of serine phosphorylated peptides from this data set indicated successful decomposition into 12 previously identified kinase motifs and 6 novel motifs (Table 3, second nonshaded region) which together were able to account for 85% of the starting phosphorylated data set. A computational analysis of the sequence space of the 1,594 phosphorylated peptides revealed an upper bound of 246,383 potential motifs (containing 1, 2 or 3 nonwildcard positions). If we make the highly conservative assumptions that ∼100 serine phosphorylation motifs exist in the literature and all are represented in the data set, then the probability of extracting a single known motif by chance is 0.0004 (100/246,383) and the odds of extracting 12 known motifs is vanishingly small. Thus, the identification of 12 known motifs served both as a validation of the methodology, and a positive control for the data. Among these were motifs for MAPK, Cyclin-Dependent Kinase (CDK), Casein II kinase, AKT, CaMK II and Golgi Casein Kinase (G-CK) (see Figs. 2b,c and Table 3). Surprisingly however, the novel motifs RxRxxsP, RxxsPxP and sPxxxRR were among the most significant motifs identified. Although they share similarities with both basophilic and proline-directed kinases, these motifs have not been previously characterized (example in Fig. 2d). Inspection of the proteins identified from the data set containing these novel motifs revealed a disproportionate number of the so-called RS domain–containing proteins involved in RNA binding and splicing23. It should also be noted that the serine motifs were robustly identified even when differing background data sets were chosen (see Box 1, Table 4 and Supplementary Tables 7,8,9,10 online).
Although we demonstrated the ability of the motif-building algorithm to decompose a large data set into constitutive motifs, we also wished to test the performance of the algorithm on small data sets containing fewer motifs. To this end we used the data sets from the Rush et al.6 tyrosine phosphorylation immunoaffinity- tandem mass spectrometry (MS/MS) study. The first of these data sets contained tyrosine phosphorylated peptides from two cell lines, Karpas 299 and SUD-DHL-1, both of which are known to express constitutively active Nucleophosmin-Anaplastic Lymphoma Kinase (NPM-ALK)24, an oncogenic fusion tyrosine kinase. The degenerate motif-building analysis was able to extract four novel motif classes (Table 5, first shaded region). These motifs represent candidate consensus sequences for NPM-ALK, a kinase which currently has no known consensus.
Figure 2e shows the sequence logo for the Exxy NPM-ALK motif. Inspection of this sequence logo indicated a clear C2H2 zinc finger domain consensus sequence, with the phosphorylation site falling between the second histidine (position −6) and first cysteine (position 2) of the domain25. Interestingly, there are 14 unique phosphorylated C2H2 zinc fingers identified by the data set, all of which contain an invariable glutamic acid at the −3 position despite the fact that this residue is not well conserved among all members of the C2H2 zinc finger family of proteins. Though not mentioned in the Rush et al.6 study, this represents the first example of tyrosine phosphorylation in a zinc finger domain. The possibility exists that this phosphorylation event interferes with zinc coordination and may shed light on the poorly understood process by which zinc fingers associate and dissociate from their cognate DNA sequences.
The next data set we analyzed from this study contained tyrosine phosphorylated sequences from cells expressing constitutively active c-Src kinase. The optimal substrate sequence for c-Src has been determined by combinatorial library screening methods to be DEEIy[GE]EFF26. Comparison of the consensus to the motifs identified in this study yielded some striking similarities and differences (Table 5). The sequence logo for the c-Src motif yD (Fig. 2f), despite containing only 26 unique sequences, shares similar residue characteristics at nearly all positions with the in vitro–determined consensus, whereas the most significant motif identified, yS, bears only slight resemblance to the library-based c-Src motif (Fig. 2g).
The normal and degenerate motif analysis from the pervanadate-treated Jurkat cells in the third Rush et al.6 data set also revealed several motifs, all of which are indicative of the known Src activation in these cells27. One such significant motif (Fig. 2h) contained a proline at the +3 position (accounting for approximately one-fifth of the data set), consistent with recent work that has indicated the ability of Src kinase to also phosphorylate YxxP motifs28.
As proteomic sequence data sets grow ever larger, tools for the extraction of biologically relevant motifs will become even more useful. The algorithm presented here represents an attempt to extract biologically relevant motifs based on sequence information from large-scale mass spectrometry–based data sets, and is meant to serve as a starting point for future research. Using a statistical framework that does not assume independence of positions in the motif owing to a dynamic statistical background, a two-step methodology of motif building and set reduction is used to decompose a given data set into its constitutive motifs. The strategy taken is substantially different from previous work in the realm of protein motif discovery, and its validity has been demonstrated through direct comparison to other motif discovery algorithms. Furthermore, the approach's usefulness has been exemplified through its ability to extract known and novel motifs from several large-scale MS/MS-based phosphorylation data sets. Since the motifs and their cognate position weight matrices are generated from actual in vivo phosphorylation sites as opposed to synthetic peptide libraries, the method may lead to an improvement in phospho-site prediction. It should be noted however, that the derived position weight matrices, and not consensus sequences alone, should be used in the search for potential phosphorylation sites, as they contain information on residue frequencies surrounding the 'locked' sites and will help reduce false-positive rates. We also envision the novel motifs to be used as peptide baits to identify corresponding uncharacterized kinases.
Given the success of the algorithm on a linguistic data set, it is apparent that the method has the versatility to extract motifs from a wide range of data sets including, but not limited to, other post-translational modifications and genomic data. Additionally, the described algorithm may lead to the discovery of novel biologically relevant protein motifs directly from a raw proteome.
Phosphorylation data formatting.
The Beausoleil, Jedrychowski, et al.8 phosphoserine and phosphothreonine data set and the Rush et al.6 phosphotyrosine data sets were used as starting points for the analysis. Only those tryptic peptides where the exact site of phosphorylation was known were selected. Peptides were mapped back to their prospective proteins and six residues upstream and downstream of the phosphorylation sites were reextracted from the proteome sequence. This step removed the nonuniformity of tryptic peptide fragments to produce a data set of known phosphorylation sites in a uniform 13-residue context. In cases where the phosphorylation site was within six residues of a protein terminus, the sequence was discarded. Thus, only those sites which had sequence information for 12 residues surrounding the phosphorylation site were used. The sequences were then filtered for redundancy, so that only unique sequences remained. This formatting procedure gave rise to 1,594 phosphoserine-centered sequences, 195 phosphothreonine-centered sequences and 573 phosphotyrosine-centered sequences.
Background data set creation.
This analysis was dependent upon the ability to compare the distribution of residue frequencies in the phosphorylation data set with a background model. Background data sets were created in three ways. For the phosphoserine and phosphothreonine analyses, background data sets were created using fully-tryptic peptides generated from mass spectrometry data in the Gygi lab and searched using the human protein database with high Sequest29 thresholds (Xcorr > 2.5, dCn > 0.1). Data were centered on nonphosphorylated serine or threonine residues and formatted according to the same procedure described in the previous section, thus yielding two similarly aligned background data sets containing 27,980 sequences centered on serine and 20,208 sequences centered on threonine. Owing to the lower abundance of tyrosine residues, a second type of background data set for the phosphotyrosine analysis was created by taking six residues upstream and downstream of every tyrosine residue in the human proteome. This resulted in a background data set centered on tyrosine, containing 441,343 sequences. It should be noted, however, that despite our intention of using a mass spectrometry–based background to avoid mass spectrometry–specific residue biases, performing the serine and threonine analyses with a proteomic background as opposed to a mass spectrometry–based background did not significantly alter the motifs extracted (Supplementary Table 9 online). Finally, a third type of background data set with randomized residue positions for each of the serine-centered mass spectrometry peptides was created (see Box 1).
Degenerate residue positions.
To allow for conservative amino acid substitutions at various positions, we condensed our phosphorylated peptide and background lists from a 20-amino-acid code to a degenerate 11-amino-acid code based on chemical properties as follows: A = AG, D = DE, F = FY, K = KR, I = ILMV, Q = QN, S = ST, C = C, H = H, P = P, W = W. The analysis was then carried out as described for the nondegenerate analysis (see shaded regions of Tables 3 and 5).
The motif-building strategy was carried out by finding successive significant residue/position pairs. Though the significance threshold is a parameter of the algorithm, for all analyses in this paper, residue/position pairs were deemed significant if they had random probabilities less than 10−6 according to a binomially distributed model (equation (1) below),
where m equaled the number of sequences in the data set matrix, cxj equaled the count of residue x at position j in the data set matrix, and pxj equaled the fractional percentage of residue x at position j in the current background matrix. The result was calculated using the pbinom function in the Math::CDF PERL module. The function could not calculate probabilities below 10−16. Since each recursive iteration of the algorithm chose the residue/position pair with the lowest binomial probability, if more than one pair had probabilities of 10−16, then the pair with the greater frequency in the data set matrix was selected.
Despite the statistical significance of every motif extracted, heuristic scores for the motifs were calculated as the sum of the negative log of the binomial probabilities used to generate the motifs (equation (2) below),
Using the analytical framework previously created by Bussemaker et al.14, text from the first ten chapters of Moby Dick by Herman Melville15 with random characters inserted between words was retrieved from http://www.physics.rockefeller.edu/siggia/projects/mobydick/. By taking a sliding 13-character window, the text was then transformed into a matrix of all 13-character strings, thus constituting the background data set. From this background data set, 26 subsets were created, each being centered on a different letter of the alphabet. Using the background data set and each of the subsets, the motif-building methodology (with P < 10−6, and occurrences ≥ 10) was carried out 26 times, thus yielding motifs centered on every letter of the alphabet (Supplementary Tables 1,2,3,4 online).
Comparison to other algorithms.
To compare our algorithm to other motif discovery tools, we input our synthetically generated list of 300 proteins and the manually curated phosphorylation list containing 298 13-mers to four websites: Pratt at http://www.ebi.ac.uk/pratt/ with parameters C% = 2%, PL = 13, PN = 50, PX = 5, FN = 2, and FL = 1; TEIRESIAS at http://cbcsrv.watson.ibm.com/Tspd.html with option 'exact discovery' and parameters L = 2 or 3, W = 5 and K = 2; eMOTIF at http://motif.stanford.edu/emotif/emotif-maker.html with a 10% match threshold; Gibbs motif sampler at http://bayesweb.wadsworth.org/cgi-bin/gibbs.9.pl?data_type=protein with number of patterns = 5, max sites per sequences = 1, motif width = 5, estimated total sites = 40.
Public access to algorithm.
Access to the algorithm will be available through a website currently under construction at, http://motif-X.med.harvard.edu/ which will allow users to input their sequence data and adjust the various algorithm parameters to retrieve motif results.
Programming and sequence logos.
All programming and analysis was done using the PERL programming language on a Linux workstation (2.2 GHz microprocessor with 1.5 GB RAM). Sequence logos were generated online using Weblogo21 at http://weblogo.berkeley.edu/
Note: Supplementary information is available on the Nature Biotechnology website.
The authors thank John Rush and Cell Signaling Technology for providing access to the tyrosine phosphorylation data sets prior to their publication. Additionally, D.S. wishes to thank Michael Chou for assistance with the Moby Dick analysis as well as numerous stimulating conversations regarding the algorithm and critical reading of the manuscript. This work was supported in part by National Institutes of Health grant HG03456 (S.P.G.).