Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets


With the recent exponential increase in protein phosphorylation sites identified by mass spectrometry, a unique opportunity has arisen to understand the motifs surrounding such sites. Here we present an algorithm designed to extract motifs from large data sets of naturally occurring phosphorylation sites. The methodology relies on the intrinsic alignment of phospho-residues and the extraction of motifs through iterative comparison to a dynamic statistical background. Results show the identification of dozens of novel and known phosphorylation motifs from recently published serine, threonine and tyrosine phosphorylation studies. When applied to a linguistic data set to test the versatility of the approach, the algorithm successfully extracted hundreds of language motifs. This method, in addition to shedding light on the consensus sequences of identified and as yet unidentified kinases and modular protein domains, may also eventually be used as a tool to determine potential phosphorylation sites in proteins of interest.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: Overview of motif-building strategy.
Figure 2: Sequence logo representations of various extracted motifs.


  1. 1

    Schlessinger, J. & Lemmon, M.A. SH2 and PTB domains in tyrosine kinase signaling. Sci. STKE 2003, RE12 (2003).

    PubMed  Google Scholar 

  2. 2

    Ang, X.L. & Wade Harper, J. SCF-mediated protein degradation and cell cycle control. Oncogene 24, 2860–2870 (2005).

    CAS  Article  Google Scholar 

  3. 3

    Pawson, T. & Scott, J.D. Protein phosphorylation in signaling—50 years and counting. Trends Biochem. Sci. 30, 286–290 (2005).

    CAS  Article  Google Scholar 

  4. 4

    Obenauer, J.C., Cantley, L.C. & Yaffe, M.B. Scansite 2.0: Proteome-wide prediction of cell signaling interactions using short sequence motifs. Nucleic Acids Res. 31, 3635–3641 (2003).

    CAS  Article  Google Scholar 

  5. 5

    Manning, B.D. & Cantley, L.C. Hitting the target: emerging technologies in the search for kinase substrates. Sci. STKE 2002, PE49 (2002).

    PubMed  Google Scholar 

  6. 6

    Rush, J. et al. Immunoaffinity profiling of tyrosine phosphorylation in cancer cells. Nat. Biotechnol. 23, 94–101 (2005).

    CAS  Article  Google Scholar 

  7. 7

    Ficarro, S.B. et al. Phosphoproteome analysis by mass spectrometry and its application to Saccharomyces cerevisiae. Nat. Biotechnol. 20, 301–305 (2002).

    CAS  Article  Google Scholar 

  8. 8

    Beausoleil, S.A. et al. Large-scale characterization of HeLa cell nuclear phosphoproteins. Proc. Natl. Acad. Sci. USA 101, 12130–12135 (2004).

    CAS  Article  Google Scholar 

  9. 9

    Collins, M.O. et al. Proteomic analysis of in vivo phosphorylated synaptic proteins. J. Biol. Chem. 280, 5972–5982 (2005).

    CAS  Article  Google Scholar 

  10. 10

    Ballif, B.A., Villen, J., Beausoleil, S.A., Schwartz, D. & Gygi, S.P. Phosphoproteomic analysis of the developing mouse brain. Mol. Cell. Proteomics 3, 1093–1101 (2004).

    CAS  Article  Google Scholar 

  11. 11

    Gruhler, A. et al. Quantitative phosphoproteomics applied to the yeast pheromone signaling pathway. Mol. Cell. Proteomics 4, 310–327 (2005).

    CAS  Article  Google Scholar 

  12. 12

    Nuhse, T.S., Stensballe, A., Jensen, O.N. & Peck, S.C. Phosphoproteomics of the Arabidopsis plasma membrane and a new phosphorylation site database. Plant Cell 16, 2394–2405 (2004).

    Article  Google Scholar 

  13. 13

    Loyet, K.M., Stults, J.T. & Arnott, D. Mass spectrometric contributions to the practice of phosphorylation site mapping through 2003: a literature review. Mol. Cell. Proteomics 4, 235–245 (2005).

    CAS  Article  Google Scholar 

  14. 14

    Bussemaker, H.J., Li, H. & Siggia, E.D. Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. Proc. Natl. Acad. Sci. USA 97, 10096–10100 (2000).

    CAS  Article  Google Scholar 

  15. 15

    Melville, H. Moby-Dick, or, The whale (Signet Classic, New York, 1998).

    Google Scholar 

  16. 16

    Diella, F. et al. Phospho.ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins. BMC Bioinformatics 5, 79 (2004).

    Article  Google Scholar 

  17. 17

    Rigoutsos, I. & Floratos, A. Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14, 55–67 (1998).

    CAS  Article  Google Scholar 

  18. 18

    Jonassen, I., Collins, J.F. & Higgins, D.G. Finding flexible patterns in unaligned protein sequences. Protein Sci. 4, 1587–1595 (1995).

    CAS  Article  Google Scholar 

  19. 19

    Thompson, W., Rouchka, E.C. & Lawrence, C.E. Gibbs Recursive Sampler: finding transcription factor binding sites. Nucleic Acids Res. 31, 3580–3585 (2003).

    CAS  Article  Google Scholar 

  20. 20

    Nevill-Manning, C.G., Wu, T.D. & Brutlag, D.L. Highly specific protein sequence motifs for genome analysis. Proc. Natl. Acad. Sci. USA 95, 5865–5871 (1998).

    CAS  Article  Google Scholar 

  21. 21

    Schneider, T.D. & Stephens, R.M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18, 6097–6100 (1990).

    CAS  Article  Google Scholar 

  22. 22

    Crooks, G.E., Hon, G., Chandonia, J.M. & Brenner, S.E. WebLogo: a sequence logo generator. Genome Res. 14, 1188–1190 (2004).

    CAS  Article  Google Scholar 

  23. 23

    Boucher, L., Ouzounis, C.A., Enright, A.J. & Blencowe, B.J. A genome-wide survey of RS domain proteins. RNA 7, 1693–1701 (2001).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. 24

    Fujimoto, J. et al. Characterization of the transforming activity of p80, a hyperphosphorylated protein in a Ki-1 lymphoma cell line with chromosomal translocation t(2;5). Proc. Natl. Acad. Sci. USA 93, 4181–4186 (1996).

    CAS  Article  Google Scholar 

  25. 25

    Iuchi, S. Three classes of C2H2 zinc finger proteins. Cell. Mol. Life Sci. 58, 625–635 (2001).

    CAS  Article  Google Scholar 

  26. 26

    Songyang, Z. & Cantley, L.C. Recognition and specificity in protein tyrosine kinase-mediated signalling. Trends Biochem. Sci. 20, 470–475 (1995).

    CAS  Article  Google Scholar 

  27. 27

    Branch, D.R. & Mills, G.B. pp60c-src expression is induced by activation of normal human T lymphocytes. J. Immunol. 154, 3678–3685 (1995).

    CAS  PubMed  Google Scholar 

  28. 28

    Shin, N.Y. et al. Subsets of the major tyrosine phosphorylation sites in Crk-associated substrate (CAS) are sufficient to promote cell migration. J. Biol. Chem. 279, 38331–38337 (2004).

    CAS  Article  Google Scholar 

  29. 29

    Yates, J.R. III, Eng, J.K. & McCormack, A.L. Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. Anal. Chem. 67, 3202–3210 (1995).

    CAS  Article  Google Scholar 

Download references


The authors thank John Rush and Cell Signaling Technology for providing access to the tyrosine phosphorylation data sets prior to their publication. Additionally, D.S. wishes to thank Michael Chou for assistance with the Moby Dick analysis as well as numerous stimulating conversations regarding the algorithm and critical reading of the manuscript. This work was supported in part by National Institutes of Health grant HG03456 (S.P.G.).

Author information



Corresponding author

Correspondence to Daniel Schwartz.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Schwartz, D., Gygi, S. An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets. Nat Biotechnol 23, 1391–1398 (2005).

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing