Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Predicting protein function from sequence and structure

Key Points

  • 'Inheritance through homology' is the most common and generally more accessible approach to function prediction, but orthology should be established where possible to improve confidence in predictions.

  • The body of functional annotations of proteins is becoming increasingly computer-readable and is being organized in ways that can enhance the scope of in silico prediction methods.

  • Significant advances in complete genome sequencing have resulted in a new generation of methods that exploit sequence analysis on the genome level.

  • Curated protein family resources can often guide the assignment of protein functions and the detection of motifs or sequence patterns.

  • New approaches are being developed to identify functional residues in proteins; these can then be applied to divide larger protein families into more specific functional subfamilies.

  • There have been exciting new developments in databases of experimentally determined protein–protein interactions, as well as genomic inference methods for predicting these interactions.

  • Non-homology-based function prediction methods that exploit the properties of sequences and not their evolutionary history are also becoming more successful.

  • Recent Structural Genomics Initiatives (SGIs) are attempting to target functionally diverse relatives within protein families.

  • Function prediction from structure can be achieved by global comparison of protein structures to detect homology or through the use of structural templates derived from the active sites of enzymes. It is also possible to explore the protein surface for sequence-conserved patches, clefts and electrostatic potentials.

  • In general terms, it is best to seek and compare the results of several methods to predict the function of novel proteins. Meta-servers simplify this by providing easy access to a range of the best-performing methods.

  • Future developments will see more efficient integration of prediction methods and experimental data; for example, microarrays, yeast two-hybrid screens and tandem affinity purification. Better understanding of the diversification of function in protein families will permit more sophisticated means of predicting function and functional networks.

Abstract

While the number of sequenced genomes continues to grow, experimentally verified functional annotation of whole genomes remains patchy. Structural genomics projects are yielding many protein structures that have unknown function. Nevertheless, subsequent experimental investigation is costly and time-consuming, which makes computational methods for predicting protein function very attractive. There is an increasing number of noteworthy methods for predicting protein function from sequence and structural data alone, many of which are readily available to cell biologists who are aware of the strengths and pitfalls of each available technique.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Flow chart suggesting a possible strategy for molecular function prediction from a protein sequence and some possible outcomes.
Figure 2: The evolutionary trace (ET) method for identifying specificity residues.
Figure 3: Flow chart suggesting a possible strategy for function prediction from a protein structure and some possible outcomes.
Figure 4: Change of protein function in the ATP-grasp superfamily by insertion of secondary structure elements.
Figure 5: Using surface features and physico-chemical properties to recognize similarities between binding sites.

Similar content being viewed by others

References

  1. Liolios, K., Tavernarakis, N., Hugenholtz, P. & Kyrpides, N. C. The Genomes On Line Database (GOLD) v2: a monitor of genome projects worldwide. Nucleic Acids Res. 34, D332–D334 (2006).

    Article  CAS  PubMed  Google Scholar 

  2. Wu, C. H. et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 34, D187–D191 (2006).

    Article  CAS  PubMed  Google Scholar 

  3. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Wheeler, D. L. GenBank. Nucleic Acids Res. 34, D16–D20 (2006).

    Article  CAS  PubMed  Google Scholar 

  4. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet. 25, 25–29 (2000) www.nature.com/ng/journal/v25/n1/abs/ng0500_25.html. One of the best and most comprehensive attempts to standardize and organize the annotation of protein function.

    Article  CAS  PubMed  Google Scholar 

  5. Whisstock, J. C. & Lesk, A. M. Prediction of protein function from protein sequence and structure. Q. Rev. Biophys. 36, 307–340 (2003). A thorough and fairly recent review of the whole field of protein-function prediction from sequence and structure.

    Article  CAS  PubMed  Google Scholar 

  6. Bork, P. et al. Predicting function: from genes to genomes and back. J. Mol. Biol. 283, 707–725 (1998).

    Article  CAS  PubMed  Google Scholar 

  7. Watson, J. D., Laskowski, R. A. & Thornton, J. M. Predicting protein function from sequence and structural data. Curr. Opin. Struct. Biol. 15, 275–284 (2005).

    Article  CAS  PubMed  Google Scholar 

  8. Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Brenner, S. E. Errors in genome annotation. Trends Genet. 15, 132–133 (1999).

    Article  CAS  PubMed  Google Scholar 

  10. Devos, D. & Valencia, A. Intrinsic errors in genome annotation. Trends Genet. 17, 429–431 (2001).

    Article  CAS  PubMed  Google Scholar 

  11. Godzik, A., Jambon, M. & Friedberg, I. Computational protein function prediction: are we making progress? Cell Mol. Life Sci. 64, 2505–2511 (2007).

    Article  CAS  PubMed  Google Scholar 

  12. Fitch, W. M. Homology: a personal view on some of the problems. Trends Genet. 16, 227–231 (2000). An interesting discussion of some important concepts in the field of protein-function prediction.

    Article  CAS  PubMed  Google Scholar 

  13. Krallinger, M. & Valencia, A. Text-mining and information-retrieval services for molecular biology. Genome Biol. 6, 224 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  14. Lord, P. W., Stevens, R. D., Brass, A. & Goble, C. A. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19, 1275–1283 (2003).

    Article  CAS  PubMed  Google Scholar 

  15. Schlicker, A., Domingues, F. S., Rahnenfuhrer, J. & Lengauer, T. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics 7, 302 (2006).

    Article  PubMed  PubMed Central  Google Scholar 

  16. Rison, S. C., Hodgman, T. C. & Thornton, J. M. Comparison of functional annotation schemes for genomes. Funct. Integr. Genomics 1, 56–69 (2000).

    Article  CAS  PubMed  Google Scholar 

  17. Mulder, N. J. et al. New developments in the InterPro database. Nucleic Acids Res. 35, D224–D228 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Martin, D. M., Berriman, M. & Barton, G. J. GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 5, 178 (2004).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  20. Hawkins, T., Luban, S. & Kihara, D. Enhanced automated function prediction using distantly related sequences and contextual association by PFP. Protein Sci. 15, 1550–1556 (2006). This method performed well in the CASP7 function-prediction category.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Blair, H. S. & Kumar, S. Genomic clocks and evolutionary timescales. Trends Genet. 19, 200–206 (2003).

    Article  CAS  Google Scholar 

  22. Wall, D. P. et al. Functional genomic analysis of the rates of protein evolution. Proc. Natl. Acad. Sci. USA 102, 5483–5488 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Gattiker, A. et al. Automated annotation of microbial proteomes in SWISS-PROT. Comput. Biol. Chem. 27, 49–58 (2003).

    Article  CAS  PubMed  Google Scholar 

  24. Tatusov, R. L. et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41 (2003).

    Article  PubMed  PubMed Central  Google Scholar 

  25. O'Brien, K. P., Remm, M. & Sonnhammer, E. L. Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 33, D476–D480 (2005).

    Article  CAS  PubMed  Google Scholar 

  26. Storm, C. E. & Sonnhammer, E. L. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics 18, 92–99 (2002).

    Article  CAS  PubMed  Google Scholar 

  27. Mewes, H. W. et al. MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Res. 34, D169–D172 (2006).

    Article  CAS  PubMed  Google Scholar 

  28. Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 30, 276–280 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Apweiler, R. et al. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29, 37–40 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Pearl, F. et al. The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Res. 33, D247–D251 (2005).

    Article  CAS  PubMed  Google Scholar 

  31. Todd, A. E., Orengo, C. A. & Thornton, J. M. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307, 1113–1143 (2001). This paper examines the sequence–structure–function paradigm through an analysis of enzymes within superfamilies in the CATH database. It gives several examples of the different ways in which sequence and structure can change over evolution to produce new functions.

    Article  CAS  PubMed  Google Scholar 

  32. Tian, W. & Skolnick, J. How well is enzyme function conserved as a function of pairwise sequence identity? J. Mol. Biol. 333, 863–882 (2003).

    Article  CAS  PubMed  Google Scholar 

  33. Rost, B. Enzyme function less conserved than anticipated. J. Mol. Biol. 318, 595–608 (2002).

    Article  CAS  PubMed  Google Scholar 

  34. Marttinen, P., Corander, J., Toronen, P. & Holm, L. Bayesian search of functionally divergent protein subgroups and their function specific residues. Bioinformatics 22, 2466–2474 (2006).

    Article  CAS  PubMed  Google Scholar 

  35. Thomas, P. D. et al. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 13, 2129–2141 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Krishnamurthy, N., Brown, D. P., Kirshner, D. & Sjolander, K. PhyloFacts: an online structural phylogenomic encyclopedia for protein functional and structural classification. Genome Biol. 7, R83 (2006).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  37. del Sol, M. A., Pazos, F. & Valencia, A. Automatic methods for predicting functionally important residues. J. Mol. Biol. 326, 1289–1302 (2003).

    Article  PubMed  CAS  Google Scholar 

  38. Yao, H. et al. An accurate, sensitive, and scalable method to identify functional sites in protein structures. J. Mol. Biol. 326, 255–261 (2003).

    Article  CAS  PubMed  Google Scholar 

  39. Joachimiak, M. P. & Cohen, F. E. JEvTrace: refinement and variations of the evolutionary trace in JAVA. Genome Biol. 3, RESEARCH0077 (2002). genomebiology.com/2002/3/12/RESEARCH/0077

    Article  PubMed  PubMed Central  Google Scholar 

  40. Morgan, D. H., Kristensen, D. M., Mittelman, D. & Lichtarge, O. ET viewer: an application for predicting and visualizing functional sites in protein structures. Bioinformatics 22, 2049–2050 (2006).

    Article  CAS  PubMed  Google Scholar 

  41. La, D. & Livesay, D. R. MINER: software for phylogenetic motif identification. Nucleic Acids Res. 33, W267–W270 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Chelliah, V., Chen, L., Blundell, T. L. & Lovell, S. C. Distinguishing structural and functional restraints in evolution in order to identify interaction sites. J. Mol. Biol. 342, 1487–1504 (2004).

    Article  CAS  PubMed  Google Scholar 

  43. Engelhardt, B. E., Jordan, M. I., Muratore, K. E. & Brenner, S. E. Protein molecular function prediction by Bayesian phylogenomics. PLoS Comput. Biol. 1, e45 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  44. Yao, H., Mihalek, I. & Lichtarge, O. Rank information: a structure-independent measure of evolutionary trace quality that improves identification of protein functional sites. Proteins 65, 111–123 (2006).

    Article  CAS  PubMed  Google Scholar 

  45. Pazos, F., Rausell, A. & Valencia, A. Phylogeny-independent detection of functional residues. Bioinformatics 22, 1440–1448 (2006).

    Article  CAS  PubMed  Google Scholar 

  46. Ng, P. C. & Henikoff, S. Predicting the effects of amino acid substitutions on protein function. Annu. Rev. Genomics Hum. Genet. 7, 61–80 (2006).

    Article  CAS  PubMed  Google Scholar 

  47. Valdar, W. S. Scoring residue conservation. Proteins 48, 227–241 (2002).

    Article  CAS  PubMed  Google Scholar 

  48. Pirovano, W., Feenstra, K. A. & Heringa, J. Sequence comparison by sequence harmony identifies subtype-specific functional sites. Nucleic Acids Res. 34, 6540–6548 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Abhiman, S. & Sonnhammer, E. L. FunShift: a database of function shift analysis on protein subfamilies. Nucleic Acids Res. 33, D197–D200 (2005).

    Article  CAS  PubMed  Google Scholar 

  50. Tian, W., Arakaki, A. K. & Skolnick, J. EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference. Nucleic Acids Res. 32, 6226–6239 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Thompson, J. D., Higgins, D. G. & Gibson, T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Katoh, K., Kuma, K., Toh, H. & Miyata, T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33, 511–518 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Notredame, C., Higgins, D. G. & Heringa, J. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217 (2000).

    Article  CAS  PubMed  Google Scholar 

  55. Porter, C. T., Bartlett, G. J. & Thornton, J. M. The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res. 32, D129–D133 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. George, R. A. et al. Effective function annotation through catalytic residue conservation. Proc. Natl. Acad. Sci. USA 102, 12299–12304 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Shoemaker, B. A. & Panchenko, A. R. Deciphering protein–protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLoS Comput Biol. 3, e43 (2007). An accessible introduction to computational methods for predicting protein-interaction partners.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  58. Aloy, P. & Russell, R. B. Structural systems biology: modelling protein interactions. Nature Rev. Mol. Cell Biol. 7, 188–197 (2006).

    Article  CAS  Google Scholar 

  59. Guldener, U. et al. MPact: the MIPS protein interaction resource on yeast. Nucleic Acids Res. 34, D436–D441 (2006).

    Article  PubMed  CAS  Google Scholar 

  60. von Mering, C. et al. STRING 7 — recent developments in the integration and prediction of protein interactions. Nucleic Acids Res. 35, D358–D362 (2007). A good example of a state-of-the-art protein-interaction database.

    Article  CAS  PubMed  Google Scholar 

  61. Krull, M. et al. TRANSPATH: an information resource for storing and visualizing signaling pathways and their pathological aberrations. Nucleic Acids Res. 34, D546–D551 (2006).

    Article  CAS  PubMed  Google Scholar 

  62. Vastrik, I. et al. Reactome: a knowledge base of biologic pathways and processes. Genome Biol. 8, R39 (2007).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  63. Mishra, G. R. et al. Human protein reference database — 2006 update. Nucleic Acids Res. 34, D411–D414 (2006).

    Article  CAS  PubMed  Google Scholar 

  64. Dandekar, T., Snel, B., Huynen, M. & Bork, P. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci. 23, 324–328 (1998).

    Article  CAS  PubMed  Google Scholar 

  65. Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G. D. & Maltsev, N. The use of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. USA 96, 2896–2901 (1999).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Teichmann, S. A. & Babu, M. M. Conservation of gene co-regulation in prokaryotes and eukaryotes. Trends Biotechnol. 20, 407–410 (2002).

    Article  CAS  PubMed  Google Scholar 

  67. Korbel, J. O., Jensen, L. J., von Mering, C. & Bork, P. Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs. Nature Biotechnol. 22, 911–917 (2004).

    Article  CAS  Google Scholar 

  68. Marcotte, E. M. et al. Detecting protein function and protein–protein interactions from genome sequences. Science 285, 751–753 (1999).

    Article  CAS  PubMed  Google Scholar 

  69. Burns, D. M., Horn, V., Paluh, J. & Yanofsky, C. Evolution of the tryptophan synthetase of fungi. Analysis of experimentally fused Escherichia coli tryptophan synthetase α and β chains. J. Biol. Chem. 265, 2060–2069 (1990).

    Article  CAS  PubMed  Google Scholar 

  70. Marcotte, C. J. & Marcotte, E. M. Predicting functional linkages from gene fusions with confidence. Appl. Bioinformatics. 1, 93–100 (2002).

    PubMed  Google Scholar 

  71. Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D. & Yeates, T. O. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA 96, 4285–4288 (1999).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Pagel, P., Wong, P. & Frishman, D. A domain interaction map based on phylogenetic profiling. J. Mol. Biol. 344, 1331–1346 (2004).

    Article  CAS  PubMed  Google Scholar 

  73. Ranea, J. A. G., Yeats, C., Grant, A. & Orengo, C. A. Predicting protein function with hierarchical phylogenetic profiles: the Gene3D “Phylo-Tuner” method applied to eukaryotic genomes. PLoS Comput. Biol. (in the press).

  74. Pazos, F. & Valencia, A. Similarity of phylogenetic trees as indicator of protein–protein interaction. Protein Eng. 14, 609–614 (2001).

    Article  CAS  PubMed  Google Scholar 

  75. Pazos, F., Ranea, J. A., Juan, D. & Sternberg, M. J. Assessing protein co-evolution in the context of the tree of life assists in the prediction of the interactome. J. Mol. Biol. 352, 1002–1015 (2005).

    Article  CAS  PubMed  Google Scholar 

  76. Qi, Y., Bar-Joseph, Z. & Klein-Seetharaman, J. Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins 63, 490–500 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Lee, D., Grant, A., Marsden, R. L. & Orengo, C. Identification and distribution of protein families in 120 completed genomes using Gene3D. Proteins 59, 603–615 (2005).

    Article  CAS  PubMed  Google Scholar 

  78. Gardy, J. L. & Brinkman, F. S. Methods for predicting bacterial protein subcellular localization. Nature Rev. Microbiol. 4, 741–751 (2006).

    Article  CAS  Google Scholar 

  79. Donnes, P. & Hoglund, A. Predicting protein subcellular localization: past, present, and future. Genomics Proteomics Bioinformatics 2, 209–215 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  80. Jensen, L. J. et al. Prediction of human protein function from post-translational modifications and localization features. J. Mol. Biol. 319, 1257–1265 (2002).

    Article  CAS  PubMed  Google Scholar 

  81. de Lichtenberg, U., Jensen, T. S., Jensen, L. J. & Brunak, S. Protein feature based identification of cell cycle regulated proteins in yeast. J. Mol. Biol. 329, 663–674 (2003).

    Article  CAS  PubMed  Google Scholar 

  82. Lobley, A., Swindells, M. B., Orengo, C. A. & Jones, D. T. Inferring function using patterns of native disorder in proteins. PLoS Comput. Biol. 3, e162 (2007).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  83. Chothia, C. & Lesk, A. M. The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823–826 (1986).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. Greene, L. H. et al. The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res. 35, D291–D297 (2007).

    Article  CAS  PubMed  Google Scholar 

  85. Holm, L. & Sander, C. Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, 123–138 (1993).

    Article  CAS  PubMed  Google Scholar 

  86. Shindyalov, I. N. & Bourne, P. E. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11, 739–747 (1998).

    Article  CAS  PubMed  Google Scholar 

  87. Taylor, W. R. & Orengo, C. A. Protein structure alignment. J. Mol. Biol. 208, 1–22 (1989).

    Article  CAS  PubMed  Google Scholar 

  88. Kolodny, R., Koehl, P. & Levitt, M. Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J. Mol. Biol. 346, 1173–1188 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  89. Reeves, G. A., Dallman, T. J., Redfern, O. C., Akpor, A. & Orengo, C. A. Structural diversity of domain superfamilies in the CATH database. J. Mol. Biol. 360, 725–741 (2006).

    Article  CAS  PubMed  Google Scholar 

  90. Orengo, C. A., Sillitoe, I., Reeves, G. & Pearl, F. M. Review: what can structural classifications reveal about protein evolution? J. Struct. Biol. 134, 145–165 (2001).

    Article  CAS  PubMed  Google Scholar 

  91. Lisewski, A. M. & Lichtarge, O. Rapid detection of similarity in protein structure and function through contact metric distances. Nucleic Acids Res. 34, e152 (2006).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  92. Barker, J. A. & Thornton, J. M. An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis. Bioinformatics 19, 1644–1649 (2003).

    Article  CAS  PubMed  Google Scholar 

  93. Laskowski, R. A., Watson, J. D. & Thornton, J. M. Protein function prediction using local 3D templates. J. Mol. Biol. 351, 614–626 (2005).

    Article  CAS  PubMed  Google Scholar 

  94. Ivanisenko, V. A. et al. PDBSiteScan: a tool for search for the best-matching superposition in the database PDBSite. Third International Conference on Bioinformatics of Genome Regulation and Structure 3, 149–152 (2002). Description of the PDBSiteScan server, which allows the user to compare a query protein structure against known functional sites in solved structures in the PDB.

    Google Scholar 

  95. Golovin, A., Dimitropoulos, D., Oldfield, T., Rachedi, A. & Henrick, K. MSDsite: a database search and retrieval system for the analysis and viewing of bound ligands and active sites. Proteins 58, 190–199 (2005).

    Article  CAS  PubMed  Google Scholar 

  96. Stark, A. & Russell, R. B. Annotation in three dimensions. PINTS: Patterns In Non-homologous Tertiary Structures. Nucleic Acids Res. 31, 3341–3344 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  97. Wangikar, P. P., Tendulkar, A. V., Ramya, S., Mali, D. N. & Sarawagi, S. Functional sites in protein families uncovered via an objective and automated graph theoretic approach. J. Mol. Biol. 326, 955–978 (2003).

    Article  CAS  PubMed  Google Scholar 

  98. Polacco, B. J. & Babbitt, P. C. Automated discovery of 3D motifs for protein function annotation. Bioinformatics 22, 723–730 (2006).

    Article  CAS  PubMed  Google Scholar 

  99. Laskowski, R. A., Luscombe, N. M., Swindells, M. B. & Thornton, J. M. Protein clefts in molecular recognition and function. Protein Sci. 5, 2438–2452 (1996).

    CAS  PubMed  PubMed Central  Google Scholar 

  100. Binkowski, T. A., Joachimiak, A. & Liang, J. Protein surface analysis for function annotation in high-throughput structural genomics pipeline. Protein Sci. 14, 2972–2981 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  101. Shulman-Peleg, A., Nussinov, R. & Wolfson, H. J. SiteEngines: recognition and comparison of binding sites and protein–protein interfaces. Nucleic Acids Res. 33, W337–W341 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  102. Kinoshita, K. & Nakamura, H. eF-site and PDBjViewer: database and viewer for protein functional sites. Bioinformatics 20, 1329–1330 (2004).

    Article  CAS  PubMed  Google Scholar 

  103. Pawlowski, K. & Godzik, A. Surface map comparison: studying function diversity of homologous proteins. J. Mol. Biol. 309, 793–806 (2001).

    Article  CAS  PubMed  Google Scholar 

  104. Ko, J., Murga, L. F., Wei, Y. & Ondrechen, M. J. Prediction of active sites for protein structures from computed chemical properties. Bioinformatics 21 (Suppl. 1), i258–i265 (2005).

    Article  CAS  PubMed  Google Scholar 

  105. Laskowski, R. A., Watson, J. D. & Thornton, J. M. ProFunc: a server for predicting protein function from 3D structure. Nucleic Acids Res. 33, W89–W93 (2005). Description of the ProFunc server, which combines sequence and structure comparison methods to predict protein function from a given structure.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  106. Pal, D. & Eisenberg, D. Inference of protein function from protein structure. Structure 13, 121–130 (2005). Description of the ProKnow server, which, like ProFunc, aims to combine a range of homology-detection methods for a given structure to predict function. Gene Ontology terms from matched proteins are combined using a statistical framework to provide the user with a combined significance score for each predicted function.

    Article  CAS  PubMed  Google Scholar 

  107. Parkinson, H. et al. ArrayExpress — a public database of microarray experiments and gene expression profiles. Nucleic Acids Res. 35, D747–D750 (2007).

    Article  CAS  PubMed  Google Scholar 

  108. Kahlem, P. & Birney, E. Dry work in a wet world: computation in systems biology. Mol. Syst. Biol. 2, 40 (2006).

    Article  PubMed  PubMed Central  Google Scholar 

  109. Breitling, R., Amtmann, A. & Herzyk, P. Iterative Group Analysis (iGA): a simple tool to enhance sensitivity and facilitate interpretation of microarray experiments. BMC Bioinformatics 5, 34 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  110. Breslin, T., Eden, P. & Krogh, M. Comparing functional annotation analyses with Catmap. BMC Bioinformatics 5, 193 (2004).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  111. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  112. Hu, P., Bader, G., Wigle, D. A. & Emili, A. Computational prediction of cancer-gene function. Nature Rev. Cancer 7, 23–34 (2007).

    Article  CAS  Google Scholar 

  113. Editorial. A decade of genome-wide biology. Nature Genetics 37, S3 (2005).

  114. Hinsby, A. M. et al. A wiring of the human nucleolus. Mol. Cell 22, 285–295 (2006).

    Article  CAS  PubMed  Google Scholar 

  115. Shulman-Peleg, A., Nussinov, R. & Wolfson, H. J. Recognition of functional sites in protein structures. J. Mol. Biol. 339, 607–633 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We would particularly like to acknowledge E. Sideris for help with the figures in this manuscript.

Author information

Authors and Affiliations

Authors

Supplementary information

Supplementary information S1 (table): Online resources

This site is not intended to list all available online resources but rather those that are widely used, of high quality, and publicly available as of June 2007. (PDF 382 kb)

Related links

Related links

DATABASES

Protein Data Bank

1ATP

1EHI

1MJH

FURTHER INFORMATION

Christine Orengo's homepage

Programs

3did

BIND

BLAST EBI

BLAST NCBI

Catalytic Site Atlas

CATH

CATHEDRAL

CE

ClustalW

COG

DALI

DIP

DRESPAT

EFICAz

eF-Site

ELM

ENZYME

Evolutionary Trace

FunCat

FunShift

Gene3D

GO

HAMAP

Human Protein Reference Database

IMG

InParanoid

IntAct

InterPro

iPfam

KEGG

MetaCyc

MIPS

MINT

Nest

Orthostrapper

PANTHER

PDBSiteScan

Pfam

PhyloFacts

PIBASE

PINTS

PRINTS

ProDom

ProFunc

ProKnow

PROSITE

ProteinKeys

ProtFun

ProtoNet

PSIMAP

PUMA2

pvSOAR

Reactome

SCOP

SCOPPI

SIFT

SiteEngine

SMART

SNAPPI-DB

SSAP

SSM

STRING

STRUCTAL

SUPERFAMILY

Surfnet

Swiss-Prot

SYSTERS

TIGRFAMs

TRANSPATH

Glossary

Orthologue

A homologue that is found in separate species and has been separated by speciation rather than by a gene duplication event.

Homologue

Protein sequences are homologous if they have descended, usually with divergence, from a common ancestral sequence.

Paralogue

A homologue that is the product of a gene duplication event within a species.

Phylogenetic tree

Shows the evolutionary inter-relationships among various species or other entities that are believed to have a common ancestor. Each node that has descendants represents the most recent common ancestor of those descendants, with edge lengths sometimes corresponding to time estimates.

TIM barrel

Consists of eight α-helices and eight parallel β-strands that alternate along the peptide backbone. The structure is named after triose phosphate isomerase, a conserved glycolytic enzyme.

Superposition

After equivalent residues in two protein structures have been determined, the coordinates of one protein can be transformed onto the other.

Rossmann fold

Composed of three or more parallel β-strands linked by two α-helices and is found in proteins that bind nucleotides, such as the NAD and FMN co-factors.

Superfamily

A group of evolutionarily related proteins that often have the same overall domain structure, but may have diverged beyond recognition at the sequence level.

Structural template

Many methods of predicting function from structure involve listing specific residues and expected inter-atom distances in a template file, which can then be compared against other structures.

SITE record

Part of a Protein Data Bank file containing details of which residues are relevant to the protein function (for example, those involved in substrate binding).

De novo sequence method

A method that does not rely upon homology between sequences for transferring functional annotations but rather on the recognition of features such as residue composition and subcellular localization signals.

Meta-server

In the context of this review, a meta-server is a gateway to a well-benchmarked set of prediction methods.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, D., Redfern, O. & Orengo, C. Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol 8, 995–1005 (2007). https://doi.org/10.1038/nrm2281

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrm2281

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing