Predicting protein function from sequence and structure

Lee, David; Redfern, Oliver; Orengo, Christine

doi:10.1038/nrm2281

Review Article
Published: December 2007

Predicting protein function from sequence and structure

David Lee¹,
Oliver Redfern¹ &
Christine Orengo¹

Nature Reviews Molecular Cell Biology volume 8, pages 995–1005 (2007)Cite this article

13k Accesses
364 Citations
4 Altmetric
Metrics details

Key Points

'Inheritance through homology' is the most common and generally more accessible approach to function prediction, but orthology should be established where possible to improve confidence in predictions.
The body of functional annotations of proteins is becoming increasingly computer-readable and is being organized in ways that can enhance the scope of in silico prediction methods.
Significant advances in complete genome sequencing have resulted in a new generation of methods that exploit sequence analysis on the genome level.
Curated protein family resources can often guide the assignment of protein functions and the detection of motifs or sequence patterns.
New approaches are being developed to identify functional residues in proteins; these can then be applied to divide larger protein families into more specific functional subfamilies.
There have been exciting new developments in databases of experimentally determined protein–protein interactions, as well as genomic inference methods for predicting these interactions.
Non-homology-based function prediction methods that exploit the properties of sequences and not their evolutionary history are also becoming more successful.
Recent Structural Genomics Initiatives (SGIs) are attempting to target functionally diverse relatives within protein families.
Function prediction from structure can be achieved by global comparison of protein structures to detect homology or through the use of structural templates derived from the active sites of enzymes. It is also possible to explore the protein surface for sequence-conserved patches, clefts and electrostatic potentials.
In general terms, it is best to seek and compare the results of several methods to predict the function of novel proteins. Meta-servers simplify this by providing easy access to a range of the best-performing methods.
Future developments will see more efficient integration of prediction methods and experimental data; for example, microarrays, yeast two-hybrid screens and tandem affinity purification. Better understanding of the diversification of function in protein families will permit more sophisticated means of predicting function and functional networks.

Abstract

While the number of sequenced genomes continues to grow, experimentally verified functional annotation of whole genomes remains patchy. Structural genomics projects are yielding many protein structures that have unknown function. Nevertheless, subsequent experimental investigation is costly and time-consuming, which makes computational methods for predicting protein function very attractive. There is an increasing number of noteworthy methods for predicting protein function from sequence and structural data alone, many of which are readily available to cell biologists who are aware of the strengths and pitfalls of each available technique.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Flow chart suggesting a possible strategy for molecular function prediction from a protein sequence and some possible outcomes.**

**Figure 2: The evolutionary trace (ET) method for identifying specificity residues.**

**Figure 3: Flow chart suggesting a possible strategy for function prediction from a protein structure and some possible outcomes.**

**Figure 4: Change of protein function in the ATP-grasp superfamily by insertion of secondary structure elements.**

**Figure 5: Using surface features and physico-chemical properties to recognize similarities between binding sites.**

Highly accurate protein structure prediction for the human proteome

Article Open access 22 July 2021

Kathryn Tunyasuvunakool, Jonas Adler, … Demis Hassabis

Sequence-structure-function relationships in the microbial protein universe

Article Open access 26 April 2023

Julia Koehler Leman, Pawel Szczerbiak, … Tomasz Kosciolek

Discovering functionally important sites in proteins

Article Open access 13 July 2023

Matteo Cagiada, Sandro Bottaro, … Kresten Lindorff-Larsen

References

Liolios, K., Tavernarakis, N., Hugenholtz, P. & Kyrpides, N. C. The Genomes On Line Database (GOLD) v2: a monitor of genome projects worldwide. Nucleic Acids Res. 34, D332–D334 (2006).
Article CAS PubMed Google Scholar
Wu, C. H. et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 34, D187–D191 (2006).
Article CAS PubMed Google Scholar
Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Wheeler, D. L. GenBank. Nucleic Acids Res. 34, D16–D20 (2006).
Article CAS PubMed Google Scholar
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet. 25, 25–29 (2000) www.nature.com/ng/journal/v25/n1/abs/ng0500_25.html. One of the best and most comprehensive attempts to standardize and organize the annotation of protein function.
Article CAS PubMed Google Scholar
Whisstock, J. C. & Lesk, A. M. Prediction of protein function from protein sequence and structure. Q. Rev. Biophys. 36, 307–340 (2003). A thorough and fairly recent review of the whole field of protein-function prediction from sequence and structure.
Article CAS PubMed Google Scholar
Bork, P. et al. Predicting function: from genes to genomes and back. J. Mol. Biol. 283, 707–725 (1998).
Article CAS PubMed Google Scholar
Watson, J. D., Laskowski, R. A. & Thornton, J. M. Predicting protein function from sequence and structural data. Curr. Opin. Struct. Biol. 15, 275–284 (2005).
Article CAS PubMed Google Scholar
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
Article CAS PubMed PubMed Central Google Scholar
Brenner, S. E. Errors in genome annotation. Trends Genet. 15, 132–133 (1999).
Article CAS PubMed Google Scholar
Devos, D. & Valencia, A. Intrinsic errors in genome annotation. Trends Genet. 17, 429–431 (2001).
Article CAS PubMed Google Scholar
Godzik, A., Jambon, M. & Friedberg, I. Computational protein function prediction: are we making progress? Cell Mol. Life Sci. 64, 2505–2511 (2007).
Article CAS PubMed Google Scholar
Fitch, W. M. Homology: a personal view on some of the problems. Trends Genet. 16, 227–231 (2000). An interesting discussion of some important concepts in the field of protein-function prediction.
Article CAS PubMed Google Scholar
Krallinger, M. & Valencia, A. Text-mining and information-retrieval services for molecular biology. Genome Biol. 6, 224 (2005).
Article PubMed PubMed Central CAS Google Scholar
Lord, P. W., Stevens, R. D., Brass, A. & Goble, C. A. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19, 1275–1283 (2003).
Article CAS PubMed Google Scholar
Schlicker, A., Domingues, F. S., Rahnenfuhrer, J. & Lengauer, T. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics 7, 302 (2006).
Article PubMed PubMed Central Google Scholar
Rison, S. C., Hodgman, T. C. & Thornton, J. M. Comparison of functional annotation schemes for genomes. Funct. Integr. Genomics 1, 56–69 (2000).
Article CAS PubMed Google Scholar
Mulder, N. J. et al. New developments in the InterPro database. Nucleic Acids Res. 35, D224–D228 (2007).
Article CAS PubMed PubMed Central Google Scholar
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS PubMed PubMed Central Google Scholar
Martin, D. M., Berriman, M. & Barton, G. J. GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 5, 178 (2004).
Article PubMed PubMed Central CAS Google Scholar
Hawkins, T., Luban, S. & Kihara, D. Enhanced automated function prediction using distantly related sequences and contextual association by PFP. Protein Sci. 15, 1550–1556 (2006). This method performed well in the CASP7 function-prediction category.
Article CAS PubMed PubMed Central Google Scholar
Blair, H. S. & Kumar, S. Genomic clocks and evolutionary timescales. Trends Genet. 19, 200–206 (2003).
Article CAS Google Scholar
Wall, D. P. et al. Functional genomic analysis of the rates of protein evolution. Proc. Natl. Acad. Sci. USA 102, 5483–5488 (2005).
Article CAS PubMed PubMed Central Google Scholar
Gattiker, A. et al. Automated annotation of microbial proteomes in SWISS-PROT. Comput. Biol. Chem. 27, 49–58 (2003).
Article CAS PubMed Google Scholar
Tatusov, R. L. et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41 (2003).
Article PubMed PubMed Central Google Scholar
O'Brien, K. P., Remm, M. & Sonnhammer, E. L. Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 33, D476–D480 (2005).
Article CAS PubMed Google Scholar
Storm, C. E. & Sonnhammer, E. L. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics 18, 92–99 (2002).
Article CAS PubMed Google Scholar
Mewes, H. W. et al. MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Res. 34, D169–D172 (2006).
Article CAS PubMed Google Scholar
Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 30, 276–280 (2002).
Article CAS PubMed PubMed Central Google Scholar
Apweiler, R. et al. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29, 37–40 (2001).
Article CAS PubMed PubMed Central Google Scholar
Pearl, F. et al. The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Res. 33, D247–D251 (2005).
Article CAS PubMed Google Scholar
Todd, A. E., Orengo, C. A. & Thornton, J. M. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307, 1113–1143 (2001). This paper examines the sequence–structure–function paradigm through an analysis of enzymes within superfamilies in the CATH database. It gives several examples of the different ways in which sequence and structure can change over evolution to produce new functions.
Article CAS PubMed Google Scholar
Tian, W. & Skolnick, J. How well is enzyme function conserved as a function of pairwise sequence identity? J. Mol. Biol. 333, 863–882 (2003).
Article CAS PubMed Google Scholar
Rost, B. Enzyme function less conserved than anticipated. J. Mol. Biol. 318, 595–608 (2002).
Article CAS PubMed Google Scholar
Marttinen, P., Corander, J., Toronen, P. & Holm, L. Bayesian search of functionally divergent protein subgroups and their function specific residues. Bioinformatics 22, 2466–2474 (2006).
Article CAS PubMed Google Scholar
Thomas, P. D. et al. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 13, 2129–2141 (2003).
Article CAS PubMed PubMed Central Google Scholar
Krishnamurthy, N., Brown, D. P., Kirshner, D. & Sjolander, K. PhyloFacts: an online structural phylogenomic encyclopedia for protein functional and structural classification. Genome Biol. 7, R83 (2006).
Article PubMed PubMed Central CAS Google Scholar
del Sol, M. A., Pazos, F. & Valencia, A. Automatic methods for predicting functionally important residues. J. Mol. Biol. 326, 1289–1302 (2003).
Article PubMed CAS Google Scholar
Yao, H. et al. An accurate, sensitive, and scalable method to identify functional sites in protein structures. J. Mol. Biol. 326, 255–261 (2003).
Article CAS PubMed Google Scholar
Joachimiak, M. P. & Cohen, F. E. JEvTrace: refinement and variations of the evolutionary trace in JAVA. Genome Biol. 3, RESEARCH0077 (2002). genomebiology.com/2002/3/12/RESEARCH/0077
Article PubMed PubMed Central Google Scholar
Morgan, D. H., Kristensen, D. M., Mittelman, D. & Lichtarge, O. ET viewer: an application for predicting and visualizing functional sites in protein structures. Bioinformatics 22, 2049–2050 (2006).
Article CAS PubMed Google Scholar
La, D. & Livesay, D. R. MINER: software for phylogenetic motif identification. Nucleic Acids Res. 33, W267–W270 (2005).
Article CAS PubMed PubMed Central Google Scholar
Chelliah, V., Chen, L., Blundell, T. L. & Lovell, S. C. Distinguishing structural and functional restraints in evolution in order to identify interaction sites. J. Mol. Biol. 342, 1487–1504 (2004).
Article CAS PubMed Google Scholar
Engelhardt, B. E., Jordan, M. I., Muratore, K. E. & Brenner, S. E. Protein molecular function prediction by Bayesian phylogenomics. PLoS Comput. Biol. 1, e45 (2005).
Article PubMed PubMed Central CAS Google Scholar
Yao, H., Mihalek, I. & Lichtarge, O. Rank information: a structure-independent measure of evolutionary trace quality that improves identification of protein functional sites. Proteins 65, 111–123 (2006).
Article CAS PubMed Google Scholar
Pazos, F., Rausell, A. & Valencia, A. Phylogeny-independent detection of functional residues. Bioinformatics 22, 1440–1448 (2006).
Article CAS PubMed Google Scholar
Ng, P. C. & Henikoff, S. Predicting the effects of amino acid substitutions on protein function. Annu. Rev. Genomics Hum. Genet. 7, 61–80 (2006).
Article CAS PubMed Google Scholar
Valdar, W. S. Scoring residue conservation. Proteins 48, 227–241 (2002).
Article CAS PubMed Google Scholar
Pirovano, W., Feenstra, K. A. & Heringa, J. Sequence comparison by sequence harmony identifies subtype-specific functional sites. Nucleic Acids Res. 34, 6540–6548 (2006).
Article CAS PubMed PubMed Central Google Scholar
Abhiman, S. & Sonnhammer, E. L. FunShift: a database of function shift analysis on protein subfamilies. Nucleic Acids Res. 33, D197–D200 (2005).
Article CAS PubMed Google Scholar
Tian, W., Arakaki, A. K. & Skolnick, J. EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference. Nucleic Acids Res. 32, 6226–6239 (2004).
Article CAS PubMed PubMed Central Google Scholar
Thompson, J. D., Higgins, D. G. & Gibson, T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994).
Article CAS PubMed PubMed Central Google Scholar
Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).
Article CAS PubMed PubMed Central Google Scholar
Katoh, K., Kuma, K., Toh, H. & Miyata, T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33, 511–518 (2005).
Article CAS PubMed PubMed Central Google Scholar
Notredame, C., Higgins, D. G. & Heringa, J. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217 (2000).
Article CAS PubMed Google Scholar
Porter, C. T., Bartlett, G. J. & Thornton, J. M. The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Res. 32, D129–D133 (2004).
Article CAS PubMed PubMed Central Google Scholar
George, R. A. et al. Effective function annotation through catalytic residue conservation. Proc. Natl. Acad. Sci. USA 102, 12299–12304 (2005).
Article CAS PubMed PubMed Central Google Scholar
Shoemaker, B. A. & Panchenko, A. R. Deciphering protein–protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLoS Comput Biol. 3, e43 (2007). An accessible introduction to computational methods for predicting protein-interaction partners.
Article PubMed PubMed Central CAS Google Scholar
Aloy, P. & Russell, R. B. Structural systems biology: modelling protein interactions. Nature Rev. Mol. Cell Biol. 7, 188–197 (2006).
Article CAS Google Scholar
Guldener, U. et al. MPact: the MIPS protein interaction resource on yeast. Nucleic Acids Res. 34, D436–D441 (2006).
Article PubMed CAS Google Scholar
von Mering, C. et al. STRING 7 — recent developments in the integration and prediction of protein interactions. Nucleic Acids Res. 35, D358–D362 (2007). A good example of a state-of-the-art protein-interaction database.
Article CAS PubMed Google Scholar
Krull, M. et al. TRANSPATH: an information resource for storing and visualizing signaling pathways and their pathological aberrations. Nucleic Acids Res. 34, D546–D551 (2006).
Article CAS PubMed Google Scholar
Vastrik, I. et al. Reactome: a knowledge base of biologic pathways and processes. Genome Biol. 8, R39 (2007).
Article PubMed PubMed Central CAS Google Scholar
Mishra, G. R. et al. Human protein reference database — 2006 update. Nucleic Acids Res. 34, D411–D414 (2006).
Article CAS PubMed Google Scholar
Dandekar, T., Snel, B., Huynen, M. & Bork, P. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci. 23, 324–328 (1998).
Article CAS PubMed Google Scholar
Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G. D. & Maltsev, N. The use of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. USA 96, 2896–2901 (1999).
Article CAS PubMed PubMed Central Google Scholar
Teichmann, S. A. & Babu, M. M. Conservation of gene co-regulation in prokaryotes and eukaryotes. Trends Biotechnol. 20, 407–410 (2002).
Article CAS PubMed Google Scholar
Korbel, J. O., Jensen, L. J., von Mering, C. & Bork, P. Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs. Nature Biotechnol. 22, 911–917 (2004).
Article CAS Google Scholar
Marcotte, E. M. et al. Detecting protein function and protein–protein interactions from genome sequences. Science 285, 751–753 (1999).
Article CAS PubMed Google Scholar
Burns, D. M., Horn, V., Paluh, J. & Yanofsky, C. Evolution of the tryptophan synthetase of fungi. Analysis of experimentally fused Escherichia coli tryptophan synthetase α and β chains. J. Biol. Chem. 265, 2060–2069 (1990).
Article CAS PubMed Google Scholar
Marcotte, C. J. & Marcotte, E. M. Predicting functional linkages from gene fusions with confidence. Appl. Bioinformatics. 1, 93–100 (2002).
PubMed Google Scholar
Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D. & Yeates, T. O. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA 96, 4285–4288 (1999).
Article CAS PubMed PubMed Central Google Scholar
Pagel, P., Wong, P. & Frishman, D. A domain interaction map based on phylogenetic profiling. J. Mol. Biol. 344, 1331–1346 (2004).
Article CAS PubMed Google Scholar
Ranea, J. A. G., Yeats, C., Grant, A. & Orengo, C. A. Predicting protein function with hierarchical phylogenetic profiles: the Gene3D “Phylo-Tuner” method applied to eukaryotic genomes. PLoS Comput. Biol. (in the press).
Pazos, F. & Valencia, A. Similarity of phylogenetic trees as indicator of protein–protein interaction. Protein Eng. 14, 609–614 (2001).
Article CAS PubMed Google Scholar
Pazos, F., Ranea, J. A., Juan, D. & Sternberg, M. J. Assessing protein co-evolution in the context of the tree of life assists in the prediction of the interactome. J. Mol. Biol. 352, 1002–1015 (2005).
Article CAS PubMed Google Scholar
Qi, Y., Bar-Joseph, Z. & Klein-Seetharaman, J. Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins 63, 490–500 (2006).
Article CAS PubMed PubMed Central Google Scholar
Lee, D., Grant, A., Marsden, R. L. & Orengo, C. Identification and distribution of protein families in 120 completed genomes using Gene3D. Proteins 59, 603–615 (2005).
Article CAS PubMed Google Scholar
Gardy, J. L. & Brinkman, F. S. Methods for predicting bacterial protein subcellular localization. Nature Rev. Microbiol. 4, 741–751 (2006).
Article CAS Google Scholar
Donnes, P. & Hoglund, A. Predicting protein subcellular localization: past, present, and future. Genomics Proteomics Bioinformatics 2, 209–215 (2004).
Article PubMed PubMed Central Google Scholar
Jensen, L. J. et al. Prediction of human protein function from post-translational modifications and localization features. J. Mol. Biol. 319, 1257–1265 (2002).
Article CAS PubMed Google Scholar
de Lichtenberg, U., Jensen, T. S., Jensen, L. J. & Brunak, S. Protein feature based identification of cell cycle regulated proteins in yeast. J. Mol. Biol. 329, 663–674 (2003).
Article CAS PubMed Google Scholar
Lobley, A., Swindells, M. B., Orengo, C. A. & Jones, D. T. Inferring function using patterns of native disorder in proteins. PLoS Comput. Biol. 3, e162 (2007).
Article PubMed PubMed Central CAS Google Scholar
Chothia, C. & Lesk, A. M. The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823–826 (1986).
Article CAS PubMed PubMed Central Google Scholar
Greene, L. H. et al. The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res. 35, D291–D297 (2007).
Article CAS PubMed Google Scholar
Holm, L. & Sander, C. Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, 123–138 (1993).
Article CAS PubMed Google Scholar
Shindyalov, I. N. & Bourne, P. E. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11, 739–747 (1998).
Article CAS PubMed Google Scholar
Taylor, W. R. & Orengo, C. A. Protein structure alignment. J. Mol. Biol. 208, 1–22 (1989).
Article CAS PubMed Google Scholar
Kolodny, R., Koehl, P. & Levitt, M. Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J. Mol. Biol. 346, 1173–1188 (2005).
Article CAS PubMed PubMed Central Google Scholar
Reeves, G. A., Dallman, T. J., Redfern, O. C., Akpor, A. & Orengo, C. A. Structural diversity of domain superfamilies in the CATH database. J. Mol. Biol. 360, 725–741 (2006).
Article CAS PubMed Google Scholar
Orengo, C. A., Sillitoe, I., Reeves, G. & Pearl, F. M. Review: what can structural classifications reveal about protein evolution? J. Struct. Biol. 134, 145–165 (2001).
Article CAS PubMed Google Scholar
Lisewski, A. M. & Lichtarge, O. Rapid detection of similarity in protein structure and function through contact metric distances. Nucleic Acids Res. 34, e152 (2006).
Article PubMed PubMed Central CAS Google Scholar
Barker, J. A. & Thornton, J. M. An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis. Bioinformatics 19, 1644–1649 (2003).
Article CAS PubMed Google Scholar
Laskowski, R. A., Watson, J. D. & Thornton, J. M. Protein function prediction using local 3D templates. J. Mol. Biol. 351, 614–626 (2005).
Article CAS PubMed Google Scholar
Ivanisenko, V. A. et al. PDBSiteScan: a tool for search for the best-matching superposition in the database PDBSite. Third International Conference on Bioinformatics of Genome Regulation and Structure 3, 149–152 (2002). Description of the PDBSiteScan server, which allows the user to compare a query protein structure against known functional sites in solved structures in the PDB.
Google Scholar
Golovin, A., Dimitropoulos, D., Oldfield, T., Rachedi, A. & Henrick, K. MSDsite: a database search and retrieval system for the analysis and viewing of bound ligands and active sites. Proteins 58, 190–199 (2005).
Article CAS PubMed Google Scholar
Stark, A. & Russell, R. B. Annotation in three dimensions. PINTS: Patterns In Non-homologous Tertiary Structures. Nucleic Acids Res. 31, 3341–3344 (2003).
Article CAS PubMed PubMed Central Google Scholar
Wangikar, P. P., Tendulkar, A. V., Ramya, S., Mali, D. N. & Sarawagi, S. Functional sites in protein families uncovered via an objective and automated graph theoretic approach. J. Mol. Biol. 326, 955–978 (2003).
Article CAS PubMed Google Scholar
Polacco, B. J. & Babbitt, P. C. Automated discovery of 3D motifs for protein function annotation. Bioinformatics 22, 723–730 (2006).
Article CAS PubMed Google Scholar
Laskowski, R. A., Luscombe, N. M., Swindells, M. B. & Thornton, J. M. Protein clefts in molecular recognition and function. Protein Sci. 5, 2438–2452 (1996).
CAS PubMed PubMed Central Google Scholar
Binkowski, T. A., Joachimiak, A. & Liang, J. Protein surface analysis for function annotation in high-throughput structural genomics pipeline. Protein Sci. 14, 2972–2981 (2005).
Article CAS PubMed PubMed Central Google Scholar
Shulman-Peleg, A., Nussinov, R. & Wolfson, H. J. SiteEngines: recognition and comparison of binding sites and protein–protein interfaces. Nucleic Acids Res. 33, W337–W341 (2005).
Article CAS PubMed PubMed Central Google Scholar
Kinoshita, K. & Nakamura, H. eF-site and PDBjViewer: database and viewer for protein functional sites. Bioinformatics 20, 1329–1330 (2004).
Article CAS PubMed Google Scholar
Pawlowski, K. & Godzik, A. Surface map comparison: studying function diversity of homologous proteins. J. Mol. Biol. 309, 793–806 (2001).
Article CAS PubMed Google Scholar
Ko, J., Murga, L. F., Wei, Y. & Ondrechen, M. J. Prediction of active sites for protein structures from computed chemical properties. Bioinformatics 21 (Suppl. 1), i258–i265 (2005).
Article CAS PubMed Google Scholar
Laskowski, R. A., Watson, J. D. & Thornton, J. M. ProFunc: a server for predicting protein function from 3D structure. Nucleic Acids Res. 33, W89–W93 (2005). Description of the ProFunc server, which combines sequence and structure comparison methods to predict protein function from a given structure.
Article CAS PubMed PubMed Central Google Scholar
Pal, D. & Eisenberg, D. Inference of protein function from protein structure. Structure 13, 121–130 (2005). Description of the ProKnow server, which, like ProFunc, aims to combine a range of homology-detection methods for a given structure to predict function. Gene Ontology terms from matched proteins are combined using a statistical framework to provide the user with a combined significance score for each predicted function.
Article CAS PubMed Google Scholar
Parkinson, H. et al. ArrayExpress — a public database of microarray experiments and gene expression profiles. Nucleic Acids Res. 35, D747–D750 (2007).
Article CAS PubMed Google Scholar
Kahlem, P. & Birney, E. Dry work in a wet world: computation in systems biology. Mol. Syst. Biol. 2, 40 (2006).
Article PubMed PubMed Central Google Scholar
Breitling, R., Amtmann, A. & Herzyk, P. Iterative Group Analysis (iGA): a simple tool to enhance sensitivity and facilitate interpretation of microarray experiments. BMC Bioinformatics 5, 34 (2004).
Article PubMed PubMed Central Google Scholar
Breslin, T., Eden, P. & Krogh, M. Comparing functional annotation analyses with Catmap. BMC Bioinformatics 5, 193 (2004).
Article PubMed PubMed Central CAS Google Scholar
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005).
Article CAS PubMed PubMed Central Google Scholar
Hu, P., Bader, G., Wigle, D. A. & Emili, A. Computational prediction of cancer-gene function. Nature Rev. Cancer 7, 23–34 (2007).
Article CAS Google Scholar
Editorial. A decade of genome-wide biology. Nature Genetics 37, S3 (2005).
Hinsby, A. M. et al. A wiring of the human nucleolus. Mol. Cell 22, 285–295 (2006).
Article CAS PubMed Google Scholar
Shulman-Peleg, A., Nussinov, R. & Wolfson, H. J. Recognition of functional sites in protein structures. J. Mol. Biol. 339, 607–633 (2004).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We would particularly like to acknowledge E. Sideris for help with the figures in this manuscript.

Author information

Authors and Affiliations

Department of Biochemistry and Molecular Biology, Biomolecular Structure and Modelling Group, University College London, Gower Street, London, WC1E 6BT, UK
David Lee, Oliver Redfern & Christine Orengo

Authors

David Lee
View author publications
You can also search for this author in PubMed Google Scholar
Oliver Redfern
View author publications
You can also search for this author in PubMed Google Scholar
Christine Orengo
View author publications
You can also search for this author in PubMed Google Scholar

Supplementary information

Supplementary information S1 (table): Online resources

This site is not intended to list all available online resources but rather those that are widely used, of high quality, and publicly available as of June 2007. (PDF 382 kb)

Glossary

Orthologue: A homologue that is found in separate species and has been separated by speciation rather than by a gene duplication event.
Homologue: Protein sequences are homologous if they have descended, usually with divergence, from a common ancestral sequence.
Paralogue: A homologue that is the product of a gene duplication event within a species.
Phylogenetic tree: Shows the evolutionary inter-relationships among various species or other entities that are believed to have a common ancestor. Each node that has descendants represents the most recent common ancestor of those descendants, with edge lengths sometimes corresponding to time estimates.
TIM barrel: Consists of eight α-helices and eight parallel β-strands that alternate along the peptide backbone. The structure is named after triose phosphate isomerase, a conserved glycolytic enzyme.
Superposition: After equivalent residues in two protein structures have been determined, the coordinates of one protein can be transformed onto the other.
Rossmann fold: Composed of three or more parallel β-strands linked by two α-helices and is found in proteins that bind nucleotides, such as the NAD and FMN co-factors.
Superfamily: A group of evolutionarily related proteins that often have the same overall domain structure, but may have diverged beyond recognition at the sequence level.
Structural template: Many methods of predicting function from structure involve listing specific residues and expected inter-atom distances in a template file, which can then be compared against other structures.
SITE record: Part of a Protein Data Bank file containing details of which residues are relevant to the protein function (for example, those involved in substrate binding).
De novo sequence method: A method that does not rely upon homology between sequences for transferring functional annotations but rather on the recognition of features such as residue composition and subcellular localization signals.
Meta-server: In the context of this review, a meta-server is a gateway to a well-benchmarked set of prediction methods.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, D., Redfern, O. & Orengo, C. Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol 8, 995–1005 (2007). https://doi.org/10.1038/nrm2281

Download citation

Issue Date: December 2007
DOI: https://doi.org/10.1038/nrm2281

This article is cited by

One substrate many enzymes virtual screening uncovers missing genes of carnitine biosynthesis in human and mouse
- Marco Malatesta
- Emanuele Fornasier
- Riccardo Percudani
Nature Communications (2024)
Uncovering supramolecular chirality codes for the design of tunable biomaterials
- Stephen J. Klawa
- Michelle Lee
- Ronit Freeman
Nature Communications (2024)
Discovering functionally important sites in proteins
- Matteo Cagiada
- Sandro Bottaro
- Kresten Lindorff-Larsen
Nature Communications (2023)
Favourable Interfacial Characteristics of A2 Milk Protein Monolayer
- Balaji S. Dhopte
- V. N. Lad
The Journal of Membrane Biology (2023)
The CmMYB3 transcription factors isolated from the Chrysanthemum morifolium regulate flavonol biosynthesis in Arabidopsis thaliana
- Feng Yang
- Tao Wang
- Shuyan Yu
Plant Cell Reports (2023)

Predicting protein function from sequence and structure

Key Points

Abstract

Access options

Similar content being viewed by others

Highly accurate protein structure prediction for the human proteome

Sequence-structure-function relationships in the microbial protein universe

Discovering functionally important sites in proteins

References

Acknowledgements

Author information

Authors and Affiliations

Supplementary information

Supplementary information S1 (table): Online resources

Related links

DATABASES

Protein Data Bank

FURTHER INFORMATION

Programs

Glossary

Rights and permissions

About this article

Cite this article

This article is cited by

One substrate many enzymes virtual screening uncovers missing genes of carnitine biosynthesis in human and mouse

Uncovering supramolecular chirality codes for the design of tunable biomaterials

Discovering functionally important sites in proteins

Favourable Interfacial Characteristics of A2 Milk Protein Monolayer

The CmMYB3 transcription factors isolated from the Chrysanthemum morifolium regulate flavonol biosynthesis in Arabidopsis thaliana

Search

Quick links

Key Points

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Supplementary information

Related links

Related links

DATABASES

Protein Data Bank

FURTHER INFORMATION

Programs

Glossary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links