Abstract
Despite the practically unlimited number of possible protein sequences, the number of basic shapes in which proteins fold seems not only to be finite, but also to be relatively small, with probably no more than 10,000 folds in existence. Moreover, the distribution of proteins among these folds is highly non-homogeneous — some folds and superfamilies are extremely abundant, but most are rare. Protein folds and families encoded in diverse genomes show similar size distributions with notable mathematical properties, which also extend to the number of connections between domains in multidomain proteins. All these distributions follow asymptotic power laws, such as have been identified in a wide variety of biological and physical systems, and which are typically associated with scale-free networks. These findings suggest that genome evolution is driven by extremely general mechanisms based on the preferential attachment principle.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Holm, L. & Sander, C. Mapping the protein universe. Science 273, 595–603 (1996).
Zhang, C. & DeLisi, C. Protein folds: molecular systematics in three dimensions. Cell. Mol. Life Sci. 58, 72–79 (2001).
Rost, B. Did evolution leap to create the protein universe? Curr. Opin. Struct. Biol. 12, 409–416 (2002).
Dayhoff, M. The origin and evolution of protein superfamilies. Fed. Proc. 35, 2132–2138 (1976).
Dayhoff, M. O., Barker, W. C. & Hunt, L. T. Establishing homologies in protein sequences. Methods Enzymol. 91, 524–545 (1983).
Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995).
Murzin, A. G. Structural classification of proteins: new superfamilies. Curr. Opin. Struct. Biol. 6, 386–394 (1996).
Orengo, C. A. et al. CATH—a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997).
Todd, A. E., Orengo, C. A. & Thornton, J. M. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307, 1113–1143 (2001).
Lo Conte, L., Brenner, S. E., Hubbard, T. J., Chothia, C. & Murzin, A. G. SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res. 30, 264–267 (2002).
Orengo, C. A. et al. The CATH protein family database: a resource for structural and functional annotation of genomes. Proteomics 2, 11–21 (2002).
Branden, C.-I & Tooze, J. Introduction to Protein Structure (Garland Publishing, New York, 1999).
Anantharaman, V., Koonin, E. V. & Aravind, L. Comparative genomics and evolution of proteins involved in RNA metabolism. Nucleic Acids Res. 30, 1427–1464 (2002).
Anantharaman, V., Koonin, E. V. & Aravind, L. Regulatory potential, phyletic distribution and evolution of ancient, intracellular small-molecule-binding domains. J. Mol. Biol. 307, 1271–1292 (2001).
Saraste, M., Sibbald, P. R. & Wittinghofer, A. The P-loop—a common motif in ATP- and GTP-binding proteins. Trends Biochem. Sci. 15, 430–434 (1990).
Koonin, E. V. A superfamily of ATPases with diverse functions containing either classical or deviant ATP-binding motif. J. Mol. Biol. 229, 1165–1174 (1993).
Aravind, L., Mazumder, R., Vasudevan, S. & Koonin, E. V. Trends in protein evolution inferred from sequence and structure analysis. Curr. Opin. Struct. Biol. 12, 392–399 (2002).
Galperin, M. Y., Walker, D. R. & Koonin, E. V. Analogous enzymes: independent inventions in enzyme evolution. Genome Res. 8, 779–790 (1998).
Martin, A. C. et al. Protein folds and functions. Structure 6, 875–884 (1998).
Fitch, W. M. Distinguishing homologous from analogous proteins. Syst. Zool. 19, 99–113 (1970).
Fitch, W. M. Homology a personal view on some of the problems. Trends Genet. 16, 227–231 (2000).
Tatusov, R. L., Koonin, E. V. & Lipman, D. J. A genomic perspective on protein families. Science 278, 631–637 (1997).
Tatusov, R. L., Galperin, M. Y., Natale, D. A. & Koonin, E. V. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28, 33–36 (2000).
Jordan, I. K., Makarova, K. S., Spouge, J. L., Wolf, Y. I. & Koonin, E. V. Lineage-specific gene expansions in bacterial and archaeal genomes. Genome Res. 11, 555–565 (2001).
Remm, M., Storm, C. E. & Sonnhammer, E. L. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 1041–1052 (2001).
Lespinet, O., Wolf, Y. I., Koonin, E. V. & Aravind, L. The role of lineage-specific gene family expansion in the evolution of eukaryotes. Genome Res. 12, 1048–1059 (2002).
Henikoff, S. et al. Gene families: the taxonomy of protein paralogs and chimeras. Science 278, 609–614 (1997).
Alexandrov, N. N. & Go, N. Biological meaning, statistical significance, and classification of local spatial similarities in nonhomologous proteins. Protein Sci. 3, 866–875 (1994).
Orengo, C. A., Jones, D. T. & Thornton, J. M. Protein superfamilies and domain superfolds. Nature 372, 631–634 (1994).
Zuckerkandl, E. The appearance of new structures and functions in proteins during evolution. J. Mol. Evol. 7, 1–57 (1975).
Chothia, C. One thousand families for the molecular biologist. Nature 357, 543–544 (1992).
Zhang, C. T. Relations of the numbers of protein sequences, families and folds. Protein Eng. 10, 757–761 (1997).
Wang, Z. X. A re-estimation for the total numbers of protein folds and superfamilies. Protein Eng. 11, 621–626 (1998).
Zhang, C. & DeLisi, C. Estimating the number of protein folds. J. Mol. Biol. 284, 1301–1305 (1998).
Govindarajan, S., Recabarren, R. & Goldstein, R. A. Estimating the total number of protein folds. Proteins 35, 408–414 (1999).
Wolf, Y. I., Grishin, N. V. & Koonin, E. V. Estimating the number of protein folds and families from complete genome data. J. Mol. Biol. 299, 897–905 (2000).
Coulson, A. F. & Moult, J. A unifold, mesofold, and superfold model of protein fold use. Proteins 46, 61–71 (2002).
Kuznetsov, V. A. in Computational and Statistical Approaches to Genomics (eds Zhang, W. & Shmulevich, I.) 125–171 (Kluwer, Boston, 2002).
Karev, G. P., Wolf, Y. I., Rzhetsky, A. Y., Berezovskaya, F. S. & Koonin, E. V. in Computational Genomics: from Sequence to Function (eds Galperin, M. Y. & Koonin, E. V.) (Horizon, Amsterdam, in the press).
Karev, G. P., Wolf, Y. I., Rzhetsky, A. Y., Berezovskaya, F. S. & Koonin, E. V. Birth and death of protein domains: a simple model of evolution explains power law behavior. BMC Evol. Biol. (in the press).
Huynen, M. A. & van Nimwegen, E. The frequency distribution of gene family sizes in complete genomes. Mol. Biol. Evol. 15, 583–589 (1998).
Qian, J., Luscombe, N. M. & Gerstein, M. Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. J. Mol. Biol. 313, 673–681 (2001).
Harrison, P. M. & Gerstein, M. Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. J. Mol. Biol. 318, 1155–1174 (2002).
Luscombe, N., Qian, J., Zhang, Z., Johnson, T. & Gerstein, M. The dominance of the population by a selected few: power-law behaviour applies to a wide variety of genomic properties. Genome Biol. 3, research0040.1–0040.7 (2002).
Barabasi, A. L. & Albert, R. Emergence of scaling in random networks. Science 286, 509–512 (1999).
Bilke, S. & Peterson, C. Topological properties of citation and metabolic networks. Phys. Rev. E 64, 036106-1–036106-5 (2001).
Barabasi, A. L. Linked: The New Science of Networks (Perseus, New York, 2002).
Albert, R. & Barabasi, A. L. Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–97 (2002).
Gisiger, T. Scale invariance in biology: coincidence or footprint of a universal mechanism? Biol. Rev. Camb. Phil. Soc. 76, 161–209 (2001).
Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. & Barabasi, A. L. The large-scale organization of metabolic networks. Nature 407, 651–654 (2000).
Zipf, G. K. Human Behaviour and the Principle of Least Effort (Addison-Wesley, Boston, 1949).
Pareto, V. Cours d'Economie Politique (Rouge et Cie, Paris, 1897).
Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N. & Barabasi, A. L. Hierarchical organization of modularity in metabolic networks. Science 297, 1551–1555 (2002).
Jeong, H., Mason, S. P., Barabasi, A. L. & Oltvai, Z. N. Lethality and centrality in protein networks. Nature 411, 41–42 (2001).
Li, H., Helling, R., Tang, C. & Wingreen, N. Emergence of preferred structures in a simple model of protein folding. Science 273, 666–669 (1996).
Li, H., Tang, C. & Wingreen, N. S. Are protein folds atypical? Proc. Natl Acad. Sci. USA 95, 4987–4990 (1998).
Rzhetsky, A. & Gomez, S. M. Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics 17, 988–996 (2001).
Yule, G. U. A mathematical theory of evolution, based on the conclusions of Dr. J.C. Willis, F.R.S. Phil. Trans. R. Soc. Lond. B 213, 21–87 (1924).
Gould, S. J. The Structure of Evolutionary Theory (Harvard Univ. Press, Cambridge, MA, 2002).
Doolittle, W. F. Lateral genomics. Trends Cell Biol. 9, M5–M8 (1999).
Doolittle, W. F. Phylogenetic classification and the universal tree. Science 284, 2124–2129 (1999).
Doolittle, W. F. You are what you eat: a gene transfer ratchet could account for bacterial genes in eukaryotic nuclear genomes. Trends Genet 14, 307–311 (1998).
Koonin, E. V., Makarova, K. S. & Aravind, L. Horizontal gene transfer in prokaryotes: quantification and classification. Annu. Rev. Microbiol. 55, 709–742 (2001).
Ragan, M. A. Detection of lateral gene transfer among microbial genomes. Curr. Opin. Genet. Dev. 11, 620–626 (2001).
Marcotte, E. M. et al. Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751–753 (1999).
Enright, A. J., Illopoulos, I., Kyrpides, N. C. & Ouzounis, C. A. Protein interaction maps for complete genomes based on gene fusion events. Nature 402, 86–90 (1999).
Galperin, M. Y. & Koonin, E. V. Who's your neighbor? New computational approaches for functional genomics. Nature Biotechnol. 18, 609–613 (2000).
Aravind, L. Guilt by association: contextual information in genome analysis. Genome Res. 10, 1074–1077 (2000).
Koonin, E. V., Aravind, L. & Kondrashov, A. S. The impact of comparative genomics on our understanding of evolution. Cell 101, 573–576 (2000).
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Wolf, Y. I., Brenner, S. E., Bash, P. A. & Koonin, E. V. Distribution of protein folds in the three superkingdoms of life. Genome Res. 9, 17–26 (1999).
Wuchty, S. Scale-free behavior in protein domain networks. Mol. Biol. Evol. 18, 1694–1702 (2001).
Apic, G., Gough, J. & Teichmann, S. A. An insight into domain combinations. Bioinformatics 17 (Suppl. 1), S83–S89 (2001).
Bork, P. et al. A superfamily of conserved domains in DNA damage-responsive cell cycle checkpoint proteins. FASEB J. 11, 68–76 (1997).
Derbyshire, D. J. et al. Crystal structure of human 53BP1 BRCT domains bound to p53 tumour suppressor. EMBO J. 21, 3863–3872 (2002).
Vitkup, D., Melamud, E., Moult, J. & Sander, C. Completeness in structural genomics. Nature Struct. Biol. 8, 559–566 (2001).
Marchler-Bauer, A. et al. CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res. 30, 281–283 (2002).
Acknowledgements
We thank A. Panchenko and S. He (NCBI) for help with the use of the Conserved Domain Database, and A. Rzhetsky and V. Kuznetsov for helpful discussions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Koonin, E., Wolf, Y. & Karev, G. The structure of the protein universe and genome evolution. Nature 420, 218–223 (2002). https://doi.org/10.1038/nature01256
Issue Date:
DOI: https://doi.org/10.1038/nature01256
This article is cited by
-
Domain Architecture Based Methods for Comparative Functional Genomics Toward Therapeutic Drug Target Discovery
Journal of Molecular Evolution (2023)
-
Sub-region analysis of DMD gene in cases with idiopathic generalized epilepsy
neurogenetics (2023)
-
Protein as evolvable functionally constrained amorphous matter
Journal of Biosciences (2022)
-
Intrinsic disorder in protein domains contributes to both organism complexity and clade-specific functions
Scientific Reports (2021)
-
Phylogeny and Sequence Space: A Combined Approach to Analyze the Evolutionary Trajectories of Homologous Proteins. The Case Study of Aminodeoxychorismate Synthase
Acta Biotheoretica (2020)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.