Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

The structure of the protein universe and genome evolution

Abstract

Despite the practically unlimited number of possible protein sequences, the number of basic shapes in which proteins fold seems not only to be finite, but also to be relatively small, with probably no more than 10,000 folds in existence. Moreover, the distribution of proteins among these folds is highly non-homogeneous — some folds and superfamilies are extremely abundant, but most are rare. Protein folds and families encoded in diverse genomes show similar size distributions with notable mathematical properties, which also extend to the number of connections between domains in multidomain proteins. All these distributions follow asymptotic power laws, such as have been identified in a wide variety of biological and physical systems, and which are typically associated with scale-free networks. These findings suggest that genome evolution is driven by extremely general mechanisms based on the preferential attachment principle.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Double-logarithmic plot of the distribution of protein folds by the number of families.
Figure 2: Double-logarithmic plots of the size distribution of protein domain families in genomes.
Figure 3
Figure 4: Distributions of the number of domains in proteins from the three primary kingdoms of life.
Figure 5: Double-logarithmic plot of the distribution of protein domains by the number of links in multidomain proteins.
Figure 6: A fragment of the network of multidomain connections.

Similar content being viewed by others

References

  1. Holm, L. & Sander, C. Mapping the protein universe. Science 273, 595–603 (1996).

    Article  ADS  CAS  Google Scholar 

  2. Zhang, C. & DeLisi, C. Protein folds: molecular systematics in three dimensions. Cell. Mol. Life Sci. 58, 72–79 (2001).

    Article  CAS  Google Scholar 

  3. Rost, B. Did evolution leap to create the protein universe? Curr. Opin. Struct. Biol. 12, 409–416 (2002).

    Article  CAS  Google Scholar 

  4. Dayhoff, M. The origin and evolution of protein superfamilies. Fed. Proc. 35, 2132–2138 (1976).

    CAS  PubMed  Google Scholar 

  5. Dayhoff, M. O., Barker, W. C. & Hunt, L. T. Establishing homologies in protein sequences. Methods Enzymol. 91, 524–545 (1983).

    Article  CAS  Google Scholar 

  6. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995).

    CAS  PubMed  Google Scholar 

  7. Murzin, A. G. Structural classification of proteins: new superfamilies. Curr. Opin. Struct. Biol. 6, 386–394 (1996).

    Article  CAS  Google Scholar 

  8. Orengo, C. A. et al. CATH—a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997).

    Article  CAS  Google Scholar 

  9. Todd, A. E., Orengo, C. A. & Thornton, J. M. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307, 1113–1143 (2001).

    Article  CAS  Google Scholar 

  10. Lo Conte, L., Brenner, S. E., Hubbard, T. J., Chothia, C. & Murzin, A. G. SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res. 30, 264–267 (2002).

    Article  CAS  Google Scholar 

  11. Orengo, C. A. et al. The CATH protein family database: a resource for structural and functional annotation of genomes. Proteomics 2, 11–21 (2002).

    Article  CAS  Google Scholar 

  12. Branden, C.-I & Tooze, J. Introduction to Protein Structure (Garland Publishing, New York, 1999).

    Google Scholar 

  13. Anantharaman, V., Koonin, E. V. & Aravind, L. Comparative genomics and evolution of proteins involved in RNA metabolism. Nucleic Acids Res. 30, 1427–1464 (2002).

    Article  CAS  Google Scholar 

  14. Anantharaman, V., Koonin, E. V. & Aravind, L. Regulatory potential, phyletic distribution and evolution of ancient, intracellular small-molecule-binding domains. J. Mol. Biol. 307, 1271–1292 (2001).

    Article  CAS  Google Scholar 

  15. Saraste, M., Sibbald, P. R. & Wittinghofer, A. The P-loop—a common motif in ATP- and GTP-binding proteins. Trends Biochem. Sci. 15, 430–434 (1990).

    Article  Google Scholar 

  16. Koonin, E. V. A superfamily of ATPases with diverse functions containing either classical or deviant ATP-binding motif. J. Mol. Biol. 229, 1165–1174 (1993).

    Article  CAS  Google Scholar 

  17. Aravind, L., Mazumder, R., Vasudevan, S. & Koonin, E. V. Trends in protein evolution inferred from sequence and structure analysis. Curr. Opin. Struct. Biol. 12, 392–399 (2002).

    Article  CAS  Google Scholar 

  18. Galperin, M. Y., Walker, D. R. & Koonin, E. V. Analogous enzymes: independent inventions in enzyme evolution. Genome Res. 8, 779–790 (1998).

    Article  CAS  Google Scholar 

  19. Martin, A. C. et al. Protein folds and functions. Structure 6, 875–884 (1998).

    Article  CAS  Google Scholar 

  20. Fitch, W. M. Distinguishing homologous from analogous proteins. Syst. Zool. 19, 99–113 (1970).

    Article  CAS  Google Scholar 

  21. Fitch, W. M. Homology a personal view on some of the problems. Trends Genet. 16, 227–231 (2000).

    Article  CAS  Google Scholar 

  22. Tatusov, R. L., Koonin, E. V. & Lipman, D. J. A genomic perspective on protein families. Science 278, 631–637 (1997).

    Article  ADS  CAS  Google Scholar 

  23. Tatusov, R. L., Galperin, M. Y., Natale, D. A. & Koonin, E. V. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28, 33–36 (2000).

    Article  CAS  Google Scholar 

  24. Jordan, I. K., Makarova, K. S., Spouge, J. L., Wolf, Y. I. & Koonin, E. V. Lineage-specific gene expansions in bacterial and archaeal genomes. Genome Res. 11, 555–565 (2001).

    Article  CAS  Google Scholar 

  25. Remm, M., Storm, C. E. & Sonnhammer, E. L. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 1041–1052 (2001).

    Article  CAS  Google Scholar 

  26. Lespinet, O., Wolf, Y. I., Koonin, E. V. & Aravind, L. The role of lineage-specific gene family expansion in the evolution of eukaryotes. Genome Res. 12, 1048–1059 (2002).

    Article  CAS  Google Scholar 

  27. Henikoff, S. et al. Gene families: the taxonomy of protein paralogs and chimeras. Science 278, 609–614 (1997).

    Article  ADS  CAS  Google Scholar 

  28. Alexandrov, N. N. & Go, N. Biological meaning, statistical significance, and classification of local spatial similarities in nonhomologous proteins. Protein Sci. 3, 866–875 (1994).

    Article  CAS  Google Scholar 

  29. Orengo, C. A., Jones, D. T. & Thornton, J. M. Protein superfamilies and domain superfolds. Nature 372, 631–634 (1994).

    Article  ADS  CAS  Google Scholar 

  30. Zuckerkandl, E. The appearance of new structures and functions in proteins during evolution. J. Mol. Evol. 7, 1–57 (1975).

    Article  ADS  CAS  Google Scholar 

  31. Chothia, C. One thousand families for the molecular biologist. Nature 357, 543–544 (1992).

    Article  ADS  CAS  Google Scholar 

  32. Zhang, C. T. Relations of the numbers of protein sequences, families and folds. Protein Eng. 10, 757–761 (1997).

    Article  CAS  Google Scholar 

  33. Wang, Z. X. A re-estimation for the total numbers of protein folds and superfamilies. Protein Eng. 11, 621–626 (1998).

    Article  CAS  Google Scholar 

  34. Zhang, C. & DeLisi, C. Estimating the number of protein folds. J. Mol. Biol. 284, 1301–1305 (1998).

    Article  CAS  Google Scholar 

  35. Govindarajan, S., Recabarren, R. & Goldstein, R. A. Estimating the total number of protein folds. Proteins 35, 408–414 (1999).

    Article  CAS  Google Scholar 

  36. Wolf, Y. I., Grishin, N. V. & Koonin, E. V. Estimating the number of protein folds and families from complete genome data. J. Mol. Biol. 299, 897–905 (2000).

    Article  CAS  Google Scholar 

  37. Coulson, A. F. & Moult, J. A unifold, mesofold, and superfold model of protein fold use. Proteins 46, 61–71 (2002).

    Article  CAS  Google Scholar 

  38. Kuznetsov, V. A. in Computational and Statistical Approaches to Genomics (eds Zhang, W. & Shmulevich, I.) 125–171 (Kluwer, Boston, 2002).

    Google Scholar 

  39. Karev, G. P., Wolf, Y. I., Rzhetsky, A. Y., Berezovskaya, F. S. & Koonin, E. V. in Computational Genomics: from Sequence to Function (eds Galperin, M. Y. & Koonin, E. V.) (Horizon, Amsterdam, in the press).

  40. Karev, G. P., Wolf, Y. I., Rzhetsky, A. Y., Berezovskaya, F. S. & Koonin, E. V. Birth and death of protein domains: a simple model of evolution explains power law behavior. BMC Evol. Biol. (in the press).

  41. Huynen, M. A. & van Nimwegen, E. The frequency distribution of gene family sizes in complete genomes. Mol. Biol. Evol. 15, 583–589 (1998).

    Article  CAS  Google Scholar 

  42. Qian, J., Luscombe, N. M. & Gerstein, M. Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. J. Mol. Biol. 313, 673–681 (2001).

    Article  CAS  Google Scholar 

  43. Harrison, P. M. & Gerstein, M. Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. J. Mol. Biol. 318, 1155–1174 (2002).

    Article  CAS  Google Scholar 

  44. Luscombe, N., Qian, J., Zhang, Z., Johnson, T. & Gerstein, M. The dominance of the population by a selected few: power-law behaviour applies to a wide variety of genomic properties. Genome Biol. 3, research0040.1–0040.7 (2002).

    Article  Google Scholar 

  45. Barabasi, A. L. & Albert, R. Emergence of scaling in random networks. Science 286, 509–512 (1999).

    Article  ADS  MathSciNet  CAS  Google Scholar 

  46. Bilke, S. & Peterson, C. Topological properties of citation and metabolic networks. Phys. Rev. E 64, 036106-1–036106-5 (2001).

    Article  ADS  Google Scholar 

  47. Barabasi, A. L. Linked: The New Science of Networks (Perseus, New York, 2002).

    Google Scholar 

  48. Albert, R. & Barabasi, A. L. Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–97 (2002).

    Article  ADS  MathSciNet  Google Scholar 

  49. Gisiger, T. Scale invariance in biology: coincidence or footprint of a universal mechanism? Biol. Rev. Camb. Phil. Soc. 76, 161–209 (2001).

    Article  CAS  Google Scholar 

  50. Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. & Barabasi, A. L. The large-scale organization of metabolic networks. Nature 407, 651–654 (2000).

    Article  ADS  CAS  Google Scholar 

  51. Zipf, G. K. Human Behaviour and the Principle of Least Effort (Addison-Wesley, Boston, 1949).

    Google Scholar 

  52. Pareto, V. Cours d'Economie Politique (Rouge et Cie, Paris, 1897).

    Google Scholar 

  53. Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N. & Barabasi, A. L. Hierarchical organization of modularity in metabolic networks. Science 297, 1551–1555 (2002).

    Article  ADS  CAS  Google Scholar 

  54. Jeong, H., Mason, S. P., Barabasi, A. L. & Oltvai, Z. N. Lethality and centrality in protein networks. Nature 411, 41–42 (2001).

    Article  ADS  CAS  Google Scholar 

  55. Li, H., Helling, R., Tang, C. & Wingreen, N. Emergence of preferred structures in a simple model of protein folding. Science 273, 666–669 (1996).

    Article  ADS  CAS  Google Scholar 

  56. Li, H., Tang, C. & Wingreen, N. S. Are protein folds atypical? Proc. Natl Acad. Sci. USA 95, 4987–4990 (1998).

    Article  ADS  CAS  Google Scholar 

  57. Rzhetsky, A. & Gomez, S. M. Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics 17, 988–996 (2001).

    Article  CAS  Google Scholar 

  58. Yule, G. U. A mathematical theory of evolution, based on the conclusions of Dr. J.C. Willis, F.R.S. Phil. Trans. R. Soc. Lond. B 213, 21–87 (1924).

    Article  Google Scholar 

  59. Gould, S. J. The Structure of Evolutionary Theory (Harvard Univ. Press, Cambridge, MA, 2002).

    Book  Google Scholar 

  60. Doolittle, W. F. Lateral genomics. Trends Cell Biol. 9, M5–M8 (1999).

    Article  CAS  Google Scholar 

  61. Doolittle, W. F. Phylogenetic classification and the universal tree. Science 284, 2124–2129 (1999).

    Article  CAS  Google Scholar 

  62. Doolittle, W. F. You are what you eat: a gene transfer ratchet could account for bacterial genes in eukaryotic nuclear genomes. Trends Genet 14, 307–311 (1998).

    Article  CAS  Google Scholar 

  63. Koonin, E. V., Makarova, K. S. & Aravind, L. Horizontal gene transfer in prokaryotes: quantification and classification. Annu. Rev. Microbiol. 55, 709–742 (2001).

    Article  CAS  Google Scholar 

  64. Ragan, M. A. Detection of lateral gene transfer among microbial genomes. Curr. Opin. Genet. Dev. 11, 620–626 (2001).

    Article  CAS  Google Scholar 

  65. Marcotte, E. M. et al. Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751–753 (1999).

    Article  CAS  Google Scholar 

  66. Enright, A. J., Illopoulos, I., Kyrpides, N. C. & Ouzounis, C. A. Protein interaction maps for complete genomes based on gene fusion events. Nature 402, 86–90 (1999).

    Article  ADS  CAS  Google Scholar 

  67. Galperin, M. Y. & Koonin, E. V. Who's your neighbor? New computational approaches for functional genomics. Nature Biotechnol. 18, 609–613 (2000).

    Article  CAS  Google Scholar 

  68. Aravind, L. Guilt by association: contextual information in genome analysis. Genome Res. 10, 1074–1077 (2000).

    Article  CAS  Google Scholar 

  69. Koonin, E. V., Aravind, L. & Kondrashov, A. S. The impact of comparative genomics on our understanding of evolution. Cell 101, 573–576 (2000).

    Article  CAS  Google Scholar 

  70. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

    Article  ADS  CAS  Google Scholar 

  71. Wolf, Y. I., Brenner, S. E., Bash, P. A. & Koonin, E. V. Distribution of protein folds in the three superkingdoms of life. Genome Res. 9, 17–26 (1999).

    CAS  PubMed  Google Scholar 

  72. Wuchty, S. Scale-free behavior in protein domain networks. Mol. Biol. Evol. 18, 1694–1702 (2001).

    Article  CAS  Google Scholar 

  73. Apic, G., Gough, J. & Teichmann, S. A. An insight into domain combinations. Bioinformatics 17 (Suppl. 1), S83–S89 (2001).

    Article  Google Scholar 

  74. Bork, P. et al. A superfamily of conserved domains in DNA damage-responsive cell cycle checkpoint proteins. FASEB J. 11, 68–76 (1997).

    Article  CAS  Google Scholar 

  75. Derbyshire, D. J. et al. Crystal structure of human 53BP1 BRCT domains bound to p53 tumour suppressor. EMBO J. 21, 3863–3872 (2002).

    Article  CAS  Google Scholar 

  76. Vitkup, D., Melamud, E., Moult, J. & Sander, C. Completeness in structural genomics. Nature Struct. Biol. 8, 559–566 (2001).

    Article  CAS  Google Scholar 

  77. Marchler-Bauer, A. et al. CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res. 30, 281–283 (2002).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We thank A. Panchenko and S. He (NCBI) for help with the use of the Conserved Domain Database, and A. Rzhetsky and V. Kuznetsov for helpful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eugene V. Koonin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Koonin, E., Wolf, Y. & Karev, G. The structure of the protein universe and genome evolution. Nature 420, 218–223 (2002). https://doi.org/10.1038/nature01256

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/nature01256

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing