Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

The structure of the protein universe and genome evolution


Despite the practically unlimited number of possible protein sequences, the number of basic shapes in which proteins fold seems not only to be finite, but also to be relatively small, with probably no more than 10,000 folds in existence. Moreover, the distribution of proteins among these folds is highly non-homogeneous — some folds and superfamilies are extremely abundant, but most are rare. Protein folds and families encoded in diverse genomes show similar size distributions with notable mathematical properties, which also extend to the number of connections between domains in multidomain proteins. All these distributions follow asymptotic power laws, such as have been identified in a wide variety of biological and physical systems, and which are typically associated with scale-free networks. These findings suggest that genome evolution is driven by extremely general mechanisms based on the preferential attachment principle.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: Double-logarithmic plot of the distribution of protein folds by the number of families.
Figure 2: Double-logarithmic plots of the size distribution of protein domain families in genomes.
Figure 3
Figure 4: Distributions of the number of domains in proteins from the three primary kingdoms of life.
Figure 5: Double-logarithmic plot of the distribution of protein domains by the number of links in multidomain proteins.
Figure 6: A fragment of the network of multidomain connections.


  1. 1

    Holm, L. & Sander, C. Mapping the protein universe. Science 273, 595–603 (1996).

    ADS  CAS  Article  Google Scholar 

  2. 2

    Zhang, C. & DeLisi, C. Protein folds: molecular systematics in three dimensions. Cell. Mol. Life Sci. 58, 72–79 (2001).

    CAS  Article  Google Scholar 

  3. 3

    Rost, B. Did evolution leap to create the protein universe? Curr. Opin. Struct. Biol. 12, 409–416 (2002).

    CAS  Article  Google Scholar 

  4. 4

    Dayhoff, M. The origin and evolution of protein superfamilies. Fed. Proc. 35, 2132–2138 (1976).

    CAS  PubMed  Google Scholar 

  5. 5

    Dayhoff, M. O., Barker, W. C. & Hunt, L. T. Establishing homologies in protein sequences. Methods Enzymol. 91, 524–545 (1983).

    CAS  Article  Google Scholar 

  6. 6

    Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995).

    CAS  PubMed  Google Scholar 

  7. 7

    Murzin, A. G. Structural classification of proteins: new superfamilies. Curr. Opin. Struct. Biol. 6, 386–394 (1996).

    CAS  Article  Google Scholar 

  8. 8

    Orengo, C. A. et al. CATH—a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997).

    CAS  Article  Google Scholar 

  9. 9

    Todd, A. E., Orengo, C. A. & Thornton, J. M. Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307, 1113–1143 (2001).

    CAS  Article  Google Scholar 

  10. 10

    Lo Conte, L., Brenner, S. E., Hubbard, T. J., Chothia, C. & Murzin, A. G. SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res. 30, 264–267 (2002).

    CAS  Article  Google Scholar 

  11. 11

    Orengo, C. A. et al. The CATH protein family database: a resource for structural and functional annotation of genomes. Proteomics 2, 11–21 (2002).

    CAS  Article  Google Scholar 

  12. 12

    Branden, C.-I & Tooze, J. Introduction to Protein Structure (Garland Publishing, New York, 1999).

    Google Scholar 

  13. 13

    Anantharaman, V., Koonin, E. V. & Aravind, L. Comparative genomics and evolution of proteins involved in RNA metabolism. Nucleic Acids Res. 30, 1427–1464 (2002).

    CAS  Article  Google Scholar 

  14. 14

    Anantharaman, V., Koonin, E. V. & Aravind, L. Regulatory potential, phyletic distribution and evolution of ancient, intracellular small-molecule-binding domains. J. Mol. Biol. 307, 1271–1292 (2001).

    CAS  Article  Google Scholar 

  15. 15

    Saraste, M., Sibbald, P. R. & Wittinghofer, A. The P-loop—a common motif in ATP- and GTP-binding proteins. Trends Biochem. Sci. 15, 430–434 (1990).

    Article  Google Scholar 

  16. 16

    Koonin, E. V. A superfamily of ATPases with diverse functions containing either classical or deviant ATP-binding motif. J. Mol. Biol. 229, 1165–1174 (1993).

    CAS  Article  Google Scholar 

  17. 17

    Aravind, L., Mazumder, R., Vasudevan, S. & Koonin, E. V. Trends in protein evolution inferred from sequence and structure analysis. Curr. Opin. Struct. Biol. 12, 392–399 (2002).

    CAS  Article  Google Scholar 

  18. 18

    Galperin, M. Y., Walker, D. R. & Koonin, E. V. Analogous enzymes: independent inventions in enzyme evolution. Genome Res. 8, 779–790 (1998).

    CAS  Article  Google Scholar 

  19. 19

    Martin, A. C. et al. Protein folds and functions. Structure 6, 875–884 (1998).

    CAS  Article  Google Scholar 

  20. 20

    Fitch, W. M. Distinguishing homologous from analogous proteins. Syst. Zool. 19, 99–113 (1970).

    CAS  Article  Google Scholar 

  21. 21

    Fitch, W. M. Homology a personal view on some of the problems. Trends Genet. 16, 227–231 (2000).

    CAS  Article  Google Scholar 

  22. 22

    Tatusov, R. L., Koonin, E. V. & Lipman, D. J. A genomic perspective on protein families. Science 278, 631–637 (1997).

    ADS  CAS  Article  Google Scholar 

  23. 23

    Tatusov, R. L., Galperin, M. Y., Natale, D. A. & Koonin, E. V. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28, 33–36 (2000).

    CAS  Article  Google Scholar 

  24. 24

    Jordan, I. K., Makarova, K. S., Spouge, J. L., Wolf, Y. I. & Koonin, E. V. Lineage-specific gene expansions in bacterial and archaeal genomes. Genome Res. 11, 555–565 (2001).

    CAS  Article  Google Scholar 

  25. 25

    Remm, M., Storm, C. E. & Sonnhammer, E. L. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 1041–1052 (2001).

    CAS  Article  Google Scholar 

  26. 26

    Lespinet, O., Wolf, Y. I., Koonin, E. V. & Aravind, L. The role of lineage-specific gene family expansion in the evolution of eukaryotes. Genome Res. 12, 1048–1059 (2002).

    CAS  Article  Google Scholar 

  27. 27

    Henikoff, S. et al. Gene families: the taxonomy of protein paralogs and chimeras. Science 278, 609–614 (1997).

    ADS  CAS  Article  Google Scholar 

  28. 28

    Alexandrov, N. N. & Go, N. Biological meaning, statistical significance, and classification of local spatial similarities in nonhomologous proteins. Protein Sci. 3, 866–875 (1994).

    CAS  Article  Google Scholar 

  29. 29

    Orengo, C. A., Jones, D. T. & Thornton, J. M. Protein superfamilies and domain superfolds. Nature 372, 631–634 (1994).

    ADS  CAS  Article  Google Scholar 

  30. 30

    Zuckerkandl, E. The appearance of new structures and functions in proteins during evolution. J. Mol. Evol. 7, 1–57 (1975).

    ADS  CAS  Article  Google Scholar 

  31. 31

    Chothia, C. One thousand families for the molecular biologist. Nature 357, 543–544 (1992).

    ADS  CAS  Article  Google Scholar 

  32. 32

    Zhang, C. T. Relations of the numbers of protein sequences, families and folds. Protein Eng. 10, 757–761 (1997).

    CAS  Article  Google Scholar 

  33. 33

    Wang, Z. X. A re-estimation for the total numbers of protein folds and superfamilies. Protein Eng. 11, 621–626 (1998).

    CAS  Article  Google Scholar 

  34. 34

    Zhang, C. & DeLisi, C. Estimating the number of protein folds. J. Mol. Biol. 284, 1301–1305 (1998).

    CAS  Article  Google Scholar 

  35. 35

    Govindarajan, S., Recabarren, R. & Goldstein, R. A. Estimating the total number of protein folds. Proteins 35, 408–414 (1999).

    CAS  Article  Google Scholar 

  36. 36

    Wolf, Y. I., Grishin, N. V. & Koonin, E. V. Estimating the number of protein folds and families from complete genome data. J. Mol. Biol. 299, 897–905 (2000).

    CAS  Article  Google Scholar 

  37. 37

    Coulson, A. F. & Moult, J. A unifold, mesofold, and superfold model of protein fold use. Proteins 46, 61–71 (2002).

    CAS  Article  Google Scholar 

  38. 38

    Kuznetsov, V. A. in Computational and Statistical Approaches to Genomics (eds Zhang, W. & Shmulevich, I.) 125–171 (Kluwer, Boston, 2002).

    Google Scholar 

  39. 39

    Karev, G. P., Wolf, Y. I., Rzhetsky, A. Y., Berezovskaya, F. S. & Koonin, E. V. in Computational Genomics: from Sequence to Function (eds Galperin, M. Y. & Koonin, E. V.) (Horizon, Amsterdam, in the press).

  40. 40

    Karev, G. P., Wolf, Y. I., Rzhetsky, A. Y., Berezovskaya, F. S. & Koonin, E. V. Birth and death of protein domains: a simple model of evolution explains power law behavior. BMC Evol. Biol. (in the press).

  41. 41

    Huynen, M. A. & van Nimwegen, E. The frequency distribution of gene family sizes in complete genomes. Mol. Biol. Evol. 15, 583–589 (1998).

    CAS  Article  Google Scholar 

  42. 42

    Qian, J., Luscombe, N. M. & Gerstein, M. Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. J. Mol. Biol. 313, 673–681 (2001).

    CAS  Article  Google Scholar 

  43. 43

    Harrison, P. M. & Gerstein, M. Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. J. Mol. Biol. 318, 1155–1174 (2002).

    CAS  Article  Google Scholar 

  44. 44

    Luscombe, N., Qian, J., Zhang, Z., Johnson, T. & Gerstein, M. The dominance of the population by a selected few: power-law behaviour applies to a wide variety of genomic properties. Genome Biol. 3, research0040.1–0040.7 (2002).

    Article  Google Scholar 

  45. 45

    Barabasi, A. L. & Albert, R. Emergence of scaling in random networks. Science 286, 509–512 (1999).

    ADS  MathSciNet  CAS  Article  Google Scholar 

  46. 46

    Bilke, S. & Peterson, C. Topological properties of citation and metabolic networks. Phys. Rev. E 64, 036106-1–036106-5 (2001).

    ADS  Article  Google Scholar 

  47. 47

    Barabasi, A. L. Linked: The New Science of Networks (Perseus, New York, 2002).

    Google Scholar 

  48. 48

    Albert, R. & Barabasi, A. L. Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–97 (2002).

    ADS  MathSciNet  Article  Google Scholar 

  49. 49

    Gisiger, T. Scale invariance in biology: coincidence or footprint of a universal mechanism? Biol. Rev. Camb. Phil. Soc. 76, 161–209 (2001).

    CAS  Article  Google Scholar 

  50. 50

    Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. & Barabasi, A. L. The large-scale organization of metabolic networks. Nature 407, 651–654 (2000).

    ADS  CAS  Article  Google Scholar 

  51. 51

    Zipf, G. K. Human Behaviour and the Principle of Least Effort (Addison-Wesley, Boston, 1949).

    Google Scholar 

  52. 52

    Pareto, V. Cours d'Economie Politique (Rouge et Cie, Paris, 1897).

    Google Scholar 

  53. 53

    Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N. & Barabasi, A. L. Hierarchical organization of modularity in metabolic networks. Science 297, 1551–1555 (2002).

    ADS  CAS  Article  Google Scholar 

  54. 54

    Jeong, H., Mason, S. P., Barabasi, A. L. & Oltvai, Z. N. Lethality and centrality in protein networks. Nature 411, 41–42 (2001).

    ADS  CAS  Article  Google Scholar 

  55. 55

    Li, H., Helling, R., Tang, C. & Wingreen, N. Emergence of preferred structures in a simple model of protein folding. Science 273, 666–669 (1996).

    ADS  CAS  Article  Google Scholar 

  56. 56

    Li, H., Tang, C. & Wingreen, N. S. Are protein folds atypical? Proc. Natl Acad. Sci. USA 95, 4987–4990 (1998).

    ADS  CAS  Article  Google Scholar 

  57. 57

    Rzhetsky, A. & Gomez, S. M. Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics 17, 988–996 (2001).

    CAS  Article  Google Scholar 

  58. 58

    Yule, G. U. A mathematical theory of evolution, based on the conclusions of Dr. J.C. Willis, F.R.S. Phil. Trans. R. Soc. Lond. B 213, 21–87 (1924).

    Article  Google Scholar 

  59. 59

    Gould, S. J. The Structure of Evolutionary Theory (Harvard Univ. Press, Cambridge, MA, 2002).

    Google Scholar 

  60. 60

    Doolittle, W. F. Lateral genomics. Trends Cell Biol. 9, M5–M8 (1999).

    CAS  Article  Google Scholar 

  61. 61

    Doolittle, W. F. Phylogenetic classification and the universal tree. Science 284, 2124–2129 (1999).

    CAS  Article  Google Scholar 

  62. 62

    Doolittle, W. F. You are what you eat: a gene transfer ratchet could account for bacterial genes in eukaryotic nuclear genomes. Trends Genet 14, 307–311 (1998).

    CAS  Article  Google Scholar 

  63. 63

    Koonin, E. V., Makarova, K. S. & Aravind, L. Horizontal gene transfer in prokaryotes: quantification and classification. Annu. Rev. Microbiol. 55, 709–742 (2001).

    CAS  Article  Google Scholar 

  64. 64

    Ragan, M. A. Detection of lateral gene transfer among microbial genomes. Curr. Opin. Genet. Dev. 11, 620–626 (2001).

    CAS  Article  Google Scholar 

  65. 65

    Marcotte, E. M. et al. Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751–753 (1999).

    CAS  Article  Google Scholar 

  66. 66

    Enright, A. J., Illopoulos, I., Kyrpides, N. C. & Ouzounis, C. A. Protein interaction maps for complete genomes based on gene fusion events. Nature 402, 86–90 (1999).

    ADS  CAS  Article  Google Scholar 

  67. 67

    Galperin, M. Y. & Koonin, E. V. Who's your neighbor? New computational approaches for functional genomics. Nature Biotechnol. 18, 609–613 (2000).

    CAS  Article  Google Scholar 

  68. 68

    Aravind, L. Guilt by association: contextual information in genome analysis. Genome Res. 10, 1074–1077 (2000).

    CAS  Article  Google Scholar 

  69. 69

    Koonin, E. V., Aravind, L. & Kondrashov, A. S. The impact of comparative genomics on our understanding of evolution. Cell 101, 573–576 (2000).

    CAS  Article  Google Scholar 

  70. 70

    Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

    ADS  CAS  Article  Google Scholar 

  71. 71

    Wolf, Y. I., Brenner, S. E., Bash, P. A. & Koonin, E. V. Distribution of protein folds in the three superkingdoms of life. Genome Res. 9, 17–26 (1999).

    CAS  PubMed  Google Scholar 

  72. 72

    Wuchty, S. Scale-free behavior in protein domain networks. Mol. Biol. Evol. 18, 1694–1702 (2001).

    CAS  Article  Google Scholar 

  73. 73

    Apic, G., Gough, J. & Teichmann, S. A. An insight into domain combinations. Bioinformatics 17 (Suppl. 1), S83–S89 (2001).

    Article  Google Scholar 

  74. 74

    Bork, P. et al. A superfamily of conserved domains in DNA damage-responsive cell cycle checkpoint proteins. FASEB J. 11, 68–76 (1997).

    CAS  Article  Google Scholar 

  75. 75

    Derbyshire, D. J. et al. Crystal structure of human 53BP1 BRCT domains bound to p53 tumour suppressor. EMBO J. 21, 3863–3872 (2002).

    CAS  Article  Google Scholar 

  76. 76

    Vitkup, D., Melamud, E., Moult, J. & Sander, C. Completeness in structural genomics. Nature Struct. Biol. 8, 559–566 (2001).

    CAS  Article  Google Scholar 

  77. 77

    Marchler-Bauer, A. et al. CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res. 30, 281–283 (2002).

    CAS  Article  Google Scholar 

Download references


We thank A. Panchenko and S. He (NCBI) for help with the use of the Conserved Domain Database, and A. Rzhetsky and V. Kuznetsov for helpful discussions.

Author information



Corresponding author

Correspondence to Eugene V. Koonin.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Koonin, E., Wolf, Y. & Karev, G. The structure of the protein universe and genome evolution. Nature 420, 218–223 (2002).

Download citation

Further reading


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing