Abstract
Structural genomics has the goal of obtaining useful, three-dimensional models of all proteins by a combination of experimental structure determination and comparative model building. We evaluate different strategies for optimizing information return on effort. The strategy that maximizes structural coverage requires about seven times fewer structure determinations compared with the strategy in which targets are selected at random. With a choice of reasonable model quality and the goal of 90% coverage, we extrapolate the estimate of the total effort of structural genomics. It would take ∼16,000 carefully selected structure determinations to construct useful atomic models for the vast majority of all proteins. In practice, unless there is global coordination of target selection, the total effort will likely increase by a factor of three. The task can be accomplished within a decade provided that selection of targets is highly coordinated and significant funding is available.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Kim, S.H. Shining a light on structural genomics. Nature Struct. Biol. 5, 643–645 (1998).
Terwilliger, T.C. et al. Class-directed structure determination: foundation for a protein structure initiative. Protein Sci. 7, 1851–1856 (1998).
Sali, A. 100,000 protein structures for the biologist. Nature Struct. Biol. 5, 1929–1932 (1998).
Montelione G.T. & Anderson, S. Structural genomics: keystone for a human proteome. Nature Struct. Biol. 6, 11–12 (1999).
Burley, S.K. et al. Structural genomics: beyond the human genome project. Nature Genet. 23, 151–157 (1999).
Eisenstein, E. et al. Biological function made crystal clear – annotation of hypothetical proteins via structural genomics. Curr. Opin. Biol. 11, 25–30 (2000).
NIGMS Structural Genomics workshop. http://www.nigms.nih.gov/news/meetings/structural_genomics_targets.html (NIH campus, USA; 1999).
Govindarajan, S., Recabarren, R. & Goldstein, R.A. Estimating the total number of protein folds. Proteins 35, 408–414 (1999).
Venter, J.C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Moult, J., Hubbard, T., Fidelis, K. & Pedersen, J.T. Critical assessment of methods of protein structure prediction (CASP): Round3. Proteins S3, 2–6 (1999).
Martin, A.C., MacArthur, M.W. & Thornton, J.M. Assessment of comparative modeling in CASP2. Proteins Suppl. 1, 14–28 (1997).
Sanchez, R. & Sali, A. Advances in comparative modeling. Curr. Opin. Struct. Biol. 7, 206–214 (1997).
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its new supplement. TrEMBL. Nucleic Acids Res. 24, 17–21 (1996).
Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acid Res. 25, 3389–3402 (1997).
Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995).
Holm, L. & Sander, C. Dali/FSSP classification of three-dimensional protein folds. Nucleic Acids Res. 25, 231–234 (1997).
Orengo, C.A. et al. CATH – a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997).
Hobohm, U., Sander, C., Scharf, M. & Schneider, R. Selection of representative protein datasets. Protein Sci. 1, 409–417 (1992).
Sanchez, R. & Sali, A. Large-scale protein structure modeling of Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. USA 95, 13597–13602 (1998).
Guex, N., Diemand, A. & Peitsch, M.C. Protein modeling for all. Trends Biochem. Sci. 24, 364–367 (1999).
Teichmann, S.A., Chothia, C. & Gerstein, M. Advances in structural genomics. Curr. Opin. Struct. Biol. 9, 390–399 (1999).
Sanchez, R. & Sali, A. ModBase: A database of comparative protein structural models. Bioinformatics 15, 1060–1061 (1999).
Wolf, Y.I., Brenner, S.E., Bash, P.A. & Koonin, E.V. Distribution of protein folds in the three superkingdoms of life. Genome Res. 9, 17–26 (1999).
Gerstein, M. Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census. Proteins 33, 518–534 (1998).
Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 27, 263–266 (2000).
Holm, L. & Sander, C. Dictionary of recurrent domains in protein structures. Proteins 33, 88–96 (1998).
Eddy, S.R. Hidden Markov models. Curr. Opin. Struct. Biol. 6, 361–365 (1996).
Wootton, J.C. Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput. Chem. 18, 269–274 (1994).
Fischer, D. & Eisenberg, D. Finding families for genomics ORFans. Bioinformatics 15, 759–762 (1999).
Ashburner, M. et al. An exploration of the sequence of a 2.9-megabase region of the genome of Drosophila melanogaster – The Adh region. Genetics 15, 179–219 (1999).
Rubin, M.G. et al. Comparative genomics of the eukaryotes. Science 287, 2204–2215 (2000).
Sonnhammer, E.L.L. & Durbin, R. Analysis of protein domain families in Caenorhabditis elegans. Genomics 46, 200–216 (1997).
Lin, X. et al. Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature 402, 761–768 (1999).
Mayer, K. et al. Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana. Nature 402, 769–777 (1999).
Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489–495 (1999).
Eddy, S., Mitchison, G. & Durbin, R. Maximum discrimination hidden Markov models of sequence consensus. J. Comput. Biol. 2, 9–23 (1995).
Wolf, Y.I., Grishin, N.V. & Koonin, E.V. Estimating the number of protein folds and families from complete genome data. J. Mol. Biol. 299, 897–905 (2000).
Krause, A., Nicodeme, P., Bornber-Bauer, E., Rehmsmeier, M. & Vingron, M. WWW access to the SYSTERS protein sequence cluster set. Bioinformatics 15, 262–263 (1999).
Heger, A. & Holm, L. Towards a covering set of protein family profiles. Prog. Biophys. Mol. Biol. 73, 321–337 (2000).
Yona, G., Linial, N., Tishby, N. & Linial, M. A map of the protein space – an automatic hierarchical classification of all protein sequences. ISMB 6, 212–221 (1998).
Corpet, F., Gouzy, J. & Kahn, D. Recent improvements of the ProDom database of protein domain families. Nucleic Acids Res. 27, 263–267 (1999).
Tatusov, R.L., Koonin, E.V. & Lipman, D.J. A genomic perspective on protein families. Science 278, 631–637 (1997).
Wu, C.H., Shivakumar, S. & Huang, H. ProClass protein family database. Nucleic Acids Res. 27, 272–274 (1999).
Bourne, P.E. Editorial in bioinformatics. Bioinformatics 15, 715–716 (1999).
Holm, L. & Sander, C. Protein folds and families: sequence and structure alignments. Nucleic Acids Res. 27, 244–247 (1999).
Brenner, S.E. & Levitt, M. Expectations from structural genomics. Protein Sci. 9, 197–200 (2000).
Kohonen, T., Hynninen, J., Kangas, J. & Laaksonen, J. SOM_PAK: The self-organizing map program package. (Helsinki University of Technology, Helsinki; 1996).
Bernstein, F.C. et al. The Protein Data Bank: a computer based archival file for macromolecular structures. J. Mol. Biol. 122, 535–542 (1977).
Czero, M., Wallin, E., Simon, I., von Heijne, G. & Elofsson, A. Prediction of transmembrane α-helices in prokaryotic membrane proteins: the dense alignment surface method. Protein Eng. 17, 673–676 (1997).
Elofsson, A. & Sonnhammer, E.L.L. A comparison of sequence and structure protein domain families as a basis for structural genomics. Bioinformatics 15, 480–500 (1999).
Acknowledgements
We thank L. Holm, D. Marks and J. Norvell for discussions and C. Venclosas for providing CASP template/target sequence identity data. This work was supported in part by research grants from the US National Institute of Health (NIGMS) and the Department of Energy to J.M. and C.S.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Vitkup, D., Melamud, E., Moult, J. et al. Completeness in structural genomics. Nat Struct Mol Biol 8, 559–566 (2001). https://doi.org/10.1038/88640
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1038/88640
This article is cited by
-
Structural and functional analysis of “non-smelly” proteins
Cellular and Molecular Life Sciences (2020)
-
Enhanced unbiased sampling of protein dynamics using evolutionary coupling information
Scientific Reports (2017)
-
The impact of structural genomics: the first quindecennial
Journal of Structural and Functional Genomics (2016)
-
Exceptionally abundant exceptions: comprehensive characterization of intrinsic disorder in all domains of life
Cellular and Molecular Life Sciences (2015)