Completeness in structural genomics

Vitkup, Dennis; Melamud, Eugene; Moult, John; Sander, Chris

doi:10.1038/88640

Article
Published: June 2001

Completeness in structural genomics

Dennis Vitkup^1,2,
Eugene Melamud³,
John Moult³ &
…
Chris Sander^1,4

Nature Structural Biology volume 8, pages 559–566 (2001)Cite this article

983 Accesses
263 Citations
1 Altmetric
Metrics details

Abstract

Structural genomics has the goal of obtaining useful, three-dimensional models of all proteins by a combination of experimental structure determination and comparative model building. We evaluate different strategies for optimizing information return on effort. The strategy that maximizes structural coverage requires about seven times fewer structure determinations compared with the strategy in which targets are selected at random. With a choice of reasonable model quality and the goal of 90% coverage, we extrapolate the estimate of the total effort of structural genomics. It would take ∼16,000 carefully selected structure determinations to construct useful atomic models for the vast majority of all proteins. In practice, unless there is global coordination of target selection, the total effort will likely increase by a factor of three. The task can be accomplished within a decade provided that selection of targets is highly coordinated and significant funding is available.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Accuracy of CASP protein structure models as a function of target-template sequence identity.**

**Figure 2: Current structural coverage of proteins in SP + TrEMBL.**

**Figure 3: Structural coverage of a protein family, illustrated using the Ras family in yeast as an example.**

**Figure 4: Scope of structural coverage as a function of model quality.**

**Figure 5: Two factors affecting the scale of structural genomics.**

Sequence-structure-function relationships in the microbial protein universe

Article Open access 26 April 2023

Julia Koehler Leman, Pawel Szczerbiak, … Tomasz Kosciolek

Highly accurate protein structure prediction for the human proteome

Article Open access 22 July 2021

Kathryn Tunyasuvunakool, Jonas Adler, … Demis Hassabis

Tutorial: a guide for the selection of fast and accurate computational tools for the prediction of intrinsic disorder in proteins

Article 22 September 2023

Lukasz Kurgan, Gang Hu, … Zsuzsanna Dosztányi

References

Kim, S.H. Shining a light on structural genomics. Nature Struct. Biol. 5, 643–645 (1998).
Article CAS Google Scholar
Terwilliger, T.C. et al. Class-directed structure determination: foundation for a protein structure initiative. Protein Sci. 7, 1851–1856 (1998).
Article CAS Google Scholar
Sali, A. 100,000 protein structures for the biologist. Nature Struct. Biol. 5, 1929–1932 (1998).
Article Google Scholar
Montelione G.T. & Anderson, S. Structural genomics: keystone for a human proteome. Nature Struct. Biol. 6, 11–12 (1999).
Article CAS Google Scholar
Burley, S.K. et al. Structural genomics: beyond the human genome project. Nature Genet. 23, 151–157 (1999).
Article CAS Google Scholar
Eisenstein, E. et al. Biological function made crystal clear – annotation of hypothetical proteins via structural genomics. Curr. Opin. Biol. 11, 25–30 (2000).
Article CAS Google Scholar
NIGMS Structural Genomics workshop. http://www.nigms.nih.gov/news/meetings/structural_genomics_targets.html (NIH campus, USA; 1999).
Govindarajan, S., Recabarren, R. & Goldstein, R.A. Estimating the total number of protein folds. Proteins 35, 408–414 (1999).
Article CAS Google Scholar
Venter, J.C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Article CAS Google Scholar
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Moult, J., Hubbard, T., Fidelis, K. & Pedersen, J.T. Critical assessment of methods of protein structure prediction (CASP): Round3. Proteins S3, 2–6 (1999).
Article Google Scholar
Martin, A.C., MacArthur, M.W. & Thornton, J.M. Assessment of comparative modeling in CASP2. Proteins Suppl. 1, 14–28 (1997).
Article Google Scholar
Sanchez, R. & Sali, A. Advances in comparative modeling. Curr. Opin. Struct. Biol. 7, 206–214 (1997).
Article CAS Google Scholar
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its new supplement. TrEMBL. Nucleic Acids Res. 24, 17–21 (1996).
Article Google Scholar
Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acid Res. 25, 3389–3402 (1997).
Article CAS Google Scholar
Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995).
CAS Google Scholar
Holm, L. & Sander, C. Dali/FSSP classification of three-dimensional protein folds. Nucleic Acids Res. 25, 231–234 (1997).
Article CAS Google Scholar
Orengo, C.A. et al. CATH – a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997).
Article CAS Google Scholar
Hobohm, U., Sander, C., Scharf, M. & Schneider, R. Selection of representative protein datasets. Protein Sci. 1, 409–417 (1992).
Article CAS Google Scholar
Sanchez, R. & Sali, A. Large-scale protein structure modeling of Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. USA 95, 13597–13602 (1998).
Article CAS Google Scholar
Guex, N., Diemand, A. & Peitsch, M.C. Protein modeling for all. Trends Biochem. Sci. 24, 364–367 (1999).
Article CAS Google Scholar
Teichmann, S.A., Chothia, C. & Gerstein, M. Advances in structural genomics. Curr. Opin. Struct. Biol. 9, 390–399 (1999).
Article CAS Google Scholar
Sanchez, R. & Sali, A. ModBase: A database of comparative protein structural models. Bioinformatics 15, 1060–1061 (1999).
Article CAS Google Scholar
Wolf, Y.I., Brenner, S.E., Bash, P.A. & Koonin, E.V. Distribution of protein folds in the three superkingdoms of life. Genome Res. 9, 17–26 (1999).
CAS PubMed Google Scholar
Gerstein, M. Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census. Proteins 33, 518–534 (1998).
Article CAS Google Scholar
Bateman, A. et al. The Pfam protein families database. Nucleic Acids Res. 27, 263–266 (2000).
Article Google Scholar
Holm, L. & Sander, C. Dictionary of recurrent domains in protein structures. Proteins 33, 88–96 (1998).
Article CAS Google Scholar
Eddy, S.R. Hidden Markov models. Curr. Opin. Struct. Biol. 6, 361–365 (1996).
Article CAS Google Scholar
Wootton, J.C. Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput. Chem. 18, 269–274 (1994).
Article CAS Google Scholar
Fischer, D. & Eisenberg, D. Finding families for genomics ORFans. Bioinformatics 15, 759–762 (1999).
Article CAS Google Scholar
Ashburner, M. et al. An exploration of the sequence of a 2.9-megabase region of the genome of Drosophila melanogaster – The Adh region. Genetics 15, 179–219 (1999).
Google Scholar
Rubin, M.G. et al. Comparative genomics of the eukaryotes. Science 287, 2204–2215 (2000).
Article CAS Google Scholar
Sonnhammer, E.L.L. & Durbin, R. Analysis of protein domain families in Caenorhabditis elegans. Genomics 46, 200–216 (1997).
Article CAS Google Scholar
Lin, X. et al. Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature 402, 761–768 (1999).
Article CAS Google Scholar
Mayer, K. et al. Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana. Nature 402, 769–777 (1999).
Article CAS Google Scholar
Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489–495 (1999).
Article CAS Google Scholar
Eddy, S., Mitchison, G. & Durbin, R. Maximum discrimination hidden Markov models of sequence consensus. J. Comput. Biol. 2, 9–23 (1995).
Article CAS Google Scholar
Wolf, Y.I., Grishin, N.V. & Koonin, E.V. Estimating the number of protein folds and families from complete genome data. J. Mol. Biol. 299, 897–905 (2000).
Article CAS Google Scholar
Krause, A., Nicodeme, P., Bornber-Bauer, E., Rehmsmeier, M. & Vingron, M. WWW access to the SYSTERS protein sequence cluster set. Bioinformatics 15, 262–263 (1999).
Article CAS Google Scholar
Heger, A. & Holm, L. Towards a covering set of protein family profiles. Prog. Biophys. Mol. Biol. 73, 321–337 (2000).
Article CAS Google Scholar
Yona, G., Linial, N., Tishby, N. & Linial, M. A map of the protein space – an automatic hierarchical classification of all protein sequences. ISMB 6, 212–221 (1998).
CAS PubMed Google Scholar
Corpet, F., Gouzy, J. & Kahn, D. Recent improvements of the ProDom database of protein domain families. Nucleic Acids Res. 27, 263–267 (1999).
Article CAS Google Scholar
Tatusov, R.L., Koonin, E.V. & Lipman, D.J. A genomic perspective on protein families. Science 278, 631–637 (1997).
Article CAS Google Scholar
Wu, C.H., Shivakumar, S. & Huang, H. ProClass protein family database. Nucleic Acids Res. 27, 272–274 (1999).
Article CAS Google Scholar
Bourne, P.E. Editorial in bioinformatics. Bioinformatics 15, 715–716 (1999).
Google Scholar
Holm, L. & Sander, C. Protein folds and families: sequence and structure alignments. Nucleic Acids Res. 27, 244–247 (1999).
Article CAS Google Scholar
Brenner, S.E. & Levitt, M. Expectations from structural genomics. Protein Sci. 9, 197–200 (2000).
Article CAS Google Scholar
Kohonen, T., Hynninen, J., Kangas, J. & Laaksonen, J. SOM_PAK: The self-organizing map program package. (Helsinki University of Technology, Helsinki; 1996).
Google Scholar
Bernstein, F.C. et al. The Protein Data Bank: a computer based archival file for macromolecular structures. J. Mol. Biol. 122, 535–542 (1977).
Article Google Scholar
Czero, M., Wallin, E., Simon, I., von Heijne, G. & Elofsson, A. Prediction of transmembrane α-helices in prokaryotic membrane proteins: the dense alignment surface method. Protein Eng. 17, 673–676 (1997).
Google Scholar
Elofsson, A. & Sonnhammer, E.L.L. A comparison of sequence and structure protein domain families as a basis for structural genomics. Bioinformatics 15, 480–500 (1999).
Article CAS Google Scholar

Download references

Acknowledgements

We thank L. Holm, D. Marks and J. Norvell for discussions and C. Venclosas for providing CASP template/target sequence identity data. This work was supported in part by research grants from the US National Institute of Health (NIGMS) and the Department of Energy to J.M. and C.S.

Author information

Authors and Affiliations

MIT Center for Genome Research, One Kendall Square, Building 300, Cambridge, 02139, Massachusetts, USA
Dennis Vitkup & Chris Sander
Department of Chemistry and Chemical Biology, Harvard University, Cambridge, 02138, Massachusetts, USA
Dennis Vitkup
Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, 9600 Gudelsky Drive, Rockville, 20850, Maryland, USA
Eugene Melamud & John Moult
Millennium Pharmaceuticals, 640 Memorial Drive, Cambridge, 02139, Massachusetts, USA
Chris Sander

Authors

Dennis Vitkup
View author publications
You can also search for this author in PubMed Google Scholar
Eugene Melamud
View author publications
You can also search for this author in PubMed Google Scholar
John Moult
View author publications
You can also search for this author in PubMed Google Scholar
Chris Sander
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chris Sander.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vitkup, D., Melamud, E., Moult, J. et al. Completeness in structural genomics. Nat Struct Mol Biol 8, 559–566 (2001). https://doi.org/10.1038/88640

Download citation

Received: 06 March 2001
Accepted: 23 March 2001
Issue Date: June 2001
DOI: https://doi.org/10.1038/88640

This article is cited by

Structural and functional analysis of “non-smelly” proteins
- Jing Yan
- Jianlin Cheng
- Vladimir N. Uversky
Cellular and Molecular Life Sciences (2020)
Enhanced unbiased sampling of protein dynamics using evolutionary coupling information
- Zahra Shamsi
- Alexander S. Moffett
- Diwakar Shukla
Scientific Reports (2017)
The impact of structural genomics: the first quindecennial
- Marek Grabowski
- Ewa Niedzialkowska
- Wladek Minor
Journal of Structural and Functional Genomics (2016)
Exceptionally abundant exceptions: comprehensive characterization of intrinsic disorder in all domains of life
- Zhenling Peng
- Jing Yan
- Lukasz Kurgan
Cellular and Molecular Life Sciences (2015)

Completeness in structural genomics

Abstract

Access options

Similar content being viewed by others

Sequence-structure-function relationships in the microbial protein universe

Highly accurate protein structure prediction for the human proteome

Tutorial: a guide for the selection of fast and accurate computational tools for the prediction of intrinsic disorder in proteins

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

This article is cited by

Structural and functional analysis of “non-smelly” proteins

Enhanced unbiased sampling of protein dynamics using evolutionary coupling information

The impact of structural genomics: the first quindecennial

Exceptionally abundant exceptions: comprehensive characterization of intrinsic disorder in all domains of life

Target practice

Search

Quick links

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links