Journal home
Advance online publication
Current issue
Archive
Press releases
Supplements
Focus
Guide to authors
Online submissionOnline submission
Permissions
For referees
Free online issue
Contact the journal
Subscribe
Advertising
work@npg
naturereprints
About this site
For librarians
 
NPG Resources
Nature
Nature Cell Biology
Nature Reviews Molecular Cell Biology
The EMBO Journal
Nature Reports Stem Cells
Nature Reports Avian Flu
NPG Subject areas
Biotechnology
Cancer
Chemistry
Clinical Medicine
Dentistry
Development
Drug Discovery
Earth Sciences
Evolution & Ecology
Genetics
Immunology
Materials Science
Medical Research
Microbiology
Molecular Cell Biology
Neuroscience
Pharmacology
Physics
Browse all publications
Meeting Report
Nature Structural Biology  5, 1029 - 1032 (1998)
doi:10.1038/4136

100,000 protein structures for the biologist

Andrej S caronali

Andrej S caronali is with the Laboratories of Molecular Biophysics, The Rockefeller University, New York, New York 10021, USA. sali@rockefeller.edu

Structural genomics promises to deliver experimentally determined three-dimensional structures for many thousands of protein domains. These domains will be carefully selected, so that the methods of fold assignment and comparative protein structure modeling will result in useful models for most other protein sequences. The impact on biology will be dramatic.
Recently, some 200 structural biologists from both academia and industry gathered in Avalon, New Jersey to explore the science and organization of the new field of structural genomics1. The aim of the structural genomics project is to deliver structural information about most proteins. While it is not feasible to determine the structure of every protein by experiment, useful models can be obtained by fold assignment and comparative modeling for those protein sequences that are related to at least one known protein structure. Structure determination of 10,000 properly chosen proteins should result in useful three-dimensional models for hundreds of thousands of other protein sequences. In other words, structural genomics will put each protein within a comparative modeling distance of a known protein structure (Eaton Lattman, Johns Hopkins University).

Knowing the structures of biological macromolecules is almost always useful for predicting, interpreting, modifying and designing their functions. The high-throughput, coordinated structure determination efforts should be more efficient for obtaining 10,000 structures than the current distributed and largely uncoordinated efforts in individual structural biology laboratories. Structural genomics will build on the human genome project and will be integrated with the functional genomics project. Here, I review the recent meeting in Avalon. Due to space restrictions, the review omits many of the presentations. However, a complete list of speakers may be found on the meeting web site1.

The structural genomics project will deliver improved methods and a process for high-throughput protein structure determination. The process involves: (i) selection of the target proteins or domains, (ii) cloning, expression, and purification of the targets, (iii) crystallization and structure determination by X-ray crystallography or by nuclear magnetic resonance (NMR) spectroscopy, and (iv) archiving and annotation of the new structures. Several pilot projects in structural genomics that implement all of these steps are already underway (Table 1 http://structbio.nature.com/web_specials/nsb1298_supp/). It is feasible to determine at least 10,000 protein structures within the next five years. No new discoveries or methods are needed for this task. The average cost per structure is expected to decrease significantly from approx$200,000 per structure to perhaps less than $20,000 per structure. These savings will be achieved by the economies of scale, new or improved methods, and their parallelization in most of the steps in the process.

Table 1. List of pilot projects in structural genomics1
Table 1 thumbnail

Full TableFull Table
Target selection
The targets for structure determination generally will be individual domains rather than multi-domain proteins (Chris Sander, Millenium Information). Such targets will be easier to determine by X-ray crystallography or NMR spectroscopy than the more flexible multi-domain proteins, although it would be beneficial to determine whole proteins whenever possible. Target selection will begin by pairwise comparison of all protein sequences, followed by clustering into groups of domains. The targets for the structural genomics project will be the representatives of the groups without structurally defined members. The groups of domains should contain members that share at least approx30% sequence identity (30-seq families), rather than those that correspond to superfamiles or fold families.

This relatively high granularity is needed because of the limitations in the current methods for comparative modeling and for inferring function from structure. Errors in the comparative models become relatively small as the sequence identity increases above 30% (Andrej S caronali, Rockefeller University) and function is generally conserved within families (John Moult, Center for Advanced Research in Biotechnology, Rockville, Maryland); Christine Orengo, University College, London). Given the clustering into the 30-seq families, the total number of targets for experimental structure determination has been estimated to be on the order of 10,000. As an alternative, it has also been suggested that 100,000 protein structures need to be determined by experiment (David Eisenberg, UCLA); this would allow calculation of models with higher accuracy than is possible with only 10,000 known structures. After a list of the 30-seq families is obtained, the representatives will be picked and prioritized. A variety of different criteria likely will be used for this task; for example, size of the family, biological knowledge about the family, distribution of the members among various organisms, the pharmaceutical relevance, the likelihood of a successful structure determination, and so forth.

A preliminary analysis suggests that among recent protein structures determined in the hope of discovering a new fold, more than two-thirds actually have exhibited an already known fold (Steven Brenner, Stanford). While this unveils weaknesses in the fold assignment methods, it is not a waste of resources. In practical terms, the structure of a protein is worth determining if we do not know it, for whatever reason. In fact, it may even be an advantage when the newly determined protein is related to other known structures because such structure is likely to be more informative about its function than in the case of structural 'orphans'.

Given a fixed number of experimentally determined protein structures, the quality and the number of models for the rest of the proteins will be influenced strongly by the methods for fold assignment and comparative protein structure modeling. The impact of expected improvements in these methods may be equivalent to several years of the experimental protein structure output (Andrej S caronali). A number of different approaches to fold assignment have been described, applied, and evaluated (Ron Levy, Rutgers University; George Rose, John Hopkins University; Barry Honig, Columbia University; Manfred Sippl, CAME, Salzburg; David Eisenberg; Eugene Koonin, NCBI; Steven Brenner). The methods that rely on multiple sequence information, such as PSI-BLAST, appear surprisingly sensitive in detecting remote relationships (Liisa Holm, EMBL-EBI, Cambridge; Eugene Koonin; Steven Brenner).

Cloning, expression and purification
Obtaining protein samples for crystallization trials and NMR measurements likely will be the bottleneck in the structural genomics process. Although the current methods are sufficiently powerful to begin the project, advances are expected in parallelization and automation of the existing methods for cloning, expression, and purification of proteins. In addition, entirely new technologies may arise, such as an in vitro system that already has allowed researchers to produce proteins up to concentrations of 6 mg ml −1 (Shigeyuki Yokoyama, Tokyo University).

Crystallization and X-ray crystallography
Stephen Burley (Rockefeller University) described how to test protein samples for their propensity to form useful crystals. Crystals have a higher chance of being formed when the protein species in solution is conformationally homogeneous. This can be evaluated by dynamic light scattering, as well as by limited proteolysis combined with mass spectroscopy. For example, candidates that are monodisperse under standard aqueous conditions have a probability of at least 75% to yield crystals suitable for X-ray studies, in comparison to 10% for poly-disperse samples.

Wayne Hendrickson (Columbia University) described several pivotal advances that have matured only recently and form the basis of crystallographic analysis at a markedly accelerated level. These include: undulator sources at synchrotrons, CCD detectors, cryo-crystallography, MAD phasing analysis, and selenomethionyl proteins2. It is now possible to sustain a data collection throughput of four structures per day on a single undulator beam line. Assuming 180 operational days per year and that only one out of three data sets results in a useful final structure, this gives a conservative estimate of approximately 360 structures per year per single undulator beam line. This means that only several dedicated undulated beam lines are needed over the course of five years to achieve the goal of determining 10,000 new protein structures.

Tom Terwilliger (LANL) has developed the SOLVE computer program for obtaining electron density maps of proteins directly from multiwavelength or multiple isomorphous replacement X-ray diffraction data. An optimization of a scoring function replaces a complex rule-based process followed manually by crystallographers. This system has already allowed crystallographers to obtain images of proteins within hours of the end of X-ray data collection. The procedure has been successfully applied to MAD data with up to 26 selenium sites per derivative.

Sung-Hou Kim (University of California, Berkeley) described results from the first structural genomics pilot project, focusing on the Methanococcus janaschii genome (Table 1 http://www.nature.com/nsmb/index.html). He presented the structures of two proteins, an ATP dependent molecular switch with a new ATP binding motif, and a nucleotide pyrophosphatase, with a new fold. For each protein, the type of a ligand bound to it, and thus its biochemical function, were predicted from the protein structure alone. The predictions were subsequently confirmed by direct experiment. These results support the idea that protein function can be anticipated from structural information alone and thus that high-throughput structure determination will make a valuable contribution to biology.

Nuclear magnetic resonance spectroscopy
Kurt Wüthrich (ETH, Zürich), Gaetano Monte-lione and Gerhardt Wagner (Harvard Medical School) described recent and prospective advances in NMR spectroscopy that will allow structure determination of larger proteins as well as increase the accuracy and speed of the analysis. These advances include partial deuteration of protein samples, new magnets (up to 1 GHz), superconducting probes, the TROSY detection method, triple resonance methods for resonance assignment, spatial information from residual dipolar coupling, and software for resonance assignments and structure determination. For example, Montelione and colleagues have developed the program AUTOASSIGN to perform automated backbone assignments in a few minutes of CPU time for proteins smaller than 200 residues. This tool, integrated with the program AUTO_STRUCTURE for automatic analysis of NOESY data and structure generation, decreased the time needed for structure refinement of a Z-domain from several months to several hours. If the NMR data collection for a 120 residue protein takes 4−5 weeks on a 600 MHz spectrometer and if automated data processing takes 1 day, 120 structures can be determined in 1 year by 12 machines (Montelione), approximately at a cost of using one synchrotron beam line. By using higher field 800 MHz to 1 GHz magnets and other sensitivity-enhancement methods, this throughput could be significantly increased.

Anticipating the usefulness of NMR for structural genomics, Shigeyuki Yokoyama of Tokyo University announced the Japanese Genomic Sciences Center that will include a large NMR facility. The center will also support computational analysis of the human genome, sequencing and mapping of the mouse genome, and collaborations with high-throughput protein crystallography efforts elsewhere in Japan. The center will have 20 NMR instruments by 2001 and will be open to international collaboration. A 1 GHz magnet is being constructed in collaboration with the Tsukuba laboratory, the engineers of the magnets used for the bullet train.

NMR spectroscopy has become increasingly more useful and efficient. However, given the synchrotrons, X-ray crystallography is likely to retain an edge in throughput, accuracy, maximal protein size, and possibly cost per protein structure determination. Nevertheless, NMR spectroscopy will be indispensable for maximizing the benefits of structural genomics because of its unique ability to obtain information about the internal dynamics of a protein on multiple time scales as well as about its ligand binding properties. For example, Kurt Wüthrich described the importance of flexible tails for the function of several proteins. Another example is the routine use of chemical shift perturbation to enumerate binding sites on a protein for a thousand ligands per day from a library of 15,000 compounds, with up to 1 mM sensitivity in the dissociation constant (Steven Fesik, Abbott Laboratories). Also potentially feasible is NMR structure determination of membrane protein structures, which are exceptionally difficult to crystallize.

Databases
Several different databases will be useful in facilitating the structural genomics project (Table 2 http://www.nature.com/nsmb/index.html). These databases will be used for target selection, coordination of the experimental efforts, archiving of the new structures, dissemination of the results, and for performing new research with the old and new data. Helen Berman of Rutgers University described a new macromolecular structure database that will replace the Brookhaven Protein Data Bank3. Liisa Holm, Christine Orengo, and Steven Brenner described CATH, FSSP, and SCOP databases respectively. These databases classify protein structures in a hierarchical manner, generally consisting of families (proteins that share at least approx30% sequence identity), superfamilies (proteins that are probably related by divergent evolution but can have very little sequence similarity), and fold families (proteins that share similar folds). Orengo pointed out a great need for an exhaustive, systematic and computer friendly database of protein functional annotations. Such a database is necessary for the development and use of the methods that correlate structure to function. Andrej S caronali described ModBase, a large database of comparative protein structure models. This database is calculated with program Modeller and currently contains protein models for five genomes. It will continue to grow with the availability of new protein structures and sequences, as well as the improvements in the modeling software. Steven Brenner introduced PRESAGE, a database that stands to become the central nervous system for coordinating the efforts in structural genomics. PRESAGE includes entries for proteins; each entry contains information about the protein's homologs, current status of structure determination, links to three-dimensional models, and so forth. The entries in PRESAGE will be contributed by the community at large. Biologists will use it as a portal for accessing the most up-to-date information about proteins of interest. For example, PRESAGE would have eliminated the duplicated structure determination of elongation factor 5A by two structural genomics pilot projects (Sung-Hou Kim; Tom Terwilliger).

Table 2. Structural genomics web sites
Table 2 thumbnail

Full TableFull Table
Using the structures
Development of the databases and software will be necessary to integrate raw structural information and structure-derived results with other information, such as genetic and biochemical analysis, gene expression patterns, mutational analysis, binding data from protein chips, interaction data from two-hybrid systems, metabolic pathway information, and so forth (Chris Sander). These tools will then allow researchers to address efficiently a host of new questions. Some such studies, illustrating but the tip of the iceberg, have already been performed and were described at the meeting. For example, questions about the distribution of different structures among the genomes, and the implications for models of evolution, have been asked by Mark Gerstein (Yale University) and Eugene Koonin. In addition, David Eisenberg focused on understanding the extreme thermostability of proteins in thermophilic organisms in terms of the amino acid composition and structural features of the proteins. The study was based on a comparison of proteins that have homologs in both thermophilic and mesophilic organisms and which have homologs of known structure. It appears that thermostability of proteins originates in part from salt bridges and from deletions of loop regions.

Jeff Skolnick (Scripps Research Institute) described a method for predicting protein function based on structure prediction and three-dimensional motif identification. This method could push the detection of protein function much further into the twilight zone of sequence identity. First, the tertiary structures of the sequences of interest are predicted either ab initio or by threading. Then, the resulting models are screened using a library of three-dimensional descriptors of protein active sites, termed 'fuzzy functional forms'. If the geometry and residue types in the predicted structures match a three-dimensional motif, then the protein is predicted to have the specified molecular function. The idea and its power relative to local sequence pattern searches were illustrated by searching for proteins with a disulfide oxidoreductase active site in three bacterial genomes.

Impact of structural genomics
Suggestions about how best to proceed with the structural genomics project were discussed. Ideas ranging from continuing 'research as usual' to establishing large centers for structural genomics were explored. The pilot projects will help shape the full scale effort. It is likely that relatively large centers will provide the infrastructure for X-ray crystallography and NMR spectroscopy, allow the economy of scale, foster collaboration, and also ensure that the technical and repetitive jobs are not performed by students and postdocs in the academic laboratories. Centers will make it possible for academic and pharmaceutical researchers to obtain protein samples and research information, such as purification protocols and crystallization conditions. Such an infrastructure will enable the individual laboratories to focus on more challenging research problems, including functional characterization of their favorite proteins, structure determination of disease targets, large proteins and protein complexes, study of post-translational modifications, and systemic questions involving networks of proteins. Structural biologists in the pharmaceutical industry will increase their impact on the drug discovery process because they will be in a position to determine the structures of many more protein−ligand complexes, at a lower cost and faster, using the structures, samples and protocols from the high-throughput structure determination. The impact of structural genomics on the structural and functional characterization of proteins, as well as on our understanding of the machinery of life and its evolution, will surely result in profusion of new drug targets and better drugs. Structural genomics will refocus biology, just as the advent of of high throughput DNA sequencing machines freed researchers to focus on more elaborate experiments rather than on accumulating sequence data.

For a set of links related to structural genomics, please see the web special at http://www.nature.com/nsmb/index.html.

 Top
REFERENCES
  1. Structure-based functional genomics, October 4−7, Avalon, New Jersey (1998). http://lion.cabm.rutgers.edu/bioinformatics_meeting/. The meeting was organized by Gaetano Montelione, Stephen Anderson, Edward Arnold, and Ann Stock of the Center for Advanced Biotechnology and Medicine at Rutgers University (CABM), with the backing of the New Jersey Commission on Science and Technology, CABM, the Merck Genome Research Institute, and the Burroughs Wellcome Fund.
  2. Synchrotron supplement. Nature Struct. Biol. 5, 614−656 (1998). | Article | PubMed | ISI | ChemPort |
  3. Editorial, Nature Struct. Biol. 5, 925−926 (1998). | Article | PubMed | ISI | ChemPort |
 Top
FULL TEXT
Previous | Next
Table of contents
Download PDFDownload PDF
Send to a friendSend to a friend
Save this linkSave this link

Open Innovation Challenges

naturejobs

Abstract
Figures & Tables
References
Export citation
Export references
natureproducts

Search buyers guide:

 
ADVERTISEMENT
 
Nature Structural & Molecular Biology
ISSN: 1545-9993
EISSN: 1545-9985
Journal home | Advance online publication | Current issue | Archive | Press releases | Supplements | For authors | Online submission | Permissions | For referees | Free online issue | About the journal | Contact the journal | Subscribe | Advertising | work@npg | naturereprints | About this site | For librarians
Nature Publishing Group, publisher of Nature, and other science journals and reference works©1998 Nature Publishing Group | Privacy policy