It took 22 years of painstaking effort for Max Perutz to produce his structure for the protein haemoglobin1. Published in Nature in 1960, the purified molecule's precise configuration was calculated from the way its crystals diffracted a beam of X-rays. Since this pioneering work, the number of protein structures that have been determined has increased exponentially. So much so that the field's main database, the Protein Data Bank, now contains some 13,000 unique structures into which proteins can fold (see chart, opposite).

Figure 2: Data deluge: the number of structures in the Protein Data Bank has grown steadily.
figure 2

PROTEIN DATA BANK

But such rapid progress will soon appear sluggish. The rate at which protein structures are cranked out is poised to rise by one, or even two, orders of magnitude. As the DNA sequencing stages of the various genome projects come to a close, attention is turning to the functions of the proteins encoded by the genes. The same brute-force philosophy that has turned genetics into genomics is now beginning to transform structural biology into structural genomics. “Structure and sequence information together are more insightful than sequence information alone,” says Gaetano Montelione, a structural biologist at Rutgers University in New Jersey.

This new discipline — which is the focus of a special Nature Structural Biology supplement this month2 — promises to change fundamentally the way structural biologists work. Getting basic structural information will largely be automated as huge 'protein structure factories' transform raw genomic data into finished protein structures on an unprecedented scale. This will leave academic structural biologists free to concentrate on interpreting the biological meaning of the new information, while those in the pharmaceutical industry apply the resulting knowledge to drug design.

On the production line

A structural genomics production line (see below) starts with a gene. Interest focuses on the parts of the gene that are transcribed into messenger RNA, and so encode proteins. Synthesizing sequences of DNA — known as complementary DNA (cDNA) — that correspond to these sections paves the way for generating the relevant protein on a large scale. By adding short pieces of marker DNA to the cDNAs, the final protein gets 'tagged' with a particular peptide, which makes extraction of the product easier.

Figure 1: On the conveyor belt: structural genomics will turn protein structure discovery into a factory production-line process.
figure 1

MOLECULAR STRUCTURE CORP/LEONARD LESSIN/SPL/KAI BIENERT/OXFORD INSTRUMENTS/JAMES HOLMES/SPL

Each cDNA is then transferred to a suitable expression system. At present, this usually involves splicing the cDNA into a loop of DNA known as a plasmid, which in turn is inserted into Escherichia coli bacteria. Culturing the bacteria produces the protein. To purify the product, the homogenized culture is normally poured through an affinity chromatography column containing molecules that bind specifically to the peptide tag. For example, proteins tagged with a series of histidine residues bind tightly to nickel ions.

The protein is then washed from the column and its structure determined using two main techniques: X-ray diffraction or nuclear magnetic resonance (NMR) spectroscopy. Each has its advantages, and most protein structure factories will use both methods. Because it works on molecules in solution, NMR is used to study the dynamics of proteins 'in action', for example recording the structural changes that occur when a protein binds to a small molecule — such as a hormone or a drug.

At the moment, NMR cannot be used with molecules larger than about 30 kilodaltons, and so cannot study large proteins in their entirety. But the technique is good enough for studying individual 'domain folds', twists and turns within distinct areas of any protein. There may be fewer than 20,000 possible configurations of these folds.

X-ray diffraction is not constrained by the size of the protein, and has the added advantage that producing a structure from the raw data requires much less computation than for NMR. But it does depend on being able to crystallize the protein — and determining the precise conditions to achieve this can be very laborious.

The great potential of structural genomics has already been demonstrated. In the past couple of years, around 70 proteins have had their structures determined using DNA sequence data as the starting material. All that remains is for the 'factory' element to be established. And structural genomics centres worldwide are making progress towards automating the entire production line.

Taking the initiative

Heinemann: plans to focus on amenable human proteins with therapeutic potential. Credit: KAI BIENERT

Surprisingly, Germany was among the first to take the leap. Research ministry cash started flowing into the Protein Structure Factory, at the Max Delbrück Centre for Molecular Medicine in Berlin, more than a year ago. “Usually, German agencies decline to invest in big projects which have not already 'proven their importance' by being supported in the United States first,” says Udo Heinemann, scientific spokesman for the initiative.

This time, the US National Institutes of Health (NIH) followed Germany's lead. In September it funded seven centres to develop technologies for high-throughput structural genomics. Elsewhere, the RIKEN Structural Genomics Initiative in Yokohama, Japan, last month switched on its battery of 16 NMR machines; the Ontario Center for Structural Proteomics was launched in Toronto in July; and other initiatives are getting under way in France, Switzerland and Italy.

Automating the protein structure production line is reasonably straightforward — even the fiddly step of crystallization can be handed over to computer-controlled robots, which vary subtly the conditions in hundreds of thousands of wells containing protein solutions. The problem is that many proteins cannot readily be expressed in culture, either failing to fold correctly or precipitating out into messy clumps, rather than remaining in solution. Proteins that are bound within cell membranes are particularly troublesome, as they do not fold correctly in aqueous solution. And even those proteins that are soluble can sometimes be difficult to crystallize.

Given these problems, some of the structural genomics initiatives have decided to concentrate on the 'low-hanging fruit' of amenable proteins first. Guido Granti of Chiron Vaccines in Siena, who is coordinating an Italian structural genomics consortium now being set up, hopes to identify new target proteins suitable for vaccine development, particularly against meningococcus B, which causes meningitis. The consortium will concentrate on the 25% of proteins from cloned bacteria that Chiron has managed to express easily. “It means abandoning many proteins that might have been interesting,” says Granti. “But we will have plenty to be working on while deciding if we want to attack insoluble proteins.”

In Berlin, the focus is on human proteins, which are notoriously difficult to study because they are extensively modified after being translated from messenger RNA. “We select our target proteins on the basis of their potential novelty, and their potential for involvement in disease,” says Heinemann. But, like the Italian effort, the initial focus is on proteins that are, as Heinemann puts it, “nice to us”.

Many researchers are experimenting with a number of tricks to coax greater cooperation from intractable proteins. This includes randomly mutating the gene in question to see if any one of numerous point mutations produces a more soluble protein. For example, Tom Terwilliger of the Los Alamos National Laboratory in New Mexico, who is coordinating one of the NIH-funded structural genomics centres, is using this approach in his assault on 1,000 of the 4,000 proteins produced by the tuberculosis bacterium.

Alternative techniques include finding molecules that will bind to a target protein, stabilizing it in its correct structure, or expressing the protein of interest alongside 'chaperone' proteins which help direct its folding. Shigeyuki Yokoyama, leader of the Japanese initiative, plans to use a cell-free expression system incorporating a 'universal chaperone' protein. Initially, the Japanese effort will focus on mouse proteins, as the RIKEN Genomic Sciences Center, which last week hosted the International Conference on Structural Genomics 2000, has already sequenced thousands of full-length mouse cDNAs.

Other researchers are targeting entire protein families — the difficult molecules as well as the easy. Dino Moras, of the Institute of Genetics, Molecular and Cellular Biology in Strasbourg, is coordinator of the French structural genomics initiative. As a whole, the initiative has a broad outlook, but Moras's own group will concentrate on receptors found in the cell nucleus — proteins such as the oestrogen and glucocorticoid receptors — which are already the focus for work in his lab.

Yet others, such as the planned Swiss consortium, coordinated by Markus Grütter of the University of Zurich and Kurt Wüthrich of the Swiss Federal Institute of Technology, also in Zurich, have taken the bold decision to specialize in difficult proteins. The Swiss effort will pay particular attention to membrane proteins, which account for at least half of all cellular proteins.

On target

In general, the targets that each initiative has selected reflect the research interests of the scientists involved. Initially, it was suggested that proteins might be distributed across the various structural genomics initiatives just as responsibility for sequencing chromosomes was divided up among the teams working on the international Human Genome Project. “But the idea that targets could be imposed on the community evaporated when it was realized that there was never going to be any agreement on an ideal target,” says Stephen Burley of the Rockefeller University in New York, who directs one of the NIH structural genomics centres.

The resulting lack of an overall strategy for the field is one reason why the pharmaceutical industry is still largely watching from the sidelines (see 'Drug companies bide their time', above). But a series of task forces established at the First International Structural Genomics Meeting held in April in Hinxton, near Cambridge, UK, is now considering how overlap can be minimized and cooperation maximized. And once things take off, protein structures will roll in thick and fast. The target throughput for the German and Japanese initiatives is 100–200 proteins per year; for the NIH initiative, 10,000 new domain folds over ten years. “It will be at the very minimum five years before structural genomics starts to have a big impact on biology, and then it will explode,” predicts Yokoyama.

For today's structural biologists, used to determining protein structures in their own labs, the idea of most of this work being sucked into the new protein structure factories may seem disconcerting. But the enthusiasts for structural genomics say that the really difficult work — deciphering biological functions from the outpouring of structural information — will still need to be done. And that will place insightful and creative structural biologists in hot demand. “Factory science does not spell the end for structural biology,” stresses Heinemann. “Things will be even better.”

Web links

Protein Data Bank

http://www.rcsb.org/pdb

Protein Structure Factory, Berlin

http://userpage.chemie.fu-berlin.de/~psf/ifv_psfx.htm

NIH Structural Genomics Initiative

http://www.nigms.nih.gov/funding/psi.html

First International Conference on Structural Genomics

http://www.nigms.nih.gov/news/meetings/hinxton.html

International Conference on Structural Genomics 2000

http://icsg2000.riken.go.jp