Nature 422, 115-116 (13 March 2003) | doi:10.1038/422115a

Constellations in a cellular universe

Ruedi Aebersold1

  1. Ruedi Aebersold is at the Department of Molecular Biology, University of Washington, Seattle, Washington, 98195-7730, USA.


We must agree now on strategies to search the proteome.

Proteomics — the systematic analysis of proteins — has raised grand expectations for biology and medicine, exceeding those of the other genomics technologies. Yet these expectations will be difficult to realize unless proteomic technologies are disseminated beyond specialized laboratories, and made robust. A substantial shift in focus is needed, and here I suggest how this might be accomplished using a novel browsing technology. Once adopted, successfully implemented and supported, this technology has the potential to achieve these goals.

Success with genes

One of the most striking results obtained from completed genome sequencing projects is the knowledge of the precise number of genes in the genome of a species. Although the number of genes analysed to date is relatively small — ranging from a few hundred for bacteria to tens of thousands for mammalian species — the number of possible products encoded by these genes is much higher. In particular, the number of encoded proteins is enormous, as the same gene can generate multiple protein products that differ as a result of combinatorial splicing, processing and modification.

As the universe of a species' biological processes and the molecules that constitute these processes is finite and knowable, there are several important consequences for experimental biology. First, even the most complex biological phenomena such as development, differentiation, metabolism and memory will be explained using known genes, their products and their interplay with the external conditions that the organism encounters. Second, projects to discover genes and their products in a species have defined end points. So far, although this end point has been reached only for gene discovery (sequencing), at some point in the future all of a species' gene products — including messenger RNA and proteins — will also be comprehensively described. Third, once all the possible molecules and activities within a species have been discovered and described, biological experimentation will be transformed from a discovery mode of identifying and describing molecules, to a 'browsing' mode, in which the universe of possible events is searched to find constellations that correlate with a particular state or function. Genomics-style biology can therefore be separated into two distinct phases:a discovery phase to characterize the universe; and a browsing phase, in which system-wide biological assays navigate the universe.

Proteins are involved in all biological processes and can therefore be considered the functionally most important biological molecules. They are also particularly rich in biological information. In addition to the amino-acid sequence defining a protein, protein properties such as the amount of a protein expressed, its specific activity, state of modification and association with other proteins or molecules of different types are crucial for the description of biological systems. The systematic identification and characterization of proteins, called proteomics, carries with it huge expectations, such as diagnostic and prognostic markers in blood serum and other body fluids; targets for pharmaceutical drugs; and improving the knowledge of fundamental biological processes. Hence the development of technologies to search the 'proteome' routinely and systematically would be a significant achievement.

Unfortunately, the same properties that make proteins information-rich also significantly complicate their experimental analysis. There is no experimental platform, even under development, to systematically measure the diverse properties of proteins at high throughput. Currently, the most mature and versatile proteomic methods are based on mass spectrometry. For the study of complex protein mixtures, most methods revolve around the following outline. Sample proteins are proteolytically cleaved into smaller peptides. The resultant peptides are separated and analysed in a mass spectrometer, the data processed through a series of computer algorithms that determine the sequence identity of the proteins, and to some extent their state of modification. Mass-spectroscopy-based analyses are extremely accurate quantitatively, provided that suitable reference standards are available. These standards are usually generated by labelling a reference sample at a known concentration with stable isotopes, generating pairs of molecules that are chemically identical but of different mass, so that they can be distinguished by a mass spectrometer. Thus the signal intensity ratio observed between the known standard and unknown sample versions of the same molecules provide an accurate measure of the abundance of each species in the original sample.

As this rather complex process requires expensive instrumentation, information-technology infrastructure and highly specialized personnel, it is unlikely that it will spread widely from specialized proteomics laboratories to biological and clinical research communities. There are two other factors that make it highly doubtful that even large, specialized proteomics laboratories will achieve the automation and sample throughput that is required for undertaking huge (for example, population-based clinical) studies. First is the enormous complexity of the proteome — the human genome is assumed to contain around 30,000 genes, but it has been estimated that human serum alone contains in excess of 500,000 different protein species. Second, current mass-spectrometry-based proteomics operates in discovery mode, which requires the re-discovery of each protein species in each experiment.It is therefore essential to develop a robust proteome-browsing technology to meet the high expectations.

Current applications

In some areas of molecular biology research, the transition from a discovery to a browsing mode of operation is already in progress, notably in gene-expression array analysis and in the analysis of single nucleotide polymorphisms (SNPs). In the former, the discovery phase establishes the transcripts potentially generated by a genome (transcriptome) using experimental and computational methods. In the browsing phase, probes specific for each transcript are arranged in ordered microarrays; the transcripts of the genes expressed by a cell in a specific state are concurrently identified and quantified in a single experiment.

Comparisons of the transcript patterns obtained from cells representing different states are expected to provide detailed diagnostic patterns classifying cellular or pathological states and provide new insights into the function and control of biological processes. This method has already yielded biologically and clinically significant results. For example, transcript profiling has been used for predicting survival in breast cancer; for predicting clinical outcomes in breast cancer and survival after chemotherapy for diffuse, large B-cell lymphoma; and concurrently to monitor hundreds of cellular functions. In the discovery phase of SNP analysis, an international consortium of scientists is generating a comprehensive catalogue of this universe in the human population. In the browsing mode, called 'scoring', associations of specific patterns of SNPs with specific phenotypes are being established. Such patterns are thought to be useful for predicting the propensity for the onset of a variety of disease conditions.

For both gene-expression-array and SNP analyses, successful studies in the browsing mode were carried out before the discovery phase was complete. High-throughput analysis is inconceivable if the objects studied have to be discovered de novo each time. Segregation of the process into a discovery phase followed by a browsing phase was vital for the implementation of these two powerful genomic technologies and will be even more so if proteomics is to succeed.

My proposal for a browsing technology is conceptually simple. For each protein, protein isoform or specifically modified form of a protein, a peptide sequence that uniquely identifies that polypeptide is selected, chemically synthesized and labelled with tags of a heavy stable isotope. These peptides are the definitive markers for the proteins to be studied. Precisely measured amounts of these reference peptides are then added to a sample in which the proteins or peptides have been labelled with tags of a light stable isotope. The combined peptide sample can be separated reproducibly, and fractions deposited on the sample plate of a mass spectrometer, effectively generating an ordered peptide array. Each array element can then be interrogated by a mass spectrometer and will generate two types of signal, one representing the signals of the peptides for which no reference peptide has been added, appearing as single peaks, and the other representing the signals for those peptides for which a reference peptide was added, appearing as paired signals with a mass difference that precisely corresponds to the mass differential encoded in the stable isotope tag. In this method, a protein is identified by correlating the position and the accurately measured mass of each isotope–peptide pair in the array. Proteins are quantified by determining the ratio of the size of the signal of a peptide derived from the protein mixture with the signal of the corresponding reference peptide.

There are several advantages of this proteome-browsing method. First, one peptide is sufficient for the unambiguous identification and quantification of each protein. Therefore, the number of peptides that need to be analysed to identify and quantify the product of every gene approaches the number of genes in a genome. Second, data analysis becomes trivial as each protein is identified and quantified by correlating the acquired data with a look-up table, rather than by de novo identification. Third, the method is easily standardized between laboratories. Fourth, the absolute quantity of each protein is determined, making data sets easily comparable. Fifth, any subset of proteins, for example proteins contained in organelles, subcellular fractions or differentiated cells, can be selectively interrogated. Sixth, splice isoforms, differentially modified or processed proteins, can similarly be absolutely quantified, provided that appropriate reference peptides can be synthesized. Finally, the method is relatively cheap, as only minuscule (nanogram to sub-nanogram) amounts of the peptide standards are used per assay. However, similar to other genomics technologies,the proposed proteomics technology requires a sizable initial cost and labour investment that will pay dividends through the wide dissemination of a rapid, robust and simple quantitative technology.

An initial investment is required for the synthesis and calibration of thousands of isotopically labelled peptides. A project of this scope not only exceeds the ambitions of a typical laboratory, but mandates a collaborative approach, for if different research groups were to generate their own sets of reference peptides, correlation of data between studies would be difficult. I therefore call for a community-based effort or a private–public partnership for the design, synthesis and distribution of suitable sets of reference peptides. Emulating the process pioneered by the yeast community for the design of polymerase chain reaction probes for each gene, a set of unique peptide standards for each yeast protein could be generated at an estimated cost of $1–2 million and made generally accessible for the systematic study of yeast biology.

In the arena of clinical research, the quantitative profiling of serum proteins lends itself particularly well as an early application of the technology. Serum protein profiling holds enormous promise as a tool for the early detection and stratified diagnosis of disease and for the assessment of the success of therapeutic regimens. In such studies, there are potentially thousands of samples that need to be profiled using reproducible and transparent methods of exquisite quantitative accuracy, attributes that match those of the proposed method. By means of a robust proteome-browsing technology, proteomics will be able to fufil its potential.