By globally cataloging cellular protein content and state, proteomics promises
to complement genomics in drug discovery and basic research.
If there is one criticism that can be lodged against genomics as a tool
for drug discovery, it is that DNA sequence information provides only a static
snapshot of all the possible ways a cell might use its genes. In actual fact,
the life of a cell is a dynamic process in which it is constantly reacting
to its environment. If, for example, a disease-inducing element is introduced,
it may change how much gene product is made, when the genes are turned on,
the type and extent of post-translational modifications that occur, and how
these events affect other genes. These effects will determine if the organism
successfully defends itself or succumbs to the disease.
Because the study of this dynamic has the potential to reveal new targets
for drug intervention in disease processes, emphasis is now being placed on
understanding how and when genome-encoded events (e.g., protein translation)
occur and what relationship non-genome-encoded events (e.g., posttranslational
modifications of proteins and interactions between proteins, nucleic acids,
lipids, carbohydrates, and combinations thereof) have to particular physiological
states. This endeavor is becoming known as proteomics because it focuses on
the protein products of the genome and their interactions rather than on simple
DNA sequence. It is being undertaken using powerful analytical tools, such
as two-dimensional electrophoresis (2-DE) and ultrasensitive mass spectrometry
(MS), coupled with high-throughput functional screening assays.
Historical perspective From a technological standpoint, the essence of proteomics is protein characterization,
which has been a mainstay of traditional biochemistry since the beginning
of the century. In 1950, protein characterization acquired a very powerful
tool, known as the Edman degradation method for protein sequencing, enabling
laboratories around the world to systematically sequence proteins they could
isolate in pure enough form1. Many proteins were sequenced by
the Edman method, including hemoglobin, insulin, and myosin, and descendants
of the method are still very much in use today. In addition to protein sequencing
tools, the other central component of proteomics research is 2-DE, which is
capable of resolving total protein extracts from cells into about 10,000 individual
protein spots. Originally described 25 years ago2,
3, progress
in the field has been dramatic as a result of coupling 2-DE with MS, and of
the development of appropriate software for the analysis of resolved proteins
and protein fragments in high-throughput modes.
The word proteome was first introduced in July 1995 and was defined as
the "total protein complement of a genome."4 This
original definition summarized what was known about protein expression in
relationship to the genome by pointing out that as not all encoded proteins
are expressed at any point in time, the pattern of protein expression changes
depending on factors such as the stage of development of an organism and the
organism's physiological state. It went on to hypothesize that the more
complex the genome, the less of the total possible proteome will be expressed
at any particular moment. Linking expression of the proteome to physiological
changes associated with healthy or diseased conditions would then be a new
way to identify clinically relevant molecular disease targets and developing
novel drugs against them.
The use of proteomics to identify drug leads need not be based only on
human protein expression patterns. Large-scale sequencing efforts of bacterial,
viral, yeast, and other higher-organism genomes provide important information
about the life cycle of the invading organism and where its weaknesses lie.
For example, the 1995 sequencing of the entire genome of Hemophilus influenzae
is likely to provide a foundation for proteomics in the study of bronchial
infections5.
Since the sequencing of H. influenzae, 31 nonhuman genomes have
been deciphered and 59 are underway. For example, the entire genomes of pathogens
such as Staphylococcus aureus or of extremophiles such as Aquifex
(a single-cell organism that lives at 90°C and may harbor novel enzymes
for industrial applications) are being sequenced, the intention being to mine
the genomic information using proteomic data that will help develop novel
antibiotics, anticancer agents, industrial catalysts, and other types of desired
molecules6. The success of these ongoing efforts is a key catalytic
component of proteomics because it enables the seamless linkage of gene sequence
and expressed phenotype in these organisms under varying physiological states.
Current state The workhorse for obtaining protein expression patterns from cells and
tissues is 2-DE. In high-format mode, this powerful technique produces gels
containing up to 10,000 distinct protein and peptide spots7.
The major problem with this technique is that over 95% of the spots cannot
be sequenced because they are beyond the limits of current high-sensitivity
Edman sequencers. By comparison, standard-format 2-D gels yield up to about
2,000 spots, which can all be sequenced by Edman methods4 or
attomole-range MS8. These MS methods have been applied successfully
toward the development of alternative methods to Edman sequencingthe
so-called peptide mass fingerprinting approach. In this technique, proteins
and peptides are digested either chemically or enzymatically to produce a
unique degradation fingerprint that can then be analyzed by MS9.
Analysis of the data, whether generated by 2-DE or MS, is also being streamlined.
At present, 2-DE gel patterns are scanned into a computer and then analyzed
by computer algorithms that quantify the different gel patterns that arise
when the proteome complement of a cell is obtained under normal or physiologically
altered states10. Databases of 2-D gels obtained from cells
or body fluids under varying conditions are now readily available over the
Internet. Examples include the SwissProt Swiss-2DPAGE and Swiss-2DISEASE databases11, which include gel patterns obtained from renal cells in renal
failure, myeloma cells, liver cells, and many others. Computer algorithms
are also used to analyze the data from peptide mass fingerprints. One method
relies on comparing the actual MS spectrum obtained from a test sample to
a database of predicted spectra, and can accomplish this comparison in a high-throughput
mode12.
Industry challenges The biggest challenges faced by the proteomics industry are technical and
validational. The technical challenges revolve around the ability to resolve
reproducibly and accurately the 10,000 proteins and peptides obtained from
whole cell extracts by large-format 2-DE. Various electrophoresis methods
are constantly being refined to accomplish this goal, including isoelectric
focusing followed by mass-based separation (ISO-DALT), nonequilibrium based
electrophoresis (NEPHGE), and immobilized first-dimension pH gradients (IPG-DALT)13.
Resolution of very complex protein mixtures is crucial to the success of
proteomics, and this must be coupled to the accurate sequencing of the proteins
and peptides in the mixtures. This is the domain of high-throughput MS and
interpretation software. Methods such as so-called tandem MS, electrospray
ionization MS, and matrix-assisted laser desorption/ionization MS are increasing
the sensitivity and versatility of MS as the method of choice for identifying
proteins and peptides in mixtures. This is also helped by parallel developments
in MS interpretation software, such as SEQUEST, which enables the high-throughput
analysis of MS spectra of unknown samples against known and predicted standards114.
Because proteomics is considered a powerful platform for novel drug development,
a number of companies are working in this area to meet these challenges (see Table 1). The January 1998 strategic alliance between Incyte
(Palo Alto, CA) and Oxford GlycoSciences demonstrates how established biotechnology
companies can leverage their unique capabilities to generate a system capable
of handling the complexity of interpreting these data. In this case, Incyte's
genomics databases were linked to Oxford GlycoScience's proteomics databases.
Table 1. Selected companies with programs in proteomics
The future According to databases based on expressed sequence tags, the human genome
consists of about 60,000−100,000 genes, scattered among 3 billion nucleotides
of chromosome-based DNA code, the sequencing of which has been essentially
completed. This represents an enormous amount of static information that needs
to be correlated with dynamic information about gene products and their interactions.
Proteomics will provide methods to correlate the vast amount of genomics information
that is becoming available with the equally vast protein information that
is being produced through analysis of cells under normal versus altered states.
The key here is high throughput, and perhaps the most promising advances
in proteomics today are being carried out not just at the MS level, but also
at the algorithm level. It is now possible to automate the acquisition of
hundreds to thousands of mass spectra from peptides resolved by 2-DE and additional
capillary electrophoresis methods. These spectra can be analyzed automatically
for the presence of peptide fragments that are then used to reconstruct parent
proteins, in a method akin to that used for the large-scale sequencing of
genomes13. The future of proteomics should see continuous improvement
in this methodology and its seamless linkage to genomics information if rapid
progress is to be made.
As proteomics matures, it is inevitable that it will form connections with
other emerging fields, and one of the most closely related is that of structural
genomics, which links gene sequences to specific proteins and structures15. For example, a recent structural genomics report describes how
protein folds in model organisms compare to each other, with the worm being
more closely related in terms of protein fold patterns to yeast than to
Escherichia coli16. These approaches may also help identify
open reading frames of unknown fold, thus enriching proteome databases of
these model organisms, and are likely to be extended to the human genome as
well.
The future of proteomics will also see the development of specialist disciplines
within the larger field, one being subproteomics. This focuses on the proteomes
of specific cellular locations, on the proteomes obtained by fractionation
based on solubility, and in general on specific proteomes obtained by any
method that simplifies the complex protein load of a cell17.
This powerful approach promises to refine our understanding of protein expression
and dynamics relative to the exact state of a cell, including its sub-cellular
compartments.
Finally, proteomics will benefit in the future from continuous improvements
in the various techniques used to obtain and analyze proteomes18.
New methods that complement the traditional 2-D gels and that have very significant
promise are the isotope-coded affinity tags approach19, two-dimensional
liquid chromatography-tandem mass spectrometry20, and head column
stacking capillary zone electrophoresis21.
Conclusions Proteomics aims to supplement gene sequence data with information on what
proteins are being made where, in what amounts, and under what conditions.
It aims to show how protein cascades inside cells change as a result of specific
diseases, thereby identifying novel potential drug targets. It then aims to
validate particular drug leads against those targets by providing information
on how those leads affect the proteome cascades (see Lead validation, pp. 47−49). Therefore, in addition
to providing answers to fundamental questions about the molecular basis of
a cell's state at any point in time, proteomics promises to accelerate
novel drug discovery through automated analysis of clinically relevant molecular
phenomena.
Reprinted from Nature Biotechnology 16, 393−394
(1998).