Here I describe the Human Genome Project (HGP) in terms of five overlapping stages defined mostly in terms of their biomedical goals. My somewhat arbitrary staging is informed by what I perceive as a steady turnover of ideas having to do with a "program for life," what it means, and where it is located. Initially the program was assumed to be located in the genome and to be isomorphic with phenotype, but this location is gradually being transferred, through intermediate stages, to the level of the organism itself. Each succeeding stage of the HGP is driven by results not anticipated in the prior stage.

While the HGP is not yet complete, enough data have been collected from it and from other genome projects (mouse, worm, fly) to allow some tentative conclusions: (1) There is not sufficient information in genomic databases to provide explanations for complex functional attributes of cells and organisms. (2) Therefore, there must be other informational systems and operating rules that complement genomic systems. (3) Epigenesis is identified as one such system. (4) Program rules by which regulation is produced are extragenomic and are most likely to be found not in molecular mechanisms per se, but in their integration into complex gene circuits and, more peripherally, into their connectedness with regulatory networks (metabolic and other) of cellular dimensions.

Evidence establishing epigenetic networks as sites of control over cellular phenotype, including control over covalent marking of DNA and chromatin (the phenotype of the genotype)1, has shifted attention from a narrow focus on DNA to the more complex dynamics of gene circuits, and their integration into larger, environmentally open networks of cellular dimensions2. This change in emphasis from linear "causal" molecules to the regulatory dynamics of molecular networks is increasingly perceived by a growing number of molecular biologists and geneticists working within the various genome projects3, as well as in biochemistry4, integrative biology and physiology5, developmental biology6, and medical7 and behavioral genetics8.

The following paragraphs will summarize the five stages of the HGP. Stages I (monogenic causality) and II (polygenic causality) deal, respectively, with rare monogenic diseases, and with polygenic diseases associated with an unknown complex of genes coupled to individual experience and environmental history. Because monogenic diseases account for only a small fraction of noninfectious diseases (2%)8, the emphasis has shifted more to stage II, in which the goal is to focus on candidate genes or small clusters of key genes among the many (possibly tens, hundreds, or thousands) involved in producing a chronic disease or any other complex phenotype. Gene maps for each physiologic function and disease will begin to close the gap between genetic information and functional outcome in cells and organisms. However, as noted previously, these maps mostly assume additive and dominant effects and do not include dynamic rules governing deployment, interaction (epistasis), redundancy (pleiotropy), and connectedness of these genes.

The transition to stage III, analysis of the proteome (the entire protein complement of a genome), focuses on expressed genes (proteins) thereby avoiding some of the problems associated with genomic complexity. However, as in stage II, problems still exist in that there is a continued reliance on describing large numbers of additive agents (proteins) without recourse to rules of interaction, redundancy, and connectedness.

Transgenic analysis (stage IV), in acknowledging the many problems just outlined for the earlier stages, will rely on the normal dynamics of the organism in conjunction with gene transfer between species to produce "novel" phenotypes and thereby reveal programmatic aspects of morphogenetic processes. But here, too, there are unexplored areas, as with pleiotropic genes and proteins, in which developmental and other higher levels of organization remain to be taken into account4. Therefore, with this strategy, although detailed genetic maps for a variety of cellular functions will be established, the nature of the processes being perturbed by gene manipulation remains a black box.

The fifth and final stage, complexity, is the logical though unpredictable extension of stage IV, and perhaps represents a new approach recognizing that higher levels of cellular organization and regulation impose constraints on the genome, and that genes and environments are inseparably integrated. Epigenetic regulation of the genome1 is seen as the most proximal of a hierarchy of constraints extending outward from DNA structure to the cell boundary and beyond. This stage of the project is now engaged with describing the molecular events involved in DNA and chromatin marking and seeks to understand, among other things, how marking constrains and orders patterns of gene expression. While these studies remain descriptive, the next logical stage is already under way. Connectedness is restated in terms of gene circuits2 or metabolic networks4, so that the large amount of information inherent in systems of genetic or biochemical activity may be collapsed into a logic of circuits and networks.

The main message here for a biotechnology devoted to finding specific causes and cures for complex diseases is that the desired specificity is severely compromised by a profound genetic, molecular, and informational complexity. For example, coronary artery disease involves several hundred genes. A complex disease like colon cancer is now acknowledged to include not only large–scale mutation but also profound changes in patterns of gene expression9. Genetic instability in the forms of loss of heterozygosity10 and aneuploidy11 also complicate the simple single or even multiple gene mutation theories of cancer12. Considering in addition the classical but mostly unrecognized uncertainties inherent in widespread epistasis and pleiotropy, the present emphasis on dominant gene effects and on single–gene or protein–based diagnosis and therapy for common human diseases must be seen as unrealistic. The gene–protein circuits and network logic studies cited earlier represent some starting points for the development of new understanding and new technologies for complex human phenotypes.