“Though facts are inherently less satisfying than the intellectual conclusions drawn from them, their importance should never be questioned.” James D. Watson, 2002.

DNA carries all of the genetic information for life. One enormously long DNA molecule forms each of the chromosomes of an organism, 23 of them in a human. The fundamental living unit is the single cell. A cell gives rise to many more cells through serial repetitions of a process known as cell division. Before each division, new copies must be made of each of the many molecules that form the cell, including the duplication of all DNA molecules. DNA replication is the name given to this duplication process, which enables an organism's genetic information — its genes — to be passed to the two daughter cells created when a cell divides. Only slightly less central to life is a process that requires dynamic DNA acrobatics, called homologous DNA recombination, which reshuffles the genes on chromosomes. In reactions closely linked to DNA replication, the recombination machinery also repairs damage that inevitably occurs to the long, fragile DNA molecules inside cells (see article in this issue by Friedberg, page 436).

The model for the DNA double helix1 proposed by James Watson and Francis Crick is based on two paired DNA strands that are complementary in their nucleotide sequence. The model had striking implications for the processes of DNA replication and DNA recombination. Before 1953, there had been no meaningful way of even speculating about the molecular mechanisms of these two central genetic processes. But the proposal that each nucleotide in one strand of DNA was tightly base-paired with its complementary nucleotide on the opposite strand — either adenine (A) with thymine (T), or guanine (G) with cytosine (C) — meant that any part of the nucleotide sequence could act as a direct template for the corresponding portion of the other strand. As a result, any part of the sequence can be used either to create or to recognize its partner nucleotide sequence — the two functions that are central for DNA replication and DNA recombination, respectively.

In this review, I discuss how the discovery of the structure of DNA half a century ago opened new avenues for understanding the processes of DNA replication and recombination. I shall also emphasize how, as our understanding of complex biological molecules and their interactions increased over the years, there have been profound changes in the way that biologists view the chemistry of life.

Structural features of DNA

The research that immediately followed the discovery of the double helix focused primarily on understanding the structural properties of the molecule. DNA specifies RNA through the process of gene transcription, and the RNA molecules in turn specify all of the proteins of a cell. This is the 'central dogma' of genetic information transfer2. Any read-out of genetic information — whether it be during DNA replication or gene transcription — requires access to the sequence of the bases buried in the interior of the double helix. DNA strand separation is therefore critical to DNA function. Thus, the Watson–Crick model drove scientists to a search for conditions that would disrupt the hydrogen bonds joining the complementary base pairs, so as to separate the two strands of the DNA double helix.

Physical chemists found that heating a solution of DNA to temperatures near boiling (100 °C), or subjecting it to extremes of pH, would cause the strands to separate — a change termed 'DNA denaturation'. The so-called 'melting temperature' (or Tm) of a stretch of DNA sequence depends on its nucleotide composition: those DNAs with a larger proportion of G–C base pairs exhibit a higher Tm because of the three hydrogen bonds that Watson and Crick had predicted to hold a G–C base pair together, compared with only two for the A–T base pair. At physiological salt concentrations, the Tm of mammalian DNA is nearly 90 °C, owing to the particular mix of its base pairs (47% G–C and 53% A–T)3.

Initially it seemed inconceivable that, once separated from its complementary partner, a DNA strand could reform a double helix again. In a complex mixture of DNA molecules, such a feat would require finding the one sequence match amongst millions during random collisions with other sequences, and then rapidly rewinding with a new partner strand. The dramatic discovery of this unexpected phenomenon4, called 'DNA renaturation', shed light on how sequences could be rearranged by DNA recombination. And it also provided a critical means by which DNA could be manipulated in the laboratory. The annealing of complementary nucleotide sequences, a process called hybridization, forms the basis of several DNA technologies that helped launch the biotechnology industry and modern genomics. These include gene cloning, genomic sequencing, and DNA copying by the polymerase chain reaction (see article by Hood and Galas on page 444).

The arrangement of DNA molecules in chromosomes presented another mystery for scientists: a long, thin molecule would be highly sensitive to shear-induced breakage, and it was hard to imagine that a mammalian chromosome might contain only a single DNA molecule. This would require that a typical chromosome be formed from a continuous DNA helix more than 100 million nucleotide pairs long — a massive molecule weighing more than 100 billion daltons, with an end-to-end distance of more than 3 cm. How could such a giant molecule be protected from accidental fragmentation in a cell only microns in diameter, while keeping it organized for efficient gene readout and other genetic functions?

There was no precedent for such giant molecules outside the world of biology. But in the early 1960s, autoradiographic studies revealed that the chromosome of the bacterium Escherichia coli was in fact a single DNA molecule, more than 3 million nucleotide pairs in length5. And when — more than a decade later — innovative physical techniques demonstrated that a single huge DNA molecule formed the basis for each mammalian chromosome6, the result was welcomed by scientists with little surprise.

DNA replication forks

How is the enormously long double-stranded DNA molecule that forms a chromosome accurately copied to produce a second identical chromosome each time a cell divides? The template model for DNA replication, proposed by Watson and Crick in 1953 (ref. 7), gained universal acceptance after two discoveries in the late 1950s. One was an elegant experiment using density-labelled bacterial DNAs that confirmed the predicted template–anti-template scheme8. The other was the discovery of an enzyme called DNA polymerase, which uses one strand of DNA as a template to synthesize a new complementary strand9. Four deoxyribonucleoside triphosphate nucleotides — dATP, dTTP, dGTP and dCTP — are the precursors to a new daughter DNA strand, each nucleotide selected by pairing with its complementary nucleotide (T, A, C or G, respectively) on the parental template strand. The DNA polymerase was shown to use these triphosphates to add nucleotides one at a time to the 3′ end of the newly synthesized DNA molecule, thereby catalysing DNA chain growth in the 5′ to 3′ chemical direction.

Although the synthesis of short stretches of DNA sequence on a single-stranded template could be demonstrated in a test tube, how an enormous, twisted double-stranded DNA molecule is replicated was a puzzle. Inside the cell, DNA replication was observed to occur at a Y-shaped structure, called a 'replication fork', which moves steadily along a parental DNA helix, spinning out two daughter DNA helices behind it (the two arms of the 'Y')5. As predicted by Watson and Crick, the two strands of the double helix run in opposite chemical directions. Therefore, as a replication fork moves, DNA polymerase can move continuously along only one arm of the Y — the arm on which the new daughter strand is being elongated in the 5′ to 3′ chemical direction. On the other arm, the new daughter strand would need to be produced in the opposite, 3′ to 5′ chemical direction (Fig. 1a). So, whereas Watson and Crick's central predictions were confirmed at the end of the first decade of research that followed their landmark discovery, the details of the DNA replication process remained a mystery.

Figure 1: The DNA replication fork.
figure 1

a, Nucleoside triphosphates serve as a substrate for DNA polymerase, according to the mechanism shown on the top strand. Each nucleoside triphosphate is made up of three phosphates (represented here by yellow spheres), a deoxyribose sugar (beige rectangle) and one of four bases (differently coloured cylinders). The three phosphates are joined to each other by high-energy bonds, and the cleavage of these bonds during the polymerization reaction releases the free energy needed to drive the incorporation of each nucleotide into the growing DNA chain. The reaction shown on the bottom strand, which would cause DNA chain growth in the 3′ to 5′ chemical direction, does not occur in nature. b, DNA polymerases catalyse chain growth only in the 5′ to 3′ chemical direction, but both new daughter strands grow at the fork, so a dilemma of the 1960s was how the bottom strand in this diagram was synthesized. The asymmetric nature of the replication fork was recognized by the early 1970s: the 'leading strand' grows continuously, whereas the 'lagging strand' is synthesized by a DNA polymerase through the backstitching mechanism illustrated. Thus, both strands are produced by DNA synthesis in the 5′ to 3′ direction. (Redrawn from ref. 27, with permission.)

Reconstructing replication

The mystery was solved over the course of the next two decades, a period in which the proteins that constitute the central players in the DNA replication process were identified. Scientists used a variety of experimental approaches to identify an ever-growing set of gene products thought to be critical for DNA replication. For example, mutant organisms were identified in which DNA replication was defective, and genetic techniques could then be used to identify specific sets of genes required for the replication process10,11,12. With the aid of the proteins specified by these genes, 'cell-free' systems were established, where the process was re-created in vitro using purified components. Initially, proteins were tested in a 'partial replication reaction', where only a subset of the protein machinery required for the full replication process was present, and the DNA template was provided in a single-stranded form13. New proteins that were identified were added one at a time or in combination to test their effects on the catalytic activity of DNA polymerase. Further advances in understanding replication then depended on creating more complex in vitro systems, in which, through the addition of a larger set of purified proteins, double-stranded DNA could eventually be replicated14,15.

Today, nearly every process inside cells — from DNA replication and recombination to membrane vesicle transport — is being studied in an in vitro system reconstructed from purified components. Although laborious to establish, such systems enable the precise control of both the concentration and the detailed structure of each component. Moreover, the 'noise' in the natural system caused by side reactions — because most molecules in a cell are engaged in more than one type of reaction — is avoided by eliminating the proteins that catalyse these other reactions. In essence, a small fraction of the cell can be re-created as a bounded set of chemical reactions, making it fully amenable to precise study using all of the tools of physics and chemistry.

By 1980, multiprotein in vitro systems had enabled a detailed characterization of the replication machinery and solved the problem of how DNA is synthesized on both sides of the replication fork (Fig. 1b). One daughter DNA strand is synthesized continuously by a DNA polymerase molecule moving along the 'leading strand', while a second DNA polymerase molecule on the 'lagging strand' produces a long series of fragments (called Okazaki fragments)16 which are joined together by the enzyme DNA ligase to produce a continuous DNA strand. As might be expected, there is a difference in the proteins required for leading- and lagging-strand DNA synthesis (see Box 1). Remarkably, the replication forks formed in these artificial systems could be shown to move at the same rapid rates as the forks inside cells (500 to 1,000 nucleotides per second), and the DNA template was copied with incredibly high fidelity15.

As more and more proteins were found to function at the replication fork, comparisons could be made between the replication machinery of different organisms. Studies of the replication machinery in viruses, bacteria and eukaryotes revealed that a common set of protein activities drives the replication forks in each organism (Box 1). Each system consists of: a leading- and a lagging-strand DNA polymerase molecule; a DNA primase to produce the RNA primers that start each Okazaki fragment; single-strand DNA binding proteins that coat the template DNA and hold it in position; a DNA helicase that unwinds the double helix; and additional polymerase accessory proteins the tie the polymerases to each other and to the DNA template. As one progresses from a simple virus to more complex organisms, such as yeasts or mammals, the number of subunits that make up each type of protein activity tends to increase. For example, the total number of polypeptide subunits that form the core of the replication apparatus increases from four and seven in bacteriophages T7 and T4, respectively, to 13 in the bacterium E. coli. And it expands to at least 27 in the yeast Saccharomyces cerevisiae and in mammals. Thus, as organisms with larger genomes evolved, the replication machinery added new protein subunits, without any change in the basic mechanisms15,18,19,20.

While the work I have described on DNA replication was advancing, other groups of researchers were establishing in vitro systems in which homologous DNA recombination could be reconstructed. The central player in these reactions was the RecA type of protein17, named after the bacterial mutant defective in recombination that led to its discovery (Box 2).

Protein machines

As for all other aspects of cell biochemistry, the DNA replication apparatus has evolved over billions of years through 'trial and error'— that is, by random variation followed by natural selection. With time, one protein after another could be added to the mix of proteins active at the replication fork, presumably because the new protein increased the speed, control or accuracy of the overall replication process. In addition, the structure of each protein was fine-tuned by mutations that altered its amino acid sequence so as to increase its effectiveness. The end results of this unusual engineering process are the replication systems that we observe today in different organisms. The mechanism of DNA replication might therefore be expected to be highly dependent on random past events. But did evolution select for whatever works, with no need for elegance?

For the first 30 years after Watson and Crick's discovery, most researchers seemed to hold the view that cell processes could be sloppy. This view was encouraged by knowledge of the tremendous speed of movements at the molecular level (for example, it was known that a typical protein collides with a second molecule present at a concentration of 1 mM about 106 times per second). The rapid rates of molecular movement were thought initially to allow a process like DNA replication to occur without any organization of the proteins involved in three-dimensional space.

Quite to the contrary, molecular biologists now recognize that evolution has selected for highly ordered systems. Thus, for example, not only are the parts of the replication machinery held together in precise alignments to optimize their mutual interactions, but energy-driven changes in protein conformations are used to generate coordinated movements. This ensures that each of the successive steps in a complex process like DNA replication is closely coordinated with the next one. The result is an assembly that can be viewed as a 'protein machine'. For example, the DNA polymerase molecule on the lagging side of the replication fork remains bound to the leading-strand DNA polymerase molecule to ensure that the same lagging-strand polymerase is used over and over again for efficient synthesis of Okazaki fragments18,20,21 (Box 1). And DNA replication is by no means unique. We now believe that nearly every biological process is catalysed by a set of ten or more spatially positioned, interacting proteins that undergo highly ordered movements in a machine-like assembly22.

Protein machines generally form at specific sites in response to particular signals, and this is particularly true for protein machines that act on DNA. The replication, repair and recombination of the DNA double helix are often considered as separate, isolated processes. But inside the cell, the same DNA molecule is able to undergo any one of these reactions. Moreover, specific combinations of the three types of reactions occur. For instance, DNA recombination is often linked directly to either DNA replication or DNA repair23. For the integrity of a chromosome to be properly maintained, each specific reaction must be carefully directed and controlled. This requires that sets of proteins be assembled on the DNA and activated only where and when they are needed. Although much remains to be learned about how these choices are made, it seems that different types of DNA structures are recognized explicitly by specialized proteins that serve as 'assembly factors'. Each assembly factor then serves to nucleate a cooperative assembly of the set of proteins that forms a particular protein machine, as needed for catalysing a reaction appropriate to that time and place in the cell.

A view of the future

It has become customary, both in textbooks and in the regular scientific literature, to explain molecular mechanisms through simple two-dimensional drawings or 'cartoons'. Such drawings are useful for consolidating large amounts of data into a simple scheme, as illustrated in this review. But a whole generation of biologists may have become lulled into believing that the essence of a biological mechanism has been captured, and the entire problem therefore solved, once a researcher has deciphered enough of the puzzle to be able to draw a meaningful cartoon of this type.

In the past few years, it has become abundantly clear that much more will be demanded of scientists before we can claim to fully understand a process such as DNA replication or DNA recombination. Recent genome sequencing projects, protein-interaction mapping efforts and studies in cell signalling have revealed many more components and molecular interactions than were previously realized. For example, according to one recent analysis, S. cerevisiae, a single-celled 'simple' eukaryotic organism (which has about 6,000 genes compared with 30,000 in humans), uses 88 genes for its DNA replication and 49 genes for its DNA recombination24.

To focus on DNA replication, fully understanding the mechanism will require returning to where the studies of DNA first began — in the realms of chemistry and physics. Detailed atomic structures of all relevant proteins and nucleic acids will be needed, and spectacular progress is being made by structural biologists, owing to increasingly powerful X-ray crystallography and nuclear magnetic resonance techniques. But the ability to reconstruct biological processes in a test tube with molecules whose precise structures are known is not enough. The replication process is both very rapid and incredibly accurate, achieving a final error rate of about one nucleotide in a billion. Understanding how the reactions between the many different proteins and other molecules are coordinated to create this result will require that experimentalists determine all of the rate constants for the interactions between the various components, something that is rarely done by molecular biologists today. They can then use genetic engineering techniques to alter selected sets of these parameters, carefully monitoring the effect of these changes on the replication process.

Scientists will be able to claim that they truly understand a complex process such as DNA replication only when they can precisely predict the effect of changes in each of the various rate constants on the overall reaction. Because the range of experimental manipulations is enormous, we will need more powerful ways of deciding which such alterations are the most likely to increase our understanding. New approaches from the rapidly developing field of computational biology must therefore be developed — both to guide experimentation and to interpret the results.

The Watson–Crick model of DNA catalysed dramatic advances in our molecular understanding of biology. At the same time, its enormous success gave rise to the misleading view that many other complex aspects of biology might be similarly reduced to elegant simplicity through insightful theoretical analysis and model building. This view has been supplanted over subsequent decades, because most biological subsystems have turned out to be far too complex for their details to be predicted. We now know that nothing can substitute for rigorous experimental analyses. But traditional molecular and cell biology alone cannot bring a problem like DNA replication to closure. New types of approaches will be required, involving not only new computational tools, but also a greater integration of chemistry and physics20,25. For this reason, we urgently need to rethink the education that we are providing to the next generation of biological scientists22,26.