This issue of Nature celebrates a halfway point in the implementation of the 'map first, sequence later' strategy adopted by the Human Genome Project in the mid-1980s1. The results suggest that the strategy was basically sound. It led, as hoped, to a project that could be distributed internationally across many genome-sequencing centres, and that would allow sequenced fragments of the human genome to be anchored to mapped genomic landmarks long before the complete sequence coalesced into one long string of Gs, As, Ts and Cs.

The centrepiece of the suite of mapping papers in this issue is on page 934, where the International Human Genome Mapping Consortium describes a 'clone-based' physical map of the human genome2. A map like this not only charts the genome, giving a structure on which to hang sequence data, but also provides a starting point for sequencing. Figure 1 shows the basics of the approach. I drew this figure in 1981, using India ink and a Leroy lettering set. Both graphical and mapping technologies have come a long way since then, but the principles behind clone-based physical mapping have not changed.

Figure 1: Clone-based physical mapping.
figure 1

The top line shows the location of 'restriction' sites (vertical bars) in a particular region of the genome. Restriction sites are places at which a site-specific restriction endonuclease cleaves DNA. The fragments produced by cleavage at every possible point in this region are numbered 1 to 9. Below the line are several clones with random end points, labelled A to F. Clones are produced by first partially digesting many copies of the genome with different restriction endonucleases; the resulting large segments are then inserted into bacteria and replicated (cloned). Each clone is digested with a restriction endonuclease, and the resulting fragments are separated, by size, on an electrophoretic gel ('gel analysis of inserts'). This process yields a distinctive pattern ('fingerprint') for each clone. The map-assembly problem requires working backwards (upwards in this figure) from the fingerprints to a clone-overlap map and restriction-site map of the chromosome segment. To finish the analysis of this region of the genome, the natural choice of clones to sequence would be A and B.

The clone-based approach works as follows. Many copies of the genome are cut up into segments of about 150,000 base pairs by partial digestion with site-specific restriction endonucleases — enzymes that cleave DNA in specific places. ('Partial' digestion means that the reaction is not carried out for long enough to allow every possible cleavage to be made.) The large DNA segments are plugged into bacterial 'artificial chromosomes' (BACs) and inserted into bacteria, where they are copied exactly each time the bacteria divide. The process produces 'clones' of identical DNA molecules that can be purified for further analysis. Next, each clone is completely digested with a restriction endonuclease, chosen to produce a characteristic pattern of small fragments, or a 'fingerprint', for each clone. Comparison of the patterns reveals overlap between the clones, allowing them to be lined up in order, while the sites in the genome at which the restriction endonuclease cleaves are charted. The result is a physical map.

Individual BAC clones are then sheared into smaller fragments and cloned; the resulting 'small-insert' subclones are sequenced. The sequence of an individual BAC clone is assembled from the sequences of an 'oversampled' set of subclones (in other words, enough subclones are sequenced to ensure that each part of the original clone is analysed several times). Finally, the whole genome sequence is assembled by melding together the sequences of a set of BACs that spans the genome.

This approach is similar to that used in the 1980s and early 1990s to map and sequence the genomes of the nematode Caenorhabditis elegans and the yeast Saccharomyces cerevisiae3,4. What is new in the human project is its staggering scale, and the speed with which it has been completed. By way of comparison, although the nematode and yeast genomes are, respectively, only 3% and 0.5% the size of the human genome, these early mapping projects spanned the better part of a decade, as opposed to two years for the much larger human project.

One weakness of clone-based physical mapping is that the maps often have poor continuity. For example, there is not always a BAC clone to cover every part of the genome; and overlaps between clones can be obscured by data errors or the presence of large-scale repeats in the genome. The current map2 has more than 1,000 discontinuities. These will cause some difficulties as the Human Genome Project moves to its next phase, which will involve ensuring accuracy and filling in any gaps in the sequence. Nonetheless, the current map typically maintains continuity for several million base pairs at a stretch. These continuous segments are big enough to allow the clone-based map to be overlaid on various lower-resolution maps. In this way, the mapped segments can be ordered and orientated, much as a discontinuous patchwork of high-resolution maps of the Earth's surface can be orientated by overlaying them on a satellite photograph of the whole Earth.

Two particularly interesting low-resolution maps are the genetic and cytogenetic maps, on pages 951 and 953 of this issue5,6. The genetic map5 is based on the probability of the occurrence of recombination — the swapping of corresponding, nearly identical segments of DNA between maternally and paternally derived chromosomes as the genome is passed from one generation to the next. The cytogenetic map6 is based on subtle variations in the staining properties of different regions of the genome, as viewed by light microscopy. Yet more papers describe different approaches to clone-based mapping7,8,9. These methods were applied to particular chromosomes simply because different sequencing centres chose to rely on the whole-genome map to different degrees.

But was all this cartography even necessary? Another draft of the human genome sequence is described in this week's Science by Celera Genomics10. This group adopted a different approach, which involved preparing small-insert clones directly from genomic DNA rather than from mapped BACs. The major rationale for the BAC-by-BAC approach2 was to make easier the finishing phase of the Human Genome Project, which lies ahead. The consortium now plans to upgrade the 30,000 BAC sequences by sequencing more subclones from each BAC (the 'topping-up' phase) and then resolving internal gaps and discrepancies (the 'finishing' phase). Segmenting the finishing phase into BAC-sized portions provides an enormous advantage in dealing with blocks of sequence that are repeated at many different places within the genome. The power of this strategy is nicely illustrated by the mapping of the Y chromosome, whose repetitive structure is unusually complex (page 943 of this issue11).

Nature readers should not expect any real answer to the question of which of these two approaches is the better one. But it is likely that the only players still on the field when the toughest finishing issues are confronted will be the public consortium's BAC brigade. In the future, as genome sequencing moves on to other mammals, the context will have changed; the human sequence will provide an invaluable guide to assembling long stretches of sequence that are shared among all mammalian genomes. So the sequencing of the human genome is likely to be the only large sequencing project carried to completion by the methods described in this issue. Genome sequencing will get easier from here.

Looking ahead, there are two threats to producing a quality finished product. One is simple exhaustion on the part of the consortium's members: each new round of press conferences announcing that the human genome has been sequenced saps the morale of those who must come to work each day actually to do what they read in the newspapers has already been done.

We may also expect to hear the argument that the current sequence is good enough for most purposes, and that remaining problems should be resolved by users as the need for accurate sequence in specific regions arises. What we have now is certainly a lot better than what we had yesterday. But biologists in the future will be comparing vast data sets to the reference sequence of the human genome. They must be able to do so with confidence that the discrepancies they encounter are due to the limitations of their own data or, more interestingly, to biology. They should not need to expend time, energy and imagination compensating for a failure now to pursue the Human Genome Project to a grand conclusion. We must move on and finish the job, even as the bright lights of media attention shift elsewhere.