The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers ∼99% of the euchromatic genome and is accurate to an error rate of ∼1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human genome seems to encode only 20,000–25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead.
The Human Genome Project (HGP) was launched in 1990 with the goal of obtaining a highly accurate sequence of the vast majority of the euchromatic portion of the human genome. The initial work followed a two-pronged approach: (1) the mapping of the human and mouse genomes1,2,3,4,5,6,7,8,9 to allow the study of inherited disease and provide a crucial scaffold for genome assembly; and (2) the sequencing of organisms with smaller, simpler genomes10,11,12,13,14 to serve as a testbed for method development and assist in interpreting the human genome. With success along both paths, the sequencing of the human genome itself eventually became feasible. The International Human Genome Sequencing Consortium (IHGSC), an open collaboration involving twenty centres in six countries, was formed to carry out this component of the HGP.
In February 2001, the IHGSC15 and Celera Genomics16 each reported draft sequences providing a first overall view of the human genome. These sequences allowed systematic study of the human genome itself, including identification of genes, combinatorial architecture of proteins, regional differences in genome composition, distribution and history of transposable elements, distribution of polymorphism and relationship between genetic recombination and physical distance. Moreover, systematic knowledge of the human genome has enabled new tools and approaches that have markedly accelerated biomedical research.
Both draft sequences, however, had important shortcomings. The IHGSC sequence, for example, omitted ∼10% of the euchromatic genome; it was interrupted by ∼150,000 gaps; and the order and orientation of many segments within local regions had not been established. The IHGSC thus turned to the challenge of completing the sequence of the euchromatic genome. Operationally, a finished sequence was defined as having an error rate of, at most, one event per 104 bases, and the goal for completion was coverage in finished sequence of at least 95% of the euchromatic genome, with the only gaps being those refractory to all available techniques17 (see http://www.genome.gov/10000923). The goal was challenging because the human genome is replete with such features as dispersed repeats and large segmental duplications, which greatly complicate the determination of genome structure and sequence. In fact, near-complete sequences have been obtained so far only for three multicellular organisms: the nematode13, mustard weed18 and the fruitfly19. These genomes are all roughly 30-fold smaller than the human genome and have much simpler structure.
We describe here the results of a multiyear effort by the IHGSC towards the goal of a complete human sequence. The number of gaps has been reduced 400-fold to only 341, most of which are associated with segmental duplications and will require new methods for resolution. The assembled near-complete genome sequence has an error rate of only ∼1 event per 100,000 bases; it contains 2.85 billion nucleotides and covers ∼99% of the euchromatic genome. This paper describes the current genome sequence and the process used to produce it; examines the accuracy and completeness of the sequence; and illustrates biological analyses made possible by the sequence. We do not attempt here a comprehensive analysis of the contents of the human genome. An initial analysis was previously reported15 and a series of papers is being written describing the individual chromosomes17,20,21,22,23,24,25,26,27,28,29,30, including annotation of genes and other features.
Current genome sequence
The process of converting the initial draft sequence into a near-complete sequence is referred to as ‘finishing’. It is a complex iterative process that proceeds simultaneously at multiple scales, ranging from single nucleotides to the integrity of whole chromosomes. The fundamental challenge is that genomic regions that are not well represented or readily resolved through random shotgun sequencing tend to be highly enriched in problematic sequences. Resolving such regions required the development of special approaches, which evolved substantially over time and varied among centres.
Broadly, the finishing process involved two distinct components: (1) producing finished maps, consisting of continuous and accurate paths of overlapping large-insert clones spanning the euchromatic region of each chromosome arm; and (2) producing finished clones, consisting of continuous and accurate nucleotide sequence across each large-insert clone. In practice, these two components were tightly intertwined in that progress in each often depended on results from the other. The components are described in Boxes 1 and 2. Further information about the finishing process and finishing standards can be found in the Supplementary Information (Note 1) and at http://www.genome.gov/10000923.
In total, we generated a shotgun sequence from 59,208 large-insert clones (total length ∼5.84 gigabases (Gb)) and finished the sequence from 45,742 of these clones (total length ∼3.67 Gb). The clones consisted primarily of bacterial artificial chromosomes (BACs), but also included some P1-artificial chromosomes (PACs), yeast artificial chromosomes (YACs), fosmids and cosmids; they carried DNA from multiple anonymous sources15. We then chose a ‘clone tiling path’ of 26,720 overlapping clones across the genome, selected a ‘sequence tiling path’ of directly adjacent, non-overlapping segments from consecutive clones and concatenated these segments to create a near-complete genome sequence. Contributions of the IHGSC centres to this finishing phase are shown in Table 1.
The human sequence reported here consists of 2,851,330,913 nucleotides, lying almost entirely within the euchromatic portion of the genome (Table 2). It is interrupted by only 341 gaps, of which 33 gaps (totalling ∼198 megabases (Mb)) reflect heterochromatin, which was not targeted by the HGP, and 308 gaps (totalling ∼28 Mb) are euchromatic. The euchromatic genome is thus ∼2.88 Gb and the overall human genome is ∼3.08 Gb. The long-range continuity of the current genome sequence is high by various measures (Table 3). The N50 length is 38.5 Mb and the N-average length is 40.9 Mb; these values are ∼1,000-fold larger than the size of a typical human gene. (The first statistic is the length x such that at least 50% of nucleotides lie in a continuous segment of length ≥x, whereas the second is the average length of the contiguous segment containing a randomly chosen nucleotide.) Focusing on individual chromosome arms, the N50 length exceeds half the length of the arm in three-quarters of cases (Table 3).
The sequence is denoted as NCBI Human Build 35 (May 2004), with the individual chromosomes having accession numbers NC000001 to NC000024 (see Supplementary Information Note 3 concerning additional sequence data). The analyses reported here were performed on Build 35 or, in a few cases, its immediate predecessor, Build 34 (which differed only slightly). The poster accompanying this paper displays the 24 human chromosomes, together with various biological annotations. These include GC content, repeat content, segmental duplications, protein-coding genes, sequence similarity and synteny conservation with mouse, sequence similarity with the pufferfish, and density of single-nucleotide polymorphisms in the human genome. Many additional annotations can be found on public genome browsers (http://genome.ucsc.edu/; http://www.ensembl.org/; http://www.ncbi.nlm.nih.gov/genome/guide/human/), which are regularly updated.
Comparison with draft sequence
The near-complete sequence is a great improvement over the earlier draft sequence. It has substantially fewer gaps (341 versus 147,821) and greater continuity (38,500 kilobases (kb) versus 81 kb for N50 contig size), reflecting an overall improvement of ∼475-fold. The draft sequence contained regions in which the local order and orientation were unknown; these have now been resolved. The case of chromosome 7 is illustrated in Fig. 1. Additionally, the draft sequence contained substantial artefactual duplication, including local events caused by errors in merging some adjacent BAC-based sequences, made by the first-generation global assembly program, and global events caused by contamination of shotgun assemblies of some BACs with data from other clones. These artefacts have now been eliminated.
Accuracy and completeness
Because the human genome sequence is intended to serve as a permanent foundation for biomedical research, it was important to assess its quality and to characterize its remaining defects. For this purpose, we used a number of comparisons and consistency checks.
Assessment of accuracy
Tests of accuracy were designed to detect potential problems that may have occurred in clone-based sequencing. This may include errors in assembling the finished sequence within individual clones, and errors in concatenating adjacent finished clones to create the final product. The analysis was complicated by the presence of polymorphism in the human population, because differences between sequence clones may reflect either errors or polymorphism.
Independent quality assessment. Quality assessment (QA) exercises were performed regularly throughout the HGP31. In the final stages, an independent group examined a random sample of finished clones by generating additional data and generating new assemblies32. Briefly, this QA analysis examined ∼34 Mb and found an error rate of 1.1 per 100 kb for small events (≤ 50 bp, with average size of 1.3 bp) and 0.03 per 100 kb for large events (> 50 bp). The small events consisted largely of single-base substitutions, whereas the remaining small and large events primarily concerned the number of consecutive copies of a tandem repeat32.
Analysis of clone overlap. We extended the QA analysis to a larger region (∼ 174 Mb), by examining overlapping sequence between consecutive finished large-insert clones. If two such clones derive from the same copy of the human genome, any sequence differences in the overlap must reflect an error in one of the two clones. By comparing independent clones, this quality assessment method also has the ability to detect cloning artefacts. We examined 4,356 substantially overlapping clones derived from the same library; half are expected to be derived from the same haplotype and half from a different haplotype. We counted the number of single-base mismatches (ignoring insertion/deletions (indels)) in the overlapping regions. The resulting distribution (Fig. 2a) is bimodal. The first peak is consistent with expectation for clones from the same haplotype, with a sequencing error rate of ∼10-5 per bp. The second peak is consistent with the expectation for clones from different haplotypes, with a polymorphism rate of ∼10-3 per bp; this peak matches the distribution seen for clones from different libraries.
We then examined overlapping clones likely to be from the same haplotype (with no single-base mismatches) and counted the discrepancy rate for indels (Fig. 2b). The error rate (estimated as half the discrepancy rate) is ∼0.55 events per 100 kb, with the vast majority being in tandem repeats. By contrast, clones from different libraries show a discrepancy rate that is at least 20-fold higher. Overall, the analysis indicates that the overall error rate (reflecting both sequence error and cloning artefacts) is 20–100-fold lower than the human polymorphism rate.
Analysis of junctions. We assessed longer-range integrity of the genome sequence by studying read pairs from large insert clones. Specifically, we created a fosmid library carrying randomly sheared human DNA and sequenced both ends of the insert of ∼750,000 clones. Fosmid clones are particularly useful because their insert sizes cluster tightly around 40 kb, due to packaging constraints. We aligned the fosmid end sequences to the genome sequence. Both ends could be mapped to unique locations in the human genome in most cases (86%), and these two locations were within 39.5 ± 7.5 kb in 99% of cases. Some fosmids could not be uniquely placed because one or both ends consisted almost entirely of repeat sequence. Using the uniquely placed fosmids (which provide about eightfold clone coverage of the euchromatic genome), we sought to obtain independent confirmation of the order, orientation and adjacency of the junction between consecutive finished large-insert clones used to construct the genome sequence. The junction was considered ‘supported’ if spanned by one or more consistently placed fosmids. In all, ∼97% of junctions were supported. About half of the remaining junctions were supported by fosmids with unique placement at one end but multiple placements at the other end. Overall, the analysis provided strong support for accuracy of the junctions underlying the current genome sequence.
Search for deletions. We next scanned the genome sequence for evidence of deletions of several kilobases in size, using the same fosmid data set. At each point, we calculated the ‘apparent size’ of each fosmid spanning the point (defined as the distance between the location of the end sequences in the current genome sequence) and then calculated the ‘average apparent size’ for all the fosmids spanning the point. We searched for regions where the observed size fell far below expectation (< 3.5 standard deviations (s.d.)), suggesting a large difference between the genome sequence and the source DNA for the fosmid library (Fig. 3). Such differences could reflect either an error in the genome sequence, a deletion in the fosmid clone, or a deletion polymorphism between the DNA sources. (Given the number of fosmids used, this analysis has ∼50% sensitivity to detect deletions of 3–30 kb. Because the methodology cannot detect deletions larger than a fosmid, we also analysed discrepant fosmid links, which could reflect deletions. See Methods in Supplementary Information.)
We found 242 candidate regions, with suggestive evidence for deletions (average apparent size ∼5 kb). These regions were then scrutinized by alignment with the recently obtained draft sequence of the chimpanzee genome (R. H. Waterston, personal communication). Because the human and chimp genomes align with relatively few large indels (indels >2 kb occur at ∼1 per 100 kb), this comparison should highlight true deletions. The chimpanzee comparison supported the presence of deletions in 35% of cases. A subset of these was then tested by polymerase chain reaction (PCR) analysis of genomic DNA from multiple individuals. Roughly two-thirds appear to represent polymorphic deletions in the human population and one-third represent actual errors in the current genome sequence. Overall, the results indicate that the current genome sequence is likely to contain perhaps 50–100 erroneous deletions (average size ∼5 kb), which could be due to assembly errors or mutations occurring during propagation of large insert clones. Analysis of a larger collection of fosmids could probably pinpoint the majority of these errors, allowing them to be corrected.
Assessment of coverage
Tests of coverage were designed to measure the proportion of the euchromatic genome missing from the current genome sequence, by assessing the presence of independently sampled human sequences such as complementary DNA clones and random genomic clones.
Analysis of cDNAs. We tested for the presence of known cDNA sequences from public databases (REFSEQ33 and MGC34). The analysis35 involved 17,458 distinct gene loci spanning 925 Mb of genomic sequence. The vast majority (99.74%) could be confidently aligned to the current genome sequence over virtually their complete length with high sequence identity (at a level consistent with the expected polymorphism rate and the performance of the alignment program). A few of these (0.5%) showed strong alignment to more than one locus. A few others (0.04%) showed unusually high sequence difference (> 2%), but these were nearly all immunologically related genes (such as major histocompatibility loci and immunoglobulin-related loci) known to be highly polymorphic.
We examined the remaining cases (0.28%). The cDNA sequence appeared to be completely absent in 0.06% of cases and partially absent, with a contiguous segment missing, in 0.23% of cases. For almost all of completely absent cDNAs, the genomic location of the gene was known or could be inferred and corresponds to a gap in the current genome sequence. For the partially absent cDNAs, more than half of the cases lie adjacent to gaps. The remainder may represent either errors in the current genome sequence or polymorphic deletions; these are being investigated further. Overall, the proportion of cDNA sequence that is missing from the genome sequence is only 0.08% of the total. This may underestimate the proportion of genome missing from the finished sequence, however, because focused efforts were made to capture genomic sequence containing missing messenger RNAs.
Analysis of random genomic plasmids. As an additional and broader test of coverage, we analysed paired end-sequences from 5,000 small-insert (3–4 kb) plasmids generated as part of a human single nucleotide polymorphism (SNP) discovery project (see Methods). After excluding heterochromatic repeats and other artefacts, we found that 99.3% of the reads could be reliably aligned to the finished sequence. For 0.6% of the reads, neither end could be aligned; these probably lie in known gaps. For another 0.1% of the reads, exactly one end could be placed; some fell next to known gaps, whereas others appear to represent indel differences between the reference sequence and the source DNA for the plasmid library. The overall analysis indicates that <1% of the euchromatic genome is missing from the finished sequence. Together, the cDNA and plasmid analyses indicate that the current genome sequence contains more than 99% of the euchromatic portion of the human genome.
Characterization of remaining gaps
The current genome sequence contains 341 gaps, which could not be closed with available techniques. We briefly describe the nature of these gaps and discuss the prospects for eventual closure. (See Supplementary Information Notes 2 and 4.)
Heterochromatic regions (33 gaps). The heterochromatic regions of the human genome were not targeted by the HGP, because their highly repetitive properties make them largely refractory to current cloning and sequencing strategies. There are 33 heterochromatic regions falling into four types. The 24 centromeres (∼ 50 Mb) consist largely of alpha satellite repeats, of which ∼15 types exist; these monomeric repeats are arranged into higher-order arrays distinct to specific chromosomes, which are tandemly repeated with slight sequence variations. The three secondary constrictions are immediately adjacent to the centromere on chromosome arms 1q, 9q and 16q and contain various satellite repeats (beta, gamma, satellite I, II, III). The five acrocentric chromosome arms 13p, 14p, 15p, 21p and 22p encode the 5S, 18S and 28S ribosomal RNA genes, which lie on a 43-kb sequence present in ∼50 tandem copies on each arm and are flanked by additional repeats arranged in complex structures. Finally, there is a single large region on distal Yq composed primarily of thousands of copies of several repeat families. The heterochromatic regions all tend to be highly polymorphic in length in the human population.
Euchromatic boundary regions (35 gaps). The euchromatic regions of the human genome are bounded proximally by heterochromatin and distally by a telomere consisting of several kilobases of the hexamer repeat TTAGGG. We examined the current genome sequence for evidence of the expected boundaries on the 43 euchromatic arms. (See Supplementary Information Note 4.) At the proximal ends, 30 of the 43 cases show sequence characteristic of either heterochromatin or immediately flanking regions (such as higher-order centromeric repeats, stretches of at least 10 kb of monomeric alpha satellite repeat or other pericentromeric repeats). We cannot exclude the possibility that there is additional unique sequence between this point and the proximal heterochromatin; but efforts to extend the finished sequence further were unsuccessful. In the remaining 13 cases, the finished sequence contains no evidence of heterochromatin-related sequence. At the telomeric ends, 21 of the 43 cases show continuous sequence extending to the telomeric repeat. This sequence was typically obtained by isolation and sequencing of half-YAC clones spanning to the telomere36. An additional 18 cases are sequence gaps, in which half-YACs reaching to the telomere were isolated but finished sequence could not be obtained. The remaining four cases are physical gaps, in which large-insert clones extending to the telomere could not be obtained.
Euchromatic interior regions (273 gaps). The remaining gaps are located within the current genome sequence. These consist of 215 physical gaps for which no clones could be isolated, and 58 sequence gaps for which clones were found but reliable finished sequence could not be obtained. The physical gaps are greatly enriched in regions of segmental duplication (Fig. 4a). Roughly half of these gaps (52%) are flanked by segmental duplications with >90% sequence identity, although such duplications comprise only ∼5.3% of the euchromatic genome (Fig. 4b). Such segmental duplications are especially frequent in pericentromeric regions, and gaps are notably more frequent in these regions. The association of gaps with segmental duplications is examined in detail elsewhere37.
The most extreme case occurs near the centromere of chromosome 9. The most proximal 5 Mb on 9p and 4 Mb on 9q comprise a mere 0.3% of the genome, but account for ∼12% of the physical gaps in the euchromatic sequence. These two pericentric regions are unique in the genome with respect to density of segmental duplication and the average degree of intrachromosomal sequence identity (98.7%), and the two regions have many highly similar sequences in common. The high sequence similarity between the two regions is likely to be the reason for a polymorphic inversion of the centric heterochromatin on chromosome 9, present at a frequency of ∼1% in the human population28. Other proximal regions also show a higher-than-average density of gaps. For example, the proximal 2 Mb on the remaining 41 euchromatic arms comprise 2.9% of the genome but harbour 13.3% of the gaps. Nearly all of these proximal gaps are flanked by segmental duplications (Fig. 5a). There is also a clustering of such gaps in subtelomeric regions. The terminal 1 Mb on the 43 euchromatic arms represents 1.5% of the genome, but contains ∼14% of the total gaps; nearly all of these gaps are also flanked by segmental duplications (Fig. 5b).
Closing the remaining gaps. Although the euchromatic genome sequence has reached a much higher degree of completion than had been anticipated, it still remains incomplete with ∼1% of the euchromatin residing in 308 gaps. These represent regions that could not be reliably mapped, cloned and sequenced with current methods. Rather than applying further brute force, it is now time to develop focused strategies to resolve the regions.
The remaining euchromatic gaps probably reflect two major issues. The first pertains to regions harbouring segmentally duplicated sequence. Such regions are challenging to map because it can be extremely difficult to discern whether two clones with small sequence differences represent different loci or different alleles at a single locus. This challenge was eventually resolved for chromosome Y (ref. 23) (which is especially rich in segmental duplication) by exploiting the fact that the chromosome is haploid in males. By using DNA from a single haploid source, it was possible to rely on differences at only a handful of nucleotides to distinguish repeated sequences. This approach could be applied to the rest of the genome by using appropriate haploid sources, such as a hydatidiform mole or monochromosomal hybrids. (In both instances, use of parental controls to guard against being misled by somatic rearrangements would be well advised.) It may be useful to test these approaches on individual chromosomes. The second issue is that some gaps are likely to correspond to regions that cannot be efficiently propagated in current large-insert vectors and hosts. It may be useful to test new kinds of large-insert libraries for clones containing unique sequences not contained in the current human genome sequence (perhaps seeded by probes derived from random small-insert genomic plasmids, as discussed above). In addition, genome completion may benefit from long-range mapping techniques such as optical mapping38, which may provide independent information about difficult regions.
Completing the euchromatic sequence is an important goal, but is clearly now a research effort rather than a high-throughput project. Sequencing the human heterochromatin poses an even greater challenge. The current human sequence penetrates only the periphery of the heterochromatin—for example, the pericentric regions on a few chromosome arms39,40. This progress has required concerted efforts with specialized mapping techniques and painstaking assembly. The fundamental issue is that current shotgun strategies are poorly suited to assembling large, highly repetitive regions. The hierarchical shotgun strategy faces the challenge of accurate assembly of individual BACs and accurate overlap of BAC clones, with the underlying data consisting of nearly identical sequence; the whole-genome shotgun strategy compounds these problems. Conceivably, the hierarchical strategy could be adapted as was done for repetitive regions of chromosome Y. Approaches might include the use of the following: haploid DNA sources to restrict the problem to a single haplotype; single chromosome sources to avoid confusion among related centromeres on different chromosomes; sheared BAC libraries to avoid biases caused by the unusual distribution of restriction sites within the repeat sequences; assembly based on rare base differences that distinguish near-identical repeats; cloning vectors that minimize rearrangements; and subclone libraries of varying insert lengths. Such an approach will also require ensuring accurate recovery and stability of heterochromatic regions in large-insert clones. Even so, the path is likely to be arduous and expensive to obtain regions of uncertain information content. Alternatively, it may be possible to develop new approaches. These might include methods to obtain much longer effective read lengths, directed reads from known locations and long-range mapping information about the location of rare base differences among repeat copies (such as optical mapping38 or padlock probes41).
Examples of utility of near-complete sequence
The present genome sequence enables far more precise analyses of the human genome, especially those that depend sensitively on high accuracy and near-completeness. Rather than revisit all of the analyses in our initial analysis of the human genome, we have chosen four examples that illustrate the utility of the current near-complete sequence.
The human genome is notable for its high proportion of recent segmental duplications. They are of great medical interest because their unusual structure often predisposes them to deletion or rearrangement with consequent phenotypic effects; prominent examples include the Williams syndrome region (7q), Charcot–Marie–Tooth region (17p), DiGeorge syndrome region (22q) and the AZF-C region (Y)42. Some regions of segmental duplication have also recently been shown to be evolutionary nurseries in which coding sequences are undergoing strong positive selection43. Accurate analysis of segmental duplications was previously impossible because the draft sequence also contained a high degree of artefactual duplication. This difficulty was recognized at the time and the approximate proportion of true and artefactual duplication was inferred indirectly. With near-complete sequence, the artefacts are now largely eliminated and true segmental duplications can be reliably studied.
On the basis of the current sequence, segmental duplications cover ∼5.3% of the euchromatic genome. (Here, segmental duplications are counted as regions that are not transposable element copies, are ≥1 kb in length and have sequence identity ≥90%; this corresponds to duplication within the past ∼40 million years.) The proportion of segmental duplication and the degree of sequence identity are clearly substantially higher in the human genome than in the mouse44 or rat45 genomes (although precise figures for the rodent genomes must await finished sequence). The use of large insert clones, representing a single haplotype, was critical in resolving these regions. The distribution of segmental duplication varies widely across chromosomes, as does the proportion of intrachromosomal versus interchromosomal duplications15 (Fig 4b). The most extreme case is chromosome Y, which carries segmental duplication along >25% of its total length and includes blocks as large as ∼1.45 Mb with sequence identity of ∼99.97% (ref. 23). In addition, many pericentromeric and subtelomeric regions are rich in dispersed segmental duplications (Fig. 5), apparently resulting from a steady bombardment of insertional translocations46. Although most regions of segmental duplication have now been sequenced, ∼10% of them lie in the remaining gaps in the current sequence and will require further work to elucidate, as discussed above.
A central goal of genome analysis is the comprehensive identification of all human genes. This task remains challenging, but is greatly aided by the near-complete sequence together with other improved resources (such as expanded cDNA collections, genome sequence from other organisms and better computational methods). The current version of the human gene catalogue (Ensembl 34d) contains 22,287 gene loci (with a total of 34,214 transcripts, corresponding to 1.54 transcripts per locus), consisting of 19,438 known genes and 2,188 predicted genes. These gene loci have a total of 231,667 exons, with ∼10.4 exons per locus and ∼9.1 exons per transcript. The total length covered by the coding exons is ∼34 Mb or ∼1.2% of the euchromatic genome; the untranslated regions of the transcripts are estimated to cover another ∼21 Mb or ∼0.7% of the euchromatic genome.
Comparison of the initial and current gene catalogues highlights the substantial improvement. Many of the earlier gene models were erroneous due to defects in the draft sequence. Examples resulting from a duplication, inversion and premature stop codon are shown in Fig. 6. The improvement can be quantified by mapping the current gene models onto the draft sequence, to determine whether they could have been accurately identified. Of the transcripts in the current gene catalogue, 58% have at least one error when mapped onto the draft sequence. For 39% of transcripts, there is at least one exon that is absent or incorrectly ordered due to defects in the draft. For the remaining 19% of transcripts, the exons are all present and correctly ordered, but there are one or more nucleotide errors.
Automated gene annotation has now been complemented by manual annotation of most chromosomes, based on a careful review of gene structure and examination of expressed-sequence-tag (EST) and transcript evidence. Such analysis has been completed for 18 chromosomes (2, 4, 5, 6, 7, 8, 9, 10, 13, 14, 15, 17, 18, 19, 20, 21, 22, X and Y; refs 17, 20–30 and http://vega.sanger.ac.uk/Homo_sapiens/), with the remainder in the press or in preparation) comprising 1.7 Gb of the euchromatic genome. Although this annotation has further improved the quality of the gene models47 (by dealing with special cases and unusual features not yet handled in the automated programs, and resolving instances of conflicting experimental data), it has not significantly affected the total gene count for these chromosomes.
On the basis of available evidence, our best estimate is that the total number of protein-coding genes is in the range 20,000–25,000. The lower bound seems secure, based on the number of currently known genes (19,599). The upper bound is based on estimates of the number of additional genes. Despite intense automated and manual analysis using cDNA, EST and cross-species homology, only 2,188 gene predictions have been added to the known set. This predicted set is likely to represent substantially fewer than 2,000 true genes, owing to fragmentation and false predictions arising from pseudogenes. For example, the predictions tend to have fewer exons per transcript than known genes (∼ 4.7 versus ∼9.7) and to encode shorter open reading frames (∼ 847 versus ∼1,487 amino acids). On the other hand, the set is likely to be incomplete because some protein-coding genes have surely continued to escape detection. The most problematic cases would be genes that have very short open reading frames (< 100 amino acids), consist of single exons or evolve very rapidly. Even if we assume that such genes comprise 10% of the total (which seems a generous overestimate, given our current understanding of the human and other genomes48,49,50), the total gene count would remain below 25,000. The range of 20,000–25,000 is also consistent with recent estimates (J. Weissenbach, unpublished) of the number of protein-coding genes based on cross-species homology (using the Exofish method51).
In our initial analysis of the draft sequence15, we estimated the count of human protein-coding genes at roughly 30,000. The estimate was derived as follows. We used computational analysis to generate an initial gene catalogue with ∼32,000 entries, consisting of ∼15,000 known genes and ∼17,000 gene predictions. We estimated that the catalogue actually corresponded to ∼24,500 actual genes, based on estimates of the rate of various types of errors such as fragmentation and false positive predictions (due largely to limitations of the draft sequence, such as imperfect recognition of pseudogenes and unknown order and orientation). We then adjusted the estimate to account for the proportion of genes estimated to be absent from the initial catalogue due to incomplete coverage of the genome and imperfect computational methods, resulting in a figure of ∼31,000 genes.
With the current high-quality sequence, it is now possible to revisit this earlier analysis. We directly compared the previous gene models with the current gene models, to determine whether our previous estimates of the various error rates were correct. It is clear that the main reason for the earlier overestimate is that the fragmentation rate was substantially underestimated. The fragmentation rate is defined as the average number of the previous gene models that map to the same true gene; we assessed it by mapping to the current gene catalogue. The fragmentation rates for ‘known’ and ‘predicted’ genes were estimated in our earlier paper15 at ∼1.0 and ∼1.4, whereas our current analysis indicates that they should have been ∼1.3 and ∼1.7. This correction alone would bring our previous estimate to ∼24,000. Small differences in the estimated rate of false positive and negative predictions account for the remainder of the discrepancy.
It should be emphasized that the count above refers to the count of protein-coding genes. It does not include known non-coding RNAs, such as transfer RNAs, ribosomal RNAs, small nucleolar RNAs (snoRNAs) and microRNAs52,53,54. In addition, there is evidence that the human genome gives rise to many additional RNA transcripts55. It is unclear whether most such transcripts have specific biological functions or reflect reproducible transcriptional noise; few contain substantial open-reading frames and thus they are unlikely to encode proteins. There is a need for reliable experimental and computational methods for comprehensive identification of non-coding RNAs.
Finally, the near-complete sequence makes it possible to undertake systematic searches for pseudogenes. Automated annotation of chromosomes has focused primarily on identifying large pseudogenes of more recent origin. Recent published studies have used more sensitive methods to detect smaller and older pseudogenes and have already identified ∼20,000 processed and unprocessed pseudogenes56. This is surely still an underestimate, because such analysis will miss pseudogenes that are extremely old or that contain primarily untranslated regions. The total number of pseudogenes is thus likely to exceed the total number of functional genes. A particular type of pseudogene (recently arising non-processed pseudogenes) is discussed in more detail below.
Gene birth in the human lineage
The birth of new genes is of interest because it provides raw material for adaptive evolution, with extra copies of genes able to undergo functional divergence in response to positive selection. The quality and completeness of the current sequence make it possible to study this question; such analysis would have been unreliable with the earlier draft sequence, because the extensive artefactual local duplication would have given rise to many false positives.
We searched for clusters of nearby homologous genes, indicative of local gene duplication. The divergence between such genes was assessed at sites likely to be selectively neutral, by measuring the estimated substitution rate per synonymous site (KS). We looked for nearby human gene pairs differing from one another by KS < 0.30, implying that each differs from the common ancestral source gene by an average KS < 0.15. This threshold corresponds roughly to duplications arising after divergence from the rodent lineage, either by recent gene duplication or perhaps recent gene conversion of older duplications (see Methods in Supplementary Information). A total of 1,183 genes exhibit such divergence from a neighbouring gene (see Methods in Supplementary Information). These genes often fall within larger clusters of paralogous genes including genes with greater divergence and reflecting older duplications. These clusters contain ∼3,300 genes, and those having at least five genes involved in recent duplication events are shown in Table 4. Analysis of phylogenetic trees containing the related human and mouse genes confirms that the genes are more closely related within each species than between the two species in nearly all cases (97%), as would be expected for genes arising by duplication after the divergence of the human and rodent lineages.
The recent duplications are enriched in genes with immune and olfactory function, as well as those likely to be involved in reproductive functions. For example, the gene families encoding the pregnancy-specific beta-1-glycoprotein and choriogonadotropin beta proteins may be involved in the extended gestational period in the human lineage; the latter family is known to have expanded recently within the catarrhine primate lineage57. Another example is the family of cancer/testis (CT) antigen genes, which are normally expressed in the testis and are highly expressed in carcinomas58.
The distribution of KS values (Fig. 7) for recent duplications shows a striking excess of genes with strong similarity (KS ≤ 0.015), corresponding to recent events occurring ∼3–4 million years ago. There are several possible explanations for this peak. First, it may reflect a true explosion in the rate of gene duplication in the primate lineage. (The primate lineage does show an increase in the rate of dispersed segmental duplication, although it is less extreme; the rate of local duplication will need to be carefully evaluated in comparative studies.) Second, it may partly reflect on the ongoing process of gene conversion of older gene duplication events. However, we offer a third explanation: the peak primarily reflects the transient of duplicated genes that are too young relative to the characteristic time of deletion. If so, most of these new genes are destined to be culled due to lack of functional benefit. In contrast to the first explanation, this would predict that a similar peak would be seen in most mammals.
Gene death in the human lineage
Gene death is another phenomenon that sheds light on lineage-specific evolution, but which was difficult to analyse with the earlier draft sequence. To study gene death, we scanned the genome sequence for recently arising non-processed pseudogenes—that is, nearly intact human genes that appear to have recently acquired an inactivating mutation. Specifically, we examined genomic intervals bounded at each end by two consecutive genes, with each belonging to a 1:1:1 orthology triplet in the human, mouse and rat genomes and the interval containing at most 50 genes (see Methods in Supplementary Information). We then examined the Ensembl gene predictions in the corresponding intervals of the three genomes and identified instances in which the mouse and rat genomes contained 1:1 orthologues, but the human genome appeared to contain no predicted orthologous gene. In each instance, the rodent genes were aligned to the corresponding human genomic interval to look for clear evidence of a human pseudogene—that is, a highly similar sequence containing one or more inactivating mutations in its genomic sequence (see Methods in Supplementary Information). We also required that the inactivating mutation was present in any human mRNA sequences corresponding to the locus. (This analysis excludes many older pseudogenes that do not show sufficient similarity to the rodent homologues because they have substantially degenerated.)
A total of 37 candidate pseudogenes were identified, with an average of 0.8 premature stop codons and 1.6 frameshifts (Supplementary Table 1). (Similar analyses performed on the draft sequence yielded a much larger list, including many apparent inactivating mutations that were errors and were corrected in the current sequence.) We carefully examined these candidates to confirm that they did not reflect errors in the current genome sequence (by resequencing or examination of an independently finished clone) and to determine their evolutionary origin (by re-sequencing in a panel of 24 diverse humans and comparison with a draft sequence of the chimpanzee genome). Complete experimental data could be obtained for 34 cases. The identification of a pseudogene was confirmed in 33 of the 34 cases; one case was due to an error in the current sequence (Table 5). The 19 pseudogenes with two or more inactivating mutations were all found to be pseudogenes in chimpanzee as well. The 14 pseudogenes with exactly one inactivating mutation fell into the following three classes: eight pseudogenes shared with chimpanzee; five pseudogenes fixed in the human population but functional genes in the chimpanzee; and one pseudogene that is a segregating polymorphism in the human population. (In 20 cases, the inactivating mutation occurs in the final or only exon. Although this could in principle be compatible with a functional gene, the truncation removes a functionally important domain in all but one case.)
Of the 32 pseudogenes fixed in the human population, 10 are derived from olfactory receptors. Olfactory receptors thus occur prominently in both birth and death analyses, indicating a dynamic expansion and contraction of this large gene family; the net effect has been an overall significant decrease in the number of functional olfactory receptors in humans compared with rodents59,60. The remaining 22 recent pseudogenes include a wide variety, such as genes homologous to a cationic amino-acid transporter, a serine-threonine kinase, a calreticulin, a putative G-protein coupled receptor and a cystatin.
The Human Genome Project marked a new approach in biomedical research, one in which the scientific community came together to characterize systematically a large domain of important biological knowledge. Because the precise scientific plan and the feasible degree of accuracy and completeness were unclear at the outset, the sequencing of the human genome proceeded in phases: a preliminary phase that developed and refined key approaches; a draft phase that yielded ∼90% of the information (albeit in imperfect form); and a finishing phase reported here that yielded ∼99% in high-quality form. Notably, the finishing phase required roughly equal resources of time and expense as the draft phase.
The euchromatic portion of the human genome is still not complete, with ∼1% still to be determined. The issue is no longer scale, but rather the need for new approaches to understand and resolve these recalcitrant segments. Continuing efforts should be devoted towards the eventual goal of complete closure. Nonetheless, the euchromatic human genome can now be regarded as effectively known. The accuracy and completeness of the current near-complete human genome sequence has important consequences for biomedical research. It allows systematic searches for the causes of disease—for example, to find all key heritable factors predisposing to diabetes or somatic mutations underlying breast cancer—with confidence that little can escape detection. It facilitates experimental tools to recognize cellular components—for example, detectors for mRNAs based on specific oligonucleotide probes or mass-spectrometric identification of proteins based on specific peptide sequences—with confidence that these features provide a unique signature. It allows sophisticated computational analyses—for example, to study genome structure and evolution—with confidence that subtle results will not be swamped or swayed by noisy data. At a practical level, it eliminates tedious confirmatory work by researchers, who can now rely on highly accurate information. At a conceptual level, the near-complete picture makes it reasonable for the first time to contemplate systems approaches to cellular circuitry, without fear that major components are missing.
The HGP provides an essential foundation for the sequencing and analysis of additional large genomes. With the experience gained from the human genome, it has already become scientifically and economically feasible to produce draft genome sequence from many vertebrates, which will be a crucial tool for identifying the functional elements in the human genome through comparative analysis. Ultimately, we believe that such projects should aim higher to produce genome sequence with even greater accuracy and completeness. This will require digesting the diverse experience from the finishing phase of human sequencing and selecting a subset of techniques that can be most efficiently streamlined and scaled up to improve accuracy and completeness of genome sequence. A good example is the systematic closure of gaps by primer-directed walking on fosmid templates covering each gap, which may be able to close the vast majority of gaps in a draft sequence in an automated fashion.
More generally, the HGP demonstrates the tremendous potential value of coordinated projects to create community resources to propel biomedical research. Key challenges that lie ahead61 include: (1) systematic identification of all genetic polymorphisms carried in the human population, to facilitate the study of their association with disease; this will require comprehensive study of hundreds to thousands of human genomes. (2) Systematic identification of all functional elements in the human genome, including genes, proteins, regulatory controls and structure elements; this will require comparative analysis with many additional mammalian genomes and systematic application of diverse experimental techniques. (3) Systematic identification of all the ‘modules’ in which genes and proteins function together; this will require comprehensive study and improved interpretation of expression, localization and interaction in a temporal and spatial context. Absolute completeness will be elusive but, as with the HGP, obtaining the substantial majority of the information will greatly accelerate the pace of biomedical research in thousands of laboratories.
Correspondence and requests for materials should be addressed to F .S. Collins (firstname.lastname@example.org), E. S. Lander (email@example.com), J. Rogers (firstname.lastname@example.org) or R. H. Waterston (email@example.com). The sequence described here has been deposited in public databases, with the 24 human chromosomes having accession numbers NC000001 to NC000024.
We thank D. Leja for graphic design and production of the figures. We would also like to thank the many dedicated support staff at the sequencing centres and funding agencies. In addition to finished sequence produced by the IHGSC between the time of publication of the draft human genome and April 2003, Build 35 contains some significant published finished sequence from other centres as listed in Table 1—for this we would like to acknowledge M. Adams, B. Roe and G. Evans. Build 35 also contains a small number of individual finished deposited accessions from a variety of other groups, which may or may not have been published; we would like to acknowledge all of this work. We acknowledge G. Sisk for help in preparing the manuscript. This work was supported by The Wellcome Trust; The US National Institutes of Health; The US Department of Energy; The Ministry of Education, Culture, Sports, Science and Technology, Japan; The Federal German Ministry of Education, Research, and Technology; Projektträger Biologie, Energie, Umwelt des BMBF und BMWT; the Max-Planck-Society; Deutsche Forschungsgemeinschaft; Thüringer Ministerium für Wissenschaft, Forschung, und Kunst; The Medical Research Council (UK); European Commission, Directorate Science, Research and Development; Chinese Academy of Sciences, Ministry of Science and Technology, National Natural Science Foundation of China.
A summary of evidence supporting the discussion in the main text regarding human pseudogenes.