Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

An assessment of the sequence gaps: Unfinished business in a finished human genome

Key Points

  • The finished human genome sequence contains two types of gap: those that are associated with heterochromatic sequences and those that are embedded in euchromatin.

  • Approximately 50% of the gaps in euchromatin are flanked by duplications with a high degree of sequence identity. Transition regions between euchromatin and heterochromatin are particularly enriched. Not all of these regions are recalcitrant to subcloning.

  • Regions of duplication are sites of genetic instability and large-scale structural polymorphism. These properties have complicated their sequence and assembly.

  • Specialized genomic technologies, including the construction of hydatidiform mole bacterial artificial chromosome (BAC) libraries, optical mapping, half-YAC (yeast-artificial chromosome) mapping and transformation-associated recombination, might resolve a fraction of the euchromatic gaps.

  • Closure of the remaining 1% of the euchromatin will improve gene and SNP annotation in the human reference genome sequence.

Abstract

Biological research increasingly depends on 'finished' genome sequences. Deducing what is absent from these sequences is not trivial. More than 99% of the euchromatic portion of the human genome is now represented as a high-quality finished sequence with each base ordered and oriented. However, two principal types of gap remain: heterochromatic (estimated to be 200 Mb) and euchromatic (23.0 Mb) gaps. Here, we use various global sources of data to help understand the nature of the gaps in the finished human genome. Not all gaps are recalcitrant to subcloning, nor are most heterochromatic. The presence of recent segmental duplications is the most important predictor of gap location in euchromatic sequences. The resolution of these regions remains an important challenge for the completion of the human genome, gene annotation and SNP assignment.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Chromosomal distribution of sequence gaps.
Figure 2: Duplications and sequence gaps.
Figure 3: In silico versus cytogenetic analysis of the human genome.
Figure 4: Gaps, duplications and structural variation.
Figure 5: SNP enrichment in duplicated sequences.

Similar content being viewed by others

References

  1. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–920 (2001). The first description and analysis of a publicly released assembly of the human genome.

  2. Collins, F. S., Green, E. D., Guttmacher, A. E. & Guyer, M. S. A vision for the future of genomics research. Nature 422, 835–847 (2003).

    CAS  Google Scholar 

  3. Collins, F. S. et al. New goals for the U.S. Human Genome Project: 1998–2003. Science 282, 682–689 (1998).

    CAS  Google Scholar 

  4. Bailey, J. A. et al. Recent segmental duplications in the human genome. Science 297, 1003–1007 (2002). A global analysis of the organization and properties of recent segmental duplications in the human genome using whole-genome shotgun sequence data.

    CAS  PubMed Central  Google Scholar 

  5. Green, P. Against a whole-genome shotgun. Genome Res. 7, 410–417 (1997).

    CAS  Google Scholar 

  6. Eichler, E. E. Masquerading repeats: paralogous pitfalls of the human genome. Genome Res. 8, 758–762 (1998).

    CAS  Google Scholar 

  7. Bailey, J. A., Yavor, A. M., Massa, H. F., Trask, B. J. & Eichler, E. E. Segmental duplications: organization and impact within the current Human Genome Project assembly. Genome Res. 11, 1005–1017 (2001).

    CAS  PubMed Central  Google Scholar 

  8. Cheung, J. et al. Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence. Genome Biol. 4, R25 (2003).

    PubMed Central  Google Scholar 

  9. Cheung, V. G. et al. Integration of cytogenetic landmarks into the draft sequence of the human genome. The BAC Resource Consortium. Nature 409, 953–958 (2001).

    CAS  Google Scholar 

  10. Bailey, J. A. et al. Human-specific duplication and mosaic transcripts: the recent paralogous structure of chromosome 22. Am. J. Hum. Genet. 70, 83–100 (2002).

    CAS  Google Scholar 

  11. Kehrer-Sawatzki, H., Schwickardt, T., Assum, G., Rocchi, G. & Krone, W. A third neurofibromatosis type 1 (NF1) pseudogene at chromosome 15q11. 2. Hum. Genet. 100, 595–600 (1997).

    CAS  Google Scholar 

  12. Kehrer-Sawatzki, H. et al. Molecular characterization of the pericentric inversion that causes differences between chimpanzee chromosome 19 and human chromosome 17. Am. J. Hum. Genet. 71, 375–388 (2002).

    CAS  PubMed Central  Google Scholar 

  13. Barber, J. C., Reed, C. J., Dahoun, S. P. & Joyce, C. A. Amplification of a pseudogene cassette underlies euchromatic variation of 16p at the cytogenetic level. Hum. Genet. 104, 211–218 (1999).

    CAS  Google Scholar 

  14. Sprenger, R. et al. Characterization of the glutathione S-transferase GSTT1 deletion: discrimination of all genotypes by polymerase chain reaction indicates a trimodular genotype–phenotype correlation. Pharmacogenetics 10, 557–565 (2000).

    CAS  Google Scholar 

  15. Horvath, J. E. et al. Using a pericentromeric interspersed repeat to recapitulate the phylogeny and expansion of human centromeric segmental duplications. Mol. Biol. Evol. 20, 1463–1479 (2003).

    CAS  Google Scholar 

  16. Horvath, J., Schwartz, S. & Eichler, E. The mosaic structure of a 2p11 pericentromeric segment: a strategy for characterizing complex regions of the human genome. Genome Res. 10, 839–852 (2000).

    CAS  PubMed Central  Google Scholar 

  17. Kuroda-Kawaguchi, T. et al. The AZFc region of the Y chromosome features massive palindromes and uniform recurrent deletions in infertile men. Nature Genet. 29, 279–286 (2001).

    CAS  Google Scholar 

  18. Hillier, L. W. et al. The DNA sequence of human chromosome 7. Nature 424, 157–164 (2003).

    CAS  PubMed Central  Google Scholar 

  19. Horvath, J. E., Bailey, J. A., Locke, D. P. & Eichler, E. E. Lessons from the human genome: transitions between euchromatin and heterochromatin. Hum. Mol. Genet. 10, 2215–2223 (2001).

    CAS  Google Scholar 

  20. Heilig, R. et al. The DNA sequence and analysis of human chromosome 14. Nature 421, 601–607 (2003).

    CAS  Google Scholar 

  21. Giglio, S. et al. Olfactory receptor-gene clusters, genomic-inversion polymorphisms, and common chromosome rearrangements. Am. J. Hum. Genet. 68, 874–883 (2001).

    CAS  PubMed Central  Google Scholar 

  22. Osborne, L. R. et al. A 1. 5 million-base pair inversion polymorphism in families with Williams-Beuren syndrome. Nature Genet. 29, 321–325 (2001). Provides evidence that large-scale structural polymorphisms might increase the risk of recurrent chromosomal structural rearrangements among offspring.

    CAS  PubMed Central  Google Scholar 

  23. Gimelli, G. et al. Genomic inversions of human chromosome 15q11-q13 in mothers of Angelman syndrome patients with class II (BP2/3) deletions. Hum. Mol. Genet. 12, 849–858 (2003).

    CAS  PubMed Central  Google Scholar 

  24. Giglio, S. et al. Heterozygous submicroscopic inversions involving olfactory receptor-gene clusters mediate the recurrent t(4;8)(p16;p23) translocation. Am. J. Hum. Genet. 71, 276–285 (2002).

    CAS  PubMed Central  Google Scholar 

  25. Ritchie, R. J., Mattei, M. G. & Lalande, M. A large polymorphic repeat in the pericentromeric region of human chromosome 15q contains three partial gene duplications. Hum. Mol. Genet. 7, 1253–1260 (1998).

    CAS  Google Scholar 

  26. Barber, J. C. et al. Neurofibromatosis pseudogene amplification underlies euchromatic cytogenetic duplications and triplications of proximal 15q. Hum. Genet. 103, 600–607 (1998).

    CAS  Google Scholar 

  27. Fantes, J. A. et al. Organisation of the pericentromeric region of chromosome 15: at least four partial gene copies are amplified in patients with a proximal duplication of 15q. J. Med. Genet. 39, 170–177 (2002).

    CAS  PubMed Central  Google Scholar 

  28. Skaletsky, H. et al. The male-specific region of the human Y chromosome: a mosaic of discrete sequence classes. Nature 423, 825–837 (2003).

    CAS  Google Scholar 

  29. Alexandrov, I., Kazakov, A., Tumeneva, I., Shepelev, V. & Yurov, Y. α-Satellite DNA of primates: old and new families. Chromosoma 110, 253–266 (2001). A thorough overview of the various classes of α-satellite DNA and their evolutionary properties.

    CAS  Google Scholar 

  30. Lee, C., Wevrick, R., Fisher, R. B., Ferguson-Smith, M. A. & Lin, C. C. Human centromeric DNAs. Hum. Genet. 100, 291–304 (1997).

    CAS  Google Scholar 

  31. Schueler, M. G., Higgins, A. W., Rudd, M. K., Gustashaw, K. & Willard, H. F. Genomic and genetic definition of a functional human centromere. Science 294, 109–115 (2001). Functional and structural characterization of a euchromatin–heterochromatin transition region on the X chromosome.

    CAS  Google Scholar 

  32. Horvath, J. et al. Molecular structure and evolution of an α/non-α satellite junction at 16p11. Hum. Mol. Genet. 9, 113–123 (2000).

    CAS  Google Scholar 

  33. Worton, R. et al. Human ribosomal RNA genes: orientation of the tandem array and conservation of the 5′ end. Science 239, 64–68 (1988).

    CAS  Google Scholar 

  34. Greig, G. & Willard, H. β-Satellite DNA: characterization and localization of two subfamilies from the distal and proximal short arms of human acrocentric chromosomes. Genomics 12, 573–580 (1992).

    CAS  Google Scholar 

  35. Choo, K. H., Vissel, B. & Earle, E. Evolution of α-satellite DNA on human acrocentric chromosomes. Genomics 5, 332–344 (1989).

    CAS  Google Scholar 

  36. Korenberg, J. R. et al. A high-fidelity physical map of human chromosome 21q in yeast artificial chromosomes. Genome Res. 5, 427–443 (1995).

    CAS  Google Scholar 

  37. Wang, S. Y. et al. A high-resolution physical map of human chromosome 21p using yeast artificial chromosomes. Genome Res. 9, 1059–1073 (1999).

    CAS  PubMed Central  Google Scholar 

  38. Gonzalez, I. L. & Sylvester, J. E. Complete sequence of the 43-kb human ribosomal DNA repeat: analysis of the intergenic spacer. Genomics 27, 320–328 (1995).

    CAS  Google Scholar 

  39. Gonzalez, I. L. & Sylvester, J. E. Incognito rRNA and rDNA in databases and libraries. Genome Res. 7, 65–70 (1997).

    CAS  Google Scholar 

  40. Gonzalez, I. L. & Sylvester, J. E. Human rDNA: evolutionary patterns within the genes and tandem arrays derived from multiple chromosomes. Genomics 73, 255–263 (2001).

    CAS  Google Scholar 

  41. Wohr, G., Fink, T. & Assum, G. A palindromic structure in the pericentromeric region of various human chromosomes. Genome Res. 6, 267–279 (1996).

    CAS  Google Scholar 

  42. Eisenbarth, I., Konig-Greger, D., Wohr, G., Kehrer-Sawatzki, H. & Assum, G. Characterization of an alphoid subfamily located near p-arm sequences on human chromosome 22. Chromosome Res. 7, 65–69 (1999).

    CAS  Google Scholar 

  43. Hattori, M. et al. The DNA sequence of human chromosome 21. Nature 405, 311–319 (2000).

    CAS  Google Scholar 

  44. Cserpan, I. et al. The chAB4 and NF1-related long-range multisequence DNA families are contiguous in the centromeric heterochromatin of several human chromosomes. Nucleic Acids Res. 30, 2899–2905 (2002).

    CAS  PubMed Central  Google Scholar 

  45. Guipponi, M. et al. Genomic structure of a copy of the human TPTE gene which encompasses 87 kb on the short arm of chromosome 21. Hum. Genet. 107, 127–131 (2000).

    CAS  Google Scholar 

  46. Kurahashi, H., Shaikh, T. H. & Emanuel, B. S. Alu-mediated PCR artefacts and the constitutional t(11;22) breakpoint. Hum. Mol. Genet. 9, 2727–2732 (2000).

    CAS  Google Scholar 

  47. Robledo, R. et al. A 9.1-kb gap in the genome reference map is shown to be a stable deletion/insertion polymorphism of ancestral origin. Genomics 80, 585–592 (2002).

    CAS  PubMed Central  Google Scholar 

  48. Kouprina, N. et al. Segments missing from the draft human genome sequence can be isolated by transformation-associated recombination cloning in yeast. EMBO Rep. 4, 257–262 (2003).

    CAS  PubMed Central  Google Scholar 

  49. Frohme, M. et al. Directed gap closure in large-scale sequencing projects. Genome Res. 11, 901–903 (2001).

    CAS  PubMed Central  Google Scholar 

  50. Siniscalco, M. et al. A plea to search for deletion polymorphism through genome scans in populations. Trends Genet. 16, 435–437 (2000).

    CAS  Google Scholar 

  51. Kurahashi, H., Shaikh, T., Takata, M., Toda, T. & Emanuel, B. S. The constitutional t(17;22): another translocation mediated by palindromic AT-rich repeats. Am. J. Hum. Genet. 72, 733–738 (2003).

    CAS  PubMed Central  Google Scholar 

  52. Kurahashi, H. & Emanuel, B. S. Long AT-rich palindromes and the constitutional t(11;22) breakpoint. Hum. Mol. Genet. 10, 2605–2617 (2001). Sequence characterization of a gap in the human genome and its association with recurrent chromosomal instability.

    CAS  Google Scholar 

  53. Verkerk, A. J. et al. Identification of a gene (FMR-1) containing a CGG repeat coincident with a breakpoint cluster region exhibiting length variation in fragile X syndrome. Cell 65, 905–914 (1991).

    CAS  PubMed Central  Google Scholar 

  54. Kasukawa, T. et al. Development and evaluation of an automated annotation pipeline and cDNA annotation system. Genome Res. 13, 1542–1551 (2003).

    CAS  PubMed Central  Google Scholar 

  55. Furuno, M. et al. CDS annotation in full-length cDNA sequence. Genome Res. 13, 1478–1487 (2003).

    CAS  PubMed Central  Google Scholar 

  56. Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).

  57. Collins, J. E. et al. Reevaluating human gene annotation: a second-generation analysis of chromosome 22. Genome Res. 13, 27–36 (2003). A careful re-examination of gene annotation on chromosome 22 that identifies common sources of error on the basis of genome structure and limitations of EST/gene databases.

    CAS  PubMed Central  Google Scholar 

  58. Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489–495 (1999).

    CAS  Google Scholar 

  59. Mounsey, A., Bauer, P. & Hope, I. A. Evidence suggesting that a fifth of annotated Caenorhabditis elegans genes may be pseudogenes. Genome Res. 12, 770–775 (2002).

    CAS  PubMed Central  Google Scholar 

  60. Collins, J. E., Mungall, A. J., Badcock, K. L., Fay, J. M. & Dunham, I. The organization of the γ-glutamyl transferase genes and other low copy repeats in human chromosome 22q11. Genome Res. 7, 522–531 (1997).

    CAS  Google Scholar 

  61. Estivill, X. et al. Chromosomal regions containing high-density and ambiguously mapped putative single nucleotide polymorphisms (SNPs) correlate with segmental duplications in the human genome. Hum. Mol. Genet. 11, 1987–1995 (2002).

    CAS  PubMed Central  Google Scholar 

  62. Reich, D. E. et al. Linkage disequilibrium in the human genome. Nature 411, 199–204 (2001).

    CAS  Google Scholar 

  63. Reich, D. E. et al. Human genome sequence variation and the influence of gene history, mutation and recombination. Nature Genet. 32, 135–142 (2002).

    CAS  PubMed Central  Google Scholar 

  64. Riethman, H. C. et al. Integration of telomere sequences with the draft human genome sequence. Nature 409, 948–951 (2001).

    CAS  Google Scholar 

  65. Riethman, H. C. et al. Mapping and initial analysis of human subtelomeric sequence assemblies. Genome. Res. (in the press). Describes the sequence organization of human subtelomeric regions by implementing a half-YAC strategy to resolve these complex regions of the genome.

  66. Larionov, V. et al. Specific cloning of human DNA as yeast artificial chromosomes by transformation-associated recombination. Proc. Natl Acad. Sci. USA 93, 491–496 (1996).

    CAS  Google Scholar 

  67. Kouprina, N. et al. Cloning of human centromeres by transformation-associated recombination in yeast and generation of functional human artificial chromosomes. Nucleic Acids Res. 31, 922–934 (2003).

    CAS  PubMed Central  Google Scholar 

  68. Tammi, M. T., Arner, E. & Andersson, B. TRAP: Tandem Repeat Assembly Program produces improved shotgun assemblies of repetitive sequences. Comput. Methods Programs Biomed. 70, 47–59 (2003).

    Google Scholar 

  69. Pevzner, P. A., Tang, H. & Waterman, M. S. An Eulerian path approach to DNA fragment assembly. Proc. Natl Acad. Sci. USA 98, 9748–9753 (2001).

    CAS  Google Scholar 

  70. Paulding, C. A., Ruvolo, M. & Haber, D. A. The Tre2 (USP6) oncogene is a hominoid-specific gene. Proc. Natl Acad. Sci. USA 100, 2507–2511 (2003).

    CAS  Google Scholar 

  71. Johnson, M. E. et al. Positive selection of a gene family during the emergence of humans and African apes. Nature 413, 514–519 (2001).

    CAS  Google Scholar 

  72. Lupski, J. R. Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits. Trends Genet. 14, 417–422 (1998).

    CAS  Google Scholar 

  73. RepeatMasker documentation. Index of RM [online], <http://repeatmasker.genome.washington.edu/RM/> (1997).

Download references

Acknowledgements

We would like to thank F. Collins, A. Pelsenfeld, B. Waterston and E. Lander for helpful comments in the preparation of this manuscript. This work was supported, in part, by grants from the National Institutes of Health to E.E.E.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Evan E. Eichler.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

Related links

DATABASES

LocusLink

TPTE

FURTHER INFORMATION

GenBank Database

Human Paralogy Server Segmental Duplication Database

Proposal for Construction of a Human Haploid BAC Library from Hydatidiform Mole Source Material

University of California Santa Cruz Genome Bioinformatics

Glossary

HETEROCHROMATIN

Parts of chromosomes with an unusual degree of contraction and that consequently have different staining properties from the euchromatin at nuclear divisions. Largely composed of repetitive DNA, heterochromatin forms dark bands after Giemsa staining.

EUCHROMATIN

Parts of chromosomes that show the normal cycle of condensation and normal staining properties at nuclear divisions. Euchromatin generally contains active or potentially active genes and falls within light bands after Giemsa staining.

SATELLITE DNA

Various classes of highly repetitive DNA that are tandemly repeated and most often associated with centromeric or pericentromeric regions of the genome. α-Satellite DNA is a class of centromeric satellite in which the monomeric unit is 171 bp. Higher-order structures of this satellite define the DNA component of human centromeres. β-Satellite DNA is a class of pericentromeric satellite in which the basic repeat unit is a 68-bp monomer.

ACROCENTRIC

A chromosome in which the centromere is located subterminally, and concomitantly the chromosome arms are unequal in length.

PARALOGOUS

The quality of having sequence similarity as a result of duplication.

CONTIG

A set of contiguous overlapping clones that span a genomic region.

MULTISITE SIGNALS

Multiple fluorescence in situ hybridization (FISH) signals on metaphase chromosomal preparations.

GIEMSA

A cytogenetic stain that is applied to metaphase chromosomes after limited digestion with trypsin. Individual chromosomes are distinguished on the basis of a characteristic banding pattern of dark and light bands.

MONOCHROMOSOMAL HYBRID

A cell line that carries a single, intact human chromosome in a rodent somatic cell background.

FOSMID

A low-copy vector for the construction of stable genomic libraries that uses the Escherichia coli F-factor origin for replication.

NON-ROBERTSONIAN TRANSLOCATION

A chromosomal rearrangement that is characterized by the fusion of different chromosomes, but does not involve the fusion of whole long arms of two acrocentric chromosomes (chromosomes 13, 14, 15, 21 and 22 in humans).

PALINDROME

DNA sequence in which the 5′–3′ composition is identical on each strand with respect to its midpoint.

ORTHOLOGOUS

The quality of having sequence similarity as a result of speciation.

BALANCING SELECTION

Natural selection in which heterozygotes have increased evolutionary fitness with respect to either homozygous condition.

SPHEROPLAST FUSION

A method for transferring DNA into cells or between cells whereby the host cell wall is removed (spheroplast) before polyethylene glycol fusion.

TRANSFORMATION-ASSOCIATED RECOMBINATION-BASED GENOMIC LIBRARIES

Genomic libraries that are constructed by a transformation-associated recombination (TAR) cloning system (see box 1).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Eichler, E., Clark, R. & She, X. An assessment of the sequence gaps: Unfinished business in a finished human genome. Nat Rev Genet 5, 345–354 (2004). https://doi.org/10.1038/nrg1322

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg1322

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing