Resolving the complexity of the human genome using single-molecule sequencing

Journal name:
Nature
Volume:
517,
Pages:
608–611
Date published:
DOI:
doi:10.1038/nature13907
Received
Accepted
Published online

The human genome is arguably the most complete mammalian reference assembly1, 2, 3, yet more than 160 euchromatic gaps remain4, 5, 6 and aspects of its structural variation remain poorly understood ten years after its completion7, 8, 9. To identify missing sequence and genetic variation, here we sequence and analyse a haploid human genome (CHM1) using single-molecule, real-time DNA sequencing10. We close or extend 55% of the remaining interstitial gaps in the human GRCh37 reference genome—78% of which carried long runs of degenerate short tandem repeats, often several kilobases in length, embedded within (G+C)-rich genomic regions. We resolve the complete sequence of 26,079 euchromatic structural variants at the base-pair level, including inversions, complex insertions and long tracts of tandem repeats. Most have not been previously reported, with the greatest increases in sensitivity occurring for events less than 5 kilobases in size. Compared to the human reference, we find a significant insertional bias (3:1) in regions corresponding to complex insertions and long short tandem repeats. Our results suggest a greater complexity of the human genome in the form of variation of longer and more complex repetitive DNA that can now be largely resolved with the application of this longer-read sequencing technology.

At a glance

Figures

  1. Sequence content of gap closures.
    Figure 1: Sequence content of gap closures.

    a, Gap closures are enriched for simple repeats compared to equivalently sized regions randomly sampled from GRCh37. b, Human genome gaps typically consist of (G+C)-rich sequence (yellow) flanking complex (A+T)-rich STRs (green) (empirical P value; Supplementary Information). Red line indicates genomic (G+C) content.

  2. Structural variation analyses.
    Figure 2: Structural variation analyses.

    a, Histograms display the distribution of novel insertions (black/grey) and deletions (red/pink) between CHM1 and GRCh37 haplotypes compared to copy number variants identified from other studies for insertions and deletions less than 1 kb (left) and greater than or equal to 1 kb (right). Most of the increased sensitivity occurs below 5 kb. Peaks at ~300 bp and 6 kb correspond to Alu and L1 insertions, respectively. b, STR insertions in CHM1 (green) are longer than the human genome (blue; GRCh37), and this effect becomes more pronounced with increasing length (x axis). c, The percentage repeat composition (x axis) of 1-kb sequences flanking insertion sites for Alu, L1 and SVA mobile element insertions. Insertion calls from the 1000 Genomes Project (pink)21 compared to calls from CHM1 using SMRT reads (blue) show increased sensitivity for repeat-rich insertions.

  3. CHM1 clone-based assembly of the human 10q11 genomic region.
    Figure 3: CHM1 clone-based assembly of the human 10q11 genomic region.

    The clone-based assembly is composed primarily of BACs from the CH17 library as shown in the tiling path below the internal repeat structure of the region. Coloured arrows indicate large segmental duplications (SDs) with homologous sequences connected by lines generated by Miropeats23.

  4. Sequence content of gap closures.
    Extended Data Fig. 1: Sequence content of gap closures.

    ac, Gap closures are enriched for simple repeats compared to equivalently sized regions randomly sampled from GRCh37; examples of the organization of these regions are shown using Miropeats for chromosome 4 (GRCh37, chr4:59724333–59804333) (a), chromosome 11 (GRCh37, chr11:87673378–87753378) (b), and chromosome X (GRCh37, chrX:143492324–143572324) (c). Dotplots show the architecture of the degenerate STRs with the core motif highlighted below. Shared sequence motifs between blocks are indicated by colour.

  5. Variant detection pipeline.
    Extended Data Fig. 2: Variant detection pipeline.

    At every variant locus, we collected the full-length reads that overlap the locus, performed de novo assembly using the Celera assembler, and called a consensus using Quiver after remapping reads used in the assembly as well as reads flanking the assembly (yellow reads) to increase consensus quality at the boundaries of the assembly. BLASR is used to align the assembly consensus sequences to the reference, and insertions and deletions in the alignments are output as variants. Reads spanning a deletion event within a single alignment are shown as bars connected by a solid line, and double hard-stop reads spanning a larger deletion event and split into two separate alignments of the same read are shown as a dotted line.

  6. Genome distribution of closed gaps and insertions.
    Extended Data Fig. 3: Genome distribution of closed gaps and insertions.

    Chromosome ideogram heatmap depicts the normalized density of inserted CHM1 base pairs per 5-Mb bin with a strong bias noted near the end of most chromosomes. Locations of structural variants and closed gaps are given by coloured diamonds to the left of each chromosome: closed gap sequences (red), inversions (green), and complex events (blue).

  7. Confirmation of complex insertions in additional genomes.
    Extended Data Fig. 4: Confirmation of complex insertions in additional genomes.

    Top, genotypes of polymorphic complex regions using read depth of unique k-mers (blue: present; white: absent). Bottom, extended examples of complex insertion events: alignment to chimpanzee panTro4 reference (dark blue); existing human reference hg19 (light teal); inserted sequence (dark teal). The bottom rows show repeat annotations, with darker hues for repeats overlapping the inserted region.

  8. Inversion validation by BAC-insert sequencing.
    Extended Data Fig. 5: Inversion validation by BAC-insert sequencing.

    Inversions detected by alignment of single long reads were validated by sequencing clones from the CHM1 BAC library (CHORI17), in which end mappings to GRCh37 spanned the putative inversions. Inversions were validated by aligning the corresponding BAC sequences to GRCh37 with Miropeats. Shared sequence between the BACs and GRCh37 is shown in black; inversion events are indicated in red.

  9. CHM1 clone-based assembly of the human 10q11 genomic region.
    Extended Data Fig. 6: CHM1 clone-based assembly of the human 10q11 genomic region.

    a, The clone-based assembly is composed primarily of BACs from the CH17 library as shown in the tiling path below the internal repeat structure of the region. Coloured arrows indicate large segmental duplications with homologous sequences connected by coloured lines (Miropeats). Genes annotated from alignment of RefSeq messenger RNA sequences with GMAP27 are shown. b, Miropeats comparisons of the 10q11 clone-based assembly against the corresponding sequence from GRCh37, with gaps shown in red, highlight the degree to which the reference was misassembled.

Accession codes

Primary accessions

Sequence Read Archive

References

  1. The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 5665 (2012)
  2. The International HapMap Project Consortium. The International HapMap Project. Nature 426, 789796 (2003)
  3. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931945 (2004)
  4. Kurahashi, H. et al. Molecular cloning of a translocation breakpoint hotspot in 22q11. Genome Res. 17, 461469 (2007)
  5. Genovese, G. et al. Using population admixture to help complete maps of the human genome. Nature Genet. 45, 406414 (2013)
  6. Bovee, D. et al. Closing gaps in the human genome with fosmid resources generated from multiple individuals. Nature Genet. 40, 96101 (2008)
  7. Mills, R. E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 5965 (2011)
  8. Kidd, J. M. et al. A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell 143, 837847 (2010)
  9. Eichler, E. E., Clark, R. A. & She, X. An assessment of the sequence gaps: unfinished business in a finished human genome. Nature Rev. Genet. 5, 345354 (2004)
  10. Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133138 (2009)
  11. Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012)
  12. Lee, H. & Schatz, M. C. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics 28, 20972105 (2012)
  13. Myers, E. W. et al. A whole-genome assembly of Drosophila. Science 287, 21962204 (2000)
  14. Chin, C. S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods 10, 563569 (2013)
  15. Huddleston, J. et al. Reconstructing complex regions of genomes using long-read sequencing technology. Genome Res. 24, 688696 (2014)
  16. Kimelman, A. et al. A vast collection of microbial genes that are toxic to bacteria. Genome Res. 22, 802809 (2012)
  17. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860921 (2001)
  18. Venter, J. C. et al. The sequence of the human genome. Science 291, 13041351 (2001)
  19. Conrad, D. F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704712 (2010)
  20. Kong, A. et al. A high-resolution recombination map of the human genome. Nature Genet. 31, 241247 (2002)
  21. Stewart, C. et al. A comprehensive map of mobile element insertion polymorphisms in humans. PLoS Genet. 7, e1002236 (2011)
  22. Steinberg, K. M. et al. Single haplotype assembly of the human genome from a hydatidiform mole. Genome Res (in press)
  23. Parsons, J. D. Miropeats: graphical DNA sequence comparisons. Comput. Appl. Biosci. 11, 615619 (1995)
  24. Jurka, J., Klonowski, P., Dagman, V. & Pelton, P. CENSOR–a program for identification and elimination of repetitive elements from DNA sequences. Comput. Chem. 20, 119121 (1996)
  25. Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker Open-3.0 http://www.repeatmasker.org (19962010)
  26. Adey, A. et al. Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition. Genome Biol. 11, R119 (2010)
  27. Wu, T. & Watanabe GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 18591875 (2005)

Download references

Author information

Affiliations

  1. Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA

    • Mark J. P. Chaisson,
    • John Huddleston,
    • Megan Y. Dennis,
    • Peter H. Sudmant,
    • Maika Malig,
    • Fereydoun Hormozdiari,
    • Richard Sandstrom,
    • John A. Stamatoyannopoulos &
    • Evan E. Eichler
  2. Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA

    • John Huddleston &
    • Evan E. Eichler
  3. Dipartimento di Biologia, Università degli Studi di Bari ‘Aldo Moro’, Bari 70125, Italy

    • Francesca Antonacci
  4. Department of Pathology, University of Pittsburgh, Pittsburgh, Pennsylvania 15261, USA

    • Urvashi Surti
  5. Pacific Biosciences of California, Inc., Menlo Park, California 94025, USA

    • Matthew Boitano,
    • Jane M. Landolin,
    • Michael W. Hunkapiller &
    • Jonas Korlach

Contributions

E.E.E., M.J.P.C., M.Y.D., J.H. and J.K. designed experiments; M.M. prepared DNA; M.M. and M.B. prepared libraries and generated sequence data; P.H.S., J.H. and M.Y.D. identified clones for sequencing; J.H., P.H.S., M.Y.D., F.H. and M.J.P.C. performed bioinformatics analyses; M.Y.D., F.A. and M.M. performed targeted sequencing of clones; M.J.P.C. designed algorithms and pipelines for mapping SMRT sequence data and detection of structural variants; M.W.H., U.S., R.S. and J.A.S. provided access to critical resources; J.M.L. deposited SMRT sequence data into SRA; M.J.P.C., J.H. and E.E.E. wrote the manuscript.

Competing financial interests

M.B., J.L., M.W.H. and J.K. are employees of Pacific Biosciences, Inc., a company commercializing DNA sequencing technologies; E.E.E. is on the scientific advisory board (SAB) of DNAnexus, Inc. and was formerly an SAB member of Pacific Biosciences, Inc. (2009–2013) and SynapDx Corp. (2011–2013); and M.J.P.C. was a former employee for Pacific Biosciences, Inc.

Corresponding author

Correspondence to:

All underlying SMRT WGS read data have been released within the NCBI Sequence Read Archive (SRA) under accession SRX533609 and may also be accessed as part of all the SMRT data sets (NCBI SRA accession SRP040522). Illumina WGS data for CHM1 are available in the NCBI SRA under accession SRP044331 as well as finished BAC and fosmid clone inserts using SMRT sequence data (GenBank accessions in Supplementary Table 35). For the purpose of mapping and annotation, we developed a patched GRCh37 reference genome including a track hub for upload into the UCSC Genome Browser. A complete list of all inaccessible regions of the human genome and a database of heterochromatic and subtelomeric sequence reads that could not be assembled are available at (http://eichlerlab.gs.washington.edu/publications/chm1-structural-variation).

Author details

Extended data figures and tables

Extended Data Figures

  1. Extended Data Figure 1: Sequence content of gap closures. (261 KB)

    ac, Gap closures are enriched for simple repeats compared to equivalently sized regions randomly sampled from GRCh37; examples of the organization of these regions are shown using Miropeats for chromosome 4 (GRCh37, chr4:59724333–59804333) (a), chromosome 11 (GRCh37, chr11:87673378–87753378) (b), and chromosome X (GRCh37, chrX:143492324–143572324) (c). Dotplots show the architecture of the degenerate STRs with the core motif highlighted below. Shared sequence motifs between blocks are indicated by colour.

  2. Extended Data Figure 2: Variant detection pipeline. (212 KB)

    At every variant locus, we collected the full-length reads that overlap the locus, performed de novo assembly using the Celera assembler, and called a consensus using Quiver after remapping reads used in the assembly as well as reads flanking the assembly (yellow reads) to increase consensus quality at the boundaries of the assembly. BLASR is used to align the assembly consensus sequences to the reference, and insertions and deletions in the alignments are output as variants. Reads spanning a deletion event within a single alignment are shown as bars connected by a solid line, and double hard-stop reads spanning a larger deletion event and split into two separate alignments of the same read are shown as a dotted line.

  3. Extended Data Figure 3: Genome distribution of closed gaps and insertions. (369 KB)

    Chromosome ideogram heatmap depicts the normalized density of inserted CHM1 base pairs per 5-Mb bin with a strong bias noted near the end of most chromosomes. Locations of structural variants and closed gaps are given by coloured diamonds to the left of each chromosome: closed gap sequences (red), inversions (green), and complex events (blue).

  4. Extended Data Figure 4: Confirmation of complex insertions in additional genomes. (769 KB)

    Top, genotypes of polymorphic complex regions using read depth of unique k-mers (blue: present; white: absent). Bottom, extended examples of complex insertion events: alignment to chimpanzee panTro4 reference (dark blue); existing human reference hg19 (light teal); inserted sequence (dark teal). The bottom rows show repeat annotations, with darker hues for repeats overlapping the inserted region.

  5. Extended Data Figure 5: Inversion validation by BAC-insert sequencing. (289 KB)

    Inversions detected by alignment of single long reads were validated by sequencing clones from the CHM1 BAC library (CHORI17), in which end mappings to GRCh37 spanned the putative inversions. Inversions were validated by aligning the corresponding BAC sequences to GRCh37 with Miropeats. Shared sequence between the BACs and GRCh37 is shown in black; inversion events are indicated in red.

  6. Extended Data Figure 6: CHM1 clone-based assembly of the human 10q11 genomic region. (428 KB)

    a, The clone-based assembly is composed primarily of BACs from the CH17 library as shown in the tiling path below the internal repeat structure of the region. Coloured arrows indicate large segmental duplications with homologous sequences connected by coloured lines (Miropeats). Genes annotated from alignment of RefSeq messenger RNA sequences with GMAP27 are shown. b, Miropeats comparisons of the 10q11 clone-based assembly against the corresponding sequence from GRCh37, with gaps shown in red, highlight the degree to which the reference was misassembled.

Supplementary information

PDF files

  1. Supplementary Information (4.9 MB)

    This file contains Supplementary Methods, Text and Data, Supplementary Figures 1-29, Supplementary Tables 1-35 and additional references. Tables shown in this file represent views of the full tables given in the Supplementary Tables file.

Excel files

  1. Supplementary Tables (442 KB)

    This file contains the full table values for the Supplementary Tables 1-35 (see separate Supplementary information file).

Additional data