Abstract

We present the first comprehensive analysis of a diploid human genome that combines single-molecule sequencing with single-molecule genome maps. Our hybrid assembly markedly improves upon the contiguity observed from traditional shotgun sequencing approaches, with scaffold N50 values approaching 30 Mb, and we identified complex structural variants (SVs) missed by other high-throughput approaches. Furthermore, by combining Illumina short-read data with long reads, we phased both single-nucleotide variants and SVs, generating haplotypes with over 99% consistency with previous trio-based studies. Our work shows that it is now possible to integrate single-molecule and high-throughput sequence data to generate de novo assembled genomes that approach reference quality.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Accessions

Primary accessions

BioProject

Sequence Read Archive

References

  1. 1.

    et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

  2. 2.

    et al. Performance comparison of whole-genome sequencing platforms. Nat. Biotechnol. 30, 78–82 (2012).

  3. 3.

    et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).

  4. 4.

    et al. Whole-genome shotgun assembly and comparison of human genome assemblies. Proc. Natl. Acad. Sci. USA 101, 1916–1921 (2004).

  5. 5.

    et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108, 1513–1518 (2011).

  6. 6.

    et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

  7. 7.

    Human Genome Sequencing Consortium International. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).

  8. 8.

    , , , & Performance of high-throughput sequencing for the discovery of genetic variation across the complete size spectrum. G3 (Bethesda) 4, 63–65 (2014).

  9. 9.

    , & A window into third generation sequencing. Hum. Mol. Genet. 19, R227–R240 (2010).

  10. 10.

    1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  11. 11.

    et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65 (2011).

  12. 12.

    et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013).

  13. 13.

    et al. Origins of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany. N. Engl. J. Med. 365, 709–717 (2011).

  14. 14.

    et al. A hybrid approach for the automated finishing of bacterial genomes. Nat. Biotechnol. 30, 701–707 (2012).

  15. 15.

    et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).

  16. 16.

    et al. Finished bacterial genomes from shotgun sequence data. Genome Res. 22, 2270–2277 (2012).

  17. 17.

    et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012).

  18. 18.

    et al. Reconstructing complex regions of genomes using long-read sequencing technology. Genome Res. 24, 688–696 (2014).

  19. 19.

    , , & Amplification and thrifty single-molecule sequencing of recurrent somatic structural variations. Genome Res. 24, 318–328 (2014).

  20. 20.

    et al. Rapid genome mapping in nanochannel arrays for highly complete and accurate de novo sequence assembly of the complex Aegilops tauschii genome. PLoS ONE 8, e55864 (2013).

  21. 21.

    et al. Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nat. Biotechnol. 30, 771–776 (2012).

  22. 22.

    et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012).

  23. 23.

    et al. ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads. Genome Biol. 10, R103 (2009).

  24. 24.

    et al. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol. Syst. Biol. 7, 522 (2011).

  25. 25.

    , , & An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res. 18, 1336–1346 (2008).

  26. 26.

    et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015).

  27. 27.

    et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).

  28. 28.

    et al. Genome-wide analysis of the human Alu Yb-lineage. Hum. Genomics 1, 167–178 (2004).

  29. 29.

    et al. A comprehensive analysis of recently integrated human Ta L1 elements. Am. J. Hum. Genet. 71, 312–326 (2002).

  30. 30.

    et al. Location analysis for the estrogen receptor-alpha reveals binding to diverse ERE sequences and widespread binding within repetitive DNA elements. Nucleic Acids Res. 38, 2355–2368 (2010).

  31. 31.

    et al. Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles. Nucleic Acids Res. 41, e32 (2013).

  32. 32.

    Lipoprotein(a) and ischemic heart disease–a causal association? A review. Atherosclerosis 211, 15–23 (2010).

  33. 33.

    et al. 5′-Transducing SVA retrotransposon groups spread efficiently throughout the human genome. Genome Res. 19, 1992–2008 (2009).

  34. 34.

    et al. Emergence of primate genes by retrotransposon-mediated sequence transduction. Proc. Natl. Acad. Sci. USA 103, 17608–17613 (2006).

  35. 35.

    & Trans mobilization of genomic DNA as a mechanism for retrotransposon-mediated exon shuffling. Hum. Mol. Genet. 12, 1321–1328 (2003).

  36. 36.

    & Resolving complex tandem repeats with long reads. Bioinformatics 30, 3491–3498 (2014).

  37. 37.

    in Algorithms in Bioinformatics (eds. Brown, D. & Morgenstern, B.) 52–67 (Springer, 2014).

  38. 38.

    et al. Assembling large genomes with single-molecule sequencing and locality sensitive hashing. bioRxiv doi: (2014).

  39. 39.

    et al. AGORA: Assembly Guided by Optical Restriction Alignment. BMC Bioinformatics 13, 189 (2012).

  40. 40.

    The fragment assembly string graph. Bioinformatics 21 (suppl. 2), ii79–ii85 (2005).

  41. 41.

    et al. Whole-genome haplotyping using long reads and statistical methods. Nat. Biotechnol. 32, 261–266 (2014).

  42. 42.

    et al. Palindromic GOLGA8 core duplicons promote chromosome 15q13.3 microdeletion and evolutionary instability. Nat. Genet. 46, 1293–1302 (2014).

  43. 43.

    , & Mechanisms for human genomic rearrangements. Pathogenetics 1, 4 (2008).

  44. 44.

    , & Structural variation of the human genome. Annu. Rev. Genomics Hum. Genet. 7, 407–442 (2006).

  45. 45.

    , , , & Evaluation of paired-end sequencing strategies for detection of genome rearrangements in cancer. PLoS Comput. Biol. 4, e1000051 (2008).

  46. 46.

    et al. Fine-scale structural variation of the human genome. Nat. Genet. 37, 727–732 (2005).

  47. 47.

    et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

  48. 48.

    et al. SOAPindel: Efficient identification of indels from short paired reads. Genome Res. 23, 195–200 (2013).

  49. 49.

    et al. Natural mutagenesis of human genomes by endogenous retrotransposons. Cell 141, 1253–1261 (2010).

  50. 50.

    et al. Detecting false-positive signals in exome sequencing. Hum. Mutat. 33, 609–613 (2012).

  51. 51.

    Genomic Mapping: A Statistical and Algorithmic Analysis of the Optical Mapping System. PhD thesis, Univ. Southern California (2010).

  52. 52.

    & in Algorithms Bioinformatics WABI (eds. Gascuel, O. & Moret, B.M.E.) 27–40 (Springer, 2001).

  53. 53.

    , , & An algorithm for assembly of ordered restriction maps from single DNA molecules. Proc. Natl. Acad. Sci. USA 103, 15770–15775 (2006).

  54. 54.

    & Mapping single molecule sequencing reads using Basic Local Alignment with Successive Refinement (BLASR): theory and application. BMC Bioinformatics 13, 238 (2012).

  55. 55.

    & Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

  56. 56.

    & Haplotype-based variant detection from short-read sequencing. Preprint at (2012).

  57. 57.

    , & PBHoney: identifying genomic variants via long-read discordance and interrupted mapping. BMC Bioinformatics 15, 180 (2014).

  58. 58.

    An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 705–708 (1982).

  59. 59.

    , , & Sparse dynamic programming I: linear cost functions. J. ACM 39, 519–545 (1992).

  60. 60.

    et al. Glocal alignment: finding rearrangements during alignment. Bioinformatics 19, i54–i62 (2003).

  61. 61.

    , , & Multiple whole-genome alignments without a reference organism. Genome Res. 19, 682–689 (2009).

  62. 62.

    Generating consensus sequences from partial order multiple sequence alignment graphs. Bioinformatics 19, 999–1008 (2003).

  63. 63.

    et al. Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Res. 41, D70–D82 (2013).

  64. 64.

    & HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24, i153–i159 (2008).

  65. 65.

    et al. Pacific Biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics 13, 375 (2012).

  66. 66.

    & Enhancements and modifications of primer design program Primer3. Bioinformatics 23, 1289–1291 (2007).

  67. 67.

    , , , & Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

Download references

Acknowledgements

This work was supported in part by institutional support from the Icahn Institute for Genomics and Multiscale Biology, R01 HG005946, U01 HL107388, R01 DK098242-01, R01 MH106531, US National Institutes of Health (NIH) U41HG007497, the Irma T. Hirschl and Monique Weill-Caulier Charitable Trusts, the STARR Consortium, the WorldQuant Foundation, the Pershing Square Foundation, the Genomics & Epigenomics Core Facilities and SMRT Sequencing Center at Weill Cornell Medical College, and through the computational resources and staff expertise provided by the Department of Scientific Computing at the Icahn School of Medicine at Mount Sinai. DNA samples were provided by the Coriell Institute for Medical Research and the US National Institute of Standards and Technology (NIST). We would also like to thank T. Zichner for assistance with the design of validations and M. Chaisson for assistance with running Blasr, the assembly-based SV pipeline, and in performing the CHM1 comparison.

Author information

Author notes

    • Matthew Pendleton
    • , Robert Sebra
    • , Andy Wing Chun Pang
    •  & Ajay Ummat

    These authors contributed equally to this work.

Affiliations

  1. Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York, USA.

    • Matthew Pendleton
    • , Robert Sebra
    • , Ajay Ummat
    • , Oscar Franzen
    • , Ariella Cohain
    • , Gintaras Deikus
    • , Eric E Schadt
    •  & Ali Bashir
  2. BioNano Genomics, San Diego, California, USA.

    • Andy Wing Chun Pang
    • , William Stedman
    • , Thomas Anantharaman
    • , Alex Hastie
    • , Heng Dai
    •  & Han Cao
  3. Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.

    • Tobias Rausch
    • , Adrian M Stütz
    • , Markus Hsi-Yang Fritz
    •  & Jan O Korbel
  4. The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medical College, New York, New York, USA.

    • Russell E Durrett
    • , Roger Altman
    •  & Christopher E Mason
  5. Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York, USA.

    • Scott C Blanchard
  6. Pacific Biosciences, Menlo Park, California, USA.

    • Chen-Shan Chin
    • , Yan Guo
    •  & Ellen E Paxinos
  7. European Bioinformatics Institute, European Molecular Biology Laboratory, Hinxton, UK.

    • Jan O Korbel
  8. Laboratory of Neuro-Oncology, The Rockefeller University, New York, New York, USA.

    • Robert B Darnell
  9. Howard Hughes Medical Institute, New York, New York, USA.

    • Robert B Darnell
  10. The Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA.

    • W Richard McCombie
  11. The Watson School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, USA.

    • W Richard McCombie
  12. Institute for Human Genetics, University of California–San Francisco, San Francisco, California, USA.

    • Pui-Yan Kwok
  13. Department of Medicine, Division of Hematology/Oncology, Weill Cornell Medical College, New York, New York, USA.

    • Christopher E Mason
  14. The Feil Family Brain and Mind Research Institute, Weill Cornell Medical College, New York, New York, USA.

    • Christopher E Mason

Authors

  1. Search for Matthew Pendleton in:

  2. Search for Robert Sebra in:

  3. Search for Andy Wing Chun Pang in:

  4. Search for Ajay Ummat in:

  5. Search for Oscar Franzen in:

  6. Search for Tobias Rausch in:

  7. Search for Adrian M Stütz in:

  8. Search for William Stedman in:

  9. Search for Thomas Anantharaman in:

  10. Search for Alex Hastie in:

  11. Search for Heng Dai in:

  12. Search for Markus Hsi-Yang Fritz in:

  13. Search for Han Cao in:

  14. Search for Ariella Cohain in:

  15. Search for Gintaras Deikus in:

  16. Search for Russell E Durrett in:

  17. Search for Scott C Blanchard in:

  18. Search for Roger Altman in:

  19. Search for Chen-Shan Chin in:

  20. Search for Yan Guo in:

  21. Search for Ellen E Paxinos in:

  22. Search for Jan O Korbel in:

  23. Search for Robert B Darnell in:

  24. Search for W Richard McCombie in:

  25. Search for Pui-Yan Kwok in:

  26. Search for Christopher E Mason in:

  27. Search for Eric E Schadt in:

  28. Search for Ali Bashir in:

Contributions

E.E.S., A.B., R.S., C.E.M., W.R.M. and R.B.D. conceived the project and provided resources for sequencing and algorithmic analysis. A.B. and E.E.S. provided bioinformatics oversight. J.O.K., M.H.-Y.F., A.M.S. and T.R. performed Illumina SV analysis and PCR validation. R.S., M.P. and E.E.P. prepared long libraries for PacBio sequencing. R.S., Y.G., A.C., S.C.B., R.A. and R.E.D. performed PacBio sequencing and primary analysis of hdf5 data. A.W.C.P., H.D., A.H., T.A., W.S., H.C. and P.-Y.K. generated the BioNano Data, built initial Genome Maps and performed BioNano alignment and SV calling. O.F., A.B. and M.P. performed PacBio SV analysis and validation. A.U., A.B. and C.-S.C. performed error correction and assembly. A.W.C.P. and H.D. built the initial hybrid scaffolding pipeline. A.B. and M.P. refined the hybrid scaffolding pipeline. A.W.C.P., H.D. and A.B. performed scaffold analysis and phasing. A.B., A.W.C.P., M.P. and A.U. generated figures for the main text. A.B., E.E.S., R.S., M.P. and A.W.C.P. primarily wrote the manuscript, though many coauthors provided edits and methods sections.

Competing interests

A.W.C.P., H.C., W.S., T.A., A.H. and H.D. are employees of BioNano Genomics. C.-S.C., Y.G. and E.E.P. are employees of Pacific Biosciences, and E.E.S. is on the scientific advisory board of Pacific Biosciences.

Corresponding author

Correspondence to Ali Bashir.

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1–15, Supplementary Tables 1–4 and 6–12, Supplementary Results and Supplementary Notes 1–3

Zip files

  1. 1.

    Supplementary Software

    Custom scripts for performing hybrid scaffolding and SV analysis

Excel files

  1. 1.

    Supplementary Table 5

    Insertion and deletion SVs with phasing

  2. 2.

    Supplementary Table 13

    Alignment coordinates between sequence contigs and V2 hybrid scaffolds

  3. 3.

    Supplementary Table 14

    Alignment coordinates between BioNano genome maps and V2 hybrid scaffolds

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/nmeth.3454

Further reading