Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Assembly and diploid architecture of an individual human genome via single-molecule technologies

Abstract

We present the first comprehensive analysis of a diploid human genome that combines single-molecule sequencing with single-molecule genome maps. Our hybrid assembly markedly improves upon the contiguity observed from traditional shotgun sequencing approaches, with scaffold N50 values approaching 30 Mb, and we identified complex structural variants (SVs) missed by other high-throughput approaches. Furthermore, by combining Illumina short-read data with long reads, we phased both single-nucleotide variants and SVs, generating haplotypes with over 99% consistency with previous trio-based studies. Our work shows that it is now possible to integrate single-molecule and high-throughput sequence data to generate de novo assembled genomes that approach reference quality.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Figure 1: De novo assembly and scaffold layout.
Figure 2: Tandem-repeat detection from single molecules predicts a large divergence from reference.
Figure 3: De novo maps identify large structural variants.
Figure 4: CLRs highlight multiple colocated SVs and complex SV structures.

Accession codes

Primary accessions

BioProject

Sequence Read Archive

References

  1. Zook, J.M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

    Article  CAS  PubMed  Google Scholar 

  2. Lam, H.Y.K. et al. Performance comparison of whole-genome sequencing platforms. Nat. Biotechnol. 30, 78–82 (2012).

    Article  CAS  Google Scholar 

  3. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  4. Istrail, S. et al. Whole-genome shotgun assembly and comparison of human genome assemblies. Proc. Natl. Acad. Sci. USA 101, 1916–1921 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108, 1513–1518 (2011).

    Article  CAS  PubMed  Google Scholar 

  6. Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

    Article  CAS  PubMed  Google Scholar 

  7. Human Genome Sequencing Consortium International. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).

  8. Pang, A.W.C., Macdonald, J.R., Yuen, R.K.C., Hayes, V.M. & Scherer, S.W. Performance of high-throughput sequencing for the discovery of genetic variation across the complete size spectrum. G3 (Bethesda) 4, 63–65 (2014).

    Article  CAS  Google Scholar 

  9. Schadt, E.E., Turner, S. & Kasarskis, A. A window into third generation sequencing. Hum. Mol. Genet. 19, R227–R240 (2010).

    Article  CAS  PubMed  Google Scholar 

  10. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  11. Mills, R.E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Ross, M.G. et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Rasko, D.A. et al. Origins of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany. N. Engl. J. Med. 365, 709–717 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Bashir, A. et al. A hybrid approach for the automated finishing of bacterial genomes. Nat. Biotechnol. 30, 701–707 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Chin, C.-S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).

    Article  CAS  PubMed  Google Scholar 

  16. Ribeiro, F.J. et al. Finished bacterial genomes from shotgun sequence data. Genome Res. 22, 2270–2277 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30, 693–700 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Huddleston, J. et al. Reconstructing complex regions of genomes using long-read sequencing technology. Genome Res. 24, 688–696 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Patel, A., Schwab, R., Liu, Y.-T. & Bafna, V. Amplification and thrifty single-molecule sequencing of recurrent somatic structural variations. Genome Res. 24, 318–328 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Hastie, A.R. et al. Rapid genome mapping in nanochannel arrays for highly complete and accurate de novo sequence assembly of the complex Aegilops tauschii genome. PLoS ONE 8, e55864 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Lam, E.T. et al. Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nat. Biotechnol. 30, 771–776 (2012).

    Article  CAS  PubMed  Google Scholar 

  22. Salzberg, S.L. et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Maccallum, I. et al. ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads. Genome Biol. 10, R103 (2009).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  24. Rozowsky, J. et al. AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol. Syst. Biol. 7, 522 (2011).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  25. Bansal, V., Halpern, A.L., Axelrod, N. & Bafna, V. An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res. 18, 1336–1346 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Chaisson, M.J.P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015).

    Article  CAS  PubMed  Google Scholar 

  27. Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Carter, A.B. et al. Genome-wide analysis of the human Alu Yb-lineage. Hum. Genomics 1, 167–178 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Myers, J.S. et al. A comprehensive analysis of recently integrated human Ta L1 elements. Am. J. Hum. Genet. 71, 312–326 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Mason, C.E. et al. Location analysis for the estrogen receptor-alpha reveals binding to diverse ERE sequences and widespread binding within repetitive DNA elements. Nucleic Acids Res. 38, 2355–2368 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Highnam, G. et al. Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles. Nucleic Acids Res. 41, e32 (2013).

    Article  CAS  PubMed  Google Scholar 

  32. Kamstrup, P.R. Lipoprotein(a) and ischemic heart disease–a causal association? A review. Atherosclerosis 211, 15–23 (2010).

    Article  CAS  PubMed  Google Scholar 

  33. Damert, A. et al. 5′-Transducing SVA retrotransposon groups spread efficiently throughout the human genome. Genome Res. 19, 1992–2008 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Xing, J. et al. Emergence of primate genes by retrotransposon-mediated sequence transduction. Proc. Natl. Acad. Sci. USA 103, 17608–17613 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Ejima, Y. & Yang, L. Trans mobilization of genomic DNA as a mechanism for retrotransposon-mediated exon shuffling. Hum. Mol. Genet. 12, 1321–1328 (2003).

    Article  CAS  PubMed  Google Scholar 

  36. Ummat, A. & Bashir, A. Resolving complex tandem repeats with long reads. Bioinformatics 30, 3491–3498 (2014).

    Article  CAS  PubMed  Google Scholar 

  37. Myers, G. in Algorithms in Bioinformatics (eds. Brown, D. & Morgenstern, B.) 52–67 (Springer, 2014).

  38. Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality sensitive hashing. bioRxiv doi:http://dx.doi.org/10.1101/008003 (2014).

  39. Lin, H.C. et al. AGORA: Assembly Guided by Optical Restriction Alignment. BMC Bioinformatics 13, 189 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  40. Myers, E.W. The fragment assembly string graph. Bioinformatics 21 (suppl. 2), ii79–ii85 (2005).

    CAS  PubMed  Google Scholar 

  41. Kuleshov, V. et al. Whole-genome haplotyping using long reads and statistical methods. Nat. Biotechnol. 32, 261–266 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Antonacci, F. et al. Palindromic GOLGA8 core duplicons promote chromosome 15q13.3 microdeletion and evolutionary instability. Nat. Genet. 46, 1293–1302 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Gu, W., Zhang, F. & Lupski, J.R. Mechanisms for human genomic rearrangements. Pathogenetics 1, 4 (2008).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  44. Sharp, A.J., Cheng, Z. & Eichler, E.E. Structural variation of the human genome. Annu. Rev. Genomics Hum. Genet. 7, 407–442 (2006).

    Article  CAS  PubMed  Google Scholar 

  45. Bashir, A., Volik, S., Collins, C., Bafna, V. & Raphael, B.J. Evaluation of paired-end sequencing strategies for detection of genome rearrangements in cancer. PLoS Comput. Biol. 4, e1000051 (2008).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  46. Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat. Genet. 37, 727–732 (2005).

    Article  CAS  PubMed  Google Scholar 

  47. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Li, S. et al. SOAPindel: Efficient identification of indels from short paired reads. Genome Res. 23, 195–200 (2013).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  49. Iskow, R.C. et al. Natural mutagenesis of human genomes by endogenous retrotransposons. Cell 141, 1253–1261 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Fuentes Fajardo, K.V. et al. Detecting false-positive signals in exome sequencing. Hum. Mutat. 33, 609–613 (2012).

    Article  CAS  PubMed  Google Scholar 

  51. Nguyen, J.V. Genomic Mapping: A Statistical and Algorithmic Analysis of the Optical Mapping System. PhD thesis, Univ. Southern California (2010).

  52. Anantharaman, T. & Mishra, B. in Algorithms Bioinformatics WABI (eds. Gascuel, O. & Moret, B.M.E.) 27–40 (Springer, 2001).

  53. Valouev, A., Schwartz, D.C., Zhou, S. & Waterman, M.S. An algorithm for assembly of ordered restriction maps from single DNA molecules. Proc. Natl. Acad. Sci. USA 103, 15770–15775 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Chaisson, M.J. & Tesler, G. Mapping single molecule sequencing reads using Basic Local Alignment with Successive Refinement (BLASR): theory and application. BMC Bioinformatics 13, 238 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at http://arxiv.org/abs/1207.3907 (2012).

  57. English, A.C., Salerno, W.J. & Reid, J.G. PBHoney: identifying genomic variants via long-read discordance and interrupted mapping. BMC Bioinformatics 15, 180 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  58. Gotoh, O. An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 705–708 (1982).

    Article  CAS  PubMed  Google Scholar 

  59. Eppstein, D., Galil, Z., Giancarlo, R. & Italiano, G.F. Sparse dynamic programming I: linear cost functions. J. ACM 39, 519–545 (1992).

    Article  Google Scholar 

  60. Brudno, M. et al. Glocal alignment: finding rearrangements during alignment. Bioinformatics 19, i54–i62 (2003).

    Article  PubMed  Google Scholar 

  61. Dubchak, I., Poliakov, A., Kislyuk, A. & Brudno, M. Multiple whole-genome alignments without a reference organism. Genome Res. 19, 682–689 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Lee, C. Generating consensus sequences from partial order multiple sequence alignment graphs. Bioinformatics 19, 999–1008 (2003).

    Article  CAS  PubMed  Google Scholar 

  63. Wheeler, T.J. et al. Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Res. 41, D70–D82 (2013).

    Article  CAS  PubMed  Google Scholar 

  64. Bansal, V. & Bafna, V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24, i153–i159 (2008).

    Article  PubMed  Google Scholar 

  65. Carneiro, M.O. et al. Pacific Biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics 13, 375 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Koressaar, T. & Remm, M. Enhancements and modifications of primer design program Primer3. Bioinformatics 23, 1289–1291 (2007).

    Article  CAS  PubMed  Google Scholar 

  67. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This work was supported in part by institutional support from the Icahn Institute for Genomics and Multiscale Biology, R01 HG005946, U01 HL107388, R01 DK098242-01, R01 MH106531, US National Institutes of Health (NIH) U41HG007497, the Irma T. Hirschl and Monique Weill-Caulier Charitable Trusts, the STARR Consortium, the WorldQuant Foundation, the Pershing Square Foundation, the Genomics & Epigenomics Core Facilities and SMRT Sequencing Center at Weill Cornell Medical College, and through the computational resources and staff expertise provided by the Department of Scientific Computing at the Icahn School of Medicine at Mount Sinai. DNA samples were provided by the Coriell Institute for Medical Research and the US National Institute of Standards and Technology (NIST). We would also like to thank T. Zichner for assistance with the design of validations and M. Chaisson for assistance with running Blasr, the assembly-based SV pipeline, and in performing the CHM1 comparison.

Author information

Authors and Affiliations

Authors

Contributions

E.E.S., A.B., R.S., C.E.M., W.R.M. and R.B.D. conceived the project and provided resources for sequencing and algorithmic analysis. A.B. and E.E.S. provided bioinformatics oversight. J.O.K., M.H.-Y.F., A.M.S. and T.R. performed Illumina SV analysis and PCR validation. R.S., M.P. and E.E.P. prepared long libraries for PacBio sequencing. R.S., Y.G., A.C., S.C.B., R.A. and R.E.D. performed PacBio sequencing and primary analysis of hdf5 data. A.W.C.P., H.D., A.H., T.A., W.S., H.C. and P.-Y.K. generated the BioNano Data, built initial Genome Maps and performed BioNano alignment and SV calling. O.F., A.B. and M.P. performed PacBio SV analysis and validation. A.U., A.B. and C.-S.C. performed error correction and assembly. A.W.C.P. and H.D. built the initial hybrid scaffolding pipeline. A.B. and M.P. refined the hybrid scaffolding pipeline. A.W.C.P., H.D. and A.B. performed scaffold analysis and phasing. A.B., A.W.C.P., M.P. and A.U. generated figures for the main text. A.B., E.E.S., R.S., M.P. and A.W.C.P. primarily wrote the manuscript, though many coauthors provided edits and methods sections.

Corresponding author

Correspondence to Ali Bashir.

Ethics declarations

Competing interests

A.W.C.P., H.C., W.S., T.A., A.H. and H.D. are employees of BioNano Genomics. C.-S.C., Y.G. and E.E.P. are employees of Pacific Biosciences, and E.E.S. is on the scientific advisory board of Pacific Biosciences.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–15, Supplementary Tables 1–4 and 6–12, Supplementary Results and Supplementary Notes 1–3 (PDF 16327 kb)

Supplementary Software

Custom scripts for performing hybrid scaffolding and SV analysis (ZIP 23658 kb)

Supplementary Table 5

Insertion and deletion SVs with phasing (XLSX 1079 kb)

Supplementary Table 13

Alignment coordinates between sequence contigs and V2 hybrid scaffolds (XLSX 260 kb)

Supplementary Table 14

Alignment coordinates between BioNano genome maps and V2 hybrid scaffolds (XLSX 89 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Pendleton, M., Sebra, R., Pang, A. et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat Methods 12, 780–786 (2015). https://doi.org/10.1038/nmeth.3454

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.3454

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing