Abstract

Complex allelic variation hampers the assembly of haplotype-resolved sequences from diploid genomes. We developed trio binning, an approach that simplifies haplotype assembly by resolving allelic variation before assembly. In contrast with prior approaches, the effectiveness of our method improved with increasing heterozygosity. Trio binning uses short reads from two parental genomes to first partition long reads from an offspring into haplotype-specific sets. Each haplotype is then assembled independently, resulting in a complete diploid reconstruction. We used trio binning to recover both haplotypes of a diploid human genome and identified complex structural variants missed by alternative approaches. We sequenced an F1 cross between the cattle subspecies Bos taurus taurus and Bos taurus indicus and completely assembled both parental haplotypes with NG50 haplotig sizes of >20 Mb and 99.998% accuracy, surpassing the quality of current cattle reference genomes. We suggest that trio binning improves diploid genome assembly and will facilitate new studies of haplotype variation and inheritance.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Accessions

Primary accessions

BioProject

References

  1. 1.

    New advances in sequence assembly. Genome Res. 27, xi–xiii (2017).

  2. 2.

    et al. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. 14, R101 (2013).

  3. 3.

    et al. De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads. Gigascience 6, 1–16 (2017).

  4. 4.

    et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).

  5. 5.

    Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).

  6. 6.

    International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

  7. 7.

    et al. De novo assembly of a haplotype-resolved human genome. Nat. Biotechnol. 33, 617–622 (2015).

  8. 8.

    et al. Single haplotype assembly of the human genome from a hydatidiform mole. Genome Res. 24, 2066–2076 (2014).

  9. 9.

    et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).

  10. 10.

    et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015).

  11. 11.

    et al. ARACHNE: a whole-genome shotgun assembler. Genome Res. 12, 177–189 (2002).

  12. 12.

    et al. Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. Genome Res. 15, 1127–1135 (2005).

  13. 13.

    et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).

  14. 14.

    et al. WhatsHap: weighted haplotype assembly for future-generation sequencing reads. J. Comput. Biol. 22, 498–509 (2015).

  15. 15.

    , & HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. 27, 801–812 (2017).

  16. 16.

    et al. Whole-genome resequencing of two elite sires for the detection of haplotypes under selection in dairy cattle. Proc. Natl. Acad. Sci. USA 109, 7693–7698 (2012).

  17. 17.

    , & Completely phased genome sequencing through chromosome sorting. Proc. Natl. Acad. Sci. USA 108, 12–17 (2011).

  18. 18.

    & Strand-seq: a unifying tool for studies of chromosome segregation. Semin. Cell Dev. Biol. 24, 643–652 (2013).

  19. 19.

    , , & Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nat. Biotechnol. 31, 1111–1118 (2013).

  20. 20.

    et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).

  21. 21.

    , , , & Direct determination of diploid genome sequences. Genome Res. 27, 757–767 (2017).

  22. 22.

    et al. De novo assembly and phasing of a Korean human genome. Nature 538, 243–247 (2016).

  23. 23.

    et al. Dense and accurate whole-chromosome haplotyping of individual genomes. Nat. Commun. 8, 1293 (2017).

  24. 24.

    et al. Improved Aedes aegypti mosquito reference genome assembly enables biological discovery and vector control. Nature (in the press).

  25. 25.

    International HapMap Consortium. The International HapMap Project. Nature 426, 789–796 (2003).

  26. 26.

    The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

  27. 27.

    et al. A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. Genome Res. 27, 157–164 (2017).

  28. 28.

    , & trio-sga: facilitating de novo assemblyof highly heterozygous genomes with parent–child trios. bioRxiv Preprint at (2016).

  29. 29.

    et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).

  30. 30.

    et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics 33, 2202–2204 (2017).

  31. 31.

    et al. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol. Biol. Evol. (2017).

  32. 32.

    et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012).

  33. 33.

    et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).

  34. 34.

    & Assemblytics: a web analytics tool for the detection of variants from an assembly. Bioinformatics 32, 3021–3023 (2016).

  35. 35.

    et al. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 40, D1202–D1210 (2012).

  36. 36.

    et al. High-accuracy HLA type inference from whole-genome sequencing data using population reference graphs. PLoS Comput. Biol. 12, e1005151 (2016).

  37. 37.

    et al. A whole-genome assembly of the domestic cow, Bos taurus. Genome Biol. 10, R42 (2009).

  38. 38.

    et al. Genome sequence and assembly of Bos indicus. J. Hered. 103, 342–348 (2012).

  39. 39.

    et al. Genome-wide CNV analysis reveals variants associated with growth traits in Bos indicus. BMC Genomics 17, 419 (2016).

  40. 40.

    et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).

  41. 41.

    et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

  42. 42.

    Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016).

  43. 43.

    , , & Assembly of long error-prone reads using repeat graphs. bioRxiv Preprint at (2018).

  44. 44.

    , , & Duplications de novo using polyploid phasing. in International Conference on Research in Computational Molecular Biology (ed. Sahinalp S.) 117–133 (Springer, 2017).

  45. 45.

    et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One 7, e47768 (2012).

  46. 46.

    et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).

  47. 47.

    , & A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat. Methods 12, 733–735 (2015).

  48. 48.

    , & PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002).

  49. 49.

    , & Ribbon: visualizing complex genome alignments and structural variation. bioRxiv Preprint at (2016).

  50. 50.

    , , , & KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics 33, 574–576 (2017).

  51. 51.

    et al. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res. 24, 1384–1395 (2014).

  52. 52.

    et al. How independent are the appearances of n-mers in different genomes? Bioinformatics 20, 2421–2428 (2004).

  53. 53.

    , , , & Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682–688 (2015).

  54. 54.

    Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at (2013).

  55. 55.

    & Haplotype-based variant detection from short-read sequencing. Preprint at (2012).

  56. 56.

    et al. The UCSC Genome Browser database: 2018 update. Nucleic Acids Res. 46, D762–D769 (2018).

Download references

Acknowledgements

We thank W, Thompson, K. Kuhn, K. McClure and R. Lee for technical assistance, and T. Graves-Lindsay and Washington University in St. Louis for public release of the PacBio NA12878 data. S.K., A.R., B.P.W. and A.M.P. were supported by the Intramural Research Program of the National Human Genome Research Institute, US National Institutes of Health. S.H. and J.L.W. were funded from the JS Davies bequest to the University of Adelaide. T.P.L.S. was supported by USDA-ARS Project 3040-31000-100-00D. D.M.B. was supported by USDA-ARS Project 5090-31000-026-00-D. This research was also supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI17C2098). This work used the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov).

Author information

Author notes

    • Sergey Koren
    •  & Arang Rhie

    These authors contributed equally to this work.

Affiliations

  1. Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, Maryland, USA.

    • Sergey Koren
    • , Arang Rhie
    • , Brian P Walenz
    • , Alexander T Dilthey
    •  & Adam M Phillippy
  2. Institute of Medical Microbiology, Heinrich-Heine-University Düsseldorf, Düsseldorf, North Rhine-Westphalia, Germany.

    • Alexander T Dilthey
  3. Cell Wall Biology and Utilization Laboratory, ARS USDA, Madison, Wisconsin, USA.

    • Derek M Bickhart
  4. Pacific Biosciences, Menlo Park, California, USA.

    • Sarah B Kingan
  5. Davies Research Centre, School of Animal and Veterinary Sciences, The University of Adelaide, Roseworthy SA, Australia.

    • Stefan Hiendleder
    •  & John L Williams
  6. Robinson Research Institute, The University of Adelaide, Adelaide SA, Australia.

    • Stefan Hiendleder
  7. US Meat Animal Research Center, ARS USDA, Clay Center, Nebraska, USA.

    • Timothy P L Smith

Authors

  1. Search for Sergey Koren in:

  2. Search for Arang Rhie in:

  3. Search for Brian P Walenz in:

  4. Search for Alexander T Dilthey in:

  5. Search for Derek M Bickhart in:

  6. Search for Sarah B Kingan in:

  7. Search for Stefan Hiendleder in:

  8. Search for John L Williams in:

  9. Search for Timothy P L Smith in:

  10. Search for Adam M Phillippy in:

Contributions

A.M.P. and T.P.L.S. conceived and coordinated the project. S.K. and A.R. designed the trio-binning method. S.K., A.R. and B.P.W. implemented the software. S.K., A.R., B.P.W., A.T.D., D.M.B., S.B.K. and A.M.P. performed analyses. S.H. designed and performed breeding experiments and sample collections. J.L.W. contributed to development of the concept and provision of samples. T.P.L.S. performed sequencing. S.K., A.R., T.P.L.S., J.L.W. and A.M.P. wrote the manuscript. All of the authors approved the final manuscript.

Competing interests

S.B.K. is a current employee of Pacific Biosciences.

Corresponding authors

Correspondence to Timothy P L Smith or Adam M Phillippy.

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1-21, Supplementary Tables 1-10, and Supplementary Note 1

  2. 2.

    Life Sciences Reporting Summary

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/nbt.4277