Abstract

We have developed a computational method based on polyploid phasing of long sequence reads to resolve collapsed regions of segmental duplications within genome assemblies. Segmental Duplication Assembler (SDA; https://github.com/mvollger/SDA) constructs graphs in which paralogous sequence variants define the nodes and long-read sequences provide attraction and repulsion edges, enabling the partition and assembly of long reads corresponding to distinct paralogs. We apply it to single-molecule, real-time sequence data from three human genomes and recover 33–79 megabase pairs (Mb) of duplications in which approximately half of the loci are diverged (<99.8%) compared to the reference genome. We show that the corresponding sequence is highly accurate (>99.9%) and that the diverged sequence corresponds to copy-number-variable paralogs that are absent from the human reference genome. Our method can be applied to other complex genomes to resolve the last gene-rich gaps, improve duplicate gene annotation, and better understand copy-number-variant genetic diversity at the base-pair level.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

SMRT WGS for CHM1, CHM13, and NA12940 from this study are available at the NCBI Sequence Read Archive (SRA) under accession numbers SRP044331 for CHM1; SRX818607, SRX825542, and SRX825575SRX825579 for CHM13; and SRX1093000, SRX1093555, SRX1093654, SRX1094289, SRX1094374, SRX1094388, and SRX1096798 for NA19240. ONT WGS data are available at https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md. De novo assemblies of CHM1, CHM13, NA12940, and NA12878 from this study are available at the NCBI Assembly database under accession numbers GCA_001297185.1, GCA_000983455.2, GCA_001524155.4, and GCA_900232925.1, respectively. Assembled CHORI-17 BACs are available at the NCBI Clone DB (https://www.ncbi.nlm.nih.gov/clone/) under the accession numbers listed in Supplementary Table 4. Information about length, PSVs, and mapping location in GRCh38 can be found for all the SDA contigs generated, in Supplementary Table 8. Additional data that support the findings of this study are available from the corresponding author upon request.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).

  2. 2.

    Alkan, C., Sajjadian, S. & Eichler, E. E. Limitations of next-generation genome sequence assembly. Nat. Methods 8, 61–65 (2011).

  3. 3.

    Seo, J. S. et al. De novo assembly and phasing of a Korean human genome. Nature 538, 243–247 (2016).

  4. 4.

    Shi, L. et al. Long-read sequencing and de novo assembly of a Chinese genome. Nat. Commun. 7, 12065 (2016).

  5. 5.

    Bickhart, D. M. et al. Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat. Genet. 49, 643–650 (2017).

  6. 6.

    Gordon, D. et al. Long-read sequence assembly of the gorilla genome. Science 352, aae0344 (2016).

  7. 7.

    Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).

  8. 8.

    Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).

  9. 9.

    Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).

  10. 10.

    Kronenberg, Z. N. et al. High-resolution comparative analysis of great ape genomes. Science 360, eaar6343 (2018).

  11. 11.

    Kelley, D. R. & Salzberg, S. L. Detection and correction of false segmental duplications caused by genome mis-assembly. Genome. Biol. 11, R28 (2010).

  12. 12.

    Pop, M. Shotgun sequence assembly. Adv. Comput. 60, 193–248 (2004).

  13. 13.

    Pevzner, P. A., Tang, H. & Waterman, M. S. An Eulerian path approach to DNA fragment assembly. Proc. Natl Acad. Sci. USA 98, 9748–9753 (2001).

  14. 14.

    Pevzner, P. A., Tang, H. & Tesler, G. De novo repeat classification and fragment assembly. Genome Res. 14, 1786–1796 (2004).

  15. 15.

    Myers, E. W. The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005).

  16. 16.

    Stankiewicz, P. & Lupski, J. R. Genome architecture, rearrangements and genomic disorders. Trends Genet. 18, 74–82 (2002).

  17. 17.

    Sharp, A. J. et al. Discovery of previously unidentified genomic disorders from the duplication architecture of the human genome. Nat. Genet. 38, 1038–1042 (2006).

  18. 18.

    Sudmant, P. H. et al. Global diversity, population stratification, and selection of human copy-number variation. Science 349, aab3761 (2015).

  19. 19.

    Chen, J. et al. Bovine NK-lysin: copy number variation and functional diversification. Proc. Natl. Acad. Sci. USA 112, E7223–E7229 (2015).

  20. 20.

    Dennis, M. Y. & Eichler, E. E. Human adaptation and evolution by segmental duplication. Curr. Opin. Genet. Dev. 41, 44–52 (2016).

  21. 21.

    Abegglen, L. M. et al. Potential mechanisms for cancer resistance in elephants and comparative cellular response to DNA damage in humans. J. Am. Med. Assoc. 314, 1850–1860 (2015).

  22. 22.

    Church, D. M. et al. Lineage-specific biology revealed by a finished genome assembly of the mouse. PLoS Biol. 7, e1000112 (2009).

  23. 23.

    Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

  24. 24.

    Emanuel, B. S. & Shaikh, T. H. Segmental duplications: an ‘expanding’ role in genomic instability and disease. Nat. Rev. Genet. 2, 791–800 (2001).

  25. 25.

    Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

  26. 26.

    Chaisson, M. J., Mukherjee, S., Kannan, S. & Eichler, E. E. Resolving multicopy duplications de novo using polyploid phasing. RECOMB 10229, 117–133 (2017).

  27. 27.

    Bailey, J. A. et al. Recent segmental duplications in the human genome. Science 297, 1003–1007 (2002).

  28. 28.

    Ailon, N., Charikar, M. & Newman, A. Aggregating inconsistent information. J. Assoc. Comput. Mach. 55, 1–27 (2008).

  29. 29.

    Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

  30. 30.

    Fiddes, I. T. et al. Human-specific NOTCH2NL genes affect notch signaling and cortical neurogenesis. Cell 173, 1356–1369 (2018).

  31. 31.

    Florio, M. et al. Evolution and cell-type specificity of human-specific genes preferentially expressed in progenitors of fetal neocortex. eLife 7, e32332 (2018).

  32. 32.

    Dennis, M. Y. et al. Evolution of human-specific neural SRGAP2 genes by incomplete segmental duplication. Cell 149, 912–922 (2012).

  33. 33.

    Nuttle, X. et al. Rapid and accurate large-scale genotyping of duplicated genes and discovery of interlocus gene conversions. Nat. Methods 10, 903–909 (2013).

  34. 34.

    Dennis, M. Y. et al. The evolution and population diversity of human-specific segmental duplications. Nat. Ecol. Evol. 1, 0069 (2017).

  35. 35.

    Steinberg, K. M. et al. High-quality assembly of an individual of Yoruban descent. bioRxiv Preprint at https://www.biorxiv.org/content/early/2016/08/02/067447 (2016).

  36. 36.

    Chaisson, M. J. P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015).

  37. 37.

    BACPAC Resources. The CHORI-17 BAC library from a hydatidiform (haploid) mole. CloneDB https://www.ncbi.nlm.nih.gov/clone/library/genomic/76/ (2018).

  38. 38.

    Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).

  39. 39.

    Nuttle, X. et al. Emergence of a Homo sapiens–specific gene family and chromosome 16p11.2 CNV susceptibility. Nature 536, 205–209 (2016).

  40. 40.

    Dougherty, M. L. et al. Transcriptional fates of human-specific segmental duplications in brain. Genome Res. 28, 1566–1576 (2018).

  41. 41.

    Das, S. & Vikalo, H. SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming. BMC Genomics 16, 260 (2015).

  42. 42.

    Aguiar, D. & Istrail, S. Haplotype assembly in polyploid genomes and identical by descent shared tracts. Bioinformatics 29, i352–i360 (2013).

  43. 43.

    Berger, E., Yorukoglu, D., Peng, J. & Berger, B. in Research in Computational Molecular Biology: RECOMB 2014 (ed Sharan, R.) 18–19 (Springer, 2014).

  44. 44.

    Puljiz, Z. & Vikalo, H. Decoding genetic variations: communications-inspired haplotype assembly. IEEE/ACM. Trans. Comput. Biol. Bioinform. 13, 518–530 (2016).

  45. 45.

    Bonizzoni, P. et al. On the minimum error correction problem for haplotype assembly in diploid and polyploid genomes. J. Comput. Biol. 23, 718–736 (2016).

  46. 46.

    Artyomenko, A. et al. Long single-molecule reads can resolve the complexity of the influenza virus composed of rare, closely related mutant variants. J. Comput. Biol. 24, 558–570 (2017).

  47. 47.

    Parsons, J. D. Miropeats: graphical DNA sequence comparisons. Comput. Appl. Biosci. 11, 615–619 (1995).

  48. 48.

    Jain, C., Koren, S., Dilthey, A., Phillippy, A. M. & Aluru, S. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics 34, i748–i756 (2018).

  49. 49.

    Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).

  50. 50.

    Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

  51. 51.

    Patterson, M. et al. WhatsHap: weighted haplotype assembly for future-generation sequencing reads. J. Comput. Biol. 22, 498–509 (2015).

  52. 52.

    Steinberg, K. M. et al. Structural diversity and African origin of the 17q21.31 inversion polymorphism. Nat. Genet. 44, 872–880 (2012).

  53. 53.

    Sudmant, P. H. et al. Diversity of human copy number variation and multicopy genes. Science 330, 641–646 (2010).

Download references

Acknowledgements

The authors thank S. Cantsilieris and D. Gordon for technical assistance, J. Underwood for recommendations regarding the analysis of HSDs and Iso-Seq data, and T. Brown for help in editing this manuscript. This work was supported, in part, by grants from the US National Institutes of Health (NIH) (HG002385 to E.E.E., HG007635 to R.K.W. and E.E.E., and HG003079 to R.K.W.). M.R.V. was supported by a National Library of Medicine (NLM) Big Data Training Grant for Genomics and Neuroscience (5T32LM012419-04). P.C.D. was supported by a National Human Genome Research Institute (NHGRI) training grant (5T32HG000035-23). E.E.E. is an investigator of the Howard Hughes Medical Institute.

Author information

Affiliations

  1. Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA

    • Mitchell R. Vollger
    • , Philip C. Dishuck
    • , Melanie Sorensen
    • , AnneMarie E. Welch
    • , Vy Dang
    • , Max L. Dougherty
    •  & Evan E. Eichler
  2. The McDonnell Genome Institute at Washington University, Washington University School of Medicine, St. Louis, MO, USA

    • Tina A. Graves-Lindsay
  3. Institute for Genomic Medicine, Nationwide Children’s Hospital, Columbus, OH, USA

    • Richard K. Wilson
  4. Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, USA

    • Richard K. Wilson
  5. University of Southern California, Los Angeles, CA, USA

    • Mark J. P. Chaisson
  6. Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA

    • Evan E. Eichler

Authors

  1. Search for Mitchell R. Vollger in:

  2. Search for Philip C. Dishuck in:

  3. Search for Melanie Sorensen in:

  4. Search for AnneMarie E. Welch in:

  5. Search for Vy Dang in:

  6. Search for Max L. Dougherty in:

  7. Search for Tina A. Graves-Lindsay in:

  8. Search for Richard K. Wilson in:

  9. Search for Mark J. P. Chaisson in:

  10. Search for Evan E. Eichler in:

Contributions

M.R.V., M.J.P.C., and E.E.E. developed the SDA method; R.K.W. and T.A.G.-L. generated the PacBio genome sequence; M.S., A.E.W., M.R.V., and V.D. sequenced and analyzed the BAC clone insert; P.C.D., M.R.V., and M.L.D. carried out Iso-Seq analysis; M.R.V. organized the supplementary material; M.R.V., E.E.E., and M.J.P.C. wrote the manuscript; M.R.V. and P.C.D. produced the display items.

Competing interests

E.E.E. is on the scientific advisory board (SAB) of DNAnexus, Inc.

Corresponding authors

Correspondence to Mark J. P. Chaisson or Evan E. Eichler.

Integrated supplementary information

  1. Supplementary Figure 1 Proportion of resolved SDs in different PacBio (PB)/ONT genome assemblies.

    The figure shows the percent of SD bases that are resolved in human genome assemblies plotted as a function of the length of minimum extension of the alignment past the duplication. The number of resolved SD base pairs is relatively constant irrespective of the requirement of flanking unique base pairs. The dashed red line indicates the threshold chosen for our analysis used to generate the first panel in Supplementary Fig. 2 and the fraction of resolved SDs in Supplementary Table 1.

  2. Supplementary Figure 2 Resolution of SDs in SMRT genome assemblies.

    SDs (as a function of percent identity and length) in GRCh38 are marked as resolved (black) if present in the CHM1 assembly, or unresolved (red) if it appears only in the reference. The stacked marginal histograms show the relative number of resolved and unresolved SDs within each bin. Resolved duplications are defined as those mapping with high sequence identity, being completely contained, and extending at least 50 kb into unique sequence on either side of the duplication block (Methods). See Supplementary Fig. 1 and Supplementary Table 1 for the fraction of unresolved duplications across different genomes, assemblers, and technologies. Note that resolved and unresolved SDs are offset from one another along the y-axis to avoid overlapping. b) This plot shows the number of genes that exist within unresolved SDs blocks in the CHM1 assembly versus the maximum percent identity SD within that block.

  3. Supplementary Figure 3 Length of collapsed SDs and SDA assemblies.

    Correlation of collapse length and SDA assembly length in a) CHM1 (n = 590), b) CHM13 (n = 1,440), and c) NA19240 (n = 1,772) genome assemblies. In all three assemblies there is a strong correlation (Pearson’s correlation) between the length of a collapsed SD and the length of the resulting SDA assembly. SDA is not restricted to assembling duplications less than the maximum read length (like other assemblers), but rather it is restricted by the size of the collapsed duplication.

  4. Supplementary Figure 4 Sequence and assembly of NOTCH2 loci in the CHM1 human genome.

    a) A collapsed representation of a portion of the NOTCH2 loci is shown. Plotted is the read-depth profile over a collapsed representation of NOTCH2. Each black dot represents the coverage of the most frequent base pair at that position, while each red dot is the second most frequent. Secondary bases at low frequency represent sequencing error; however, those at high frequency represent PSV candidates. b) NOTCH2 PSV graph resolves the collapse into five potential loci. c) The alignment of each SDA contig back to the loci for NOTCH2 (./NLA/NLB/NLC/NLD) using Miropeats. Our assembled sequence is 99.88% identical over all five loci and >99.995% identical if only mismatched bases are counted as errors.

  5. Supplementary Figure 5 SDA results for the CHM13 assembly.

    a) SDA analysis of the CHM13 FALCON assembly generates 1,848 PSV clusters. b) Cumulative distribution of the assemblies and their percent identity to their best match in the reference. There are 40.4 Mb of diverged assembly (gray) and 43.0 Mb that map to the reference at high identity (black). c) A density plot of SDs plotted by length and percent identity. d) Copy number difference (CND) between CHM13 and the reference genome (CHM13 copy number – reference genome copy number) comparing n = 186 SD regions that match (>99.8%) versus n = 374 diverged SD regions (<99.8% identity). The mean CND of the matched sequence is 1.61 and the mean CND of the diverged sequence is 5.98, indicating that the diverged sequences are much more likely to represent additional duplicate copies that are unrepresented in the reference genome (GRCh38) (two-sided Mann-Whitney test; P = 2.77 × 10–5). The boxes indicate the range between the first and third quartiles, with the bold line specifying the median. The whiskers show the minimum and maximum within 1.5 times the interquartile range extending from the first and third quartiles. (See Fig. 2 for more details.).

  6. Supplementary Figure 6 SDA results for the NA19240 (African Yoruban) assembly.

    a) SDA analysis of the NA19240 FALCON assembly generates 2,136 PSV clusters. b) Cumulative distribution of the assemblies and their percent identity to their best match in the reference. There are 46.1 Mb of diverged assembly (gray) and 41.0 Mb that maps to the reference at high identity (black). c) A density plot of SDs plotted by length and percent identity. d) CND between NA19240 and the reference genome (NA19240 copy number – reference genome copy number) comparing n = 177 SD regions that match (>99.8%) versus n = 384 diverged SD regions (<99.8% identity). The mean CND of the matched sequence is 4.11 and the mean CND of the diverged sequence is 10.87, indicating that the diverged sequences are much more likely to represent additional duplicate copies that are unrepresented in the reference genome (GRCh38) (two-sided Mann-Whitney test; P = 1.88 × 10–4). The boxes indicate the range between the first and third quartiles, with the bold line specifying the median. The whiskers show the minimum and maximum within 1.5 times the interquartile range extending from the first and third quartiles. (See Fig. 2 for more details.).

  7. Supplementary Figure 7 Comparison of SDA on ONT versus SMRT data.

    The left half of the figure shows the results of SDA applied to the ONT assembly of NA12878; on the right is the PacBio assembly of NA19240. a) SDA analysis of the NA12878 assembly generated 38 assemblies that mapped with >99.8% identity (matched) to GRCh38 and 792 mapped with <99.8% sequence identity (diverged). Failed clusters (n = 1,052) did not result in an assembly, while multiple assemblies were PSV clusters with more than one contig produced by the Canu assembly. b) Cumulative distribution of the assemblies and their percent identity to their best match in the reference. The number of assembly Mb is calculated independently of a mapping to the reference. c) Length distribution of the matched and diverged assemblies (NA12878: matched n = 38, diverged n = 792; NA19240: matched n = 789, diverged n = 983). The lines on the violin plots indicate the first and third quartiles as well as the median. d) Sequencing read-depth distribution of the second most common SNV across all collapsed regions of SDs.

  8. Supplementary Figure 8 Sequence and assembly of a missing 16p12.1 duplication.

    The Miropeats alignments compare a BAC-based tiling path assembly of CHM1 (top line) to the human reference genome (GRCh38) (middle line) to a de novo assembly of CHM1 where SDA was applied (bottom line). The A/C duplication (red blue) proposed by Sudmant et al. that is present in most humans was correctly assembled using SDA and matches at high sequence identity (99.9%) to the BAC-based assembly structure.

  9. Supplementary Figure 9 Mapping differential of transcripts between SDA and de novo CHM13.

    The percent identity differential of the mapping of full-length Iso-Seq transcripts (n = 14,562) from human-specific segmental duplications (HSDs) to both the de novo assembly of CHM13 and the SDA results on CHM13 is shown. In total, 11 gene families showed significantly (P < 0.001, two-sided Wilcoxon signed-rank test) improved mapping to the SDA-resolved contigs. The boxes indicate the range between the first and third quartiles, with the bold line specifying the median. The whiskers show the minimum and maximum within 1.5 times the interquartile range extending from the first and third quartiles.

  10. Supplementary Figure 10 Multiple sequence alignment (MSA) between GRCh38 GPRIN2 and SDA GPRIN2A/B.

    Shown is the amino acid MSA between the copies of GPRIN2 resolved by SDA and the copy of GPRIN2 in GRCh38. Of the 15 differences in the MSA, 12 are annotated in dbSNP as variants in GPRIN2 when they are in fact differences between GPRIN2A and GPRIN2B. At p.Ser104Gly, p.Arg242Gly, and p.Val375Ala, the reference has the minor allele. Supplementary Table 7 shows the allele frequencies for all variants seen in this alignment.

  11. Supplementary Figure 11 CHM1 SDA contigs that overlap with unique sequence.

    This ideogram shows where SDA contigs could extend the FALCON assembly. The bottom panel of each chromosome shows the FALCON assembly (contigs > 1 Mb (dark blue), contigs < 1 Mb (light blue)). The top panel shows where SDA contigs with unique overlaps map along the reference (contigs with > 10 kb of overlap (green), contig with < 10 kb (red)).

  12. Supplementary Figure 12 PSV graph without attraction edges.

    Reproduced above is the PSV graph shown in Fig. 3 for SRGAP2. The left-hand side shows the attraction edges used in correlation clustering (CC). On the right-hand side, the edges are removed so that the transparency of the nodes is visible. The opacity of each node scales from 0.25 to 1, with 0.25 reflecting the start position on the contig and 1 representing the final position on the contig.

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figs. 1–12, Supplementary Tables 1–3 and 7, and Supplementary Note 1

  2. Reporting Summary

  3. Supplementary Table 4

    CHORI-17 BAC clone sequences.

  4. Supplementary Table 5

    BAC clone sequence analysis.

  5. Supplementary Table 6

    Gene content analysis.

  6. Supplementary Table 8

    Summary of all SDA assemblies from CHM1, CHM13, and NA19240.

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/s41592-018-0236-3