Long-read sequence and assembly of segmental duplications


We have developed a computational method based on polyploid phasing of long sequence reads to resolve collapsed regions of segmental duplications within genome assemblies. Segmental Duplication Assembler (SDA; https://github.com/mvollger/SDA) constructs graphs in which paralogous sequence variants define the nodes and long-read sequences provide attraction and repulsion edges, enabling the partition and assembly of long reads corresponding to distinct paralogs. We apply it to single-molecule, real-time sequence data from three human genomes and recover 33–79 megabase pairs (Mb) of duplications in which approximately half of the loci are diverged (<99.8%) compared to the reference genome. We show that the corresponding sequence is highly accurate (>99.9%) and that the diverged sequence corresponds to copy-number-variable paralogs that are absent from the human reference genome. Our method can be applied to other complex genomes to resolve the last gene-rich gaps, improve duplicate gene annotation, and better understand copy-number-variant genetic diversity at the base-pair level.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Flowchart of the SDA method.
Fig. 2: SDA results of the CHM1 human genome assembly.
Fig. 3: Sequence and assembly of SRGAP2 loci in the CHM13 human genome.
Fig. 4: Correspondence between SDA sequence-diverged contigs and BACs.
Fig. 5: Gene discovery.

Data availability

SMRT WGS for CHM1, CHM13, and NA12940 from this study are available at the NCBI Sequence Read Archive (SRA) under accession numbers SRP044331 for CHM1; SRX818607, SRX825542, and SRX825575SRX825579 for CHM13; and SRX1093000, SRX1093555, SRX1093654, SRX1094289, SRX1094374, SRX1094388, and SRX1096798 for NA19240. ONT WGS data are available at https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md. De novo assemblies of CHM1, CHM13, NA12940, and NA12878 from this study are available at the NCBI Assembly database under accession numbers GCA_001297185.1, GCA_000983455.2, GCA_001524155.4, and GCA_900232925.1, respectively. Assembled CHORI-17 BACs are available at the NCBI Clone DB (https://www.ncbi.nlm.nih.gov/clone/) under the accession numbers listed in Supplementary Table 4. Information about length, PSVs, and mapping location in GRCh38 can be found for all the SDA contigs generated, in Supplementary Table 8. Additional data that support the findings of this study are available from the corresponding author upon request.


  1. 1.

    Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).

    CAS  Article  Google Scholar 

  2. 2.

    Alkan, C., Sajjadian, S. & Eichler, E. E. Limitations of next-generation genome sequence assembly. Nat. Methods 8, 61–65 (2011).

    CAS  Article  Google Scholar 

  3. 3.

    Seo, J. S. et al. De novo assembly and phasing of a Korean human genome. Nature 538, 243–247 (2016).

    CAS  Article  Google Scholar 

  4. 4.

    Shi, L. et al. Long-read sequencing and de novo assembly of a Chinese genome. Nat. Commun. 7, 12065 (2016).

    CAS  Article  Google Scholar 

  5. 5.

    Bickhart, D. M. et al. Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat. Genet. 49, 643–650 (2017).

    CAS  Article  Google Scholar 

  6. 6.

    Gordon, D. et al. Long-read sequence assembly of the gorilla genome. Science 352, aae0344 (2016).

    Article  Google Scholar 

  7. 7.

    Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).

    CAS  Article  Google Scholar 

  8. 8.

    Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).

    CAS  Article  Google Scholar 

  9. 9.

    Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).

    CAS  Article  Google Scholar 

  10. 10.

    Kronenberg, Z. N. et al. High-resolution comparative analysis of great ape genomes. Science 360, eaar6343 (2018).

    Article  Google Scholar 

  11. 11.

    Kelley, D. R. & Salzberg, S. L. Detection and correction of false segmental duplications caused by genome mis-assembly. Genome. Biol. 11, R28 (2010).

    Article  Google Scholar 

  12. 12.

    Pop, M. Shotgun sequence assembly. Adv. Comput. 60, 193–248 (2004).

    Article  Google Scholar 

  13. 13.

    Pevzner, P. A., Tang, H. & Waterman, M. S. An Eulerian path approach to DNA fragment assembly. Proc. Natl Acad. Sci. USA 98, 9748–9753 (2001).

    CAS  Article  Google Scholar 

  14. 14.

    Pevzner, P. A., Tang, H. & Tesler, G. De novo repeat classification and fragment assembly. Genome Res. 14, 1786–1796 (2004).

    CAS  Article  Google Scholar 

  15. 15.

    Myers, E. W. The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005).

    CAS  Google Scholar 

  16. 16.

    Stankiewicz, P. & Lupski, J. R. Genome architecture, rearrangements and genomic disorders. Trends Genet. 18, 74–82 (2002).

    CAS  Article  Google Scholar 

  17. 17.

    Sharp, A. J. et al. Discovery of previously unidentified genomic disorders from the duplication architecture of the human genome. Nat. Genet. 38, 1038–1042 (2006).

    CAS  Article  Google Scholar 

  18. 18.

    Sudmant, P. H. et al. Global diversity, population stratification, and selection of human copy-number variation. Science 349, aab3761 (2015).

    Article  Google Scholar 

  19. 19.

    Chen, J. et al. Bovine NK-lysin: copy number variation and functional diversification. Proc. Natl. Acad. Sci. USA 112, E7223–E7229 (2015).

    CAS  Article  Google Scholar 

  20. 20.

    Dennis, M. Y. & Eichler, E. E. Human adaptation and evolution by segmental duplication. Curr. Opin. Genet. Dev. 41, 44–52 (2016).

    CAS  Article  Google Scholar 

  21. 21.

    Abegglen, L. M. et al. Potential mechanisms for cancer resistance in elephants and comparative cellular response to DNA damage in humans. J. Am. Med. Assoc. 314, 1850–1860 (2015).

    CAS  Article  Google Scholar 

  22. 22.

    Church, D. M. et al. Lineage-specific biology revealed by a finished genome assembly of the mouse. PLoS Biol. 7, e1000112 (2009).

    Article  Google Scholar 

  23. 23.

    Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

    CAS  Article  Google Scholar 

  24. 24.

    Emanuel, B. S. & Shaikh, T. H. Segmental duplications: an ‘expanding’ role in genomic instability and disease. Nat. Rev. Genet. 2, 791–800 (2001).

    CAS  Article  Google Scholar 

  25. 25.

    Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).

    CAS  Article  Google Scholar 

  26. 26.

    Chaisson, M. J., Mukherjee, S., Kannan, S. & Eichler, E. E. Resolving multicopy duplications de novo using polyploid phasing. RECOMB 10229, 117–133 (2017).

    CAS  Google Scholar 

  27. 27.

    Bailey, J. A. et al. Recent segmental duplications in the human genome. Science 297, 1003–1007 (2002).

    CAS  Article  Google Scholar 

  28. 28.

    Ailon, N., Charikar, M. & Newman, A. Aggregating inconsistent information. J. Assoc. Comput. Mach. 55, 1–27 (2008).

    Article  Google Scholar 

  29. 29.

    Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  Google Scholar 

  30. 30.

    Fiddes, I. T. et al. Human-specific NOTCH2NL genes affect notch signaling and cortical neurogenesis. Cell 173, 1356–1369 (2018).

    CAS  Article  Google Scholar 

  31. 31.

    Florio, M. et al. Evolution and cell-type specificity of human-specific genes preferentially expressed in progenitors of fetal neocortex. eLife 7, e32332 (2018).

    Article  Google Scholar 

  32. 32.

    Dennis, M. Y. et al. Evolution of human-specific neural SRGAP2 genes by incomplete segmental duplication. Cell 149, 912–922 (2012).

    CAS  Article  Google Scholar 

  33. 33.

    Nuttle, X. et al. Rapid and accurate large-scale genotyping of duplicated genes and discovery of interlocus gene conversions. Nat. Methods 10, 903–909 (2013).

    CAS  Article  Google Scholar 

  34. 34.

    Dennis, M. Y. et al. The evolution and population diversity of human-specific segmental duplications. Nat. Ecol. Evol. 1, 0069 (2017).

    Article  Google Scholar 

  35. 35.

    Steinberg, K. M. et al. High-quality assembly of an individual of Yoruban descent. bioRxiv Preprint at https://www.biorxiv.org/content/early/2016/08/02/067447 (2016).

  36. 36.

    Chaisson, M. J. P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015).

    CAS  Article  Google Scholar 

  37. 37.

    BACPAC Resources. The CHORI-17 BAC library from a hydatidiform (haploid) mole. CloneDB https://www.ncbi.nlm.nih.gov/clone/library/genomic/76/ (2018).

  38. 38.

    Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).

    CAS  Article  Google Scholar 

  39. 39.

    Nuttle, X. et al. Emergence of a Homo sapiens–specific gene family and chromosome 16p11.2 CNV susceptibility. Nature 536, 205–209 (2016).

    CAS  Article  Google Scholar 

  40. 40.

    Dougherty, M. L. et al. Transcriptional fates of human-specific segmental duplications in brain. Genome Res. 28, 1566–1576 (2018).

    CAS  Article  Google Scholar 

  41. 41.

    Das, S. & Vikalo, H. SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming. BMC Genomics 16, 260 (2015).

    Article  Google Scholar 

  42. 42.

    Aguiar, D. & Istrail, S. Haplotype assembly in polyploid genomes and identical by descent shared tracts. Bioinformatics 29, i352–i360 (2013).

    CAS  Article  Google Scholar 

  43. 43.

    Berger, E., Yorukoglu, D., Peng, J. & Berger, B. in Research in Computational Molecular Biology: RECOMB 2014 (ed Sharan, R.) 18–19 (Springer, 2014).

  44. 44.

    Puljiz, Z. & Vikalo, H. Decoding genetic variations: communications-inspired haplotype assembly. IEEE/ACM. Trans. Comput. Biol. Bioinform. 13, 518–530 (2016).

    CAS  Article  Google Scholar 

  45. 45.

    Bonizzoni, P. et al. On the minimum error correction problem for haplotype assembly in diploid and polyploid genomes. J. Comput. Biol. 23, 718–736 (2016).

    CAS  Article  Google Scholar 

  46. 46.

    Artyomenko, A. et al. Long single-molecule reads can resolve the complexity of the influenza virus composed of rare, closely related mutant variants. J. Comput. Biol. 24, 558–570 (2017).

    CAS  Article  Google Scholar 

  47. 47.

    Parsons, J. D. Miropeats: graphical DNA sequence comparisons. Comput. Appl. Biosci. 11, 615–619 (1995).

    CAS  Google Scholar 

  48. 48.

    Jain, C., Koren, S., Dilthey, A., Phillippy, A. M. & Aluru, S. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics 34, i748–i756 (2018).

    Article  Google Scholar 

  49. 49.

    Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).

    CAS  Article  Google Scholar 

  50. 50.

    Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  Google Scholar 

  51. 51.

    Patterson, M. et al. WhatsHap: weighted haplotype assembly for future-generation sequencing reads. J. Comput. Biol. 22, 498–509 (2015).

    CAS  Article  Google Scholar 

  52. 52.

    Steinberg, K. M. et al. Structural diversity and African origin of the 17q21.31 inversion polymorphism. Nat. Genet. 44, 872–880 (2012).

    CAS  Article  Google Scholar 

  53. 53.

    Sudmant, P. H. et al. Diversity of human copy number variation and multicopy genes. Science 330, 641–646 (2010).

    CAS  Article  Google Scholar 

Download references


The authors thank S. Cantsilieris and D. Gordon for technical assistance, J. Underwood for recommendations regarding the analysis of HSDs and Iso-Seq data, and T. Brown for help in editing this manuscript. This work was supported, in part, by grants from the US National Institutes of Health (NIH) (HG002385 to E.E.E., HG007635 to R.K.W. and E.E.E., and HG003079 to R.K.W.). M.R.V. was supported by a National Library of Medicine (NLM) Big Data Training Grant for Genomics and Neuroscience (5T32LM012419-04). P.C.D. was supported by a National Human Genome Research Institute (NHGRI) training grant (5T32HG000035-23). E.E.E. is an investigator of the Howard Hughes Medical Institute.

Author information




M.R.V., M.J.P.C., and E.E.E. developed the SDA method; R.K.W. and T.A.G.-L. generated the PacBio genome sequence; M.S., A.E.W., M.R.V., and V.D. sequenced and analyzed the BAC clone insert; P.C.D., M.R.V., and M.L.D. carried out Iso-Seq analysis; M.R.V. organized the supplementary material; M.R.V., E.E.E., and M.J.P.C. wrote the manuscript; M.R.V. and P.C.D. produced the display items.

Corresponding authors

Correspondence to Mark J. P. Chaisson or Evan E. Eichler.

Ethics declarations

Competing interests

E.E.E. is on the scientific advisory board (SAB) of DNAnexus, Inc.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Proportion of resolved SDs in different PacBio (PB)/ONT genome assemblies.

The figure shows the percent of SD bases that are resolved in human genome assemblies plotted as a function of the length of minimum extension of the alignment past the duplication. The number of resolved SD base pairs is relatively constant irrespective of the requirement of flanking unique base pairs. The dashed red line indicates the threshold chosen for our analysis used to generate the first panel in Supplementary Fig. 2 and the fraction of resolved SDs in Supplementary Table 1.

Supplementary Figure 2 Resolution of SDs in SMRT genome assemblies.

SDs (as a function of percent identity and length) in GRCh38 are marked as resolved (black) if present in the CHM1 assembly, or unresolved (red) if it appears only in the reference. The stacked marginal histograms show the relative number of resolved and unresolved SDs within each bin. Resolved duplications are defined as those mapping with high sequence identity, being completely contained, and extending at least 50 kb into unique sequence on either side of the duplication block (Methods). See Supplementary Fig. 1 and Supplementary Table 1 for the fraction of unresolved duplications across different genomes, assemblers, and technologies. Note that resolved and unresolved SDs are offset from one another along the y-axis to avoid overlapping. b) This plot shows the number of genes that exist within unresolved SDs blocks in the CHM1 assembly versus the maximum percent identity SD within that block.

Supplementary Figure 3 Length of collapsed SDs and SDA assemblies.

Correlation of collapse length and SDA assembly length in a) CHM1 (n = 590), b) CHM13 (n = 1,440), and c) NA19240 (n = 1,772) genome assemblies. In all three assemblies there is a strong correlation (Pearson’s correlation) between the length of a collapsed SD and the length of the resulting SDA assembly. SDA is not restricted to assembling duplications less than the maximum read length (like other assemblers), but rather it is restricted by the size of the collapsed duplication.

Supplementary Figure 4 Sequence and assembly of NOTCH2 loci in the CHM1 human genome.

a) A collapsed representation of a portion of the NOTCH2 loci is shown. Plotted is the read-depth profile over a collapsed representation of NOTCH2. Each black dot represents the coverage of the most frequent base pair at that position, while each red dot is the second most frequent. Secondary bases at low frequency represent sequencing error; however, those at high frequency represent PSV candidates. b) NOTCH2 PSV graph resolves the collapse into five potential loci. c) The alignment of each SDA contig back to the loci for NOTCH2 (./NLA/NLB/NLC/NLD) using Miropeats. Our assembled sequence is 99.88% identical over all five loci and >99.995% identical if only mismatched bases are counted as errors.

Supplementary Figure 5 SDA results for the CHM13 assembly.

a) SDA analysis of the CHM13 FALCON assembly generates 1,848 PSV clusters. b) Cumulative distribution of the assemblies and their percent identity to their best match in the reference. There are 40.4 Mb of diverged assembly (gray) and 43.0 Mb that map to the reference at high identity (black). c) A density plot of SDs plotted by length and percent identity. d) Copy number difference (CND) between CHM13 and the reference genome (CHM13 copy number – reference genome copy number) comparing n = 186 SD regions that match (>99.8%) versus n = 374 diverged SD regions (<99.8% identity). The mean CND of the matched sequence is 1.61 and the mean CND of the diverged sequence is 5.98, indicating that the diverged sequences are much more likely to represent additional duplicate copies that are unrepresented in the reference genome (GRCh38) (two-sided Mann-Whitney test; P = 2.77 × 10–5). The boxes indicate the range between the first and third quartiles, with the bold line specifying the median. The whiskers show the minimum and maximum within 1.5 times the interquartile range extending from the first and third quartiles. (See Fig. 2 for more details.).

Supplementary Figure 6 SDA results for the NA19240 (African Yoruban) assembly.

a) SDA analysis of the NA19240 FALCON assembly generates 2,136 PSV clusters. b) Cumulative distribution of the assemblies and their percent identity to their best match in the reference. There are 46.1 Mb of diverged assembly (gray) and 41.0 Mb that maps to the reference at high identity (black). c) A density plot of SDs plotted by length and percent identity. d) CND between NA19240 and the reference genome (NA19240 copy number – reference genome copy number) comparing n = 177 SD regions that match (>99.8%) versus n = 384 diverged SD regions (<99.8% identity). The mean CND of the matched sequence is 4.11 and the mean CND of the diverged sequence is 10.87, indicating that the diverged sequences are much more likely to represent additional duplicate copies that are unrepresented in the reference genome (GRCh38) (two-sided Mann-Whitney test; P = 1.88 × 10–4). The boxes indicate the range between the first and third quartiles, with the bold line specifying the median. The whiskers show the minimum and maximum within 1.5 times the interquartile range extending from the first and third quartiles. (See Fig. 2 for more details.).

Supplementary Figure 7 Comparison of SDA on ONT versus SMRT data.

The left half of the figure shows the results of SDA applied to the ONT assembly of NA12878; on the right is the PacBio assembly of NA19240. a) SDA analysis of the NA12878 assembly generated 38 assemblies that mapped with >99.8% identity (matched) to GRCh38 and 792 mapped with <99.8% sequence identity (diverged). Failed clusters (n = 1,052) did not result in an assembly, while multiple assemblies were PSV clusters with more than one contig produced by the Canu assembly. b) Cumulative distribution of the assemblies and their percent identity to their best match in the reference. The number of assembly Mb is calculated independently of a mapping to the reference. c) Length distribution of the matched and diverged assemblies (NA12878: matched n = 38, diverged n = 792; NA19240: matched n = 789, diverged n = 983). The lines on the violin plots indicate the first and third quartiles as well as the median. d) Sequencing read-depth distribution of the second most common SNV across all collapsed regions of SDs.

Supplementary Figure 8 Sequence and assembly of a missing 16p12.1 duplication.

The Miropeats alignments compare a BAC-based tiling path assembly of CHM1 (top line) to the human reference genome (GRCh38) (middle line) to a de novo assembly of CHM1 where SDA was applied (bottom line). The A/C duplication (red blue) proposed by Sudmant et al. that is present in most humans was correctly assembled using SDA and matches at high sequence identity (99.9%) to the BAC-based assembly structure.

Supplementary Figure 9 Mapping differential of transcripts between SDA and de novo CHM13.

The percent identity differential of the mapping of full-length Iso-Seq transcripts (n = 14,562) from human-specific segmental duplications (HSDs) to both the de novo assembly of CHM13 and the SDA results on CHM13 is shown. In total, 11 gene families showed significantly (P < 0.001, two-sided Wilcoxon signed-rank test) improved mapping to the SDA-resolved contigs. The boxes indicate the range between the first and third quartiles, with the bold line specifying the median. The whiskers show the minimum and maximum within 1.5 times the interquartile range extending from the first and third quartiles.

Supplementary Figure 10 Multiple sequence alignment (MSA) between GRCh38 GPRIN2 and SDA GPRIN2A/B.

Shown is the amino acid MSA between the copies of GPRIN2 resolved by SDA and the copy of GPRIN2 in GRCh38. Of the 15 differences in the MSA, 12 are annotated in dbSNP as variants in GPRIN2 when they are in fact differences between GPRIN2A and GPRIN2B. At p.Ser104Gly, p.Arg242Gly, and p.Val375Ala, the reference has the minor allele. Supplementary Table 7 shows the allele frequencies for all variants seen in this alignment.

Supplementary Figure 11 CHM1 SDA contigs that overlap with unique sequence.

This ideogram shows where SDA contigs could extend the FALCON assembly. The bottom panel of each chromosome shows the FALCON assembly (contigs > 1 Mb (dark blue), contigs < 1 Mb (light blue)). The top panel shows where SDA contigs with unique overlaps map along the reference (contigs with > 10 kb of overlap (green), contig with < 10 kb (red)).

Supplementary Figure 12 PSV graph without attraction edges.

Reproduced above is the PSV graph shown in Fig. 3 for SRGAP2. The left-hand side shows the attraction edges used in correlation clustering (CC). On the right-hand side, the edges are removed so that the transparency of the nodes is visible. The opacity of each node scales from 0.25 to 1, with 0.25 reflecting the start position on the contig and 1 representing the final position on the contig.

Supplementary information

Supplementary Text and Figures

Supplementary Figs. 1–12, Supplementary Tables 1–3 and 7, and Supplementary Note 1

Reporting Summary

Supplementary Table 4

CHORI-17 BAC clone sequences.

Supplementary Table 5

BAC clone sequence analysis.

Supplementary Table 6

Gene content analysis.

Supplementary Table 8

Summary of all SDA assemblies from CHM1, CHM13, and NA19240.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Vollger, M.R., Dishuck, P.C., Sorensen, M. et al. Long-read sequence and assembly of segmental duplications. Nat Methods 16, 88–94 (2019). https://doi.org/10.1038/s41592-018-0236-3

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing