We have developed a computational method based on polyploid phasing of long sequence reads to resolve collapsed regions of segmental duplications within genome assemblies. Segmental Duplication Assembler (SDA; https://github.com/mvollger/SDA) constructs graphs in which paralogous sequence variants define the nodes and long-read sequences provide attraction and repulsion edges, enabling the partition and assembly of long reads corresponding to distinct paralogs. We apply it to single-molecule, real-time sequence data from three human genomes and recover 33–79 megabase pairs (Mb) of duplications in which approximately half of the loci are diverged (<99.8%) compared to the reference genome. We show that the corresponding sequence is highly accurate (>99.9%) and that the diverged sequence corresponds to copy-number-variable paralogs that are absent from the human reference genome. Our method can be applied to other complex genomes to resolve the last gene-rich gaps, improve duplicate gene annotation, and better understand copy-number-variant genetic diversity at the base-pair level.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $20.17 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
SMRT WGS for CHM1, CHM13, and NA12940 from this study are available at the NCBI Sequence Read Archive (SRA) under accession numbers SRP044331 for CHM1; SRX818607, SRX825542, and SRX825575–SRX825579 for CHM13; and SRX1093000, SRX1093555, SRX1093654, SRX1094289, SRX1094374, SRX1094388, and SRX1096798 for NA19240. ONT WGS data are available at https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md. De novo assemblies of CHM1, CHM13, NA12940, and NA12878 from this study are available at the NCBI Assembly database under accession numbers GCA_001297185.1, GCA_000983455.2, GCA_001524155.4, and GCA_900232925.1, respectively. Assembled CHORI-17 BACs are available at the NCBI Clone DB (https://www.ncbi.nlm.nih.gov/clone/) under the accession numbers listed in Supplementary Table 4. Information about length, PSVs, and mapping location in GRCh38 can be found for all the SDA contigs generated, in Supplementary Table 8. Additional data that support the findings of this study are available from the corresponding author upon request.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).
Alkan, C., Sajjadian, S. & Eichler, E. E. Limitations of next-generation genome sequence assembly. Nat. Methods 8, 61–65 (2011).
Seo, J. S. et al. De novo assembly and phasing of a Korean human genome. Nature 538, 243–247 (2016).
Shi, L. et al. Long-read sequencing and de novo assembly of a Chinese genome. Nat. Commun. 7, 12065 (2016).
Bickhart, D. M. et al. Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat. Genet. 49, 643–650 (2017).
Gordon, D. et al. Long-read sequence assembly of the gorilla genome. Science 352, aae0344 (2016).
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).
Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).
Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).
Kronenberg, Z. N. et al. High-resolution comparative analysis of great ape genomes. Science 360, eaar6343 (2018).
Kelley, D. R. & Salzberg, S. L. Detection and correction of false segmental duplications caused by genome mis-assembly. Genome. Biol. 11, R28 (2010).
Pop, M. Shotgun sequence assembly. Adv. Comput. 60, 193–248 (2004).
Pevzner, P. A., Tang, H. & Waterman, M. S. An Eulerian path approach to DNA fragment assembly. Proc. Natl Acad. Sci. USA 98, 9748–9753 (2001).
Pevzner, P. A., Tang, H. & Tesler, G. De novo repeat classification and fragment assembly. Genome Res. 14, 1786–1796 (2004).
Myers, E. W. The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005).
Stankiewicz, P. & Lupski, J. R. Genome architecture, rearrangements and genomic disorders. Trends Genet. 18, 74–82 (2002).
Sharp, A. J. et al. Discovery of previously unidentified genomic disorders from the duplication architecture of the human genome. Nat. Genet. 38, 1038–1042 (2006).
Sudmant, P. H. et al. Global diversity, population stratification, and selection of human copy-number variation. Science 349, aab3761 (2015).
Chen, J. et al. Bovine NK-lysin: copy number variation and functional diversification. Proc. Natl. Acad. Sci. USA 112, E7223–E7229 (2015).
Dennis, M. Y. & Eichler, E. E. Human adaptation and evolution by segmental duplication. Curr. Opin. Genet. Dev. 41, 44–52 (2016).
Abegglen, L. M. et al. Potential mechanisms for cancer resistance in elephants and comparative cellular response to DNA damage in humans. J. Am. Med. Assoc. 314, 1850–1860 (2015).
Church, D. M. et al. Lineage-specific biology revealed by a finished genome assembly of the mouse. PLoS Biol. 7, e1000112 (2009).
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Emanuel, B. S. & Shaikh, T. H. Segmental duplications: an ‘expanding’ role in genomic instability and disease. Nat. Rev. Genet. 2, 791–800 (2001).
Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018).
Chaisson, M. J., Mukherjee, S., Kannan, S. & Eichler, E. E. Resolving multicopy duplications de novo using polyploid phasing. RECOMB 10229, 117–133 (2017).
Bailey, J. A. et al. Recent segmental duplications in the human genome. Science 297, 1003–1007 (2002).
Ailon, N., Charikar, M. & Newman, A. Aggregating inconsistent information. J. Assoc. Comput. Mach. 55, 1–27 (2008).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Fiddes, I. T. et al. Human-specific NOTCH2NL genes affect notch signaling and cortical neurogenesis. Cell 173, 1356–1369 (2018).
Florio, M. et al. Evolution and cell-type specificity of human-specific genes preferentially expressed in progenitors of fetal neocortex. eLife 7, e32332 (2018).
Dennis, M. Y. et al. Evolution of human-specific neural SRGAP2 genes by incomplete segmental duplication. Cell 149, 912–922 (2012).
Nuttle, X. et al. Rapid and accurate large-scale genotyping of duplicated genes and discovery of interlocus gene conversions. Nat. Methods 10, 903–909 (2013).
Dennis, M. Y. et al. The evolution and population diversity of human-specific segmental duplications. Nat. Ecol. Evol. 1, 0069 (2017).
Steinberg, K. M. et al. High-quality assembly of an individual of Yoruban descent. bioRxiv Preprint at https://www.biorxiv.org/content/early/2016/08/02/067447 (2016).
Chaisson, M. J. P. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015).
BACPAC Resources. The CHORI-17 BAC library from a hydatidiform (haploid) mole. CloneDB https://www.ncbi.nlm.nih.gov/clone/library/genomic/76/ (2018).
Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).
Nuttle, X. et al. Emergence of a Homo sapiens–specific gene family and chromosome 16p11.2 CNV susceptibility. Nature 536, 205–209 (2016).
Dougherty, M. L. et al. Transcriptional fates of human-specific segmental duplications in brain. Genome Res. 28, 1566–1576 (2018).
Das, S. & Vikalo, H. SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming. BMC Genomics 16, 260 (2015).
Aguiar, D. & Istrail, S. Haplotype assembly in polyploid genomes and identical by descent shared tracts. Bioinformatics 29, i352–i360 (2013).
Berger, E., Yorukoglu, D., Peng, J. & Berger, B. in Research in Computational Molecular Biology: RECOMB 2014 (ed Sharan, R.) 18–19 (Springer, 2014).
Puljiz, Z. & Vikalo, H. Decoding genetic variations: communications-inspired haplotype assembly. IEEE/ACM. Trans. Comput. Biol. Bioinform. 13, 518–530 (2016).
Bonizzoni, P. et al. On the minimum error correction problem for haplotype assembly in diploid and polyploid genomes. J. Comput. Biol. 23, 718–736 (2016).
Artyomenko, A. et al. Long single-molecule reads can resolve the complexity of the influenza virus composed of rare, closely related mutant variants. J. Comput. Biol. 24, 558–570 (2017).
Parsons, J. D. Miropeats: graphical DNA sequence comparisons. Comput. Appl. Biosci. 11, 615–619 (1995).
Jain, C., Koren, S., Dilthey, A., Phillippy, A. M. & Aluru, S. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics 34, i748–i756 (2018).
Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Patterson, M. et al. WhatsHap: weighted haplotype assembly for future-generation sequencing reads. J. Comput. Biol. 22, 498–509 (2015).
Steinberg, K. M. et al. Structural diversity and African origin of the 17q21.31 inversion polymorphism. Nat. Genet. 44, 872–880 (2012).
Sudmant, P. H. et al. Diversity of human copy number variation and multicopy genes. Science 330, 641–646 (2010).
The authors thank S. Cantsilieris and D. Gordon for technical assistance, J. Underwood for recommendations regarding the analysis of HSDs and Iso-Seq data, and T. Brown for help in editing this manuscript. This work was supported, in part, by grants from the US National Institutes of Health (NIH) (HG002385 to E.E.E., HG007635 to R.K.W. and E.E.E., and HG003079 to R.K.W.). M.R.V. was supported by a National Library of Medicine (NLM) Big Data Training Grant for Genomics and Neuroscience (5T32LM012419-04). P.C.D. was supported by a National Human Genome Research Institute (NHGRI) training grant (5T32HG000035-23). E.E.E. is an investigator of the Howard Hughes Medical Institute.
Integrated supplementary information
The figure shows the percent of SD bases that are resolved in human genome assemblies plotted as a function of the length of minimum extension of the alignment past the duplication. The number of resolved SD base pairs is relatively constant irrespective of the requirement of flanking unique base pairs. The dashed red line indicates the threshold chosen for our analysis used to generate the first panel in Supplementary Fig. 2 and the fraction of resolved SDs in Supplementary Table 1.
SDs (as a function of percent identity and length) in GRCh38 are marked as resolved (black) if present in the CHM1 assembly, or unresolved (red) if it appears only in the reference. The stacked marginal histograms show the relative number of resolved and unresolved SDs within each bin. Resolved duplications are defined as those mapping with high sequence identity, being completely contained, and extending at least 50 kb into unique sequence on either side of the duplication block (Methods). See Supplementary Fig. 1 and Supplementary Table 1 for the fraction of unresolved duplications across different genomes, assemblers, and technologies. Note that resolved and unresolved SDs are offset from one another along the y-axis to avoid overlapping. b) This plot shows the number of genes that exist within unresolved SDs blocks in the CHM1 assembly versus the maximum percent identity SD within that block.
Correlation of collapse length and SDA assembly length in a) CHM1 (n = 590), b) CHM13 (n = 1,440), and c) NA19240 (n = 1,772) genome assemblies. In all three assemblies there is a strong correlation (Pearson’s correlation) between the length of a collapsed SD and the length of the resulting SDA assembly. SDA is not restricted to assembling duplications less than the maximum read length (like other assemblers), but rather it is restricted by the size of the collapsed duplication.
a) A collapsed representation of a portion of the NOTCH2 loci is shown. Plotted is the read-depth profile over a collapsed representation of NOTCH2. Each black dot represents the coverage of the most frequent base pair at that position, while each red dot is the second most frequent. Secondary bases at low frequency represent sequencing error; however, those at high frequency represent PSV candidates. b) NOTCH2 PSV graph resolves the collapse into five potential loci. c) The alignment of each SDA contig back to the loci for NOTCH2 (./NLA/NLB/NLC/NLD) using Miropeats. Our assembled sequence is 99.88% identical over all five loci and >99.995% identical if only mismatched bases are counted as errors.
a) SDA analysis of the CHM13 FALCON assembly generates 1,848 PSV clusters. b) Cumulative distribution of the assemblies and their percent identity to their best match in the reference. There are 40.4 Mb of diverged assembly (gray) and 43.0 Mb that map to the reference at high identity (black). c) A density plot of SDs plotted by length and percent identity. d) Copy number difference (CND) between CHM13 and the reference genome (CHM13 copy number – reference genome copy number) comparing n = 186 SD regions that match (>99.8%) versus n = 374 diverged SD regions (<99.8% identity). The mean CND of the matched sequence is 1.61 and the mean CND of the diverged sequence is 5.98, indicating that the diverged sequences are much more likely to represent additional duplicate copies that are unrepresented in the reference genome (GRCh38) (two-sided Mann-Whitney test; P = 2.77 × 10–5). The boxes indicate the range between the first and third quartiles, with the bold line specifying the median. The whiskers show the minimum and maximum within 1.5 times the interquartile range extending from the first and third quartiles. (See Fig. 2 for more details.).
a) SDA analysis of the NA19240 FALCON assembly generates 2,136 PSV clusters. b) Cumulative distribution of the assemblies and their percent identity to their best match in the reference. There are 46.1 Mb of diverged assembly (gray) and 41.0 Mb that maps to the reference at high identity (black). c) A density plot of SDs plotted by length and percent identity. d) CND between NA19240 and the reference genome (NA19240 copy number – reference genome copy number) comparing n = 177 SD regions that match (>99.8%) versus n = 384 diverged SD regions (<99.8% identity). The mean CND of the matched sequence is 4.11 and the mean CND of the diverged sequence is 10.87, indicating that the diverged sequences are much more likely to represent additional duplicate copies that are unrepresented in the reference genome (GRCh38) (two-sided Mann-Whitney test; P = 1.88 × 10–4). The boxes indicate the range between the first and third quartiles, with the bold line specifying the median. The whiskers show the minimum and maximum within 1.5 times the interquartile range extending from the first and third quartiles. (See Fig. 2 for more details.).
The left half of the figure shows the results of SDA applied to the ONT assembly of NA12878; on the right is the PacBio assembly of NA19240. a) SDA analysis of the NA12878 assembly generated 38 assemblies that mapped with >99.8% identity (matched) to GRCh38 and 792 mapped with <99.8% sequence identity (diverged). Failed clusters (n = 1,052) did not result in an assembly, while multiple assemblies were PSV clusters with more than one contig produced by the Canu assembly. b) Cumulative distribution of the assemblies and their percent identity to their best match in the reference. The number of assembly Mb is calculated independently of a mapping to the reference. c) Length distribution of the matched and diverged assemblies (NA12878: matched n = 38, diverged n = 792; NA19240: matched n = 789, diverged n = 983). The lines on the violin plots indicate the first and third quartiles as well as the median. d) Sequencing read-depth distribution of the second most common SNV across all collapsed regions of SDs.
The Miropeats alignments compare a BAC-based tiling path assembly of CHM1 (top line) to the human reference genome (GRCh38) (middle line) to a de novo assembly of CHM1 where SDA was applied (bottom line). The A/C duplication (red blue) proposed by Sudmant et al. that is present in most humans was correctly assembled using SDA and matches at high sequence identity (99.9%) to the BAC-based assembly structure.
The percent identity differential of the mapping of full-length Iso-Seq transcripts (n = 14,562) from human-specific segmental duplications (HSDs) to both the de novo assembly of CHM13 and the SDA results on CHM13 is shown. In total, 11 gene families showed significantly (P < 0.001, two-sided Wilcoxon signed-rank test) improved mapping to the SDA-resolved contigs. The boxes indicate the range between the first and third quartiles, with the bold line specifying the median. The whiskers show the minimum and maximum within 1.5 times the interquartile range extending from the first and third quartiles.
Shown is the amino acid MSA between the copies of GPRIN2 resolved by SDA and the copy of GPRIN2 in GRCh38. Of the 15 differences in the MSA, 12 are annotated in dbSNP as variants in GPRIN2 when they are in fact differences between GPRIN2A and GPRIN2B. At p.Ser104Gly, p.Arg242Gly, and p.Val375Ala, the reference has the minor allele. Supplementary Table 7 shows the allele frequencies for all variants seen in this alignment.
This ideogram shows where SDA contigs could extend the FALCON assembly. The bottom panel of each chromosome shows the FALCON assembly (contigs > 1 Mb (dark blue), contigs < 1 Mb (light blue)). The top panel shows where SDA contigs with unique overlaps map along the reference (contigs with > 10 kb of overlap (green), contig with < 10 kb (red)).
Reproduced above is the PSV graph shown in Fig. 3 for SRGAP2. The left-hand side shows the attraction edges used in correlation clustering (CC). On the right-hand side, the edges are removed so that the transparency of the nodes is visible. The opacity of each node scales from 0.25 to 1, with 0.25 reflecting the start position on the contig and 1 representing the final position on the contig.
Supplementary Figs. 1–12, Supplementary Tables 1–3 and 7, and Supplementary Note 1
CHORI-17 BAC clone sequences.
BAC clone sequence analysis.
Gene content analysis.
Summary of all SDA assemblies from CHM1, CHM13, and NA19240.