Increased mutation and gene conversion within human segmental duplications

Vollger, Mitchell R.; Dishuck, Philip C.; Harvey, William T.; DeWitt, William S.; Guitart, Xavi; Goldberg, Michael E.; Rozanski, Allison N.; Lucas, Julian; Asri, Mobin; Munson, Katherine M.; Lewis, Alexandra P.; Hoekzema, Kendra; Logsdon, Glennis A.; Porubsky, David; Paten, Benedict; Harris, Kelley; Hsieh, PingHsun; Eichler, Evan E.

doi:10.1038/s41586-023-05895-y

Download PDF

Article
Open access
Published: 10 May 2023

Increased mutation and gene conversion within human segmental duplications

Nature volume 617, pages 325–334 (2023)Cite this article

19k Accesses
15 Citations
1303 Altmetric
Metrics details

Subjects

Abstract

Single-nucleotide variants (SNVs) in segmental duplications (SDs) have not been systematically assessed because of the limitations of mapping short-read sequencing data^1,2. Here we constructed 1:1 unambiguous alignments spanning high-identity SDs across 102 human haplotypes and compared the pattern of SNVs between unique and duplicated regions^3,4. We find that human SNVs are elevated 60% in SDs compared to unique regions and estimate that at least 23% of this increase is due to interlocus gene conversion (IGC) with up to 4.3 megabase pairs of SD sequence converted on average per human haplotype. We develop a genome-wide map of IGC donors and acceptors, including 498 acceptor and 454 donor hotspots affecting the exons of about 800 protein-coding genes. These include 171 genes that have ‘relocated’ on average 1.61 megabase pairs in a subset of human haplotypes. Using a coalescent framework, we show that SD regions are slightly evolutionarily older when compared to unique sequences, probably owing to IGC. SNVs in SDs, however, show a distinct mutational spectrum: a 27.1% increase in transversions that convert cytosine to guanine or the reverse across all triplet contexts and a 7.6% reduction in the frequency of CpG-associated mutations when compared to unique DNA. We reason that these distinct mutational properties help to maintain an overall higher GC content of SD DNA compared to that of unique DNA, probably driven by GC-biased conversion between paralogous sequences^5,6.

Evolution of tissue-specific expression of ancestral genes across vertebrates and insects

Article 15 April 2024

Genome assembly in the telomere-to-telomere era

Article 22 April 2024

Complexity of avian evolution revealed by family-level genomes

Article 01 April 2024

Main

The landscape of human SNVs has been well characterized for more than a decade in large part owing to wide-reaching efforts such as the International HapMap Project and the 1000 Genomes Project^7,8. Although these consortia helped to establish the genome-wide pattern of SNVs (as low as 0.1% allele frequency) and linkage disequilibrium on the basis of sequencing and genotyping thousands of human genomes, not all parts of the human genome could be equally ascertained. Approximately 10–15% of the human genome⁸ has remained inaccessible to these types of analysis either because of gaps in the human genome sequence or, more frequently, the low mapping quality associated with aligning short-read whole-genome sequencing data. This is because short-read sequence data are of insufficient length (<200 base pairs (bp)) to unambiguously assign reads and, therefore, variants to specific loci⁹. Although certain classes of large, highly identical repeats (for example, α-satellites in centromeres) were readily recognized, others, especially SDs¹ and their 859 associated genes¹⁰, in euchromatin were much more problematic to recognize.

Operationally, SDs are defined as interchromosomal or intrachromosomal homologous regions in any genome that are >1 kbp in length and >90% identical in sequence^1,11. As such regions arise by duplication as opposed to retrotransposition, they were initially difficult to identify and early versions of the human genome sequence had either missed or misassembled these regions owing to their high sequence identity^12,13. Large-insert BAC clones ultimately led to many of these regions being resolved. Subsequent analyses showed that SDs contribute disproportionately to copy number polymorphisms and disease structural variation^9,14, are hotspots for gene conversion¹⁵, are substantially enriched in GC-rich DNA and Alu repeats^16,17, and are transcriptionally diverse leading to the emergence, in some cases, of human-specific genes thought to be important for human adaptation^18,19,20,21. Despite their importance, the pattern of SNVs among humans has remained poorly characterized. Early on, paralogous sequence variants were misclassified as SNVs² and, as a result, later high-identity SDs became blacklisted from SNV analyses because short-read sequence data could not be uniquely placed^22,23. This exclusion has translated into a fundamental lack of understanding in mutational processes precisely in regions predicted to be more mutable owing to the action of IGC^{24,25,26,27,28}. Previously, we noted an increase in SNV density in duplicated regions when compared to unique regions of the genome on the basis of our comparison of GRCh38 and the complete telomere-to-telomere (T2T) human reference genome¹⁰. Leveraging high-quality phased genome assemblies from 47 humans generated as part of the Human Pangenome Reference Consortium (HPRC)³, we sought to investigate this difference more systematically and compare the SNV landscape of duplicated and unique DNA in the human genome revealing distinct mutational properties.

Strategy and quality control

Unlike previous SNV discovery efforts, which catalogued SNVs on the basis of the alignment of sequence reads, our strategy was assembly driven (Extended Data Fig. 1). We focused on the comparison of 102 haplotype-resolved genomes (Supplementary Table 1) generated as part of the HPRC (n = 94) or other efforts (n = 8)^3,4,12,29 in which phased genome assemblies had been assembled using high-fidelity (HiFi) long-read sequencing³⁰. The extraordinary assembly contiguity of these haplotypes (contig N50, defined as the sequence length of the shortest contig at 50% of the total assembly length, > 40 Mbp) provided an unprecedented opportunity to align large swathes (>1 Mbp) of the genome, including high-identity SD repeats anchored by megabases of synteny.

As SD regions are often enriched in assembly errors even among long-read assemblies^3,4,31, we carried out a series of analyses to assess the integrity and quality of these regions in each assembled haplotype. First, we searched for regions of collapse¹¹ by identifying unusual increases or decreases in sequence read depth³. We determine that, on average, only 1.64 Mbp (1.37%) of the analysed SD sequence was suspect owing to unusually high or low sequence read depth on the basis of mapping of underlying read data— as such patterns are often indicative of a misassembly³ (Methods). Next, for all SD regions used in our analysis we compared the predicted copy number by Illumina sequence read depth with the sum based on the total copy number from the two assembled haplotypes. These orthogonal copy number estimates were highly correlated (Pearson’s R = 0.99, P < 2.2 × 10⁻¹⁶; Supplementary Fig. 1) implying that most SD sequences in the assemblies have the correct copy number. To confirm these results in even the most difficult to assemble duplications, we selected 19 of the largest and most identical SDs across 47 haplotypes for a total of 893 tests. These estimates were also highly correlated (Pearson’s R = 0.99, P < 2.2 × 10⁻¹⁶; Supplementary Figs. 2 and 3), and of the 893 tests conducted, 756 were identical. For the 137 tests for which estimates differed, most (n = 125) differed by only one copy. Finally, most of these discrepancies came from just three large (>140 kbp) and highly identical (>99.3%) SDs (Supplementary Fig. 3).

To validate the base-level accuracy, we next compared the quality value for both SD and unique sequences using Illumina sequencing data for 45 of the HPRC samples (Methods). Both unique (average quality value = 59 s.d. 1.9) and SD (average quality value = 53 s.d. 1.9) regions are remarkably high quality, which in the case of SDs translates into less than 1 SNV error every 200 kbp (Supplementary Fig. 4). We further show that these high-quality assembles result in accurate variant calls (Supplementary Notes and Supplementary Figs. 5–9). We also assessed the contiguity of the underlying assemblies using a recently developed tool, GAVISUNK, which compares unique k-mer distributions between HiFi-based assemblies and orthogonal Oxford Nanopore Technologies sequencing data from the same samples. We found that, on average, only 0.11% of assayable SD sequence was in error compared to 0.14% of unique regions assayed (Supplementary Table 2), implying high and comparable assembly contiguity. As a final control for potential haplotype-phasing errors introduced by trio HiFi assembly of diploid samples, we generated deep Oxford Nanopore Technologies and HiFi data from a second complete hydatidiform mole (CHM1) for which a single paternal haplotype was present and applied a different assembly algorithm³² (Verkko 1.0; Extended Data Fig. 2). We show across our many analyses that the results from the CHM1 Verkko assembly are consistent with individual haplotypes obtained from diploid HPRC samples produced by trio hifiasm^3,32 (Supplementary Fig. 10). We therefore conclude that phasing errors have, at most, a negligible effect on our results and that most (>98%) SDs analysed were accurately assembled from multiple human genomes allowing the pattern of SNV diversity in SDs to be systematically interrogated.

Increased SNV density in SD regions

To assess SNVs, we limited our analysis to portions of the genome where a 1:1 orthologous relationship could be unambiguously assigned (as opposed to regions with extensive copy number variation). Using the T2T-CHM13 reference genome, we aligned the HPRC haplotypes requiring alignments to be a minimum of 1 Mbp in length and carry no structural variation events greater than 10 kbp (Methods and Extended Data Fig. 1). Although the proportion of haplotypes compared for any locus varied (Fig. 1a), the procedure allowed us to establish, on average, 120.2 Mbp 1:1 fully aligned sequence per genome for SD regions out of a total of 217 Mbp from the finished human genome (T2T-CHM13 v1.1). We repeated the analysis for ‘unique’ (or single-copy) regions of the genome and recovered by comparison 2,508 Mbp as 1:1 alignments (Fig. 1a). All downstream analyses were then carried out using this orthologous alignment set. We first compared the SNV diversity between unique and duplicated regions excluding suboptimal alignments mapping to tandem repeats or homopolymer stretches. Overall, we observe a significant 60% increase in SNVs in SD regions (Methods; Pearson’s chi-squared test with Yates’s continuity correction P < 2.2 × 10⁻¹⁶; Fig. 1b). Specifically, we observe an average of 15.3 SNVs per 10 kbp versus 9.57 SNVs per 10 kbp for unique sequences (Fig. 1d). An empirical cumulative distribution comparing the number of SNVs in 10-kbp windows between SD and unique sequence confirms that this is a general property and not driven simply by outliers. The empirical cumulative distribution shows that more than half of the SD sequences have more SNVs than their unique counterparts (Fig. 1b). Moreover, for all haplotypes we divided the unique portions of the genome into 125-Mbp bins and found that all SD bins of equivalent size have more SNVs than any of the bins of unique sequence (empirical P value < 0.0005; Extended Data Fig. 3). This elevation in SNVs is only modestly affected by the sequence identity of the underlying SDs (Pearson’s correlation of only 0.008; Supplementary Fig. 11). The increase in SNVs (60%) in SDs is greater than that in all other assayable classes of repeats: Alu (23%), L1 (−9.4%), human endogenous retroviruses (−9.4%) and ancient SDs for which the divergence is greater than 10% (12%) (Extended Data Fig. 4 and Supplementary Table 3). We find, however, that SNV density correlates with increasing GC content (Supplementary Fig. 12) consistent with Alu repeats representing the only other class of common repeat to show an elevation.

**Fig. 1: Increased single-nucleotide variation in SDs.**

Previous publications have shown that African haplotypes are genetically more diverse, having on average about 20% more variant sites compared to non-African haplotypes⁸. To confirm this observation in our data, we examined the number of SNVs per 10 kbp of unique sequence in African versus non-African haplotypes (Fig. 1c,d) and observed a 27% (10.8 versus 8.5) excess in African haplotypes. As a result, among African haplotypes, we see that the average distance between SNVs (979 bp) is 19.4% closer than in non-African haplotypes (1,215 bp), as expected^8,12. African genomes also show increased variation in SDs, but it is less pronounced with an average distance of 784 bases between consecutive SNVs as compared to 909 bases in non-African haplotypes (13.8%). Although elevated in African haplotypes, SNV density is higher in SD sequence across populations and these properties are not driven by a few sites but, once again, are a genome-wide feature. We put forward three possible hypotheses to account for this increase although note these are not mutually exclusive: SDs have unique mutational mechanisms that increase SNVs; SDs have a deeper average coalescence than unique parts of the genome; and differences in sequence composition (for example, GC richness) make SDs more prone to particular classes of mutation.

Putative IGC

One possible explanation for increased diversity in SDs is IGC in which sequence that is orthologous by position no longer shares an evolutionary history because a paralogue from a different location has ‘donated’ its sequence through ectopic template-driven conversion³³, also known as nonallelic gene conversion²⁷. To identify regions of IGC, we developed a method that compares two independent alignment strategies to pinpoint regions where the orthologous alignment of an SD sequence is inferior to an independent alignment of the sequence without flanking information (Fig. 2a and Methods). We note several limitations of our approach (Supplementary Notes); however, we show that our high-confidence IGC calls (20+ supporting SNVs) have strong overlap with other methods for identifying IGC (Supplementary Notes and Supplementary Fig. 13). Using this approach, we created a genome-wide map of putative large IGC events for all of the HPRC haplotypes for which 1:1 orthologous relationships could be established (Fig. 2).

Across all 102 haplotypes, we observe 121,631 putative IGC events for an average of 1,193 events per human haplotype (Fig. 2b,c and Supplementary Table 4). Of these events, 17,949 are rare and restricted to a single haplotype (singletons) whereas the remaining events are observed in several human haplotypes grouping into 14,663 distinct events (50% reciprocal overlap at both the donor and acceptor site). In total, we estimate that there is evidence for 32,612 different putative IGC events (Supplementary Table 5) among the SD regions that are assessed at present. Considering the redundant IGC callset (n = 121,631), the average IGC length observed in our data is 6.26 kbp with the largest event observed being 504 kbp (Extended Data Fig. 5). On average, each IGC event has 13.3 SNVs that support the conversion event and 2.03 supporting SNVs per kilobase pair, and as expected, there is strong correlation (Pearson’s R = 0.63, P < 2.2 × 10⁻¹⁶; Fig. 2d) between the length of the events and supporting SNVs. Furthermore, we validated these supporting SNVs against Illumina sequencing data and find that on average only 1% (12/1,192) of IGC events contain even one erroneous SNV (Supplementary Fig. 4). The putative IGC events detected with our method are largely restricted to higher identity duplications with only 325 events detected in 66.1 Mbp of SDs with >10% sequence divergence (Supplementary Figs. 14 and 15). We further stratify these results by callset, minimum number of supporting SNVs and haplotype (Supplementary Table 6). Finally, we use the number of supporting informative SNVs to estimate the statistical confidence of every putative IGC call (Fig. 2c, Supplementary Table 7 and Methods). Using these P values, we identify a subset of the high-confidence (P value < 0.05) IGC calls with 31,910 IGC events and 10,102 nonredundant events.

On average, we identify 7.5 Mbp of sequence per haplotype affected by putative IGC and 4.3 Mbp in our high-confidence callset (Fig. 2b). Overall, 33.8% (60.77/180.0 Mbp) of the analysed SD sequence is affected by putative IGC in at least one human haplotype. Furthermore, among all SDs covered by at least 20 assembled haplotypes, we identify 498 acceptor and 454 donor IGC hotspots with at least 20 distinct IGC events (Fig. 3 and Supplementary Table 8). IGC hotspots are more likely to associate with higher copy number SDs compared to a random sample of SD windows of equal size (median of 9 overlaps compared to 3, one-sided Wilcoxon rank sum test P < 2.2 × 10⁻¹⁶) and regions with more IGC events are moderately correlated with the copy number of the SD (Pearson’s R = 0.23, P < 2.2 × 10⁻¹⁶; Supplementary Fig. 16). IGC hotspots also preferentially overlap higher identity duplications (median 99.4%) compared to randomly sampled windows (median 98.0%, one-sided Wilcoxon rank sum test P < 2.2 × 10⁻¹⁶).

These events intersect 1,179 protein-coding genes, and of these genes, 799 have at least one coding exon affected by IGC (Supplementary Tables 9 and 10). As a measure of functional constraint, we used the probability of being loss-of-function intolerant (pLI) for each of the 799 genes³⁴ (Fig. 4a). Among these, 314 (39.3%) have never been assessed for mutation intolerance (that is, no pLI) owing to the limitations of mapping short-read data from population samples³⁴. Of the remaining genes, we identify 38 with a pLI greater than 0.5, including genes associated with disease (F8, HBG1 and C4B) and human evolution (NOTCH2 and TCAF). Of the genes with high pLI scores, 12 are the acceptor site for at least 50 IGC events, including CB4, NOTCH2 and OPNL1W—a locus for red–green colour blindness (Fig. 4b–e). We identify a subset of 418 nonredundant IGC events that are predicted to copy the entirety of a gene body to a ‘new location’ in the genome (Fig. 4f,g). As a result, 171 different protein-coding genes with at least 2 exons and 200 coding base pairs are converted in their entirety by putative IGC events in a subset of human haplotypes (Supplementary Table 11), and we refer to this phenomenon as gene repositioning. These gene-repositioning events are large (average 26 kbp; median 16.7 kbp) and supported by a high number of SNVs (average 64.7; median 15.3 SNVs), suggesting that they are unlikely to be mapping artefacts. Markedly, these putative IGC events copy the reference gene model on average a distance of 1.66 Mbp (median 216 kbp) from its original location. These include several disease-associated genes (for example, TAOK2, C4A, C4B, PDPK1 and IL27) as well as genes that have eluded complete characterization owing to their duplicative nature^35,36,37.

**Fig. 4: Protein-coding genes affected by IGC.**

Evolutionary age of SDs

Our analysis suggests that putative IGC contributes modestly to the significant increase of human SNV diversity in SDs. For example, if we apply the least conservative definition of IGC (1 supporting SNV) and exclude all putative IGC events from the human haplotypes, we estimate that it accounts for only 23% of the increase (Extended Data Fig. 6). If we restrict to higher confidence IGC events (P < 0.05), only 19.6% of the increase could be accounted for. An alternative explanation may be that the SDs are evolutionarily older, perhaps owing to reduced selective constraint on duplicated copies^38,39. To test whether SD sequences seem to have a deeper average coalescence than unique regions, we constructed a high-quality, locally phased assembly (hifiasm v0.15.2) of a chimpanzee (Pan troglodytes) genome to calibrate age since the time of divergence and to distinguish ancestral versus derived alleles in human SD regions (Methods). Constraining our analysis to syntenic regions between human and chimpanzee genomes (Methods), we characterized 4,316 SD regions (10 kbp in size) where we had variant calls from at least 50 human and one chimpanzee haplotype. We selected at random 9,247 analogous windows from unique regions for comparison. We constructed a multiple sequence alignment for each window and estimated the time to the most recent common ancestor (TMRCA) for each 10-kbp window independently. We infer that SDs are significantly older than the corresponding unique regions of similar size (Supplementary Figs. 17 and 18; one-sided Wilcoxon rank sum test P value = 4.3 × 10⁻¹⁴), assuming that mutation rates have remained constant over time within these regions since the human–chimpanzee divergence. The TMRCAs inferred from SD regions are, on average, 22% more ancient when compared to unique regions (650 versus 530 thousand years ago (ka)), but only a 5% difference is noted when comparing the median (520 versus 490 ka). However, this effect all but disappears (only a 0.2% increase) after excluding windows classified as IGC (Supplementary Fig. 19; one-sided Wilcoxon rank sum test P = 0.05; mean TMRCA_unique = 528 ka, mean TMRCA_SD = 581 ka, median TMRCA_unique = 495 ka, median TMRCA_SD = 496 ka).

SNV mutational spectra in SDs

As a third possibility, we considered potential differences in the sequence context of unique and duplicated DNA. It has been recognized for almost two decades that human SDs are particularly biased towards Alu repeats and GC-rich DNA of the human genome^16,40. Notably, among the SNVs in SDs, we observed a significant excess of transversions (transition/transversion ratio (Ti/Tv) = 1.78) when compared to unique sequence (Ti/Tv = 2.06; P < 2.2 × 10⁻¹⁶, Pearson’s chi-squared test with Yates’s continuity correction). Increased mutability of GC-rich DNA is expected and may explain, in part, the increased variation in SDs and transversion bias^6,27,41. Using a more complete genome, we compared the GC composition of unique and duplicated DNA specifically for the regions considered in this analysis. We find that, on average, 42.4% of the analysed SD regions are guanine or cytosine (43.0% across all SDs) when compared to 40.8% of the unique DNA (P value < 2.2 × 10⁻¹⁶, one-sided t-test). Notably, this enrichment drops slightly (41.8%) if we exclude IGC regions. Consequently, we observe an increase of all GC-containing triplets in SD sequences compared to unique regions of the genome (Fig. 5a). Furthermore, the enrichment levels of particular triplet contexts in SD sequence correlate with the mutability of the same triplet sequence in unique regions of the genome (Pearson’s R = 0.77, P = 2.4 × 10⁻⁷; Fig. 5b). This effect is primarily driven by CpG-containing triplets, which are enriched between 14 and 30% in SD sequences. Note, we observe a weaker and insignificant correlation for the non-CpG-containing triplets (Pearson’s R = 0.22, P = 0.27). Extrapolating from the mutational frequencies seen in unique sequences, we estimate that there is 3.21% more variation with SDs due to their sequence composition alone.

**Fig. 5: Sequence composition and mutational spectra of SD SNVs.**

To further investigate the changes in GC content and their effect on variation in SDs, we compared the triplet mutational spectra of SNVs from unique and duplicated regions of the genome to determine whether the predominant modes of SNV mutation differed (Methods). We considered all possible triplet changes, first quantifying the number of ancestral GC bases and triplets in SDs (Fig. 5a). A principal component analysis (PCA) of these normalized mutational spectra shows clear discrimination (Fig. 5c) between unique and SD regions (PC1) beyond that of African and non-African diversity, with the first principal component capturing 80.2% of the variation separating the mutational spectrum of SDs and unique DNA. We observe several differences when comparing the triplet-normalized mutation frequency of particular mutational events in SD and unique sequences (Fig. 5d). Most notable is a 7.6% reduction in CpG transition mutations—the most predominant mode of mutation in unique regions of the genome due to spontaneous deamination of methylated CpGs⁶ (Supplementary Tables 12 and 13).

The most notable changes in mutational spectra in SD sequences are a 27.1% increase in C>G mutations, a 15.3% increase in C>A mutations and a 10.5% increase in A>C mutations. C>G mutations are associated with double-strand breaks in humans and some other apes^42,43. This effect becomes more pronounced (+40.4%) in our candidate IGC regions consistent with previous observations showing increases in C>G mutations in regions of non-crossover gene conversion and double-strand breaks^43,44,45. However, the increase remains in SD regions without IGC (+20.0%) perhaps owing to extensive nonallelic homologous recombination associated with SDs or undetected IGC events^4,9.

To further investigate the potential effect of GC-biased gene conversion (gBGC) on the mutational spectra in SDs, we measured the frequency of (A,T)>(G,C) mutations in SD regions with evidence of IGC to determine whether cytosine and guanine bases are being preferentially maintained as might be expected in regions undergoing gBGC. If we measure the frequency of (A,T)>(C,G) in windows with at least one haplotype showing evidence of IGC, then we observe that the frequency is 4.7% higher than in unique regions of the genome; notably, in SDs without IGC, this rate is reduced compared to that of unique sequence (−3.5%). Additionally, there is a 5.8% reduction in (G,C)>(A,T) bases consistent with IGC preferentially restoring CG bases that have mutated to AT bases through gBGC. These results indicate that gBGC between paralogous sequences may be a strong factor in shaping the mutational landscape of SDs. Although, the (A,T)>(C,G) frequency is comparable in SD regions not affected by IGC, the mutational landscape at large is still very distinct between SDs and unique parts of the genome. In PCA of the mutational spectra in SDs without IGC, the first principal component distinguishing the mutational spectrum of SDs and unique DNA captures a larger fraction of the variation (94.6%) than in the PCA including IGC sites (80.2%; Supplementary Fig. 20).

Modelling of elevated SNV frequency

To model the combined effect of unique mutational properties, evolutionary age and sequence content on the frequency of SNVs, we developed a multivariable linear regression using copy number, SD identity, number of unique IGC events, GC content and TMRCA to predict the number of SNVs seen in a 10-kbp window. A linear model containing all pairwise interactions of these predictors was able to explain 10.5% of the variation in SNVs per 10 kbp (adjusted R²), whereas a model containing only the number of IGC events explained only 1.8% of the variation. We note that this measure of variance is related but not directly comparable to the finding that the elevation in the number of SNVs is reduced by 23% when excluding IGC regions. All of the random variables, including their pairwise interactions, were significant (P value < 0.05) predictors of SNVs per 10 kbp except the interaction of number of IGC events with GC content, copy number and TMRCA. The strongest single predictors were the number of unique IGC events and the divergence of the overlapping SD (Supplementary Table 14).

Discussion

Since the first publications of the human genome^12,13, the pattern of single-nucleotide variation in recently duplicated sequence has been difficult to ascertain, leading to errors^2,11. Later, indirect approaches were used to infer true SNVs in SDs, but these were far from complete⁴⁰. More often than not, large-scale sequencing efforts simply excluded such regions in an effort to prevent paralogous sequence variants from contaminating single-nucleotide polymorphism databases and leading to false genetic associations^8,23. The use of phased genome assemblies as opposed to aligned sequence reads had the advantage of allowing us to establish 1:1 orthologous relationships as well as the ability to discern the effect of IGC while comparing the pattern of single-nucleotide variation for both duplicated and unique DNA within the same haplotypes. As a result, we identify over 1.99 million nonredundant SNVs in a gene-rich portion of the genome previously considered largely inaccessible.

SNV density is significantly elevated (60%) in duplicated DNA when compared to unique DNA consistent with suggestions from primate genome comparisons and more recent de novo mutation studies from long-read sequencing data^46,47,48. Furthermore, an increased de novo mutation rate in SDs could support our observation of an elevated SNV density without the need for an increase in TMRCA. We estimate that at least 23% of this increase is due to the action of IGC between paralogous sequences that essentially diversify allelic copies through concerted evolution. IGC in SDs seems to be more pervasive in the human genome compared to earlier estimates^15,27, which owing to mapping uncertainties or gaps could assay only a smaller subset of regions^15,27. We estimate more than 32,000 candidate regions (including 799 protein-coding genes) with the average human haplotype showing 1,192 events when compared to the reference. The putative IGC events are also much larger (mean 6.26 kbp) than those of most previous reports^28,49, with the top 10% of the size distribution >14.4 kbp in length. This has the net effect that entire genes are copied hundreds of kilobase pairs into a new genomic context when compared to the reference. The effect of such ‘repositioning events’ on gene regulation will be an interesting avenue of future research.

As for allelic gene conversion, our predicted nonallelic gene conversion events are abundant, cluster into larger regional hotspots and favour G and C mutations, although this last property is not restricted to IGC regions^45,50. Although we classify these regions as putative IGC events, other mutational processes such as deletion followed by duplicative transposition could, in principle, generate the same signal creating large tracts of ‘repositioned’ DNA. It should also be stressed that our method simply relies on the discovery of a closer match within the reference; by definition, this limits the detection of IGC events to regions where the donor sequence is already present in the reference as opposed to an alternative. Moreover, we interrogated only regions where 1:1 synteny could be unambiguously established. As more of the genome is assessed in the context of a pangenome reference framework, we anticipate that the proportion of IGC will increase, especially as large-copy-number polymorphic SDs, centromeres and acrocentric DNA become fully sequence resolved³. Although we estimate 4.3 Mbp of IGC in SDs on average per human haplotype, we caution that this almost certainly represents a lower bound and should not yet be regarded as a rate until more of the genome is surveyed and studies are carried out in the context of parent–child trios to observe germline events.

One of the most notable features of duplicated DNA is its higher GC content. In this study, we show that there is a clear skew in the mutational spectrum of SNVs to maintain this property of SDs beyond expectations from unique DNA. This property and the unexpected Ti/Tv ratio cannot be explained by lower accuracy of the assembly of SD regions. We find a 27.1% increase in transversions that convert cytosine to guanine or the reverse across all triplet contexts. GC-rich DNA has long been regarded as hypermutable. For example, C>G mutations preferentially associate with double-strand breaks in humans and apes^42,43 and GC-rich regions in yeast show about 2–5 times more mutations depending on sequence context compared to AT-rich DNA⁴¹. Notably, in human SD regions, we observe a paucity of CpG transition mutations, characteristically associated with spontaneous deamination of CpG dinucleotides and concomitant transitions⁶. The basis for this is unclear, but it may be partially explained by the recent observation that duplicated genes show a greater degree of hypomethylation when compared to their unique counterparts¹⁰. We propose that excess of guanosine and cytosine transversions is a direct consequence of GC-biased gene conversion⁵ driven by an excess of double-strand breaks that result from a high rate of nonallelic homologous recombination events and other break-induced replication mechanisms among paralogous sequences.

Methods

Defining unique and SD regions

To define regions of SD, we used the annotations available for T2T-CHM13 v1.1 (ref. ¹⁰), which include all nonallelic intrachromosomal and interchromosomal pairwise alignments >1 kbp and with >90% sequence identity that do not consist entirely of common repeats or satellite sequences¹¹. To define unique regions, we found the coordinates in T2T-CHM13 that were not SDs, ancient SDs (<90% sequence identity), centromeres or satellite arrays⁵¹ and defined these areas to be the non-duplicated (unique) parts of the genome. For both SDs and unique regions, variants in tandem repeat elements as identified by Tandem Repeats Finder⁵² were excluded because many SNVs called in these regions are ultimately alignment artefacts. RepeatMasker v4.1.2 was used to annotate SNVs with additional repeat classes beyond SDs⁵³.

Copy number estimate validation

The goal of this analysis was to validate copy number from the assembled HPRC haplotypes compared to estimates from read-depth analysis of the same samples sequenced using Illumina whole-genome sequencing (WGS). Large, recently duplicated segments are prone to copy number variation and are also susceptible to collapse and misassembly owing to their repetitive nature. HPRC haplotypes were assembled using PacBio HiFi with hifiasm^3,54 creating contiguous long-read assemblies. We selected 19 SD loci corresponding to genes that were known to be duplicated and copy number variable in the human species. We k-merized the 2 haplotype assemblies corresponding to each locus for each individual into k-mers of 31 base pairs in length. We then computed copy number estimates over each locus for the sum haplotype assemblies and calculated the difference based on Illumina WGS from the same sample. For both datasets, we derived these estimates using FastCN, an algorithm implementing whole-genome shotgun sequence detection⁵⁵. When averaging across each region and comparing differences in assembly copy versus Illumina WGS copy estimate, we observe that 756 out of 893 tests were perfectly matched (δ = 0), suggesting that most of these assemblies correctly represent the underlying genomic sequence of the samples.

Quality value estimations with Merqury

Estimates of the quality value of SD and unique regions were made using Merqury v1.1 and parental Illumina sequencing data⁵⁶. We first used Meryl to create k-mer databases (with a k-mer length of 21) using the parental sequencing data following the instructions in the Merqury documentation. Then Merqury was run with default parameters (merqury.sh {k-mer meryl database} {paternal sequence} {maternal sequence}) to generate quality value estimates for the hifiasm assemblies.

Haplotype integrity analysis using inter-SUNK approach

For the 35 HPRC assemblies with matched ultralong Oxford Nanopore Technologies (ONT) data, we applied GAVISUNK v1.0.0 as an orthogonal validation of HiFi assembly integrity⁵⁷. In brief, candidate haplotype-specific singly unique nucleotide k-mers (SUNKs) of length 20 are determined from the HiFi assembly and compared to ONT reads phased with parental Illumina data. Inter-SUNK distances are required to be consistent between the assembly and ONT reads, and regions that can be spanned and tiled with consistent ONT reads are considered validated. ONT read dropouts do not necessarily correspond to misassembly—they are also caused by large regions devoid of haplotype-specific SUNKs from recent duplications, homozygosity or over-assembly of the region, as well as Poisson dropout of read coverage.

Read-depth analysis using the HPRC unreliable callset

For the 94 assembled HPRC haplotypes, we downloaded the regions identified to have abnormal coverage form S3 (s3://human-pangenomics/submissions/e9ad8022-1b30-11ec-ab04-0a13c5208311–COVERAGE_ANALYSIS_Y1_GENBANK/FLAGGER/JAN_09_2022/FINAL_HIFI_BASED/FLAGGER_HIFI_ASM_SIMPLIFIED_BEDS/ALL/). We then intersected these regions with the callable SD regions in each assembly to determine the number of collapsed, falsely duplicated and low-coverage base pairs in each assembly. The unreliable regions were determined by the HPRC using Flagger v0.1 (https://github.com/mobinasri/flagger/)³.

Whole-genome alignments and synteny definition

Whole-genome alignments were calculated against T2T-CHM13 v1.1 with a copy of GRCh38 chrY using minimap2 v2.24 (ref. ⁵⁸) with the parameters -a -x asm20–secondary=no -s 25000 -K 8G. The alignments were further processed with rustybam v0.1.29 (ref. ⁵⁹) using the subcommands trim-paf to remove redundant alignments in the query sequence and break-paf to split alignments on structural variants over 10 kbp. After these steps, the remaining alignments over 1 Mbp of continuously aligned sequence were defined to be syntenic. The software pipeline is available on GitHub at https://github.com/mrvollger/asm-to-reference-alignment/ (refs. ^{58,59,60,61,62,63,64,65,66,67}).

Estimating the diversity of SNVs in SDs and unique sequences

When enumerating the number of SNVs, we count all pairwise differences between the haplotypes and the reference, counting events observed in multiple haplotypes multiple times. Therefore, except when otherwise indicated, we are referring to the total number of pairwise differences rather than the total number of nonredundant SNVs (number of segregation sites). The software pipeline is available on GitHub at https://github.com/mrvollger/sd-divergence (refs. ^{60,61,62,63,65,66,68}).

Defining IGC events

Each query haplotype genome sequence was aligned to the reference genome (T2T-CHM13 v1.1) using minimap2 v2.24 (ref. ⁵⁸) considering only those regions that align in a 1:1 fashion for >1 Mbp without any evidence of gaps or discontinuities greater than 10 kbp in size. This eliminates large forms of structural variation, including copy number variants or regions of large-scale inversion restricting the analysis to largely copy number invariant SD regions (about 120 Mbp) and flanking unique sequence. Once these syntenic alignments were defined, we carried out a second alignment fragmenting the 1:1 synteny blocks into 1-kbp windows (100-bp increments) and remapped back to T2T-CHM13 to identify each window’s single best alignment position. These second alignments were then compared to original syntenic ones and if they no longer overlapped, we considered them to be candidate IGC regions. Adjacent IGC windows were subsequently merged into larger intervals when windows continued to be mapped non-syntenically with respect to the original alignment. We then used the CIGAR string to identify the number of matching and mismatching bases at the ‘donor’ site and compared that to the number of matching and mismatching bases at the acceptor site determined by the syntenic alignment. A donor sequence is, thus, defined as a segment in T2T-CHM13 that now maps with higher sequence identity to a new location in the human haplotype (alignment method 2) and the acceptor sequence is the segment in T2T-CHM13 that has an orthologous mapping to the same region in the human haplotype (alignment method 1). As such, there is dependence on both the reference genome and the haplotype being compared. The software pipeline is available on GitHub at https://github.com/mrvollger/asm-to-reference-alignment/ (refs. ^{58,59,60,61,62,63,64,65,66,67}).

Assigning confidence to IGC events

To assign confidence measures to our IGC events, we adapted a previously described method⁶⁹ to calculate a P value for every one of our candidate IGC calls. Our method uses a cumulative binomial distribution constructed from the number of SNVs supporting the IGC event and the total number of informative sites between two paralogues to assign a one-sided P value to each event. Specifically:

$$P(X\le k)=B(k,n,p)$$

in which B is the binomial cumulative distribution, n is the number of informative sites between paralogues, k is the number of informative sites that agree with the non-converted sequence (acceptor site), and p is the probability that at an informative site the base matches the acceptor sequence. We assume p to be 0.5 reflecting that a supporting base change can come from one of two sources: the donor or acceptor paralogue. With these assumptions, our binomial model reports the probability that we observe k or fewer sites that support the acceptor site (that is, no IGC) at random given the data, giving us a one-sided P value for each IGC event. No adjustments were made for multiple comparisons.

Testing for IGC in unique regions

To test the specificity of our method, we applied it to an equivalent total of unique sequence (125 Mbp) on each haplotype, which we expected to show no or low levels of IGC. On average, we identify only 33.5 IGC events affecting 38.2 kbp of sequence per haplotype. If we restrict this to high-confidence IGC events, we see only 5.93 events on average affecting 7.29 kbp. This implies that our method is detecting IGC above background in SDs and that the frequency of IGC in SDs is more than 50 times higher in the high-confidence callsets (31,910 versus 605).

Additional genome assemblies

We assembled HG00514, NA12878 and HG03125 using HiFi long-read data and hifiasm v0.15.2 with parental Illumina data⁵⁴. Using HiFi long-read data and hifiasm v0.15.2 we also assembled the genome of the now-deceased chimpanzee Clint (sample S006007). The assembly is locally phased as trio-binning and HiC data were unavailable. Data are available on the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) under the BioProjects PRJNA551670 (ref. ⁴), PRJNA540705 (ref. ⁷⁰), PRJEB36100 (ref. ⁴) and PRJNA659034 (ref. ⁴⁷). These assemblies are made available on Zenodo (https://doi.org/10.5281/zenodo.6792653)⁷¹.

Determining the composition of triplet mutations in SD and unique sequences

The mutational spectra for unique and SD regions from each individual were computed using mutyper on the basis of derived SNVs polarized against the chimpanzee genome assembly described above^72,73,74. These spectra were normalized to the triplet content of the respective unique or SD regions by dividing the count of each triplet mutation type by the total count of each triplet context in the ancestral region and normalizing the number of counts in SD and unique sequences to be the same. For PCA, the data were further normalized using the centred log-ratio transformation, which is commonly used for compositional measurements⁷⁵. The code is available on GitHub at https://github.com/mrvollger/mutyper_workflow/ (refs. ^{61,62,63,65,72,76}).

Estimation of TMRCA

To estimate TMRCA for a locus of interest, we focus on orthologous sequences (10-kbp windows) identified in synteny among human and chimpanzee haplotypes. Under an assumption of infinite sites, the number of mutations ${x}_{i}$ between a human sequence and its most recent common ancestor is Poisson distributed with a mean of $\mu \times T$, in which $\mu $ is the mutation rate scaled with respect to the substitutions between human and chimpanzee lineages, and T is the TMRCA. That is, $T={\sum }_{i=1}^{n}{x}_{i}/n\mu $, in which n is the number of human haplotypes. To convert TMRCA to time in years, we assume six million years of divergence between human and chimpanzee lineages. We note that the TMRCA estimates reported in the present study account for mutation variation across loci (that is, if the mutation rate is elevated for a locus, the effect would be accounted for). Thus, for each individual locus, an independent mutation (not uniform) rate is applied depending on the observed pattern of mutations compared to the chimpanzee outgroup.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

PacBio HiFi and ONT data have been deposited into NCBI SRA under the following BioProject IDs: PRJNA850430, PRJNA731524, PRJNA551670, PRJNA540705 and PRJEB36100. PacBio HiFi data for CHM1 are available under the following SRA accessions: SRX10759865 and SRX10759866. Sequencing data for Clint PTR are available on NCBI SRA under the BioProject PRJNA659034. The T2T-CHM13 v1.1 assembly can be found on NCBI (GCA_009914755.3). Cell lines obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research are listed in Supplementary Table 1. Assemblies of HPRC samples are available on NCBI under the BioProject PRJNA730822. All additional assemblies used in this work (Clint PTR, CHM1, HG00514, NA12878 and HG03125), variant calls, assembly alignments, and other annotation data used in analysis are available on Zenodo (https://doi.org/10.5281/zenodo.6792653)⁷¹.

Code availability

The software pipeline for aligning assemblies and calling IGC is available on GitHub (https://github.com/mrvollger/asm-to-reference-alignmentv0.1) and Zenodo (https://zenodo.org/record/7653446)⁶⁷. Code for analysing variants called against T2T-CHM13 v1.1 is available on GitHub (https://github.com/mrvollger/sd-divergencev0.1 and Zenodo (https://zenodo.org/record/7653464)⁶⁸. The software pipeline for analysing the triple context of SNVs is available on GitHub (https://github.com/mrvollger/mutyper_workflowv0.1) and Zenodo (https://zenodo.org/record/7653472)⁷⁶. Scripts for figure and table generation are available on GitHub (https://github.com/mrvollger/sd-divergence-and-igc-figuresv0.1) and Zenodo (https://zenodo.org/record/7653486)⁷⁷. GAVISUNK is available on GitHub (https://github.com/pdishuck/GAVISUNK) and Zenodo (https://zenodo.org/record/7655335)⁵⁷.

References

Bailey, J. A., Yavor, A. M., Massa, H. F., Trask, B. J. & Eichler, E. E. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 1005–1017 (2001).
Article CAS PubMed PubMed Central Google Scholar
Fredman, D. et al. Complex SNP-related sequence variation in segmental genome duplications. Nat. Genet. 36, 861–866 (2004).
Article CAS PubMed Google Scholar
Liao, W.-W. et al. A draft human pangenome reference. Nature, https://doi.org/10.1038/s41586-023-05896-x (2023).
Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).
Article CAS PubMed PubMed Central Google Scholar
Duret, L. & Galtier, N. Biased gene conversion and the evolution of mammalian genomic landscapes. Annu. Rev. Genomics Hum. Genet. 10, 285–311 (2009).
Article CAS PubMed Google Scholar
Duncan, B. K. & Miller, J. H. Mutagenic deamination of cytosine residues in DNA. Nature 287, 560–561 (1980).
International HapMap Consortium. The International HapMap Project. Nature 426, 789–796 (2003).
Article Google Scholar
1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Article Google Scholar
Sudmant, P. H. et al. Diversity of human copy number. Science 11184, 2–7 (2010).
Google Scholar
Vollger, M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, eabj6965 (2022).
Article CAS PubMed PubMed Central Google Scholar
Bailey, J. A. et al. Recent segmental duplications in the human genome. Science 297, 1003–1007 (2002).
Article ADS CAS PubMed Google Scholar
IHGSC. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Article ADS Google Scholar
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Article ADS CAS PubMed Google Scholar
Sharp, A. J. et al. Segmental duplications and copy-number variation in the human genome. Am. J. Hum. Genet. 77, 78–88 (2005).
Article CAS PubMed PubMed Central Google Scholar
Dumont, B. L. Interlocus gene conversion explains at least 2.7% of single nucleotide variants in human segmental duplications. BMC Genomics 16, 456 (2015).
Article PubMed PubMed Central Google Scholar
Bailey, J. A., Liu, G. & Eichler, E. E. An Alu transposition model for the origin and expansion of human segmental duplications. Am. J. Hum. Genet. 73, 823–834 (2003).
Article CAS PubMed PubMed Central Google Scholar
Jiang, Z. et al. Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Nat. Genet. 39, 1361–1368 (2007).
Article CAS PubMed Google Scholar
Nuttle, X. Emergence of a Homo sapiens-specific gene family and chromosome 16p11. 2 CNV susceptibility. Nature 536, 205–209 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Dougherty, M. L. et al. Transcriptional fates of human-specific segmental duplications in brain. Genome Res. 28, 1566–1576 (2018).
Article CAS PubMed PubMed Central Google Scholar
Fiddes, I. T. et al. Human-specific NOTCH2NL genes affect notch signaling and cortical neurogenesis. Cell 173, 1356–1369 (2018).
Article CAS PubMed PubMed Central Google Scholar
Ju, X.-C. et al. The hominoid-specific gene TBC1D3 promotes generation of basal neural progenitors and induces cortical folding in mice. eLife 5, e18197 (2016).
Article PubMed PubMed Central Google Scholar
Amemiya, H. M., Kundaje, A. & Boyle, A. P. The ENCODE blacklist: identification of problematic regions of the genome. Sci. Rep. 9, 9354 (2019).
Article ADS PubMed PubMed Central Google Scholar
Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).
Article CAS PubMed PubMed Central Google Scholar
Teshima, K. M. & Innan, H. The coalescent with selection on copy number variants. Genetics 190, 1077–1086 (2012).
Article CAS PubMed PubMed Central Google Scholar
Innan, H. The coalescent and infinite-site model of a small multigene family. Genetics 163, 803–810 (2003).
Article CAS PubMed PubMed Central Google Scholar
Hartasánchez, D. A., Vallès-Codina, O., Brasó-Vives, M. & Navarro, A. Interplay of interlocus gene conversion and crossover in segmental duplications under a neutral scenario. G3 Genes Genomes Genet. 4, 1479–1489 (2014).
Article Google Scholar
Harpak, A., Lan, X., Gao, Z. & Pritchard, J. K. Frequent nonallelic gene conversion on the human lineage and its effect on the divergence of gene duplicates. Proc. Natl Acad. Sci. USA 114, 201708151 (2017).
Article Google Scholar
Mansai, S. P., Kado, T. & Innan, H. The rate and tract length of gene conversion between duplicated genes. Genes 2, 313–331 (2011).
Article CAS PubMed PubMed Central Google Scholar
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Jarvis, E. D. et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature 611, 519–531 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Porubsky, D. et al. Gaps and complex structurally variant loci in phased genome assemblies. Genom. Res. https://doi.org/10.1101/gr.277334.122 (2023).
Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01662-6 (2023).
Bosch, E., Hurles, M. E., Navarro, A. & Jobling, M. A. Dynamics of a human interparalog gene conversion hotspot. Genome Res. 14, 835–844 (2004).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article CAS PubMed PubMed Central Google Scholar
Richter, M. et al. Altered TAOK2 activity causes autism-related neurodevelopmental and cognitive abnormalities through RhoA signaling. Mol. Psychiatry 24, 1329–1350 (2019).
Article CAS PubMed Google Scholar
Sekar, A. et al. Schizophrenia risk from complex variation of complement component 4. Nature 530, 177–183 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Pietri, M. et al. PDK1 decreases TACE-mediated α-secretase activity and promotes disease progression in prion and Alzheimer’s diseases. Nat. Med. 19, 1124–1131 (2013).
Article CAS PubMed Google Scholar
Force, A. et al. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151, 1531–1545 (1999).
Article CAS PubMed PubMed Central Google Scholar
Conant, G. C. & Wagner, A. Asymmetric sequence divergence of duplicate genes. Genome Res. 13, 2052–2058 (2003).
Article CAS PubMed PubMed Central Google Scholar
Nakken, S., Rødland, E. A., Rognes, T. & Hovig, E. Large-scale inference of the point mutational spectrum in human segmental duplications. BMC Genomics 10, 43 (2009).
Article PubMed PubMed Central Google Scholar
Kiktev, D. A., Sheng, Z., Lobachev, K. S. & Petes, T. D. GC content elevates mutation and recombination rates in the yeast Saccharomyces cerevisiae. Proc. Natl Acad. Sci. USA 115, E7109–E7118 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Goldmann, J. M. et al. Germline de novo mutation clusters arise during oocyte aging in genomic regions with high double-strand-break incidence. Nat. Genet. 50, 487–492 (2018).
Article CAS PubMed Google Scholar
Gao, Z. et al. Overlooked roles of DNA damage and maternal age in generating human germline mutations. Proc. Natl Acad. Sci. USA 116, 9491–9500 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Elliott, B., Richardson, C., Winderbaum, J., Nickoloff, J. A. & Jasin, M. Gene conversion tracts from double-strand break repair in mammalian cells. Mol. Cell. Biol. 18, 93–101 (1998).
Article CAS PubMed PubMed Central Google Scholar
Williams, A. L. et al. Non-crossover gene conversions show strong GC bias and unexpected clustering in humans. eLife 4, e04637 (2015).
Article PubMed PubMed Central Google Scholar
Liu, G. et al. Analysis of primate genomic variation reveals a repeat-driven expansion of the human genome. Genome Res. 13, 358–368 (2003).
Article CAS PubMed PubMed Central Google Scholar
Logsdon, G. A. et al. The structure, function and evolution of a complete human chromosome 8. Nature 593, 101–107 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Noyes, M. D. et al. Familial long-read sequencing increases yield of de novo mutations. Am. J. Hum. Genet. 109, 631–646 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ji, X. & Thorne, J. L. A phylogenetic approach disentangles interlocus gene conversion tract length and initiation rate. Preprint at https://arxiv.org/abs/1908.08608 (2019).
Narasimhan, V. M. et al. Estimating the human mutation rate from autozygous segments reveals population differences in human mutational processes. Nat. Commun. 8, 303 (2017).
Article ADS PubMed PubMed Central Google Scholar
Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science 376, eabl4178 (2022).
Article CAS PubMed PubMed Central Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Article CAS PubMed PubMed Central Google Scholar
Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker Open-4.0, http://www.repeatmasker.org (2013–2015).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Pendleton, A. L. et al. Comparison of village dog and wolf genomes highlights the role of the neural crest in dog domestication. BMC Biol. 16, 64 (2018).
Article PubMed PubMed Central Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).
Article CAS PubMed PubMed Central Google Scholar
Dishuck, P. C., Rozanski, A. N., Logsdon, G. A., Porubsky, D. & Eichler, E. E. GAVISUNK: genome assembly validation via inter-SUNK distances in Oxford Nanopore reads. Bioinformatics 39, btac714 (2022).
Article PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Vollger, M. R. mrvollger/rustybam: v0.1.29. Zenodo, https://doi.org/10.5281/ZENODO.6342176. (2022)
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
Article PubMed PubMed Central Google Scholar
Bonfield, J. K. et al. HTSlib: C library for reading/writing high-throughput sequencing data. Gigascience 10, giab007 (2021).
Article PubMed PubMed Central Google Scholar
Mölder, F. et al. Sustainable data analysis with Snakemake. F1000Res. 10, 33 (2021).
Article PubMed PubMed Central Google Scholar
pysam: a Python module for reading and manipulating SAM/BAM/VCF/BCF files. GitHub, https://github.com/pysam-developers/pysam (2021).
Quinlan, A. R. BEDTools: the Swiss-army tool for genome feature analysis. Curr. Protoc. Bioinformatics 47, 11.12.1-34 (2014).
Article PubMed Google Scholar
Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).
Article PubMed PubMed Central Google Scholar
Vollger, M. R. mrvollger/asm-to-reference-alignment: v0.1. Zenodo, https://doi.org/10.5281/ZENODO.7653446 (2023).
Vollger, M. R. mrvollger/sd-divergence: v0.1. Zenodo, https://doi.org/10.5281/ZENODO.7653464 (2023).
Carey, K. M., Patterson, G. & Wheeler, T. J. Transposable element subfamily annotation has a reproducibility problem. Mob. DNA 12, 4 (2021).
Article CAS PubMed PubMed Central Google Scholar
Porubsky, D. et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat. Biotechnol. 39, 302–308 (2021).
Article CAS PubMed Google Scholar
Vollger, M. Supplementary data for: Increased mutation and gene conversion within human segmental duplications. Zenodo, https://doi.org/10.5281/zenodo.7651064 (2023).
DeWitt, W. S. mutyper: assigning and summarizing mutation types for analyzing germline mutation spectra. Preprint at https://doi.org/10.1101/2020.07.01.183392 (2020).
Carlson, J., DeWitt, W. S. & Harris, K. Inferring evolutionary dynamics of mutation rates through the lens of mutation spectrum variation. Curr. Opin. Genet. Dev. 62, 50–57 (2020).
Article CAS PubMed PubMed Central Google Scholar
Harris, K. Evidence for recent, population-specific evolution of the human mutation rate. Proc. Natl Acad. Sci. USA 112, 3439–3444 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Aitchison, J. The statistical analysis of compositional data. J. R. Stat. Soc. 44, 139–160 (1982).
MathSciNet Google Scholar
Vollger, M. R. mrvollger/mutyper_workflow: v0.1. Zenodo, https://doi.org/10.5281/ZENODO.7653472 (2023).
Vollger, M. R. mrvollger/sd-divergence-and-igc-figures: v0.1. Zenodo, https://doi.org/10.5281/ZENODO.7653486 (2023).

Download references

Acknowledgements

We thank T. Brown for help in editing this manuscript, P. Green for valuable suggestions, and R. Seroussi and his staff for their generous donation of time and resources. This work was supported in part by grants from the US National Institutes of Health (NIH 5R01HG002385, 5U01HG010971 and 1U01HG010973 to E.E.E.; K99HG011041 to P.H.; and F31AI150163 to W.S.D.). W.S.D. was supported in part by a Fellowship in Understanding Dynamic and Multi-scale Systems from the James S. McDonnell Foundation. E.E.E. is an investigator of the Howard Hughes Medical Institute (HHMI). This article is subject to HHMI’s Open Access to Publications policy. HHMI laboratory heads have previously granted a nonexclusive CC BY 4.0 licence to the public and a sublicensable licence to HHMI in their research articles. Pursuant to those licences, the author-accepted manuscript of this article can be made freely available under a CC BY 4.0 licence immediately on publication.

Author information

Authors and Affiliations

Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
Mitchell R. Vollger, Philip C. Dishuck, William T. Harvey, William S. DeWitt, Xavi Guitart, Michael E. Goldberg, Allison N. Rozanski, Carl A. Baker, Jennifer Kordosky, Mitchell R. Vollger, Katherine M. Munson, Alexandra P. Lewis, Kendra Hoekzema, Glennis A. Logsdon, David Porubsky, Kelley Harris, PingHsun Hsieh & Evan E. Eichler
Division of Medical Genetics, University of Washington School of Medicine, Seattle, WA, USA
Mitchell R. Vollger & Mitchell R. Vollger
Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
William S. DeWitt
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
William S. DeWitt
UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
Julian Lucas, Mobin Asri, Xian H. Chang, Mark Diekhans, Jordan M. Eizenga, Marina Haukness, David Haussler, Glenn Hickey, Julian K. Lucas, Charles Markello, Karen H. Miga, Jean Monlong, Adam M. Novak, Hugh E. Olsen, Benedict Paten, Trevor Pesout, Jouni Sirén & Benedict Paten
Howard Hughes Medical Institute, Chevy Chase, MD, USA
David Haussler, Erich D. Jarvis & Evan E. Eichler
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St Louis, MO, USA
Haley J. Abel
McDonnell Genome Institute, Washington University School of Medicine, St Louis, MO, USA
Lucinda L. Antonacci-Fulton, Sarah Cody, Robert S. Fulton, Allison A. Regier, Chad Tomlinson & Ting Wang
Google LLC, Mountain View, CA, USA
Gunjan Baid, Anastasiya Belyaeva, Andrew Carroll, Pi-Chuan Chang, Daniel E. Cook, Alexey Kolesnikov, Maria Nattestad & Kishwar Shafin
European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
Konstantinos Billis, Susan Fairley, Paul Flicek, Adam Frankish, Carlos Garcia Giron, Leanne Haggerty, Thibaut Hourlier, Jan O. Korbel, Fergal J. Martin & Francesca Floriana Tricomi
Department of Human Genetics, McGill University, Montreal, Quebec, Canada
Guillaume Bourque
Canadian Center for Computational Genomics, McGill University, Montreal, Quebec, Canada
Guillaume Bourque
Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
Guillaume Bourque
Institute of Genetics and Biophysics, National Research Council, Naples, Italy
Silvia Buonaiuto & Vincenza Colonna
Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
Mark J. P. Chaisson & Tsung-Yu Lu
Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
Haoyu Cheng, Justin Chu, Xiaowen Feng & Heng Li
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Haoyu Cheng, Xiaowen Feng & Heng Li
Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
Vincenza Colonna, Christian Fischer, Erik Garrison, Andrea Guarracino, Pjotr Prins & Flavia Villani
Barrett and O’Connor Washington Center, Arizona State University, Washington DC, USA
Robert M. Cook-Deegan
Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA, USA
Omar E. Cornejo
Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
Daniel Doerr, Peter Ebert, Jana Ebler, Hugo Magalhães, Pierre Marijon & Tobias Marschall
Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
Daniel Doerr, Peter Ebert, Jana Ebler, Hugo Magalhães, Pierre Marijon & Tobias Marschall
Core Unit Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
Peter Ebert
Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
Olivier Fedrigo, Giulio Formenti, Erich D. Jarvis & Jacquelyn Mountcastle
National Institutes of Health (NIH)–National Human Genome Research Institute, Bethesda, MD, USA
Adam L. Felsenfeld, Baergen I. Schultz, Michael W. Smith & Heidi J. Sofia
Department of Genetics, Washington University School of Medicine, St Louis, MO, USA
Robert S. Fulton & Ting Wang
Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA
Yan Gao
Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Copenhagen, Denmark
Shilpa Garg
Institute for Society and Genetics, College of Letters and Science, University of California, Los Angeles, Los Angeles, CA, USA
Nanibaa’ A. Garrison
Institute for Precision Health, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
Nanibaa’ A. Garrison
Division of General Internal Medicine and Health Services Research, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
Nanibaa’ A. Garrison
Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
Richard E. Green
Dovetail Genomics, Scotts Valley, CA, USA
Richard E. Green
Quantitative Life Sciences, McGill University, Montreal, Quebec, Canada
Cristian Groza
Genomics Research Centre, Human Technopole, Milan, Italy
Andrea Guarracino
Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
Ira M. Hall, Wen-Wei Liao & Shuangjia Lu
Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
Ira M. Hall & Wen-Wei Liao
Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany
Simon Heumos
Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, Germany
Simon Heumos
Tree of Life, Wellcome Sanger Institute, Hinxton, UK
Kerstin Howe & Jonathan M. D. Wood
Northeastern University, Boston, MA, USA
Miten Jain
Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
Erich D. Jarvis
Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
Hanlee P. Ji & HoJoon Lee
Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Eimear E. Kenny
Program in Bioethics and Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
Barbara A. Koenig
Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
Jan O. Korbel
Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
Sergey Koren, Ann McCartney, Sergey Nurk, Adam M. Phillippy, Mikko Rautiainen, Arang Rhie & Brian Walenz
Division of Biology and Biomedical Sciences, Washington University School of Medicine, St Louis, MO, USA
Wen-Wei Liao
Computer Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain
Santiago Marco-Sola
Departament d’Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona, Spain
Santiago Marco-Sola
Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
Jennifer McDaniel, Nathan D. Olson, Justin Wagner & Justin M. Zook
Coriell Institute for Medical Research, Camden, NJ, USA
Matthew W. Mitchell
Department of Computer Science, University of Pisa, Pisa, Italy
Moses Njagi Mwaniki
Department of Public Health Sciences, University of California, Davis, Davis, CA, USA
Alice B. Popejoy
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
Daniela Puiu & Aleksey V. Zimin
Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA, USA
Samuel Sacco
Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
Ashley D. Sanders
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Valerie A. Schneider & Françoise Thibaud-Nissen
Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
Jonas A. Sibbesen
Al Jalila Genomics Center of Excellence, Al Jalila Children’s Specialty Hospital, Dubai, United Arab Emirates
Ahmad N. Abou Tayoun
Center for Genomic Discovery, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, United Arab Emirates
Ahmad N. Abou Tayoun
Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
Aleksey V. Zimin

Authors

Mitchell R. Vollger
View author publications
You can also search for this author in PubMed Google Scholar
Philip C. Dishuck
View author publications
You can also search for this author in PubMed Google Scholar
William T. Harvey
View author publications
You can also search for this author in PubMed Google Scholar
William S. DeWitt
View author publications
You can also search for this author in PubMed Google Scholar
Xavi Guitart
View author publications
You can also search for this author in PubMed Google Scholar
Michael E. Goldberg
View author publications
You can also search for this author in PubMed Google Scholar
Allison N. Rozanski
View author publications
You can also search for this author in PubMed Google Scholar
Julian Lucas
View author publications
You can also search for this author in PubMed Google Scholar
Mobin Asri
View author publications
You can also search for this author in PubMed Google Scholar
Katherine M. Munson
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra P. Lewis
View author publications
You can also search for this author in PubMed Google Scholar
Kendra Hoekzema
View author publications
You can also search for this author in PubMed Google Scholar
Glennis A. Logsdon
View author publications
You can also search for this author in PubMed Google Scholar
David Porubsky
View author publications
You can also search for this author in PubMed Google Scholar
Benedict Paten
View author publications
You can also search for this author in PubMed Google Scholar
Kelley Harris
View author publications
You can also search for this author in PubMed Google Scholar
PingHsun Hsieh
View author publications
You can also search for this author in PubMed Google Scholar
Evan E. Eichler
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

Human Pangenome Reference Consortium

Haley J. Abel
, Lucinda L. Antonacci-Fulton
, Mobin Asri
, Gunjan Baid
, Carl A. Baker
, Anastasiya Belyaeva
, Konstantinos Billis
, Guillaume Bourque
, Silvia Buonaiuto
, Andrew Carroll
, Mark J. P. Chaisson
, Pi-Chuan Chang
, Xian H. Chang
, Haoyu Cheng
, Justin Chu
, Sarah Cody
, Vincenza Colonna
, Daniel E. Cook
, Robert M. Cook-Deegan
, Omar E. Cornejo
, Mark Diekhans
, Daniel Doerr
, Peter Ebert
, Jana Ebler
, Evan E. Eichler
, Jordan M. Eizenga
, Susan Fairley
, Olivier Fedrigo
, Adam L. Felsenfeld
, Xiaowen Feng
, Christian Fischer
, Paul Flicek
, Giulio Formenti
, Adam Frankish
, Robert S. Fulton
, Yan Gao
, Shilpa Garg
, Erik Garrison
, Nanibaa’ A. Garrison
, Carlos Garcia Giron
, Richard E. Green
, Cristian Groza
, Andrea Guarracino
, Leanne Haggerty
, Ira M. Hall
, William T. Harvey
, Marina Haukness
, David Haussler
, Simon Heumos
, Glenn Hickey
, Kendra Hoekzema
, Thibaut Hourlier
, Kerstin Howe
, Miten Jain
, Erich D. Jarvis
, Hanlee P. Ji
, Eimear E. Kenny
, Barbara A. Koenig
, Alexey Kolesnikov
, Jan O. Korbel
, Jennifer Kordosky
, Sergey Koren
, HoJoon Lee
, Alexandra P. Lewis
, Heng Li
, Wen-Wei Liao
, Shuangjia Lu
, Tsung-Yu Lu
, Julian K. Lucas
, Hugo Magalhães
, Santiago Marco-Sola
, Pierre Marijon
, Charles Markello
, Tobias Marschall
, Fergal J. Martin
, Ann McCartney
, Jennifer McDaniel
, Karen H. Miga
, Matthew W. Mitchell
, Jean Monlong
, Jacquelyn Mountcastle
, Katherine M. Munson
, Moses Njagi Mwaniki
, Maria Nattestad
, Adam M. Novak
, Sergey Nurk
, Hugh E. Olsen
, Nathan D. Olson
, Benedict Paten
, Trevor Pesout
, Adam M. Phillippy
, Alice B. Popejoy
, David Porubsky
, Pjotr Prins
, Daniela Puiu
, Mikko Rautiainen
, Allison A. Regier
, Arang Rhie
, Samuel Sacco
, Ashley D. Sanders
, Valerie A. Schneider
, Baergen I. Schultz
, Kishwar Shafin
, Jonas A. Sibbesen
, Jouni Sirén
, Michael W. Smith
, Heidi J. Sofia
, Ahmad N. Abou Tayoun
, Françoise Thibaud-Nissen
, Chad Tomlinson
, Francesca Floriana Tricomi
, Flavia Villani
, Mitchell R. Vollger
, Justin Wagner
, Brian Walenz
, Ting Wang
, Jonathan M. D. Wood
, Aleksey V. Zimin
& Justin M. Zook

Contributions

Conceptualization and design: M.R.V., K. Harris, W.S.D., P.H. and E.E.E. Identification and analysis of SNVs from phased assemblies: M.R.V. Mutational spectrum analysis: M.R.V., W.S.D., M.E.G. and K. Harris. Evolutionary age analysis: M.R.V. and P.H. Assembly generation: M.A., J.L., B.P. and HPRC. PacBio genome sequence generation: K.M.M., A.P.L., K. Hoekzema and G.A.L. Copy number analysis and validation: P.C.D., X.G., W.T.H., A.N.R., D. Porubsky and M.R.V. Table organization: M.R.V. Supplementary material organization: M.R.V. Display items: M.R.V., X.G., P.H. and P.C.D. Resources: HPRC, K. Harris, B.P. and E.E.E. Manuscript writing: M.R.V. and E.E.E. with input from all authors.

Corresponding author

Correspondence to Evan E. Eichler.

Ethics declarations

Competing interests

E.E.E. is a scientific advisory board member of Variant Bio, Inc. All other authors declare no competing interests.

Peer review

Peer review information

Nature thanks Anna Lindstrand and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Analysis schema for variant and IGC calling.

Whole-genome alignments were calculated for the HPRC assemblies against T2T-CHM13 v1.1 with a copy of GRCh38 chrY using minimap2 v2.24. The alignments were further processed to remove alignments that were redundant in query sequence or that had structural variants over 10 kbp in length. After these steps, the remaining alignments over 1 Mbp were defined to be syntenic and used in downstream analyses. We then counted all pairwise single-nucleotide differences between the haplotypes and the reference and stratified these results into unique regions versus SD regions based on the SD annotations from T2T-CHM13 v1.1. All variants intersecting tandem repeats were filtered to avoid spurious SNV calls. To detect candidate regions of IGC, the query sequence with syntenic alignments was fragmented into 1 kbp windows with a 100 bp slide and realigned back to T2T-CHM13 v1.1 independent of the flanking sequence using minimap2 v2.24 to identify each window’s single best alignment position. These alignments were compared to their original syntenic alignment positions, and if they were not overlapping, we considered them to be candidate IGC windows. Candidate IGC windows were then merged into larger intervals and realigned when windows were overlapping in both the donor and the acceptor sequence. We then used the CIGAR string to identify the number of matching and mismatching bases at the “donor” site and compared that to the number of matching and mismatching bases at the acceptor site determined by the syntenic alignment to calculate the number of supporting SNVs.

Extended Data Fig. 2 Ideogram of an assembly of CHM1 aligned to T2T-CHM13.

The ideogram depicts the contiguity (alternating blue and orange contigs) of a CHM1 assembly generated by Verkko as compared to T2T-CHM13. The overall contig N50 is 105.2 Mbp providing near chromosome arm contiguity with the exception of breaks at the centromere (red) and other large satellite arrays. Because the sequence is derived from a monoploid complete hydatidiform mole, there is no opportunity for assembly errors due to inadvertent haplotype switching.

Extended Data Fig. 3 Increased variation in SD sequences and African haplotypes.

Histograms of the average number of SNVs per 10 kbp over all 125 Mbp bins of unique (blue) and SD (red) sequence for all haplotypes. African haplotypes (bottom) are compared separately to non-African (top) haplotypes. All SD bins (125 Mbp each) have more SNVs than any unique bin irrespective of human superpopulation.

Extended Data Fig. 4 Average number of SNVs across different repeat classes.

Shown are the average number of SNVs per 10 kbp within SDs (red), unique (blue), and additional sequence classes (gray) across the HPRC haplotypes. These classes include exonic regions, ancient SDs (SD with <90% sequence identity) and all elements identified by RepeatMasker (RM) with Alu, L1 LINE, and HERV elements broken out separately. Below each sequence class we show the average number of SNVs per 10 kbp for the median haplotype. Standard deviations and measurements for additional repeat classes are provided in Table S3.

Extended Data Fig. 5 Largest IGC events in the human genome.

The ideogram depicts as red arcs the positions of the largest IGC events between and within human chromosomes (top 10% of the length distribution).

Extended Data Fig. 6 Percent of increased single-nucleotide variation explained by IGC.

Shown is the fraction of the increased SNV diversity in SDs that can be attributed to IGC for each of the HPRC haplotypes stratified by global superpopulation. In text is the average across all haplotypes (23%).

Extended Data Fig. 7 IGC hotspots.

a) Density of IGC acceptor (top, blue) and donor (bottom, orange) sites across the “SD genome”. The SD genome consists of all main SD regions (>50 kbp) minus the intervening unique sequences. b) All intrachromosomal IGC events from 102 human haplotypes analyzed for chromosome 15. Arcs drawn in blue (top) have the acceptor site on the left-hand side and the donor site on the right. Arcs drawn in orange (bottom) are arranged oppositely. Protein-coding genes are drawn as vertical black lines above the ideogram, and large duplication (blue) and deletion (red) events associated with human diseases are drawn as horizontal lines just above the ideogram. c) Zoom of the 100 highest confidence (lowest p-value) IGC events identified on chromosome 15 between 17 and 31 Mbp. Genes that are intersected by IGC events are highlighted in red.

Supplementary information

Supplementary Information

This file contains Supplementary Figs. 1–20, Notes and References.

Reporting Summary

Supplementary Tables

This file contains Supplementary Tables 1–14.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Vollger, M.R., Dishuck, P.C., Harvey, W.T. et al. Increased mutation and gene conversion within human segmental duplications. Nature 617, 325–334 (2023). https://doi.org/10.1038/s41586-023-05895-y

Download citation

Received: 06 July 2022
Accepted: 28 February 2023
Published: 10 May 2023
Issue Date: 11 May 2023
DOI: https://doi.org/10.1038/s41586-023-05895-y

This article is cited by

RepEnTools: an automated repeat enrichment analysis package for ChIP-seq data reveals hUHRF1 Tandem-Tudor domain enrichment in young repeats
- Michel Choudalakis
- Pavel Bashtrykov
- Albert Jeltsch
Mobile DNA (2024)
Protein-altering variants at copy number-variable regions influence diverse human phenotypes
- Margaux L. A. Hujoel
- Robert E. Handsaker
- Po-Ru Loh
Nature Genetics (2024)
Pangenome graphs improve the analysis of structural variants in rare genetic diseases
- Cristian Groza
- Carl Schwendinger-Schreck
- Tomi Pastinen
Nature Communications (2024)
A pan-genome of 69 Arabidopsis thaliana accessions reveals a conserved genome structure throughout the global species range
- Qichao Lian
- Bruno Huettel
- Raphael Mercier
Nature Genetics (2024)
Genomic variant benchmark: if you cannot measure it, you cannot improve it
- Sina Majidian
- Daniel Paiva Agustinho
- Medhat Mahmoud
Genome Biology (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.