Recombination between heterologous human acrocentric chromosomes

Guarracino, Andrea; Buonaiuto, Silvia; de Lima, Leonardo Gomes; Potapova, Tamara; Rhie, Arang; Koren, Sergey; Rubinstein, Boris; Fischer, Christian; Gerton, Jennifer L.; Phillippy, Adam M.; Colonna, Vincenza; Garrison, Erik

doi:10.1038/s41586-023-05976-y

Download PDF

Article
Open access
Published: 10 May 2023

Recombination between heterologous human acrocentric chromosomes

Nature volume 617, pages 335–343 (2023)Cite this article

19k Accesses
17 Citations
1353 Altmetric
Metrics details

Subjects

Abstract

The short arms of the human acrocentric chromosomes 13, 14, 15, 21 and 22 (SAACs) share large homologous regions, including ribosomal DNA repeats and extended segmental duplications^1,2. Although the resolution of these regions in the first complete assembly of a human genome—the Telomere-to-Telomere Consortium’s CHM13 assembly (T2T-CHM13)—provided a model of their homology³, it remained unclear whether these patterns were ancestral or maintained by ongoing recombination exchange. Here we show that acrocentric chromosomes contain pseudo-homologous regions (PHRs) indicative of recombination between non-homologous sequences. Utilizing an all-to-all comparison of the human pangenome from the Human Pangenome Reference Consortium⁴ (HPRC), we find that contigs from all of the SAACs form a community. A variation graph⁵ constructed from centromere-spanning acrocentric contigs indicates the presence of regions in which most contigs appear nearly identical between heterologous acrocentric chromosomes in T2T-CHM13. Except on chromosome 15, we observe faster decay of linkage disequilibrium in the pseudo-homologous regions than in the corresponding short and long arms, indicating higher rates of recombination^6,7. The pseudo-homologous regions include sequences that have previously been shown to lie at the breakpoint of Robertsonian translocations⁸, and their arrangement is compatible with crossover in inverted duplications on chromosomes 13, 14 and 21. The ubiquity of signals of recombination between heterologous acrocentric chromosomes seen in the HPRC draft pangenome suggests that these shared sequences form the basis for recurrent Robertsonian translocations, providing sequence and population-based confirmation of hypotheses first developed from cytogenetic studies 50 years ago⁹.

A pan-genome of 69 Arabidopsis thaliana accessions reveals a conserved genome structure throughout the global species range

Article Open access 11 April 2024

Qichao Lian, Bruno Huettel, … Raphael Mercier

The variation and evolution of complete human centromeres

Article Open access 03 April 2024

Glennis A. Logsdon, Allison N. Rozanski, … Evan E. Eichler

Search-and-replace genome editing without double-strand breaks or donor DNA

Article 21 October 2019

Andrew V. Anzalone, Peyton B. Randolph, … David R. Liu

Main

Although the human reference genome is now 22 years old¹⁰, fundamental limitations of the bacterial artificial chromosome (BAC) libraries on which it was built prevented its completion. Incomplete regions amount to 8% of the Genome Reference Consortium’s Human Build 38, and include heterochromatic regions in the centromeres and the SAACs. Advances in long-read DNA sequencing have recently enabled the creation of a complete reference assembly—T2T-CHM13³—from a homozygous human cell line, providing a reference system for these regions for the first time. In parallel, our ongoing work in the HPRC has yielded 94 haplotype-resolved assemblies for human cell lines (HPRCy1) based on the same Pacific Biosciences circular consensus (HiFi) sequencing that forms the foundation of T2T-CHM13⁴. These resources enable us to characterize patterns of variation in these previously invisible regions. Here we study variation in the largest non-centromeric regions made visible in T2T-CHM13 and HPRCy1—those between the centromere and the ribosomal DNA (rDNA) on the SAACs, where Robertsonian translocations¹¹ (ROBs), the most common human translocation events, frequently occur.

Eighteen of the twenty-three human chromosomes are metacentric, with the centromere found in a median position between short (p) and long (q) arms, whereas 5 are acrocentric, featuring one arm that is substantially shorter than the other. The SAACs (chromosome (chr.)13p, chr. 14p, chr. 15p, chr. 21p and chr. 22p) host the nucleolus organizer regions, the genomic segments that contain rDNA genes and that give rise to the interphase nucleoli^12,13. Owing to their repetitive nature, rDNA repeat arrays facilitate intramolecular recombination¹⁴. rDNA repeats incur double-strand breaks at a high rate owing to transcription–replication conflicts¹³. Moreover, rDNA from multiple acrocentric chromosomes can be co-located in nucleoli during interphase, and multiple acrocentric chromosomes often co-localize to a single nucleolus during the pachytene stage of meiosis¹⁵, when chromosomes synapse and recombine. As it causes them to occupy the same constrained physical space, the positioning of rDNA-adjacent sequences in proximity to the nucleolus could be a driver of genetic exchange between heterologous chromosomes (Supplementary Note 1), given that estimates¹⁵ of the probability of two regions adjacent to the nucleolus-organizing region being colocalized are 120,000 times higher than colocalization in a human spermatocyte nucleus. In line with this, distal and proximal sequences to rDNA repeat arrays are conserved among the acrocentric chromosomes, suggesting that recombination homogenizes them^1,2. Experimental and sequence-based evidence indicates the presence of a common subfamily of alpha satellite DNA shared by acrocentric pairs chr. 13–chr. 21 and chr. 14–chr. 22 that provides evidence for an evolutionary process consistent with recombination between heterologous chromosomes¹⁶. Furthermore, ROBs—which occur in 1 out of 800 births—are most common between chr. 13 and chr. 14 (around 75% of cases), and between chr. 14 and chr. 21 (around 10% of cases), but the underlying sequences and recombination processes that drive them remain unknown⁸.

The T2T-CHM13 reference fully resolves the genomic structure of the SAACs, confirming their strong similarity and providing a complete view of the homologies in this single genome³. However, T2T-CHM13 does not provide information on how SAACs vary among the human population and additional genomes are needed to understand whether the representation in T2T-CHM13 is typical. Notably, alignments of HPRCy1 assemblies to T2T-CHM13 reveal individual contigs with optimal alignments to multiple CHM13 acrocentric chromosomes, suggesting possible translocations⁴. This analysis was necessarily relative to only a single frame of reference used as target in alignment, leaving open questions regarding the relationships between pairs of HPRCy1 haplotypes. A complete study of this region thus requires improvements in both sequence assembly and pangenome analysis to enable an unbiased assessment of its structure and variation in the population. Here we combine T2T-CHM13 and HPRCy1 assemblies in a reference-free pangenome variation graph (PVG) model of the SAACs. Using this model and other symmetric analyses of T2T-CHM13 and the HPRCy1 assemblies, we establish a coherent model of population-scale variation in the SAACs.

Chromosome community detection

We sought to study the chromosome groupings implied by the homologies found in all 94 assemblies of the HPRCy1 pangenome. We used homology mapping to build a reference-free model of the HPRCy1 pangenome, represented as a mapping graph with nodes as contigs and edges as mappings between them. The graph was built using chains of 50-kb seeds of 95% average nucleotide identity—features that we expect to support homologous recombination¹⁷—with up to 93 alternative mappings allowed per contig. After applying this process to all 38,325 HPRCy1 contigs and narrowing our focus to only mappings involving contigs that are at least 1 Mb in size, we built a reduced mapping graph by selecting the best 3 mappings per contig segment and labelling each contig with its reference-relative assignment (Fig. 1a). This simplified graph showed clusters that generally matched our expectations of higher similarity between certain chromosomes^18,19 (Fig. 1b). For a more quantitatively rigorous interpretation, we used a community detection algorithm (Methods) to divide the full mapping graph into 31 communities (Supplementary File 1). These communities were consistent with our expectations based on mapping the contigs to reference chromosomes T2T-CHM13 and GRCh38 and known patterns of similarity between chromosomes. We found that the community of the SAACs contained the most distinct chromosomes and the most contigs (Fig. 1c,d). Many contigs from the pseudoautosomal regions (PARs) and X-transposed regions (XTRs) of chromosome X²⁰ and all of those from chromosome Y formed one community, and others from the short arm of chromosome X—including all of those from evolutionary strata 4 and 5²¹—formed another (Extended Data Fig. 1). A few additional communities were identified that did not correspond to individual chromosomes, but typically represent single chromosome arms.

**Fig. 1: Community detection in the HPRCy1 pangenome.**

An all-acrocentric PVG

We constructed a pangenome graph from acrocentric contigs in the HPRCy1 draft pangenome to evaluate the hypothesis that heterologous SAACs recombine. We first collected long HPRCy1 contigs that span the acrocentric centromeres and can be assigned to specific acrocentric chromosomes (Extended Data Fig. 2). We then used the PanGenome Graph Builder²² (PGGB) to construct a single PVG from these contigs (Methods). PVG nodes represent sequences and edges indicate when concatenations of the nodes they connect occur in the contigs represented by the graph²³. By relating pangenome sequences to the graph as paths of nodes⁵, PVGs support base-level analysis of variation and homology between genomes^4,24,25,26. The symmetric all-to-all alignment²⁷ and graph induction²⁸ of PGGB avoid sources of bias such as reference choice and genome inclusion order that affect progressive PVG construction methods²⁸. For cross-validation of our results, we additionally include two assemblies of HG002 in the PVG: HG002-HPRCy1⁴—obtained from HiFi reads, and HG002-Verkko—a T2T diploid assembly constructed from both HiFi and Oxford Nanopore Technologies (ONT) reads as described in Methods.

The resulting acrocentric PVG (acro-PVG) presents structures that echo those observed in T2T-CHM13 and the community structure of the homology mapping graph (Fig. 2a and Supplementary Files 2 and 3). In more detail, the main connected component including all chromosomes presented a tangled region, anchored at the rDNA repeats and extending towards the centromere-proximal end of the short arms. The alpha satellite higher-order repeat arrays in the centromeres of chr. 13–chr. 21 and chr. 14–chr. 22 pairs shared high similarity within each pair^18,19, leading to collapsed motifs in the graph (Fig. 2b). The chr. 13–chr. 21 and chr. 14–chr. 22 pairs diverge in centromere-proximal regions of the q-arms. Furthermore, a region in the pangenome graph centred on the GC-rich SST1 array was present in a single copy in chr. 13, chr. 14 and chr. 21, indicating a high degree of similarity of genomes in those regions (Fig. 2c and Supplementary Fig. 1). This is compatible with the frequent involvement of these regions of chr. 13, chr. 14 and chr. 21 in ROBs^8,29. The SST1 elements in the segmentally duplicated region are GC-rich 1.4- to 2.4-kb-long sequences arranged in tandem clusters³⁰, located throughout the genome including near the centromeres of the SAACs chr. 13, chr. 14 and chr. 21³¹. The SST1 array size is variable in the human population³² and its methylation status is clinically relevant to cancer³³. SST1 repeats on chr. 13, chr. 14 and chr. 21 in T2T-CHM13 are highly similar to each other³¹, consistent with homogenization via recombination. All the graph motifs described in the acro-PVG were also confirmed by building a pangenome graph without including the T2T-CHM13 and GRCh38 references (Supplementary Figs. 2 and 3), indicating that the observed structure is independent of the reference assemblies.

**Fig. 2: The acro-PVG derived from the HPRCy1 assembly.**

Exchange among heterologous acrocentric regions

The acro-PVG provides a representation of the multiple alignment of SAACs found in the human population. In the acro-PVG, we observe many regions in the graph where multiple T2T-CHM13 chromosomes are aligned. We expect these regions to potentially support homologous recombination, which largely depends on sequence homology and physical proximity, both of which are common among heterologous SAACs^15,34.

HPRCy1 contigs are homology mosaics

We sought to test the hypothesis that homologous regions of the SAACs feature ongoing sequence exchange by searching for regions in the acro-PVG where individual contigs are best described as a mosaic of diverse T2T-CHM13 acrocentric chromosomes. We derived a pairwise alignment from the acro-PVG through ‘untangling’²⁶, a process that projects the graph into an alignment between a set of query (HPRCy1 acrocentric (HPRCy1-acro)) and reference (T2T-CHM13) sequences, jointly considering all possible alignments represented by the pangenome graph. The untangling of the acro-PVG against multiple T2T-CHM13 chromosome reference sequences simultaneously shows the best match of segments within contigs to multiple reference chromosomes.

The hypothesis of recombination between heterologous acrocentric chromosomes implies that the HPRCy1-acro contigs untangled from the acro-PVG will be a mosaic of diverse acrocentric chromosomes in the regions undergoing homologous recombination. The same would not be true for flanking regions that should map to one specific chromosome.

We queried the PVG²⁶ to obtain a mapping from segments of all PVG paths onto T2T-CHM13. This segments the graph, and for each HPRCy1 contig (query) subpath through each graph segment, we find the most similar reference segment (Extended Data Fig. 3). To reduce the possibility of error, we focused the alignment projection only on the confidently assembled regions of the HPRCy1-acro contigs⁴ (Methods) and we filtered the mappings to retain only those at greater than 90% estimated identity, removing a total of 1.17 Gbp, or 2.52% of the total SAAC contig segments (Supplementary File 4 and Supplementary Figs. 4–8).

For a reference-relative interpretation of the results, we anchored the contigs to the single T2T-CHM13 reference chromosomes to which the q-arm maps (Methods), providing a reference-relative positioning of contigs in the PVG. We find that the q-arm of each contig maps to a single chromosome, whereas the p-arm is a mosaic of segments mapping to several acrocentric chromosomes (Fig. 3a,b). Results for all the acrocentric chromosomes are shown in Extended Data Figs. 4–8.

**Fig. 3: Characteristics of the PHRs of acrocentric chromosomes.**

We cross-validated homology mosaic patterns by comparing the reference-relative anchored untangling of HG002-Verkko to HG002-HiFi, obtaining a 87.45% concordance rate in the SAACs and a 99.93% concordance rate in the acrocentric q-arms (Methods). Although HG002-HiFi contains only one contig that would meet our HPRCy1 contiguity requirements, we observe broadly concordant patterns in the two assemblies (Supplementary Figs. 9–13) and visually confirm patterns—such as those between chr. 13p and chr. 21p—that are seen in many HPRCy1 assemblies (Supplementary Fig. 9).

Homology mosaicism grows across SAACs

We counted the number of contigs that best match each of the T2T-CHM13 acrocentric chromosomes within the PVG (Fig. 3c). On the q-arm, all contigs best match their homologous T2T-CHM13 chromosome, agreeing with the observed structure of the PVG (Fig. 2a). However, as we approach the centromere from the q-arm, we observe regions of homology between chr. 13 and chr. 21 (Fig. 3c and Extended Data Figs. 4b and 7b) and between chr. 14 and chr. 22 (Extended Data Figs. 5b and 8b). By contrast, homology with other acrocentric chromosomes begins closer to the rDNA in chr. 15 (Extended Data Fig. 6b), corroborating the pattern observed in the PVG topology (Fig. 2b).

Although the higher-order repeat arrays on chr. 13 and chr. 21 and on chr. 14 and chr. 22 are both collapsed in the PVG (Fig. 2b), we observe sparse identity mappings higher than 90% within the centromeres (Fig. 3b and Extended Data Figs. 4a, 5a, 6a, 7a and 8a). This is consistent with other reports of high divergence within centromeric satellites³⁵. HPRCy1 contigs anchored on the q-arms of chr. 13, chr. 14 and chr. 21 share a segmental duplication (or homologous region) centred on the SST1 array (Fig. 3b), in line with what is seen in the pangenome graph topology (Fig. 2c and Supplementary Figs. 1 and 3). Furthermore, as in T2T-CHM13, this region is in the same orientation on chr. 13 and chr. 21, but is inverted on chr. 14 (Supplementary Figs. 14–16). All chromosomes provide similarly good matches for contigs in the regions immediately proximal and distal to the rDNA. However, this is supported by relatively few (nine) q-arm-anchored contigs that purport to cross the rDNA—loci that we do not expect to assemble correctly using current sequencing approaches³.

To assess the homology between the acrocentric chromosomes, we developed a metric that captures the degree of disorder in the untangling of HPRCy1-acro contigs over 50-kb regions of T2T-CHM13 (Methods). This metric, ‘regional homology entropy’, is greater than 0 in regions where contigs match multiple T2T-CHM13 chromosomes—a pattern indicative of recombination. We find that regional homology entropy increases as we progress over each short arm and reaches a maximum immediately on the proximal flanks of the rDNA arrays (Fig. 3d). We observed an equivalent increase of regional homology entropy in the PARs on chr. X and chr. Y (Supplementary Fig. 17), which are known to actively recombine.

Acrocentric PHRs

Our analyses suggest that regions of near-identity between multiple T2T-CHM13 chromosomes are capable of supporting large-scale homologous recombination. To study the boundaries of these regions, we derived a multiple untangling, which orders by identity multiple T2T-CHM13 matches for every contig segment (Supplementary Figs. 18–22). The order of T2T-CHM13 hits captures the leaf order of a HPRCy1 contig-rooted phylogeny³⁶. Differences in chromosome-relative phylogenies across haplotypes indicate different evolutionary histories and imply ongoing recombination³⁷, which leads to chequerboard patterns in the multiple untangling plots (Fig. 4c). To delineate regions where heterologous chromosomes are likely to recombine, we computed ‘positional homology entropy’—a measure of the diversity of reference-relative phylogenies—for each position in T2T-CHM13 (Fig. 3e and Supplementary Fig. 23). We consider regions with positional homology entropy greater than 0 over more than 30 kb to be candidates for ongoing recombination (Methods).

**Fig. 4: PHRs of chr. 13, chr. 14 and chr. 21, centred on the SST1 array.**

These PHRs total 18.329 Mb in length (Supplementary File 5) and differ in size by chromosome: chr. 13, 4.53 Mb (Fig. 3b); chr. 14, 6.48 Mb (Extended Data Fig. 5a); chr. 15, 719.25 kb (Extended Data Fig. 6a); chr. 21, 3.79 Mb (Extended Data Fig. 7a); and chr. 22, 2.81 Mb (Extended Data Fig. 8a). We term them PHRs by analogy to the PARs of sex chromosomes, because these homology domains could enable non-homologous chromosomes to pair like homologous chromosomes. Notably, the chromosomes involved in the most common ROBs (chr. 13–chr. 14 and chr. 14–chr. 21) have larger PHRs, which could promote the recombination events that lead to these translocations. Supporting this, BAC clones surrounding common recurrent ROB breakpoints⁸ map to T2T-CHM13 PHRs (Supplementary Fig. 24). A genome-wide phylogenetic analysis of SST1 array elements indicates the expected pattern of concerted evolution by chromosome, but the repeats from chr. 13, chr. 14 and chr. 21 display a unique pattern of concerted evolution between chromosomes (Fig. 4a) and furthermore, share a deletion of around 1.0 kb relative to all other SST1 repeats (Fig. 4b and Supplementary Fig. 25), suggesting inter-array recombination similar to the surrounding non-satellite sequences of the PHRs (Fig. 4c). We confirmed that patterns observed by fluorescent in situ hybridization of these BACs are compatible with breakpoints occurring in the PHRs centred at the SST1 array (Fig. 4d).

To provide a positive control, we applied the same method to the sex chromosome PVG to identify their PHRs (chr. X, 2.75 Mb and chr. Y, 2.73 Mb; Supplementary File 6 and Supplementary Fig. 26). These regions precisely match the established boundaries for the PARs and contain sparse hits in the XTRs, which would be compatible with reports of X–Y interchange in the XTR³⁸. Biallelic SNP calls from a whole-genome HPRCy1 graph released in the accompanying Article⁴ show that variant density in the PARs and the acrocentric p-arms is markedly higher than elsewhere in these chromosomes (Supplementary Fig. 27), which is consistent with increased rates of recombination in these regions²⁰.

In humans and many other mammals, the sequence specificities of the DNA-binding zinc finger protein PRDM9 regulate the formation of double-stranded breaks that drive meiotic homologue synapsis and recombination^39,40. We scanned T2T-CHM13 acrocentric chromosomes for PRDM9 motifs detected by chromatin immunoprecipitation with sequencing⁴¹ (Extended Data Fig. 9 and Supplementary File 7), finding that both rDNA and SST1 arrays are enriched for PRDM9 motifs relative to the surrounding sequence (Fig. 3f and Extended Data Figs. 4e, 5e, 6e, 7e and 8e). By contrast, we find almost no PRDM9 motifs in the centromeres, where meiotic recombination is harmful and suppressed by diverse mechanisms⁴².

Linkage disequilibrium decay in PHRs

To quantify the magnitude of putative recombination in the PHRs, we calculated the rate of the linkage disequilibrium decay between SNPs detected in the acro-PVG⁴³ (Supplementary Fig. 28). For each acrocentric chromosome, we plot the R² allele correlation versus distance for three sets of pairs of variants separated by up to 4 kb: variants on the q-arm, on the p-arm, and within the PHRs. The overall trend of linkage disequilibrium decay on the q-arms is similar to trends seen in other datasets that evaluate linkage disequilibrium in humans^44,45. On chr. 13 the decay of linkage disequilibrium in PHRs and the p-arm is similar and faster than on the q-arm, and on chr. 14, chr. 15 and chr. 22, linkage disequilibrium decay in PHR is even faster on PHRs compared with the p-arm. The same trend does not apply to chr. 21 (Extended Data Fig. 10).

The fast linkage disequilibrium decay in the PHRs compared to q-arms supports the hypothesis of ongoing recombination exchange. In general, there is a higher level of linkage disequilibrium on the p-arms than on the q-arms, perhaps owing to lower recombination in heterochromatic regions⁴⁶. However, for the majority of acrocentric chromosomes, we observe the opposite in the PHRs. This effect is clearest within the chromosomes that other analyses suggest share the most homologous sequence: chr. 13, chr. 14, chr. 21 and chr. 22, whereas chr. 15—which appears to be an outlier⁷ (as also observed in Figs. 1 and 2)—contains shorter PHRs and we have less confidence in the linkage disequilibrium decay trends (error bars on Extended Data Fig. 10, chr. 15 PHRs). This pattern is consistent with an increased recombination rate in the PHRs^6,7.

Discussion

Here we develop multiple lines of evidence indicating active recombination between heterologous human acrocentric chromosomes. First, we find that a symmetric comparison of the sequences of a draft human pangenome contains multi-chromosome communities corresponding to both the sex chromosomes and acrocentric chromosomes (Fig. 1). An acrocentric pangenome graph reveals base-level homologies that outline patterns of exchange between the heterologous chromosomes (Fig. 2). The graph highlights regions featuring a diverse patchwork of best-match patterns involving non-homologous T2T-CHM13 chromosomes (Fig. 3), and we cross-validate these findings with a T2T diploid assembly of a target sample. We develop an entropy metric sensitive to recombination between heterologous chromosomes—such as that seen between chr. X and chr. Y—and apply it to delineate PHRs where heterologous SAACs recombine (Fig. 4 and Supplementary File 5). Finally, we show that on chr. 13, chr. 14, chr. 21 and chr. 22, the resulting 18 Mb of sequence in the PHRs presents a higher rate of linkage disequilibrium decay than seen in sequences from non-PHR regions of the same chromosomes (Extended Data Fig. 10). These lines of evidence all suggest that heterologous SAACs recombine.

BACs used in a previous cytogenetic study of Robertsonian chromosomes map to the PHRs of chr. 14 and chr. 21, with the recurrent breakpoint region found in a highly homologous region on chr. 13p, chr. 14p and chr. 21p centred on the PRDM9 motif-enriched SST1 array. This leads us to propose that PHRs are maintained by recombination between heterologous chromosomes, which occasionally results in an ROB (Fig. 5). We posit that these homologous regions (Fig. 5a) might share a biological function as sequences proximal to the nucleolar organizing regions. Their proximity (Fig. 5b) can facilitate inter-chromosomal recombination (Fig. 5c)—which may be of both crossover or non-crossover types and may occur during meiosis or mitosis. Owing to an inversion of this region on chr. 14p relative to chr. 13p and chr. 21p, crossover type recombination between pairs chr. 13–chr. 14 and chr. 14–chr. 21 leads to ROBs (Fig. 5d), which our study suggests are a pathological outcome of otherwise benign recombination between heterologous chromosomes.

**Fig. 5: The PHRs of human acrocentric chromosomes.**

The HPRC pangenome provides base-level resolution of homology patterns across many SAAC haplotypes, enabling us to examine in detail the regions in which ROBs occur. We observe that the GC-rich SST1 array lies at the centre of a segmentally duplicated region on chr. 13p, chr. 14p and chr. 21p, which shows a clear pattern of haplotype mixing between these chromosomes. This may also implicate this array as a nucleation point for recombination, as suggested by the observation (in 1 out of 220 oocytes) of heterologous chr. 14p–chr. 21p synapse formation in pachytene oocytes²⁹. We speculate that these segmentally duplicated regions are where common ROBs occur, a hypothesis supported by our reanalysis of previous cytogenetic mapping of the common ROB breakpoint for chr. 14 and chr. 21 (Fig. 4d).

Although we find that SAACs present challenges for assembly methods⁴, our validation based on ONT and HiFi data integration shows that the patterns that we observe in HiFi-only assemblies are consistent with ONT reads from the same sample. HG002-Verkko recapitulates key T2T-CHM13-relative untangling patterns also seen in HPRCy1 HiFi assemblies, such as the SST1-linked PHR at chr. 13p11.2 and chr. 21p11.2, rDNA-proximal mixing of all SAACs, and mixing of chr. 22q11.21 and chr. 14q11.2. Our analyses rest on the extensive assembly validation carried out by the HPRC, and observations used to establish signals for recombination are based only on assembly regions deemed to be reliable by mapping analyses⁴. Our study confirms previous hypotheses based on decades of diverse inquiry, which provide additional assurance that patterns observed bioinformatically are biologically grounded. This body of evidence suggests that our definition of the PHRs is likely to evolve with improved resolution of rDNA arrays and distal regions of the SAACs, which remain among the most challenging regions of the human genome to assemble and lie beyond the scope of our presented work.

Our study is fundamentally population-based. We cannot directly observe recombination of SAACs in this context, leaving open questions about recombination mechanisms that may be difficult to resolve from sequence information alone⁴⁷. However, similar to mutation, recombination is a rare event, which makes it easier to measure its distribution over chromosomes in a population genetic context as we have done here. This addresses key issues with many previous studies of recombination in the SAACs, which often feature small numbers of individuals^29,48 selected on the basis of medically relevant genomic states such as trisomy and ROB⁴⁹. Our resolution of the acrocentric PHRs confirms reported homologies between the SAACs^2,3, providing a reference for their structure that will be useful for future genomic and cytogenetic studies. In principle, recombination in the PHRs may be of either crossover or non-crossover type. Our data support both, but outside of recurrent ROBs and our expectation that non-crossover recombination is substantially more common (by a ratio of around 10:1) than crossover recombination^50,51, we lack distinguishing evidence for either. To estimate the relative rates of each type of event, we can use linkage disequilibrium patterns⁵⁰ to study the PHRs in large genomic cohorts^52,53, which will require realigning cohort short read data to T2T-CHM13 or the HPRC pangenome. Future improvements to assembly of the SAACs and the planned increase in the number of individuals included in the HPRC should allow for confident estimates of the relative rates of recombination types.

The co-location of rDNA repeats from different acrocentric chromosomes in a nucleolus provides physical proximity that can facilitate recombination events, both between rDNA repeats and between the adjacent PHRs. Our analyses suggest that the rate of recombination between heterologous pairs of acrocentric chromosomes varies, leading to characteristic patterns in the homology spaces that we have explored. Human cells generally have fewer nucleoli (between one and five) than acrocentric chromosomes⁵⁴ (ten). One possibility is that groups of acrocentric chromosomes between which we observe stronger homology and recombination—such as chr. 13, chr. 14 and chr. 21—may be more likely to co-localize to the same nucleolus, as observed in pachytene spermatocytes^15,55. Proximity, homology, recombination initiation sites and sequence orientation are likely to be factors in the high rate of ROBs between these chromosomes. The HPRC draft human pangenome has enabled us to approach genome evolution from a chromosome scale. By stepping away from a reference-centric model and directly comparing whole-chromosome assemblies of the acrocentric regions, we have obtained sequence-resolved responses to long-standing questions first posed in early cytogenetic studies of human genomes.

Methods

Genome assemblies

We analysed the 47 T2T phased diploid de novo assemblies (94 haplotypes in total) produced by the HPRC⁴. We included both T2T-CHM13 version 2³ and GRCh38.

Chromosome communities overview

The homology graph

We first used all-to-all mapping to build a reference-free model of homology relationships in the HPRCy1 pangenome. This models the full HPRCy1 as a mapping graph in which nodes are contigs and edges represent mappings between them. To build the HPRCy1 mapping graph, we generated homology mappings based on chains of 50-kb seeds of 95% average nucleotide identity—which we expect to support homologous recombination¹⁷—allowing up to (n − 1) = 93 alternative mappings over any part of each contig. We first applied this process to map all 38,325 HPRCy1 contigs against all others, obtaining mappings for 38,036 of them covering the 99.9% of the total assembly sequence. This indicated that 38,036 out of 38,325 (99.2%) of the HPRCy1 assembly contigs are homologous to at least 1 other contig. Complex tangles in the assembly graphs used to build the HPRCy1 generate short contigs and tend to result in higher rates of error⁴. Thus, to simplify later analysis and focus on well-resolved regions of the assemblies, we narrowed our focus to consider only mappings involving the 16,118 contigs at least 1 Mb long, covering the 98.72% of the total assembly sequence.

We then built a graph where nodes are contigs and edges represent the mappings between them—the ‘mapping graph’. Edges in this mapping graph have weights equal to the estimated sequence identity multiplied by the length of the mapping. To infer the chromosome represented by each contig, we mapped all contigs against both T2T-CHM13 and GRCh38 references and assigned them a chromosome identity based on this mapping. This mapping graph is very dense, with up to 93 mappings per contig, making it difficult to directly visualize with existing methods. To develop intuition about patterns in this graph, we instead viewed a reduced mapping graph built from the best three mappings per contig segment, labelling each contig with its reference-relative assignment (Fig. 1a). The acrocentric cluster (Fig. 1b) generally matches our prior expectations of higher similarity between chr. 13 and chr. 21, and between chr. 14 and chr. 22^18,19.

Community detection

To quantify the significance of these patterns, we then applied a community detection algorithm⁵⁶ to the full mapping graph. The algorithm assigns each contig to a community such that the total assignment maximizes modularity, which can be understood as the density of (weighted) links inside communities compared to links between communities. This process yielded 31 communities (Supplementary File 1). We hypothesized that each cluster represented one chromosome or chromosome arm. Around half of the chromosomes (n = 11) were each represented by a single community. Chromosomes 1, 2, 3, 6 and 18 were each represented in two communities corresponding to their short and long arms, likely due to frequent assembly breaks across their centromeres (Fig. 1c,d). Contigs from chromosomes X and Y fell in the same community, although the short arm of chromosome X was represented in two communities (Fig. 1d). The SAACs formed the community with the most distinct chromosomes and most contigs (1,706 contigs containing 3.91% of HPRCy1 sequence), composed of contigs belonging to the short arms of all the five acrocentric chromosomes plus chr. 21q and chr. 22q (Fig. 1c,d). chr. 13q, chr. 14q and chr. 15q each had their own community. The inclusion of the q-arms of chromosomes 21 and 22 in the community composed of p-arms contigs is likely related to their short lengths compared to chromosomes 13, 14 and 15. We obtained similar results when we increased the sensitivity of the mappings (Supplementary Fig. 29).

In the homology mapping graph of the HPRCy1, only the acrocentric and sex chromosomes form combined communities containing multiple chromosomes. The sex chromosome community reflects the PARs on X and Y⁵⁷, which are telomeric regions where these otherwise non-homologous chromosomes recombine as if they were homologues. We hypothesized that the acrocentric community might also reflect ongoing pseudo-homologous recombination

Community detection workflow

We performed pairwise mapping for all contigs from the 47 T2T phased diploid de novo assemblies with the WFMASH sequence aligner⁵⁸ (commit ad8aeba). We set the following parameters:

wfmash HPRCy1.fa -s 50k -l 250k -p 95 -n 93 -Y ’#’ -H 0.001 -m

We used segment seed length of 50 kb (-s), requiring homologous regions at least ~250 kb long (-l) and estimated nucleotide identity of at least ~95% (-p). Having 94 haplotypes in total, we kept up to 93 mappings for each contig (-n). Moreover, we skipped mappings when the query and target had the same prefix before the ‘#’ character (-Y), that is when involving the same haplotype. To properly map through repetitive regions, only the 0.001% of the most frequent kmers were ignored (-H). We skipped the base-level alignment (-m). We also generated pairwise mapping with the same parameters, but using a segment seed length of 10 kb and requiring homologous regions at least ~50 kb long.

From the resulting mappings, we excluded those involving contigs shorter than 1 Mb to reduce the possibility of spurious matches. We then used the paf2net.py Python script (delivered in the PGGB repository) to build a graph representation of the result (a mapping graph), with nodes and edges representing contigs and mappings between them, respectively.

python3 ~/pggb/scripts/paf2net.py -p HPRCy1.1Mbps.paf

The script produces a file representing the edges, a file representing the edge weights, and a file to map graph nodes to sequence names. The weight of an edge is given by the product of the length and the nucleotide identity of the corresponding mapping (higher weights were associated with longer mappings at higher identities). Finally, we used the net2communityes.py Python script (delivered in the PGGB repository) to apply the Leiden algorithm⁵⁶, implemented in the igraph tools⁵⁹, to detect the underlying communities in the mapping graph.

python3 ~/pggb/scripts/net2communities.py \ -e HPRCy1.1Mbps.edges.list.txt \ -w HPRCy1.1Mbps.edges.weights.txt \ -n HPRCy1.1Mbps.vertices.id2name.chr.txt --accurate-detection

To identify which chromosomes were represented in each community, we partitioned all contigs by mapping them against both T2T-CHM13v1.1 and GRCh38 human reference genomes with WFMASH, this time requiring homologous regions at least 150 kb long and nucleotide identity of at least 90%.

wfmash chm13+grch38.fa HPRCy1.fa -s 50k -l 150k -p 90 -n 1 -H 0.001 -m -N

We disabled the contig splitting (-N) during mapping to obtain homologous regions covering the whole contigs. For the unmapped contigs, we repeated the mapping with the same parameters, but allowing the contig splitting (without specifying -N). We labelled contigs ‘p’ or ‘q’ depending on whether they cover the short arm or the long arm of the chromosome they belonged to. Contigs fully spanning the centromeres were labelled ‘pq’. We used such labels to identify the chromosome composition of the communities detected in the mapping graph obtained without reference sequences, and to annotate the nodes in the mapping graph.

To obtain a clean visualization of the homology relationships between the HPRC assemblies, we generated a simpler mapping graph by using the same parameters used for the main graph, but keeping up to 3 mappings for each contig and adding the T2T-CHM13 reference genome version 2, which includes also the complete HG002 chromosome Y (https://www.ncbi.nlm.nih.gov/assembly/GCF_009914755.1):

wfmash HPRCy1+chm13v2.fa -s 50k -l 250k -p 95 -n 3 -Y ’#’ -H 0.001 -m -w 5000

We set window size for sketching equal to 5000 (-w) to reduce the runtime by sampling fewer kmers. We used the paf2net.py Python script to build the mapping graph and then used Gephi⁶⁰ (version 0.9.4) to visualize it. We computed the mapping graph layout by running ‘Random Layout’ and then the ‘Yifan Hu’ algorithm.

Pangenome graph building

For each of the 47 T2T phased diploid de novo assemblies, we mapped all contigs against the T2T-CHM13 human reference genome with the WFMASH sequence aligner (commit ad8aeba). For the HG002 sample, we included two assemblies: the HG002-HPRCy1 phased diploid de novo assembly (built with HiFi reads) and a phased diploid de novo assembly based on both HiFi and ONT reads, built with the Verkko assembler. We set the following parameters:

wfmash chm13.fa assembly.fa -s 50k -l 150k -p 90 -n 1 -H 0.001 -m

We used segment seed length of 50 kb (-s), requiring homologous regions at least ~150 kb long (-l) and estimated nucleotide identity of at least ~90% (-p). We kept only one mapping (the best one) for each contig (-n). To properly map through repetitive regions, only the 0.001% of the most frequent kmers were ignored (-H). We skipped the base-level alignment (-m). For the HG002-HPRCy1 contigs, we disabled the contig splitting (-N).

Then, we identified contigs originating from acrocentric chromosomes and covering both the short and long arms of the chromosome they belonged to. We considered only contigs with mappings at least 1 kb long on both arms and at least 1 Mb away from the centromere. We call such contigs ‘p–q acrocentric contigs’. For HG002-HPRCy1, only contigs longer than or equal to 300 kb were considered, regardless of covering both arms of the belonging chromosomes.

Finally, we built a pangenome graph with all the p–q acrocentric contigs and both T2T-CHM13 and GRCh38 human reference genomes by applying PGGB²² (commit a4a6668). We set the following parameters:

pggb -i contigs.fa.gz -s 50k -l 250k -p 98 -n 162 -F 0.001 -k 311 -G 13117,13219 -O 0.03

We used segment seed length of 50 kb (-s), requiring homologous regions at least ~250 kb long (-l) and estimated nucleotide identity of at least ~98% (-p). Having 142 p–q acrocentric contigs in input (132 from HG002-HPRCy1 and 10 from HG002-Verkko) plus 10 acrocentric chromosomes from the T2T-CHM13 and GRCh38 reference genomes plus 49 HG002-HPRCy1 contigs representing other 10 acrocentric haplotypes (5 maternal and 5 paternal), we kept up to 162 mappings (142 + 10 + 10) for each contig (-n). To properly map through repetitive regions, only the 0.001% of the most frequent kmers were ignored (-F). We filtered out alignment matches shorter than 311 bp to remove possible spurious relationships caused by short repeated homologies (-k). We set big target sequence lengths and a small sequence padding for two rounds of graph normalization (-G and -O). To visualize the acrocentric pangenome graph, we built the graph layout with ODGI LAYOUT²⁶ (commit e2de6cb) and visualized with GFAESTUS⁶¹ (commit 50fe37a). This renders sequences and chains of small variants as linear structures, while repeats caused by segmental duplications, inversions and other structural variants tend to form loops.

Pangenome graph untangling

We untangled the pangenome graph by applying ODGI UNTANGLE (commit e2de6cb). Practically, we projected the graph into an alignment between a set of query (HPRCy1 contigs) and reference (T2T-CHM13) sequences. We set the following parameter:

odgi untangle -i graph.og -e 50000 -m 1000 -n 100 -j 0 -R targets.txt -d cuts.txt

We segmented the graph into regular-sized regions of ~50 kb (-e), merging regions shorter than 1 kb (-m). We reported up to the 100th best target mapping for each query segment (-n), not applying any threshold for the Jaccard similarity (-j). We used all paths in the graphs as queries and projected them against the five acrocentric chromosomes of the T2T-CHM13 genome (-R). Moreover, we emit the cut points used to segment the graph (-d).

For each query segment, if there were multiple best hits against different targets (that is, hits with the same, highest Jaccard similarity), we put as the first one the hit having as target the chromosome of origin of the query (obtained from the chromosome partitioning of the contigs).

We repeated the graph untangling another five times, but constrained the algorithm to use only one of the acrocentric chromosomes of T2T-CHM13 as a target at a time (-r) and return the best-matching hit (-n).

odgi untangle -i graph.og -e 50000 -m 1000 -n 1 -j 0 -r chr13 -c cuts.txt odgi untangle -i graph.og -e 50000 -m 1000 -n 1 -j 0 -r chr14 -c cuts.txt odgi untangle -i graph.og -e 50000 -m 1000 -n 1 -j 0 -r chr15 -c cuts.txt odgi untangle -i graph.og -e 50000 -m 1000 -n 1 -j 0 -r chr21 -c cuts.txt odgi untangle -i graph.og -e 50000 -m 1000 -n 1 -j 0 -r chr22 -c cuts.txt

We used the cut points generated when using all of the acrocentric chromosomes of T2T-CHM13 as targets (-c). In this way, all untangling runs (six in total) used the same cut points for the segment boundaries.

Finally, we ‘grounded’ the untangled output generated with all acrocentric chromosomes as targets: in more detail, each untangled query segment was placed against a particular acrocentric chromosome (not only the best-matching one) by using the untangled outputs constrained to a single target. We split the result by acrocentric chromosome and kept only queries untangling both p- and q-arms of the targets. Furthermore, we removed query segments overlapping regions flagged in the assemblies as unreliable (that is, having coverage issues) by FLAGGER⁴. FLAGGER is a HiFi read-based pipeline that detects different types of mis-assemblies within a phased diploid assembly by identifying read-mapping coverage inconsistencies across the maternal and paternal haplotypes. To focus on the more similar query-target hits, we used the Jaccard metric to estimate the sequence identity by applying the corrected formula reported in ref. ⁶², and retained only results at greater than 90% estimated identity. To analyse the orientation status of HPRCy1 contigs in the segmental duplication centred on the SST1 array, we generated a new pangenome graph with ODGI FLIP (commit 0b21b35). In more detail, we first flipped paths around if they tend to be in the reverse complement orientation relative to the pangenome graph. This leads to having a uniform orientation for the HPRCy1 contigs, all in forward orientation with respect to the graph. Then, we untangled the flipped graph in the same way as described above. We displayed the untangling results for each acrocentric chromosome with the R development environment (version 3.6.3), equipped with the following packages: tidyverse (version 1.3.0), RColorBrewer (version 1.1.2), ggplot2 (version 3.3.3) and ggrepel (version 0.9.1).

Recombination pattern analysis

Aggregating best-hit untangle results

For each group of HPRCy1-acro contigs anchored to a T2T-CHM13 acrocentric q-arm, we counted the number of contigs having as best-hit each one of the acrocentrics. In particular, for each base position of the T2T-CHM13 acrocentric chromosome of each group, we quantified how many times each of the acrocentrics appeared as best-hit in the pangenome graph untangling. We considered only best hits with an estimated identity of at least 90%.

Regional homology entropy

To quantify the degree of disorder in the untangling result, we calculated the diversity entropy across the different acrocentrics that were present as best-hit. In more detail, we projected each HPRCy1 acrocentric p–q contig against the T2T-CHM13 acrocentric to which it is anchored via the q-arm and associated each reference base position to the corresponding acrocentric best-hit. We considered only best hits with an estimated identity of at least 90%. Then, we computed the Shannon diversity index (SDI) in windows 50 kb long over the contigs. We used −1 as missing SDI value in the regions where the contigs do not match any targets. For each group of contigs, we aggregated the SDI results by computing their average (ignoring the missing SDI values) for each reference base position. We call this metric the positional homology entropy, and it serves to show regions where contigs can be described as mosaics of different reference chromosomes. However, it cannot distinguish regions where there are different orders of reference chromosome similarity—that might be indicative of recombination exchange—from those where there is regional diversity in each contig’s relationship to T2T-CHM13. The latter case could occur if T2T-CHM13 itself contains rare recombinations between acrocentrics, or where ancient homology might result in ‘noise’ in untangling alignments as contigs pick from two equally good alternative mappings. To avoid these pitfalls and establish a more stringent graph-space recombination metric, we then extended the untangling diversity metric to operate on multiple mappings.

Positional homology entropy

To take into account the other hits in addition to the first one, including their order, we generalized the diversity entropy metric to work over orders of the top 5 untangling hits and consider all contigs jointly. For each reference segment, we collected the corresponding best 5 untangling hits for each of the HPRCy1 acrocentric p–q contig; this is possible because the reference segments are stable across all contigs. We considered only best hits with an estimated identity of at least 90%. To avoid driving the untangle entropy by intra-chromosomal similarity caused by segmental duplications modelled in the structure of the PVG (as seen in loops on chr. 13q, chr. 15q and chr. 22q; Fig. 2), we ignored consecutive duplicate target hits—in other words, we took the ordered set of unique reference targets. When multiple contig segments were grounded against the same reference segment, we considered the first contig segment having the best grounding, that is having the highest estimated identity when placed against the current reference segment. Then, we ranked the five best hits by estimated similarity. Finally, for each reference segment, we computed the SDI across all available five best hits orders. We used −1 as missing SDI value in the reference regions without any contig matches. We kept in the output also the information about how many HPRCy1 acrocentric p–q contigs contributed to the entropy computation in each reference segment. This yielded the positional homology entropy.

PHR derivation

To obtain the PHRs, we aggregated the final results by considering regions with positional homology entropy greater than 0 and supported by at least 1 contig, merging with BEDtools⁶³ those that were less than 30 kb away, and removing merged regions shorter than 30 kb.

Display of untangle mosaics

We displayed the aggregated results for each acrocentric chromosome. We used genome annotations for the first 25 Mb of each acrocentric chromosome, using T2T-CHM13v2.0 UCSC trackhub (https://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_3671779_hs1). We made the figures with the scripts available at https://github.com/pangenome/chromosome_communities/tree/main/scripts. To plot the figures, we used R (version 3.6.3), equipped with the following packages: tidyverse (version 1.3.0), RColorBrewer (version 1.1.2), ggplot2 (version 3.3.3) and ggrepel (version 0.9.1). Finally, we used Inkscape (https://inkscape.org/) to compose main text figures based on the results, and provide supplementary figures directly produced by these methods.

HPRCy1 SNP density plots

We displayed biallelic SNP density in the full HPRCy1 draft pangenome built with PGGB⁴ versus both GRCh38 and T2T-CHM13. To do so, we extracted biallelic SNPs from the released VCF files versus each chromosome, for both references (get_bisnp.sh). Because T2T-CHM13 version 1.1, which was used in HPRCy1, does not have a Y chromosome, we used that of GRCh38, which includes masked PAR1 and PAR2 regions. We displayed biallelic SNP density in bins of 100 kb, using R (version 4.1.1) and tidyverse (version 1.3.1) package (plot_bisnp_dens.R).

ROB breakpoints

We mapped BAC clones from ref. ⁸ against the T2T-CHM13 human reference genome with the WFMASH sequence aligner (commit ad8aeba). We kept only mappings covering acrocentric chromosomes and with an estimated identity of at least 90%. To plot the figures, we used R (version 3.6.3), equipped with the following packages: tidyverse (version 1.3.0), RColorBrewer (version 1.1.2), ggplot2 (version 3.3.3) and ggrepel (version 0.9.1). We coloured BAC clones’ mappings according to ref. ⁸.

Maximum likelihood phylogenetic analysis

We conducted the phylogenetic analysis by using the maximum likelihood method based on the best-fit substitution model (Kimura 2-parameter +G, parameter = 5.5047) inferred by Jmodeltest2⁶⁴ with 1,000 bootstrap replicates. Bootstrap values higher than 75 are indicated at the base of each node.

Recombination hotspots analysis

We obtained the human PRDM9 binding motifs (17 in total) from ref. ⁴¹ and used FIMO⁶⁵ to scan their occurrences in T2T-CHM13v2.0 human reference genome:

fimo --thresh 1.0E-4 PRDM9_motifs.human.txt chm13v2.fa

FIMO computes a log-likelihood ratio score for each motif with respect to each sequence position and converts these scores to P values using dynamic programming (assuming a zero-order null model in which sequences are generated at random with user-specified per-letter background frequencies) and then estimate false discovery rates⁶⁵. Each motif is associated with a measure of how likely it represents a true binding target for PRDM9. We retained for downstream analyses only motifs for which such a measure is at least 70% (14 of 17). For each motif, we counted the number of occurrences present in windows 20 kb long across each T2T-CHM13v2.0 chromosome by using BEDtools⁶³.

bedtools intersect -a chm13v2.windows_20kbp.bed -b fimo_output.motif$i.bed

To plot the figures, we used R (version 3.6.3), equipped with the following packages: tidyverse (version 1.3.0), RColorBrewer (version 1.1.2), ggplot2 (version 3.3.3) and ggrepel (version 0.9.1).

Linkage disequilibrium analysis

We identified variants embedded in the pangenome graph by using VG DECONSTRUCT⁵:

vg deconstruct -P chm13 -H ‘?’ --ploidy 1 -e -a graph.gfa > variants.vcf

We called variants with respect to the T2T-CHM13 reference genome (-P), reporting variants for each HPRCy1 acrocentric p–q contig (-H and --ploidy). We considered only traversals that correspond to paths (that is, contigs) in the graph (-e) and also reported nested variation (-a). From the variant set, we considered only single nucleotide variants. We estimated linkage disequilibrium between pairs of markers within 70 kb by using PLINK v1.9⁶⁶ upon specification of haploid sets and retaining all values of r² > 0 (plot_ld_1.R). Finally, we generated binned linkage disequilibrium decay plots with confidence intervals using R (version 3.6.3), focusing on pairs less than 4 kb apart.

Validating homology mosaicism

HG002-Verkko assembly

We applied an earlier version of Verkko (beta 1, commit vd3f0b941b5facf5807c303b0c0171202d83b7c74) to build a diploid assembly graph for the HG002 cell line using the HiFi (105x) and ONT (85x) reads as described⁶⁷. The resulting assembly graph resolves the proximal junction in single contigs for each haplotype up to multi-mega bases, while the distal junctions remain to be resolved. We used homopolymer compressed markers from the parental Illumina reads to assign unitigs to maternal, paternal haplotype or ambiguous when not enough markers supported either haplotype. For estimating the number of times a unitig has to be visited, we aligned HiFi and ONT reads to the assembly graph using GraphAligner with the following parameters: –seeds-mxm-length 30–seeds-mem-count 10000 -b 15–multimap-score-fraction 0.99–precise-clipping 0.85–min-alignment-score 5000–clip-ambiguous-ends 100–overlap-incompatible-cutoff 0.15–max-trace-count 5–hpc-collapse-reads–discard-cigar⁶⁸. Four distal junctions were connected to the rDNA arrays with ambiguous nodes connecting the maternal and paternal nodes, supporting they belong to the same chromosome. Two distal junction unitigs, one maternal and paternal, were disconnected from each other but connected to the rDNA arrays, which were assigned to the same chromosome. Using the marker and ONT alignments, we identified paths in the graph and assigned them according to the most supported haplotype. If only ambiguous nodes were present between the haplotype assigned unitigs, with no ONT reads to resolve the path, nodes were randomly assigned to one haplotype to build the contig. After all paths were identified, we produced the consensus using verkko --assembly <path-to-original-assembly> --paths <path-to-paths>. The entire procedure to produce parental markers, tagging unitigs according to its haplotype on the assembly graph and finding the path using ONT reads is now available in the latest Verkko (v1) in a more automated way.

Untangling validation

To provide cross-validation of the HiFi contig assemblies and our analysis of them, we compared the untangling of two assemblies of the same sample (HG002). One was made with the HPRCy1 pipeline, while the other used the Verkko diploid T2T assembler. Verkko employs ONT to untangle ambiguous regions in a HiFi-based assembly graph, automating techniques first developed in production of T2T-CHM13. Verkko’s assembly aggregates information from ONT, thus providing a single integrated target for cross-validation of our analysis using an alternative sequencing and assembly approach.

We validated the results of the pangenome untangling by comparing the best hits of the two HG002 assemblies, the one built with HiFi reads and the other based on both HiFi and ONT reads, built with the Verkko assembler⁶⁷. For each base position of each T2T-CHM13 acrocentric chromosome, we compared the untangling best-hit of the HG002-HPRCy1 contigs with the best-hit supported by HG002-Verkko contigs. We considered only best hits with an estimated identity of at least 90%. We defined reference regions as concordant when both HG002 assemblies supported the same T2T-CHM13 acrocentric as best-hit. We treated the two haplotypes (maternal and parental) separately.

We observe a high degree of concordance between the two methods at a level of the chromosome homology mosaicism plots. The best-hit untangling shows similar patterns in the HG002-Verkko assembly as those seen in HG002-HPRCy1 (Supplementary Figs. 9–13). However, some SAAC haplotypes appear to be poorly assembled in HG002-HPRCy1. On the q-arms we measured 99.93% concordance between HG002-HPRCy1 and HG002-Verkko untangling results, but only 87.45% concordance on the p-arms (Supplementary File 8). This lower level is consistent with greater difficulty in assembling the SAACs due to their multiple duplicated sequences (including the PHRs), satellite arrays and the rDNA. We found the discordance was driven by a single chromosome haplotype: while most p-arms achieve around 90% concordance, HG002-HPRCy1 14p-maternal exhibits a high degree of discordance in the assemblies (66.19% concordance) (Supplementary Fig. 19). Although this n = 1 validation focuses on only 10 haplotypes, it incorporates many independent reads provided by deep HiFi (105x) and ONT (85x) data used in HG002-Verkko. We thus have compared the concordance between structures observed in single molecule reads across the SAACs and the HG002-HPRCy1 assembly that represents the HiFi-only assembly process that produced our pangenome.

However, this analysis should be seen as presenting a lower bound on our process accuracy. We are considering all HG002-HPRCy1 contigs that map to the acrocentrics—not only those that would meet our centromere-crossing requirement, and HG002-HPRCy1 is itself more fragmented than the other assemblies which we have selected for the acro-PVG⁴. The fragmented nature of its contigs (only one from chr. 22 meets our p–q mapping requirement) may introduce additional disagreements with HG002-Verkko. The overall result indicates that most patterns observed in HiFi-only assemblies are likely to be supported by an automated near-T2T assembly of the same sample.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Assemblies produced by the HPRC are available at AnVIL (https://anvilproject.org/), in the AnVIL_HPRC workspace. Data are also available as part of the AWS Open Data Program (https://registry.opendata.aws/) in the human-pangenomics S3 bucket (https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html). In addition, the data have been uploaded to the International Nucleotide Sequence Database Collaboration (INSDC) for long-term storage and availability. Supporting information about the data (including index files with S3 and GCP file locations) can be found at the following GitHub repository: https://github.com/human-pangenomics/HPP_Year1_Assemblies. All supplementary files, including the PVG and its layout, are available on Zenodo at https://doi.org/10.5281/zenodo.7692554.

Code availability

Code and links to methods and tools used to perform all the analyses and produce all the figures can be found on Zenodo at https://doi.org/10.5281/zenodo.7697614.

References

Floutsakou, I. et al. The shared genomic architecture of human nucleolar organizer regions. Genome Res. 23, 2003–2012 (2013).
Article CAS PubMed PubMed Central Google Scholar
van Sluis, M. et al. Human NORs, comprising rDNA arrays and functionally conserved distal elements, are located within dynamic chromosomal regions. Genes Dev. 33, 1688–1701 (2019).
Article PubMed PubMed Central Google Scholar
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Liao, W.-W. et al. A draft human pangenome reference. Nature https://doi.org/10.1038/s41586-023-05896-x (2023).
Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018).
Article CAS PubMed PubMed Central Google Scholar
Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003).
Article CAS PubMed PubMed Central Google Scholar
Huttley, G. A., Smith, M. W., Carrington, M. & O’Brien, S. J. A scan for linkage disequilibrium across the human genome. Genetics 152, 1711–1722 (1999).
Article CAS PubMed PubMed Central Google Scholar
Jarmuz-Szymczak, M., Janiszewska, J., Szyfter, K. & Shaffer, L. G. Narrowing the localization of the region breakpoint in most frequent Robertsonian translocations. Chromosome Res. 22, 517–532 (2014).
Article CAS PubMed PubMed Central Google Scholar
Hamerton, J. L., Canning, N., Ray, M. & Smith, S. A cytogenetic survey of 14,069 newborn infants. I. Incidence of chromosome abnormalities. Clin. Genet. 8, 223–243 (1975).
Article CAS PubMed Google Scholar
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Article ADS Google Scholar
Mack, H. & Swisshelm, K. in Brenner’s Encyclopedia of Genetics 2nd edn (eds Maloy, S. & Hughes, K.) 301–305 (Academic Press, 2013).
Spinner, N. B. in Brenner’s Encyclopedia of Genetics 2nd edn (eds Maloy, S. & Hughes, K.) 546–548 (Academic Press, 2013).
Lindström, M. S. et al. Nucleolus as an emerging hub in maintenance of genome stability and cancer pathogenesis. Oncogene 37, 2351–2366 (2018).
Article PubMed PubMed Central Google Scholar
Kobayashi, T. Regulation of ribosomal RNA gene copy number and its role in modulating genome integrity and evolutionary adaptability in yeast. Cell. Mol. Life Sci. 68, 1395–1403 (2011).
Article CAS PubMed PubMed Central Google Scholar
Holm, P. B. & Rasmussen, S. W. Human meiosis I. The human pachytene karyotype analyzed by three dimensional reconstruction of the synaptonemal complex. Carlsberg Res. Commun. 42, 283 (1977).
Article Google Scholar
Choo, K. H., Vissel, B., Brown, R., Filby, R. G. & Earle, E. Homologous alpha satellite sequences on human acrocentric chromosomes with selectivity for chromosomes 13, 14 and 21: implications for recombination between nonhomologues and Robertsonian translocations. Nucleic Acids Res. 16, 1273–1284 (1988).
Article CAS PubMed PubMed Central Google Scholar
Peng, Z. et al. Correlation between frequency of non-allelic homologous recombination and homology properties: evidence from homology-mediated CNV mutations in the human genome. Hum. Mol. Genet. 24, 1225–1233 (2015).
Article CAS PubMed Google Scholar
Greig, G. M., Warburton, P. E. & Willard, H. F. Organization and evolution of an alpha satellite DNA subset shared by human chromosomes 13 and 21. J. Mol. Evol. 37, 464–475 (1993).
Article ADS CAS PubMed Google Scholar
Jørgensen, A. L., Kølvraa, S., Jones, C. & Bak, A. L. A subfamily of alphoid repetitive DNA shared by the NOR-bearing human chromosomes 14 and 22. Genomics 3, 100–109 (1988).
Article PubMed Google Scholar
Cotter, D. J., Brotman, S. M. & Wilson Sayres, M. A. Genetic diversity on the human X chromosome does not support a strict pseudoautosomal boundary. Genetics 203, 485–492 (2016).
Article CAS PubMed PubMed Central Google Scholar
Ross, M. T. et al. The DNA sequence of the human X chromosome. Nature 434, 325–337 (2005).
Article ADS CAS PubMed PubMed Central Google Scholar
Garrison, E. & Guarracino, A. et al. Building pangenome graphs. Preprint at bioRxiv 2023.04.05.535718 https://doi.org/10.1101/2023.04.05.535718 (2023).
Paten, B., Novak, A. M., Eizenga, J. M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).
Article CAS PubMed PubMed Central Google Scholar
Eizenga, J. M. et al. Pangenome graphs. Annu. Rev. Genomics Hum. Genet. 21, 139–162 (2020).
Article CAS PubMed PubMed Central Google Scholar
Sirén, J. et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021).
Article PubMed PubMed Central Google Scholar
Guarracino, A., Heumos, S., Nahnsen, S., Prins, P. & Garrison, E. ODGI: understanding pangenome graphs. Bioinformatics 38, 3319–3326 (2022).
Article CAS PubMed PubMed Central Google Scholar
Marco-Sola, S. et al. Optimal gap-affine alignment in O(s) space. Bioinformatics 39, btad074 (2023).
Article PubMed PubMed Central Google Scholar
Garrison, E. & Guarracino, A. Unbiased pangenome graphs. Bioinformatics 39, btac743 (2023).
Article PubMed Google Scholar
Cheng, E. Y. & Naluai-Cecchini, T. FISHing for acrocentric associations between chromosomes 14 and 21 in human oogenesis. Am. J. Obstet. Gynecol. 190, 1781–5 (2004).
Article CAS PubMed Google Scholar
Epstein, N. D. et al. A new moderately repetitive DNA sequence family of novel organization. Nucleic Acids Res. 15, 2327–2341 (1987).
Article ADS CAS PubMed PubMed Central Google Scholar
Hoyt, S. J. et al. From telomere to telomere: the transcriptional and epigenetic state of human repeat elements. Science 376, eabk3112 (2022).
Article CAS PubMed PubMed Central Google Scholar
Tremblay, D. C., Alexander, G. Jr, Moseley, S. & Chadwick, B. P. Expression, tandem repeat copy number variation and stability of four macrosatellite arrays in the human genome. BMC Genomics 11, 632 (2010).
Article PubMed PubMed Central Google Scholar
González, B. et al. Somatic hypomethylation of pericentromeric SST1 repeats and tetraploidization in human colorectal cancer cells. Cancers 13, 5353 (2021).
Article PubMed PubMed Central Google Scholar
Henderson, A. S., Warburton, D. & Atwood, K. C. Ribosomal DNA connectives between human acrocentric chromosomes. Nature 245, 95–97 (1973).
Article ADS CAS PubMed Google Scholar
Logsdon, G. A. et al. The structure, function and evolution of a complete human chromosome 8. Nature 593, 101–107 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Kinene, T., Wainaina, J., Maina, S. & Boykin, L. M. in Encyclopedia of Evolutionary Biology (ed. Kliman, R. M.) 489–493 (Academic Press, 2016).
Arenas, M. The importance and application of the ancestral recombination graph. Front. Genet. 4, 206 (2013).
Article PubMed PubMed Central Google Scholar
Veerappa, A. M., Padakannaya, P. & Ramachandra, N. B. Copy number variation-based polymorphism in a new pseudoautosomal region 3 (PAR3) of a human X-chromosome-transposed region (XTR) in the Y chromosome. Funct. Integr. Genomics 13, 285–293 (2013).
Article CAS PubMed Google Scholar
Paigen, K. & Petkov, P. M. PRDM9 and its role in genetic recombination. Trends Genet. 34, 291–300 (2018).
Article CAS PubMed PubMed Central Google Scholar
Zickler, D. & Kleckner, N. Recombination, pairing, and synapsis of homologs during meiosis. Cold Spring Harb. Perspect. Biol. 7, a016626 (2015).
Article PubMed PubMed Central Google Scholar
Altemose, N. et al. A map of human PRDM9 binding provides evidence for novel behaviors of PRDM9 and other zinc-finger proteins in meiosis. eLife 6, e28383 (2017).
Article PubMed PubMed Central Google Scholar
Nambiar, M. & Smith, G. R. Repression of harmful meiotic recombination in centromeric regions. Semin. Cell Dev. Biol. 54, 188–197 (2016).
Article CAS PubMed PubMed Central Google Scholar
Paten, B. et al. Superbubbles, ultrabubbles, and cacti. J. Comput. Biol. 25, 649–663 (2018).
Article MathSciNet CAS PubMed PubMed Central MATH Google Scholar
Beichman, A. C., Phung, T. N. & Lohmueller, K. E. Comparison of single genome and allele frequency data reveals discordant demographic histories. G3 7, 3605–3620 (2017).
Article PubMed PubMed Central Google Scholar
Bosch, E. et al. Decay of linkage disequilibrium within genes across HGDP-CEPH human samples: most population isolates do not show increased LD. BMC Genomics 10, 338 (2009).
Article PubMed PubMed Central Google Scholar
Roberts, P. A. Difference in the behaviour of eu- and hetero-chromatin: crossing-over. Nature 205, 725–726 (1965).
Article ADS CAS PubMed Google Scholar
Ahuja, J. S., Harvey, C. S., Wheeler, D. L. & Lichten, M. Repeated strand invasion and extensive branch migration are hallmarks of meiotic recombination. Mol. Cell 81, 4258–4270.e4 (2021).
Article CAS PubMed PubMed Central Google Scholar
Guissani, U., Facchinetti, B., Cassina, G. & Zuffardi, O. Mitotic recombination among acrocentric chromosomes’ short arms. Ann. Hum. Genet. 60, 91–97 (1996).
Article CAS PubMed Google Scholar
Bandyopadhyay, R. et al. Mosaicism in a patient with Down syndrome reveals post-fertilization formation of a Robertsonian translocation and isochromosome. Am. J. Med. Genet. A 116A, 159–163 (2003).
Article PubMed Google Scholar
Gay, J., Myers, S. & McVean, G. Estimating meiotic gene conversion rates from population genetic data. Genetics 177, 881–894 (2007).
Article CAS PubMed PubMed Central Google Scholar
Cole, F., Keeney, S. & Jasin, M. Comprehensive, fine-scale dissection of homologous recombination outcomes at a hot spot in mouse meiosis. Mol. Cell 39, 700–710 (2010).
Article CAS PubMed PubMed Central Google Scholar
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Chen, S. et al. A genome-wide mutational constraint map quantified from variation in 76,156 human genomes. Preprint at bioRxiv 2022.03.20.485034 https://doi.org/10.1101/2022.03.20.485034 (2022).
Berríos, S. et al. Number and nuclear localisation of nucleoli in mammalian spermatocytes. Genetica 121, 219–228 (2004).
Article PubMed Google Scholar
Berríos, S. & Fernández-Donoso, R. Nuclear architecture of human pachytene spermatocytes: quantitative analysis of associations between nucleolar and XY bivalents. Hum. Genet. 86, 103–116 (1990).
Article PubMed Google Scholar
Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Helena Mangs, A. & Morris, B. J. The human pseudoautosomal region (PAR): origin, function and future. Curr. Genomics 8, 129–136 (2007).
Article CAS PubMed PubMed Central Google Scholar
Guarracino, A., Mwaniki, N., Marco-Sola, S. & Garrison, E. wfmash: a pangenome-scale pairwise aligner. Zenodo https://doi.org/10.5281/zenodo.6949373 (2021).
Csardi, G. & Nepusz, T. The igraph software package for complex network research. Int. J. Complex Syst. 1695 (2006).
Bastian, M., Heymann, S. & Jacomy, M. Gephi: an open source software for exploring and manipulating networks. ICWSM 3, 361–362 (2009).
Article Google Scholar
Fischer, C. & Garrison, E. chfi/gfaestus: a pangenome graph browser. Zenodo https://doi.org/10.5281/zenodo.6954036 (2022).
Belbasi, M., Blanca, A., Harris, R. S., Koslicki, D. & Medvedev, P. The minimizer Jaccard estimator is biased and inconsistent. Bioinformatics 38, i169–i176 (2022).
Article PubMed PubMed Central Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Article CAS PubMed PubMed Central Google Scholar
Darriba, D., Taboada, G. L., Doallo, R. & Posada, D. jModelTest 2: more models, new heuristics and parallel computing. Nat. Methods 9, 772 (2012).
Article CAS PubMed PubMed Central Google Scholar
Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).
Article CAS PubMed PubMed Central Google Scholar
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Article CAS PubMed PubMed Central Google Scholar
Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01662-6 (2023).
Rautiainen, M. & Marschall, T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Our work depends on the HPRC draft human pangenome resource established in the accompanying Article⁴, and we thank the production and assembly groups for their efforts in establishing this resource. This work used the computational resources of the UTHSC Octopus cluster and NIH HPC Biowulf cluster. We acknowledge support in maintaining these systems that was critical to our analyses. The authors thank M. Miller for the development of a graphical synopsis of our study (Fig. 5); and R. Williams and N. Soranzo for support and guidance in the design and discussion of our work. This work was supported, in part, by National Institutes of Health/NIDA U01DA047638 (E.G.), National Institutes of Health/NIGMS R01GM123489 (E.G.), NSF PPoSS Award no. 2118709 (E.G. and C.F.), the Tennessee Governor’s Chairs programme (C.F. and E.G.), National Institutes of Health/NCI R01CA266339 (T.P., L.G.d.L. and J.L.G.), and the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health (A.R., S.K. and A.M.P.). We acknowledge support from Human Technopole (A.G.), Consiglio Nazionale delle Ricerche, Italy (S.B. and V.C.), and Stowers Institute for Medical Research (T.P., L.G.d.L., B.R. and J.L.G.).

Author information

Authors and Affiliations

Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
Andrea Guarracino, Christian Fischer, Pjotr Prins, Flavia Villani, Vincenza Colonna & Erik Garrison
Genomics Research Centre, Human Technopole, Milan, Italy
Andrea Guarracino
Institute of Genetics and Biophysics, National Research Council, Naples, Italy
Silvia Buonaiuto & Vincenza Colonna
Stowers Institute for Medical Research, Kansas City, MO, USA
Leonardo Gomes de Lima, Tamara Potapova, Boris Rubinstein & Jennifer L. Gerton
Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
Arang Rhie, Sergey Koren, Ann McCartney, Sergey Nurk, Mikko Rautiainen, Brian Walenz & Adam M. Phillippy
Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St Louis, MO, USA
Haley J. Abel
McDonnell Genome Institute, Washington University School of Medicine, St Louis, MO, USA
Lucinda L. Antonacci-Fulton, Sarah Cody, Robert S. Fulton, Wen-Wei Liao, Allison A. Regier & Chad Tomlinson
UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
Mobin Asri, Xian H. Chang, Mark Diekhans, Jordan M. Eizenga, Marina Haukness, David Haussler, Glenn Hickey, Julian K. Lucas, Charles Markello, Karen H. Miga, Jean Monlong, Adam M. Novak, Hugh E. Olsen, Benedict Paten, Trevor Pesout & Jouni Sirén
Google, Mountain View, CA, USA
Gunjan Baid, Anastasiya Belyaeva, Andrew Carroll, Pi-Chuan Chang, Daniel E. Cook, Alexey Kolesnikov, Maria Nattestad & Kishwar Shafin
Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
Carl A. Baker, Evan E. Eichler, William T. Harvey, Kendra Hoekzema, Jennifer Kordosky, Alexandra P. Lewis, Katherine M. Munson, David Porubsky & Mitchell R. Vollger
European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
Konstantinos Billis, Susan Fairley, Paul Flicek, Adam Frankish, Carlos Garcia Giron, Leanne Haggerty, Thibaut Hourlier, Fergal J. Martin & Francesca Floriana Tricomi
Department of Human Genetics, McGill University, Montreal, Quebec, Canada
Guillaume Bourque
Canadian Center for Computational Genomics, McGill University, Montreal, Quebec, Canada
Guillaume Bourque
Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
Guillaume Bourque
Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
Mark J. P. Chaisson & Tsung-Yu Lu
Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
Haoyu Cheng, Justin Chu, Xiaowen Feng & Heng Li
Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Haoyu Cheng, Xiaowen Feng & Heng Li
Arizona State University, Barrett and O’Connor Washington Center, Washington, DC, USA
Robert M. Cook-Deegan
School of Biological Sciences, Washington State University, Pullman, WA, USA
Omar E. Cornejo
Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
Daniel Doerr, Peter Ebert, Jana Ebler, Hugo Magalhães, Pierre Marijon & Tobias Marschall
Howard Hughes Medical Institute, Chevy Chase, MD, USA
Evan E. Eichler, David Haussler & Erich D. Jarvis
The Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
Olivier Fedrigo, Giulio Formenti & Jacquelyn Mountcastle
National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
Adam L. Felsenfeld, Baergen I. Schultz, Michael W. Smith & Heidi J. Sofia
Center for Computational and Genomic Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA
Yan Gao
NNF Center for Biosustainability, Technical University of Denmark, Copenhagen, Denmark
Shilpa Garg
Institute for Society and Genetics, College of Letters and Science, University of California, Los Angeles, Los Angeles, CA, USA
Nanibaa’ A. Garrison
Institute for Precision Health, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
Nanibaa’ A. Garrison
Division of General Internal Medicine and Health Services Research, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
Nanibaa’ A. Garrison
Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA
Richard E. Green
Dovetail Genomics, Scotts Valley, CA, USA
Richard E. Green
Quantitative Life Sciences, McGill University, Montreal, Québec, Canada
Cristian Groza
Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
Ira Hall, Wen-Wei Liao & Shuangjia Lu
Center for Genomic Health, Yale University School of Medicine, New Haven, CT, USA
Ira Hall
Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany
Simon Heumos
Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, Germany
Simon Heumos
Tree of Life, Wellcome Sanger Institute, Cambridge, UK
Kerstin Howe & Jonathan M. D. Wood
Northeastern University, Boston, MA, USA
Miten Jain
The Rockefeller University, New York, NY, USA
Erich D. Jarvis
Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
Hanlee P. Ji & HoJoon Lee
Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
Eimear E. Kenny
Program in Bioethics and Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
Barbara A. Koenig
Genome Biology Unit, European Molecular Biology LaboratoryGenome Biology Unit, Heidelberg, Germany
Jan O. Korbel
Department of Medicine, Washington University School of Medicine, St Louis, MO, USA
Wen-Wei Liao
Computer Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain
Santiago Marco-Sola
Departament d’Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, Barcelona, Spain
Santiago Marco-Sola
Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
Tobias Marschall
Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
Jennifer McDaniel, Nathan D. Olson, Justin Wagner & Justin M. Zook
Coriell Institute for Medical Research, Camden, NJ, USA
Matthew W. Mitchell
Department of Computer Science, University of Pisa, Pisa, Italy
Moses Njagi Mwaniki
Department of Public Health Sciences, University of California, Davis, Davis, CA, USA
Alice B. Popejoy
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
Daniela Puiu & Aleksey V. Zimin
Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA, USA
Samuel Sacco
Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
Ashley D. Sanders
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Valerie A. Schneider & Françoise Thibaud-Nissen
Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
Jonas A. Sibbesen
Al Jalila Genomics Center of Excellence, Al Jalila Children’s Specialty Hospital, Dubai, United Arab Emirates
Ahmad N. Abou Tayoun
Center for Genomic Discovery, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, United Arab Emirates
Ahmad N. Abou Tayoun
Division of Medical Genetics, University of Washington School of Medicine, Seattle, WA, USA
Mitchell R. Vollger
Department of Genetics, Washington University School of Medicine, St Louis, MO, USA
Ting Wang
Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
Aleksey V. Zimin

Authors

Andrea Guarracino
View author publications
You can also search for this author in PubMed Google Scholar
Silvia Buonaiuto
View author publications
You can also search for this author in PubMed Google Scholar
Leonardo Gomes de Lima
View author publications
You can also search for this author in PubMed Google Scholar
Tamara Potapova
View author publications
You can also search for this author in PubMed Google Scholar
Arang Rhie
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Koren
View author publications
You can also search for this author in PubMed Google Scholar
Boris Rubinstein
View author publications
You can also search for this author in PubMed Google Scholar
Christian Fischer
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer L. Gerton
View author publications
You can also search for this author in PubMed Google Scholar
Adam M. Phillippy
View author publications
You can also search for this author in PubMed Google Scholar
Vincenza Colonna
View author publications
You can also search for this author in PubMed Google Scholar
Erik Garrison
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

Human Pangenome Reference Consortium

Haley J. Abel
, Lucinda L. Antonacci-Fulton
, Mobin Asri
, Gunjan Baid
, Carl A. Baker
, Anastasiya Belyaeva
, Konstantinos Billis
, Guillaume Bourque
, Silvia Buonaiuto
, Andrew Carroll
, Mark J. P. Chaisson
, Pi-Chuan Chang
, Xian H. Chang
, Haoyu Cheng
, Justin Chu
, Sarah Cody
, Vincenza Colonna
, Daniel E. Cook
, Robert M. Cook-Deegan
, Omar E. Cornejo
, Mark Diekhans
, Daniel Doerr
, Peter Ebert
, Jana Ebler
, Evan E. Eichler
, Jordan M. Eizenga
, Susan Fairley
, Olivier Fedrigo
, Adam L. Felsenfeld
, Xiaowen Feng
, Christian Fischer
, Paul Flicek
, Giulio Formenti
, Adam Frankish
, Robert S. Fulton
, Yan Gao
, Shilpa Garg
, Erik Garrison
, Nanibaa’ A. Garrison
, Carlos Garcia Giron
, Richard E. Green
, Cristian Groza
, Andrea Guarracino
, Leanne Haggerty
, Ira Hall
, William T. Harvey
, Marina Haukness
, David Haussler
, Simon Heumos
, Glenn Hickey
, Kendra Hoekzema
, Thibaut Hourlier
, Kerstin Howe
, Miten Jain
, Erich D. Jarvis
, Hanlee P. Ji
, Eimear E. Kenny
, Barbara A. Koenig
, Alexey Kolesnikov
, Jan O. Korbel
, Jennifer Kordosky
, Sergey Koren
, HoJoon Lee
, Alexandra P. Lewis
, Heng Li
, Wen-Wei Liao
, Shuangjia Lu
, Tsung-Yu Lu
, Julian K. Lucas
, Hugo Magalhães
, Santiago Marco-Sola
, Pierre Marijon
, Charles Markello
, Tobias Marschall
, Fergal J. Martin
, Ann McCartney
, Jennifer McDaniel
, Karen H. Miga
, Matthew W. Mitchell
, Jean Monlong
, Jacquelyn Mountcastle
, Katherine M. Munson
, Moses Njagi Mwaniki
, Maria Nattestad
, Adam M. Novak
, Sergey Nurk
, Hugh E. Olsen
, Nathan D. Olson
, Benedict Paten
, Trevor Pesout
, Adam M. Phillippy
, Alice B. Popejoy
, David Porubsky
, Pjotr Prins
, Daniela Puiu
, Mikko Rautiainen
, Allison A. Regier
, Arang Rhie
, Samuel Sacco
, Ashley D. Sanders
, Valerie A. Schneider
, Baergen I. Schultz
, Kishwar Shafin
, Jonas A. Sibbesen
, Jouni Sirén
, Michael W. Smith
, Heidi J. Sofia
, Ahmad N. Abou Tayoun
, Françoise Thibaud-Nissen
, Chad Tomlinson
, Francesca Floriana Tricomi
, Flavia Villani
, Mitchell R. Vollger
, Justin Wagner
, Brian Walenz
, Ting Wang
, Jonathan M. D. Wood
, Aleksey V. Zimin
& Justin M. Zook

Contributions

Paper writing: A.G. and E.G. Paper editing: A.G., S.B., L.G.d.L., T.P., A.R., S.K., B.R., C.F., J.L.G., A.M.P., V.C. and E.G. Development of algorithms and software: A.G. and E.G. Chromosome community detection: A.G. and E.G. Pangenome graph building and analyses: A.G. and E.G. Pangenome visualization: A.G., C.F. and E.G. ROB breakpoints: A.G., T.P., J.L.G. and E.G. Maximum likelihood phylogenetic analysis: L.G.d.L. Recombination hotspots analysis: A.G. Linkage disequilibrium analysis: A.G., S.B. and V.C. Homology mosaicism analysis: A.G. and E.G. Physical distance modelling: B.R. and J.L.G. HG002-Verkko assembly: A.R., S.K. and A.M.P. Untangling validation: A.G. and E.G.

Corresponding author

Correspondence to Erik Garrison.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 (A) Evolutionary strata 5 and 4.

Visualization with Saffire (https://mrvollger.github.io/SafFire/) of the alignment between T2T-CHM13 X and Y reveals that strata 5 and 4 feature low identity (~90%), numerous inversions, and some rearrangements; (B) X chromosome ideogram according to²¹. On the bottom, its evolutionary domains: the X-added region (XAR), the X-conserved region (XCR; dotted region in proximal Xp does not appear to be part of the XCR), the pseudoautosomal region PAR1, and evolutionary strata S5–S1. (C) The reduced all-to-all mapping graph of HPRCy1 versus itself, with contigs represented as nodes and mappings as edges. In red contigs covering the evolutionary strata 5 and 4 on chromosome X; (D) Coloring the reduced homology mapping graph in C with community assignments. Panels C and D use the same layout as Fig. 1 but focus only on the X and Y region of the visualization.

Extended Data Fig. 2 An overview of our approach to build a PVG for HPRCy1 contigs that can be anchored to a specific acrocentric q-arm.

(A) As input, we take the entire HPRCy1 and map it to T2T-CHM13. (B) This yields mappings to acrocentric chromosomes, which we filter to select contigs that map across the centromeres (red cytobands) between non-centromeric regions (over-labeled green). We include two HG002 assemblies based on standard HiFi (from HPRCy1) and on both HiFi and ONT data (from Verkko). (C) We then apply PGGB to build a PVG from the HPRCy1-acro collection. PGGB first obtains an all-to-all alignment of the input (C.a.), which is converted to a variation graph with SEQWISH²⁸ (C.b.), then normalized with sorting and multiple sequence alignment steps in SMOOTHXG (C.c-f). (D) The resulting PVG expresses genomes as paths, or walks, through a common sequence graph. This model thus contains all input sequences and their relative alignments to all others—in the example we see a CTGG/AAGTA block substitution between genomes 1 and 2.

Extended Data Fig. 3 Scheme of the graph untangling.

We applied ODGI UNTANGLE to obtain a mapping from segments of all PVG paths onto T2T-CHM13. The segmentation cuts the graph into regular-sized regions whose boundaries occur at structural variant breakpoints. For each query subpath through a graph segment, we use a Jaccard metric over the sequence space of the subpaths to find the best-matching reference segment.

Extended Data Fig. 4 Characteristics of the pseudo-homologous regions of acrocentric chromosomes on chromosome 13.

(A) We focus on the first 25 Mbp of chromosome 13, shown here as a red box over T2T-CHM13 cytobands. Pseudo-homologous regions (PHRs), where diverse sets of acrocentric chromosomes recombine, are highlighted relative to T2T-CHM13 genome annotations for repeats, GC percentage, and genes. Above, we indicate regions of interest described in the main text: rDNA, SST1 array, centromere, and q-arm. Below, we show T2T-CHM13-relative homology mosaics for each chromosome 13 matched contig from HPRCy1-acro, with the most-similar reference chromosome at each region shown using the given colors (Target). (B) Aggregated untangle results in the SAACs. For each acrocentric chromosome, we show the count of its HPRCy1 q-arm-anchored contigs mapping itself and all other acrocentrics (Contigs), (C) as well as the regional (50kbp) untangle entropy metric (Regional homology entropy) computed over the contigs’ T2T-CHM13-relative untanglings. (D) By considering the multiple untangling of each HPRCy1-acro contig, we develop a point-wise metric that captures diversity in T2T-CHM13-relative homology patterns (Positional homology entropy), leading to our definition of the PHRs. (E) The patterns of homology mosaicism suggest ongoing recombination exchange in the SAACs. A scan over T2T-CHM13 reveals that the rDNA units are enriched for PRDM9 binding motifs, and thus may host frequent double stranded breaks during meiosis. In (B-D) a gray background indicates regions with missing data due to the lack of non-T2T-CHM13 contigs. We provide the Centromeric Satellite Annotation (CenSat Annotation) track legend in Extended Data Table 1.

Extended Data Fig. 5 Characteristics of the pseudo-homologous regions of acrocentric chromosomes on chromosome 14.

(A) We focus on the first 25 Mbp of chromosome 14, shown here as a red box over T2T-CHM13 cytobands. Pseudo-homologous regions (PHRs), where diverse sets of acrocentric chromosomes recombine, are highlighted relative to T2T-CHM13 genome annotations for repeats, GC percentage, and genes. Above, we indicate regions of interest described in the main text: rDNA, SST1 array, centromere, and q-arm. Below, we show T2T-CHM13-relative homology mosaics for each chromosome 13 matched contig from HPRCy1-acro, with the most-similar reference chromosome at each region shown using the given colors (Target). (B) Aggregated untangle results in the SAACs. For each acrocentric chromosome, we show the count of its HPRCy1 q-arm-anchored contigs mapping itself and all other acrocentrics (Contigs), (C) as well as the regional (50kbp) untangle entropy metric (Regional homology entropy) computed over the contigs’ T2T-CHM13-relative untanglings. (D) By considering the multiple untangling of each HPRCy1-acro contig, we develop a point-wise metric that captures diversity in T2T-CHM13-relative homology patterns (Positional homology entropy), leading to our definition of the PHRs. (E) The patterns of homology mosaicism suggest ongoing recombination exchange in the SAACs. A scan over T2T-CHM13 reveals that the rDNA units are enriched for PRDM9 binding motifs, and thus may host frequent double stranded breaks during meiosis. In (B-D) a gray background indicates regions with missing data due to the lack of non-T2T-CHM13 contigs. We provide the Centromeric Satellite Annotation (CenSat Annotation) track legend in Extended Data Table 1.

Extended Data Fig. 6 Characteristics of the pseudo-homologous regions of acrocentric chromosomes on chromosome 15.

(A) We focus on the first 25 Mbp of chromosome 15, shown here as a red box over T2T-CHM13 cytobands. Pseudo-homologous regions (PHRs), where diverse sets of acrocentric chromosomes recombine, are highlighted relative to T2T-CHM13 genome annotations for repeats, GC percentage, and genes. Above, we indicate regions of interest described in the main text: rDNA, SST1 array, centromere, and q-arm. Below, we show T2T-CHM13-relative homology mosaics for each chromosome 13 matched contig from HPRCy1-acro, with the most-similar reference chromosome at each region shown using the given colors (Target). (B) Aggregated untangle results in the SAACs. For each acrocentric chromosome, we show the count of its HPRCy1 q-arm-anchored contigs mapping itself and all other acrocentrics (Contigs), (C) as well as the regional (50kbp) untangle entropy metric (Regional homology entropy) computed over the contigs’ T2T-CHM13-relative untanglings. (D) By considering the multiple untangling of each HPRCy1-acro contig, we develop a point-wise metric that captures diversity in T2T-CHM13-relative homology patterns (Positional homology entropy), leading to our definition of the PHRs. (E) The patterns of homology mosaicism suggest ongoing recombination exchange in the SAACs. A scan over T2T-CHM13 reveals that the rDNA units are enriched for PRDM9 binding motifs, and thus may host frequent double stranded breaks during meiosis. In (B-D) a gray background indicates regions with missing data due to the lack of non-T2T-CHM13 contigs. We provide the Centromeric Satellite Annotation (CenSat Annotation) track legend in Extended Data Table 1.

Extended Data Fig. 7 Characteristics of the pseudo-homologous regions of acrocentric chromosomes on chromosome 21.

(A) We focus on the first 25 Mbp of chromosome 21, shown here as a red box over T2T-CHM13 cytobands. Pseudo-homologous regions (PHRs), where diverse sets of acrocentric chromosomes recombine, are highlighted relative to T2T-CHM13 genome annotations for repeats, GC percentage, and genes. Above, we indicate regions of interest described in the main text: rDNA, SST1 array, centromere, and q-arm. Below, we show T2T-CHM13-relative homology mosaics for each chromosome 13 matched contig from HPRCy1-acro, with the most-similar reference chromosome at each region shown using the given colors (Target). (B) Aggregated untangle results in the SAACs. For each acrocentric chromosome, we show the count of its HPRCy1 q-arm-anchored contigs mapping itself and all other acrocentrics (Contigs), (C) as well as the regional (50kbp) untangle entropy metric (Regional homology entropy) computed over the contigs’ T2T-CHM13-relative untanglings. (D) By considering the multiple untangling of each HPRCy1-acro contig, we develop a point-wise metric that captures diversity in T2T-CHM13-relative homology patterns (Positional homology entropy), leading to our definition of the PHRs. (E) The patterns of homology mosaicism suggest ongoing recombination exchange in the SAACs. A scan over T2T-CHM13 reveals that the rDNA units are enriched for PRDM9 binding motifs, and thus may host frequent double stranded breaks during meiosis. In (B-D) a gray background indicates regions with missing data due to the lack of non-T2T-CHM13 contigs. We provide the Centromeric Satellite Annotation (CenSat Annotation) track legend in Extended Data Table 1.

Extended Data Fig. 8 Characteristics of the pseudo-homologous regions of acrocentric chromosomes on chromosome 22.

(A) We focus on the first 25 Mbp of chromosome 22, shown here as a red box over T2T-CHM13 cytobands. Pseudo-homologous regions (PHRs), where diverse sets of acrocentric chromosomes recombine, are highlighted relative to T2T-CHM13 genome annotations for repeats, GC percentage, and genes. Above, we indicate regions of interest described in the main text: rDNA, SST1 array, centromere, and q-arm. Below, we show T2T-CHM13-relative homology mosaics for each chromosome 13 matched contig from HPRCy1-acro, with the most-similar reference chromosome at each region shown using the given colors (Target). (B) Aggregated untangle results in the SAACs. For each acrocentric chromosome, we show the count of its HPRCy1 q-arm-anchored contigs mapping itself and all other acrocentrics (Contigs), (C) as well as the regional (50kbp) untangle entropy metric (Regional homology entropy) computed over the contigs’ T2T-CHM13-relative untanglings. (D) By considering the multiple untangling of each HPRCy1-acro contig, we develop a point-wise metric that captures diversity in T2T-CHM13-relative homology patterns (Positional homology entropy), leading to our definition of the PHRs. (E) The patterns of homology mosaicism suggest ongoing recombination exchange in the SAACs. A scan over T2T-CHM13 reveals that the rDNA units are enriched for PRDM9 binding motifs, and thus may host frequent double stranded breaks during meiosis. In (B-D) a gray background indicates regions with missing data due to the lack of non-T2T-CHM13 contigs. We provide the Centromeric Satellite Annotation (CenSat Annotation) track legend in Extended Data Table 1.

Extended Data Fig. 9 PRDM9 binding motif in the acrocentric chromosomes.

For each T2T-CHM13 acrocentric chromosome, we show the number of human PRDM9 binding motif hits present in windows 20 kbps long.

Extended Data Fig. 10 Linkage disequilibrium decay with distance between markers per acrocentric chromosome.

Each LD decay plot shows the p-arm (purple), q-arm (pink), and PHR (blue) mean r² (points) and 95% confidence intervals (error bars) for marker pairs binned by the given inter-marker distance range (x-axis). Dot size is proportional to the number of pairwise comparisons within a bin. LD decay is faster in PHRs for chromosomes 13, 14, and 22. No notable LD decay is observed in PHRs for chromosome 15.

Extended Data Table 1 Centromeric Satellite Annotation (CenSat Annotation) track legend

Full size table

Supplementary information

Supplementary Information

This file contains Supplementary Figs. 1–29.

Reporting Summary

Peer Review File

Supplementary Note 1

Physical proximity modelling of acrocentric short arms.

Supplementary Files

This file contains Supplementary Files 1–8.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Guarracino, A., Buonaiuto, S., de Lima, L.G. et al. Recombination between heterologous human acrocentric chromosomes. Nature 617, 335–343 (2023). https://doi.org/10.1038/s41586-023-05976-y

Download citation

Received: 15 August 2022
Accepted: 17 March 2023
Published: 10 May 2023
Issue Date: 11 May 2023
DOI: https://doi.org/10.1038/s41586-023-05976-y

This article is cited by

RepEnTools: an automated repeat enrichment analysis package for ChIP-seq data reveals hUHRF1 Tandem-Tudor domain enrichment in young repeats
- Michel Choudalakis
- Pavel Bashtrykov
- Albert Jeltsch
Mobile DNA (2024)
The variation and evolution of complete human centromeres
- Glennis A. Logsdon
- Allison N. Rozanski
- Evan E. Eichler
Nature (2024)
Pangenome graphs improve the analysis of structural variants in rare genetic diseases
- Cristian Groza
- Carl Schwendinger-Schreck
- Tomi Pastinen
Nature Communications (2024)
Pangenome graph construction from genome alignments with Minigraph-Cactus
- Glenn Hickey
- Jean Monlong
- Benedict Paten
Nature Biotechnology (2024)
A diverse and inclusive human pangenome
- Michael Attwaters
Nature Reviews Genetics (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.