Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes

Advancements in sequencing technologies and assembly methods enable the regular production of high-quality genome assemblies characterizing complex regions. However, challenges remain in efficiently interpreting variation at various scales, from smaller tandem repeats to megabase rearrangements, across many human genomes. We present a PanGenome Research Tool Kit (PGR-TK) enabling analyses of complex pangenome structural and haplotype variation at multiple scales. We apply the graph decomposition methods in PGR-TK to the class II major histocompatibility complex demonstrating the importance of the human pangenome for analyzing complicated regions. Moreover, we investigate the Y-chromosome genes, DAZ1/DAZ2/DAZ3/DAZ4, of which structural variants have been linked to male infertility, and X-chromosome genes OPN1LW and OPN1MW linked to eye disorders. We further showcase PGR-TK across 395 complex repetitive medically important genes. This highlights the power of PGR-TK to resolve complex variation in regions of the genome that were previously too complex to analyze.


Supplementary Material
6 Effect of the Parameter Choice to The MAP Vertex Sizes 10 Supplementary Figure 1 10 Supplementary Table 5 12 Suggested Parameter Choice for Region of Size Up to 5Mb 13 Supplementary Table 6 13 Performance Evaluation

Sequence Query
While PGR-TK is not designed for creating sequence alignments, the query sequence to database query provided function to identify homologuos sequences in the database to the query sequences. We compare the computing resource for such utility in PGR-TK to minimap2, currently the state of are for fast sequence alignment. For a set of ten selected regions for querying 11 haplotype pangenome references, PGR-TK can index all genome in parallel and provide comparable query time.

Supplementary Table 2
We compared the query results for two selected regions and found them to be consistent. In these two cases, due to differences in their design, the "pgr-query" command only produced a single aligned region for each reference assembly, rather than multiple supplementary alignments. Similar to minimap2, "pgr-query" provides additional information about the hits, allowing the user to apply filters and define criteria to eliminate false positive alignments caused by repeats in more complex scenarios. Supplementary Table 3 In Supplementary Table 3, we compared the results of the pgr-query and minimap2 for a test set of pangenomic sequences. While pgr-query is designed to fetch homologous sequences from the database using long query sequences, rather than as a general sequence aligner, it is important to demonstrate its performance in fetching sequences accurately. We compared all hits larger than 10 kb from the pgr-query output for a set of 395 query sequences from the CMRG to the minimap2 output and found that the results are highly consistent in most cases. Some discrepancies are due to (1) short hits and (2) low minimap2 mapQV output. The pgrquery output might be more sensitive, such that some repetitive sequences are in the query output without proper filtering.
On the other hand, PGR-TK demonstrated a significant advantage in terms of computational efficiency, with a construction time of the pangenome graphs that was 500x faster in terms of user CPU time and 60x faster in terms of wall clock time compared to Seqwish, and 75x faster and 160x faster, respectively, compared to Minigraph using default pgr-pbundle-decomp parameters.

Effect of the Parameter Choice to The MAP Vertex Sizes
Supplementary Figure 1 Vertex Sizes of the Chromosome 1 of Chm13 with different w and r, (min_span = 64).
Supplementary Figure 1a: the vertex size influences the resolution of the sequence being analyzed. Observing the vertex size distributions using various parameter sets, a flat region is observed followed by an exponential tail. To effectively query the database, it's best to ensure that the query sequence length is a multiple of the average vertex size.
Supplementary Figure 1b: this vertex size distribution plot with the same parameter sets as those in 1a, but with a small min_span. To reduce excessing minimizer in long simple repeat regions, we remove all pairs of minimizer anchors that are smaller than min_span to reduce unnecessary additional computation resources for processing simple repeat regions. We use the pgr-fetch-seqs command in PGR-TK to get the two references: pgr-fetch-seqs pgr-tk-HGRP-y1-evaluation-set-v0 \ -r regions_interest > ROI_seq.fa Then, we use the pgr-query command to get the sequences in the pangenome reference. We merge the hits that are less than 100kb apart from each other: pgr-query /wd/data/pgr-tk-HGRP-y1-evaluation-set-v0 \ /wd/results/pgr-out/ROI_seq.fa /wd/results/pgr-out/pg_seqs --merge-range-tol 100000 After fetching the sequence in the database, we filter out partial aligned contigs. With in the aligned contig, we generate the MAP-graph and the principal bundle decomposition by Please see more concrete examples in the git repo: https://github.com/GeneDx/pgr-tk

Comparing Principal Bundle Decomposition with Different Set of Parameters
We provide two illustrations of principal bundle decompositions, each with varying scales by changing the parameter sets. The first illustration, shown in Supplementary Figures 3a and 3b, pertains to a 130 kb region of interest containing the LPA KIV-II repeats. The second illustration, also shown in Supplementary Figures 3c and 3d, is of a 2.85 Mbp region located on chromosome 7 from positions 72752602 to 75600937 on GRCh38. This region is known to contain a microdeletion caused by nested repeats, which results in Williams-Beuren syndrome.
In Supplementary Figure 2a, there are 231 bundle segments spanning the 103,538 bp CHM13 LPA sequence, while only 93 bundle segments are present in Supplementary Figure 2b. The sparser decomposition with r=12 in Supplementary Figure 2b for this region may not provide sufficient detail for analyzing repeat elements in the sequences In contrast, for the large 2.85 Mbp region, the choice r=12 provides a better representation of the overall structure (Supplementary Figure 2c), as it contains only 251 bundle segments out of the 2,916,749 bp chromosome 13 sequence, compared to the r=6 choice ( Supplementary  Figure 2d), which has 685 bundle segments. The higher number of bundle segments in the r=6 choice results in over-fragmentation of the sequences, making it more difficult to identify interesting repeats.
The pgr-pbundle-decomp command-line tool generates a summary of all contigs, providing valuable information for analysis by reporting the total number and average lengths of repetitive and non-repetitive fragments. This information can help the user make informed