Main

Recent, single-molecule, long-read sequencing technologies have enabled breakthroughs in producing de novo haplotype-resolved genome assemblies1,2,3,4. Major efforts are under way5 (https://www.genome.gov/news/news-release/NIH-funds-centers-for-advancing-sequence-of-human-genome-reference) to generate hundreds of human genome assemblies, with the intention of deriving a variation-aware pangenome representation that replaces the current linear reference genome, GRCh38. Although long-read technologies are rapidly advancing, advantages of cost and scalability, and the requirement for large study cohorts, will make short reads a more practical approach for the foreseeable future.

Diploid organisms have two copies of each autosomal chromosome, each of which carries genetic variation. The process of determining whether a known variant allele is located on none, one or both of these copies is referred to as genotyping. Variant genotyping is an essential step in genetic studies, enabling population analysis, quantitative trait locus studies or trait association analysis. Large studies have produced comprehensive catalogs of human variation ranging from single-nucleotide polymorphisms (SNPs) and indels (insertions and deletions up to 49 bp in size) to larger structural variants (SVs)6,7,8,9, and many such variants have been linked to diseases and other traits10,11,12,13,14,15.

Widely used genotyping methods for sequencing data16,17,18,19,20 are based on short-read alignments to a reference genome or pangenome graphs, which include possible alternative alleles21,22,23,24,25,26,27. Graph-based approaches have been shown to improve genotyping accuracy over methods that rely on a linear reference genome. However, aligning sequencing reads is time-consuming even for linear reference genomes, where mapping 30× short-read sequencing data of a single human sample takes around 100 CPU hours. This problem is amplified when transitioning to graph-based pangenome references, where the read-mapping problem is even more computationally expensive.

A much faster alternative is to genotype known variants based on k-mers, short sequences of a fixed length k, in the raw sequencing reads without alignment to a reference. Counts of reference- and allele-specific k-mers allow fast and accurate genotyping of various types of genetic variation28,29,30,31,32,33. However, these methods can struggle in repetitive and duplicated regions of the genome not covered by unique k-mers. This is especially problematic for SVs, which are often located in repeat-rich or duplicated regions of the genome8,34 that are generally difficult to access by short-read sequencing35.

This problem has been addressed previously by leveraging long-range connectivity information from sequencing reads36. In a similar manner, haplotype-resolved assemblies of known samples could improve k-mer-based genotyping, especially in difficult-to-access regions of large diploid genomes, but methods for this have so far been lacking. Known haplotypes have been used to construct population-based reference panels to phase small variants (Li–Stephens model)37 as well as impute missing genotypes38,39,40,41, but accurate reference panels that include SVs are still lacking.

In this report, we describe an algorithm, PanGenie (for Pangenome-based Genome Inference), that makes use of haplotype information from an assembly-derived pangenome representation in combination with read k-mer counts to efficiently genotype a wide spectrum of variants. That is, our method can leverage short and longer linkage disequilibrium (LD) structures inherent in the assemblies to infer the genome of a new sample for which only short reads are available. PanGenie bypasses read mapping and is entirely based on k-mers, which allows it to rapidly proceed from the input short reads to a final callset including SNPs, indels and SVs, enabling analysis of variants typically not accessible in short-read workflows—including many deletions <1 kb and most insertions ≥50 bp. We applied our method to genotype variants called from haplotype-resolved assemblies of 11 individuals, revealing a substantial advance in terms of runtime, genotyping accuracy and number of accessible variants.

Results

Algorithm overview

We call variants from haplotype-resolved assemblies (see Constructing a pangenome reference) of several samples and construct a pangenome graph in which these variants are represented as bubbles and each haplotype as a path (Fig. 1, step 1). This graph is given as input to PanGenie, together with short-read sequencing data of a new sample to be genotyped. The k-mers contained in the graph are counted in the reads and k-mers unique to bubble regions are identified (step 2 in Fig. 1; Methods). PanGenie combines two sources of information to genotype bubbles: read k-mer counts and the already known haplotype sequences. The distribution of k-mer counts along the allele paths of a bubble can provide evidence for the genotype of the sample. Figure 1 (right panel) provides an example: k-mers corresponding to the second allele of the first bubble are absent from the reads, indicating that the individual carries the alleles of the green and blue haplotypes. However, bubbles may be poorly covered by k-mers or no unique k-mers may exist in repetitive regions of the genome. Such positions cannot be reliably genotyped based on the k-mer counts alone, but known haplotypes can help to infer genotypes based on neighboring bubbles (Fig. 1).

Fig. 1: Overview.
figure 1

a, Step 1: variants are called from haplotype-resolved assemblies of a set of known samples and a pangenome graph is constructed, which represents variants as bubbles and contains one path per haplotype. b, Step 2: the k-mers (represented by circles) contained in the graph are counted in the short-read sequencing data of the target sample to be genotyped. The color of the nodes indicates copy number estimates for the k-mers. c, Step 3: PanGenie uses k-mer counts and haplotype paths to infer the unknown genome. For the first bubble, k-mer counts suggest that the sample probably carries the alleles of the green and blue haplotypes. The second bubble is poorly covered by k-mers; however, linkage to adjacent bubbles can be used to infer the two local haplotype paths.

For genotyping, we integrate information from k-mer counts and haplotypes by constructing a hidden Markov model (HMM), which models the unknown genome as a mosaic of the provided haplotypes and reconstructs it based on the read k-mer counts observed in the sample’s sequencing reads (Methods). Hidden states represent pairs of haplotype paths that can be chosen at each bubble position and emit counts for the unique k-mers of the respective region. State transitions between adjacent bubbles correspond to recombination events. Using the forward–backward algorithm, genotype likelihoods are computed for each bubble, from which a genotype is derived.

Constructing a pangenome reference

We generated haplotype-resolved assemblies of 14 individuals including 3 mother–father–child trios (Fig. 2a and Methods; samples include: Yoruban trio: NA19238, NA19239, NA19240; Puerto Rican trio: HG00731, HG00732, HG00733; southern Han Chinese trio: HG00512, HG00513, HG00514; and NA12878, HG02818, HG03125, NA24385 and HG03486) and used all 11 unrelated samples to call variants on each haplotype of all autosomes and chromosome X. We computed the transition:transversion (ti:tv) ratio for SNPs and the heterozygous:homozygous (het:hom) ratio as quality control measures42,43. Our SNP calls contained around twice as many transitions as transversions (Fig. 2b) resulting in ti:tv ratios between 2.01 and 2.02 for all samples. We obtained het:hom ratios between 1.37 and 2.20 for all our 11 callset samples. These numbers are in line with respective results for African (AFR), American (AMR), Asian (EAS) and European (EUR) individuals reported in previous studies43,44. Furthermore, our callset contains comparable numbers of insertions and deletions (Fig. 2c), except for the expected enrichment for insertion alleles for SVs8. We show detailed counts of distinct variant alleles for all types in Fig. 2d (first row) and Supplementary Tables 2 and 3. We distinguish small variants (1–19 bp), midsize variants (20–49 bp) and large variants (≥50 bp).

Fig. 2: Callset statistics.
figure 2

a, Overview of the samples for which variants are called from haplotype-resolved assemblies as well as their het:hom ratios. Color corresponds to the population from which the samples originate. b, The number of different substitutions reported for all samples. c, Length distribution of insertions and deletions across all samples (in basepairs). d, Total number of distinct variant alleles detected across all 11 samples (first row), as well as the number of bubbles in the corresponding pangenome graph (second row). We distinguished small (1–19 bp), midsize (20–49 bp) and large (≥50 bp) variants. Biallelic bubbles were classified as SNPs, insertions or deletions; complex corresponds to all remaining bubbles with more than two branches resulting from inserting overlapping variant calls into the graph.

We created an acyclic and directed pangenome graph containing bubbles representing our variant callset (Methods and Extended Data Fig. 1). Sets of overlapping variant alleles are merged into a single bubble representing all alternative sequences observed across the haplotypes (Fig. 2d). The haplotypes themselves are represented as paths through the resulting pangenome. We distinguish biallelic from complex bubbles. The latter corresponds to bubbles with more than two branches and the former to all bubbles with two branches (reference and alternative sequence). Based on the type of bubbles, we define genomic regions as ‘biallelic’ or ‘complex’ (Methods and Extended Data Fig. 1).

Comparison to existing genotyping methods

We conducted a ‘leave-one-out experiment’ (Methods and Extended Data Fig. 2) based on Illumina reads from the Genome in a Bottle (GIAB) consortium45 and 1000 Genomes Project high-coverage data46. In the same way as described above, we created a callset containing variants detected from haplotype-resolved assemblies of a subset of ten samples and re-genotyped these variants using Illumina data of the remaining sample. Variants called from the assemblies of the left-out sample are used as the ground truth for evaluation. We ran this experiment twice, leaving out samples NA12878 and NA24385 for evaluation, respectively. In addition to running PanGenie, we ran BayesTyper32 (k-mer based), Platypus19, GATK HaplotypeCaller16, GraphTyper22, Paragraph25 and Giraffe27 (all mapping based) to re-genotype the same set of variants (Methods and Extended Data Fig. 2). We ran our experiments on coverage levels 30×, 20×, 10× and 5×.

Not all tools are designed to handle all types of variants. Therefore, we ran GATK only on SNPs, small and midsize variants and Paragraph only on midsize and large variants. GraphTyper and Giraffe were run on large variants only.

Results for NA12878 (Fig. 3 and Extended Data Figs. 38) and NA24385 (Supplementary Figs. 49) are similar, showcasing consistency of results across samples. To analyze genotyping performance, we introduced the weighted genotype concordance (wGC) which puts equal emphasis on the ability to detect all three possible genotypes (Supplementary Note). As an alternative view on the performance of the individual methods, we offer precision, recall and F score, all in an unadjusted version and an adjusted version that does not penalize methods for ‘missing’ variants that are undetectable because they are not in the input set (Supplementary Note). Furthermore, we stratify our analyses by considering variants outside and inside short-tandem repeats (STRs) and variable-number tandem repeats (VNTRs)47. We annotated variants according to their repeat status and observed that between 68% and 72% of midsize (20–49 bp) and large variants (≥50 bp) are repeat associated, respectively (Supplementary Table 4). We consider two configurations for PanGenie: ‘high-gq’ filtering, where we use only genotypes reported with high quality scores and treat all other variants as not genotyped, and ‘all’, where we consider all reported genotypes regardless of their quality.

Fig. 3: Leave-one-out experiment.
figure 3

The wGC at different coverages for sample NA12878 and F scores for coverage 30× in nonrepetitive (top) and STR/VNTR regions (bottom). We ran PanGenie, BayesTyper, Paragraph, Platypus, GATK, GraphTyper and Giraffe to re-genotype all callset variants. Besides not applying any filter on the reported genotype qualities (‘all’), we additionally report genotyping statistics for PanGenie when using ‘high-gq’ filtering (genotype quality ≥200). Insertions and deletions include all respective variants in biallelic regions of the genome, whereas complex contains all variant alleles falling into regions with complex bubbles in the pangenome graph representation.

For biallelic SNPs in nonrepetitive regions, all methods reach excellent levels of genotype concordance (Extended Data Fig. 3) and F scores (Extended Data Fig. 7), with all F scores >0.95 at coverage 30×. For biallelic SNPs in repetitive regions, PanGenie still achieves an F score of 0.85, whereas the second-best tool GATK reaches only 0.75 (Extended Data Fig. 8). In repetitive regions, BayesTyper has the largest fraction of untyped SNPs of all tools, resulting in lowest recall of 0.6 for biallelic SNPs and 0.17 for SNPs inside of complex bubbles (Extended Data Fig. 6).

For small insertions and deletions, PanGenie (‘all’) outperforms the mapping-based approaches, in particular in STR/VNTR regions (wGC of 90.4% for insertions and 92.8% for deletions; Extended Data Fig. 4), where the best mapping-based tools (GATK) achieved a wGC of 83% and 86.9% for biallelic insertions and deletions, respectively, at coverage 30×. BayesTyper and PanGenie using ‘high-gq’ filtering achieved the highest wGCs, both >99% for nonrepetitve (Extended Data Fig. 3) and >97% for repetitive regions (Extended Data Fig. 4). For both tools, these good wGCs came at the expense of relatively few genotyped variants, with PanGenie being able to genotype slightly more. We also evaluated our results for SNPs, small and midsize variants using the GIAB high-confidence small variant callset48 as a ground truth (Supplementary Fig. 11).

Performance differences were largest for midsize and large variants (Fig. 3). PanGenie clearly outperforms the mapping-based approaches, especially in repeat regions. Here, PanGenie (‘all’) reaches wGCs for large SVs of 85%, 92% and 76% for biallelic insertions, biallelic deletions and variants in complex multiallelic regions, respectively, at coverage 30×. This is in contrast with the performance of the best mapping-based tool, achieving only 64%, 79% and 51%, respectively. BayesTyper reached high wGCs, but left 42%, 39% and 77% of these variants untyped, respectively. Using ‘high-gq’ filtering, PanGenie can reach concordances similar or superior to BayesTyper, while still being able to type much larger fractions of variants (Fig. 3). PanGenie’s genotyping performance for large SVs in repetitive regions is underscored also by the F score (Fig. 3): for large biallelic insertions, for example, PanGenie (‘all’) shows an F score of 0.7 whereas all other tools reach F scores <0.5. We additionally used the SVs contained in the syndip benchmark set49 to evaluate genotyping performance. Although the absolute results tend to be slightly worse for all tools, PanGenie again produced the most accurate genotype predictions and outperformed the other tools (Supplementary Fig. 12).

Runtimes

For each method, we measured the time required to produce genotypes given variants and raw, unaligned sequencing reads (Supplementary Table 5). The k-mer-based methods PanGenie and BayesTyper were much faster compared with the remaining, mapping-based methods that were combined with BWA50 for read mapping. PanGenie was fastest on all coverages, being between 3.97× and 4.6× faster than the fastest mapping-based approach at 30×.

Accuracy in the major histocompatibility complex

To evaluate the accuracy of all 14 haplotype-resolved assemblies in the human leukocyte antigen (HLA) region, we used HLA*ASM51 to determine assembly HLA types (Supplementary Table 6). HLA*ASM successfully processed 27 of 28 input assemblies and identified perfect (edit distance 0) HLA-G group matches52 for all classic HLA loci (HLA-A, -B, -C, -DQA1, -DQB1 and -DRB1) in all processed input assemblies with one exception (HLA-DRB1 in NA19238), which was resolved by manual curation with minimap253. To verify the accuracy of the assembly HLA types, we integrated publicly available HLA genotype data for samples from the 1000 Genomes Project54,55,56 for HLA-A, -B, -C, -DQB1 and -DRB1, intersected these with the assembly-implied HLA types, and found perfect agreement in all evaluated cases (9 samples and 85 individual genotype comparisons; Supplementary Table 6).

We additionally evaluated PanGenie’s genotyping performance in the HLA region based on a ‘leave-one-out’ experiment for samples HG00731, NA12878 and NA24385, and observed high levels of wGCs across commonly studied HLA genes. Although the average wGC across all three samples was lowest for HLA-DRB1 and -C4 (58% and 79%, respectively in biallelic regions), it was between 98% and 100% for HLA-C, -DPA1, -DPB1 and -DRA in biallelic regions, and between 93% and 100% for all variants in complex regions (Extended Data Fig. 9).

Genotyping larger cohorts

The low runtime of PanGenie makes it well suited to genotype larger cohorts. As an example use case, we applied it to a set of 300 samples consisting of 100 randomly selected trios from the 1000 Genomes Project using high-coverage data46. We used our pangenome graph containing all 2 × 11 haplotypes to compute genotypes for all detected variants. Similar to the approach introduced previously4, we employed Mendelian consistency of the genotyped trios and the genotype quality reported by PanGenie to compute an integrated score for genotyping reliability of each variant. To this end, we defined different filters for a positive set with the most reliable (termed ‘strict’ set) and a negative set with the most unreliable variants. Using a machine-learning approach trained on these two subsets, we computed scores for all remaining variants, reflecting how confident we were about their genotyping, and used those to derive a ‘lenient’ set of variants containing 78% and 83% of all insertion SVs and deletion SVs, respectively (Supplementary Note and Supplementary Table 8). To confirm that the lenient set still offers very good genotyping performance, we analyzed allele frequencies and heterozygosities observed from the predicted genotypes for all variants in the lenient set and observed a relationship close to what is expected from the Hardy–Weinberg equilibrium (HWE; Fig. 4a and Methods). When testing for HWE, 90.7% of SV alleles inside of repeats, and 90.9% outside of repeats, showed no significant deviation. Furthermore, observed allele frequencies (AFs) across all 200 unrelated samples are in excellent agreement with coarse-grained AF estimates obtained from the 22 haplotype assemblies of our 11 input samples (Fig. 4b). Note that neither of these two measures, HWE and agreement in estimated AFs, has been used when defining the lenient set and therefore serves as independent evidence for PanGenie’s performance. PanGenie on average only took about 30 single-core CPU hours per sample.

Fig. 4: Genotyping large cohorts.
figure 4

a, The hexbin plots show the relationship between AFs and heterozygosities of the PanGenie genotypes for all 200 unrelated samples from the 1000 Genomes Project. The barplots show the one-dimensional distributions of both features (top: AF, right: heterozygosity). All large insertions (≥50 bp, n = 84,836) and deletions (≥50 bp, n = 34,290) contained in our lenient set were taken into account. b, Comparison of AFs computed from the PanGenie genotypes for 200 samples and the corresponding AFs observed in the 11 assembly samples from which variants were called. As in a, we consider all large insertions (≥50 bp, n = 84,836) and deletions (≥50 bp, n = 34,290) contained in our lenient set. In the boxplots, lower and upper limits of the box represent the lower and upper quartiles (Q1 and Q3); the median is marked in yellow. Lower and upper whiskers are defined as Q1 − 1.5 (Q3–Q1) and Q3 + 1.5 (Q3–Q1), respectively, and outliers are marked by dots. c, Length distribution of the number of common insertions and deletions (AF ≥ 5%) contained in the PanGenie lenient callset and gnomAD.

Our callset contains 209 of 250 medically relevant SVs reported by GIAB57. We observed that 174 medically relevant SVs were contained in our lenient set, of which 119 were part of our strictly filtered set. We show the score distribution for these variants as well as AFs and heterozygosities observed across all 200 unrelated samples for the lenient set in Extended Data Fig. 10.

Comparison to gnomAD

We compared the 119,126 SV alleles genotypable by PanGenie (lenient set) with the SVs that are part of the Genome Aggregation Database (gnomAD)9; gnomAD contains SVs collected across 14,891 genomes from different populations9. Requiring a reciprocal overlap of at least 50% or a start, end and variant length deviation of <200 bp, we found that both callsets had 34,468 variants in common (Supplementary Note), whereas 84,658 (71%) of our SV alleles were not contained in gnomAD. This finding is consistent with previous observations that short-read-based SV detection misses most SVs35. Of those 84,658 SVs, around 80% were located in STR/VNTR regions. Furthermore, 43% of these 84,658 variants were common variants with AF ≥ 0.05 across all genotyped samples. The length distribution of common insertions and deletions (Fig. 4c) demonstrates the ability of PanGenie to genotype variants in regions inaccessible by callers based on short-read data alone, and shows its particular impact when genotyping insertions and shorter deletions.

LD analysis

Based on the genotypes obtained across all 200 unrelated samples (Genotyping larger cohorts), we performed an LD analysis (Methods). We selected all SNPs from our callset that were contained at least five times in the NHGRI-EBI GWAS (genome-wide association studies) catalog58. For each resulting variant, we calculated LD, comparing it with all our callset variants within a window of 1 Mb.

For 147 of 3,404 disease-associated SNPs from NHGRI-EBI, we found nearby structural variants that were in LD (r2 ≥ 0.8; see Supplementary Table 9 for all hits with r2 ≥ 0.9). An insertion of length 129 bp located at position 133,278,856 on chromosome 9, close to the ABO gene, looked particularly interesting (Fig. 5). It is in LD with six GWAS variants (rs2519093, rs495828, rs507666, rs579459, rs635634 and rs651007) which are related to low-density lipoprotein-cholesterol levels58. Of note, neither the GWAS SNPs nor the insertions are in LD with blood-type markers present in our callset (rs8176747 (ref. 59), rs8176746 (ref. 60), rs8176743 (ref. 59), rs8176742 (ref. 61), rs8176741 (ref. 61), rs8176740 (ref. 61), rs7853989 (ref. 61), rs1053878 (ref. 61), rs8176720 (ref. 61) and rs8176719 (ref. 60)). The insertion is located in a long tandem repeat (LTR10B2 for ERV1 endogenous retrovirus). Analysis of the insertion sequence revealed that it contains three exact copies of a 43-bp sequence (Methods and Supplementary Fig. 19), which appears with copy number 1 in the reference genome. We thus conclude that this insertion is a repeat expansion, leading to four copies of this repeated subsequence. A comparison with nonhuman primate genomes62,63 shows that the 43-mer occurs as two copies in gorilla (Gorilla gorilla), but is a single copy in chimpanzee (Pan troglodytes), bonobo (Pan paniscus) and the Sumatran orangutan (Pongo abelii). This suggests independent expansion events or incomplete lineage sorting in humans and gorillas.

Fig. 5: LD analysis.
figure 5

We calculated the LD for GWAS variants and SVs that were part of our assembly-based callset. We detected an insertion (marked in blue) close to the ABO gene which was in LD with six GWAS SNPs. The plots show all callset variants in this region; GWAS variants are annotated with their name. Those variants colored in red correspond to blood-type markers.

Another interesting association was an intronic insertion of length 322 bp located at position 28,264,365 on chromosome 12, inside the CCDC91 gene close to a regulatory element reported by ENCODE64 (Supplementary Fig. 20). It was in LD with two GWAS variants (rs10843151 and rs11049566), which are both linked to body fat58. One of these SNPs, rs10843151, is in perfect LD with many other variants in this region, which suggests that it is probably embedded in the same haplotype block. Such perfect LD provides further evidence that PanGenie is accurately genotyping new insertions within short-read sequencing data.

Discussion

We presented an algorithm, PanGenie, that can leverage the long-range haplotype information inherent to a panel of assembled haplotypes in combination with read k-mer counts for genotyping an uncharacterized sample. Although we generated such pangenome reference panels from haplotype-resolved assemblies for the present study, generating these panels was not the main focus of this report and PanGenie is not restricted to panels created in this way. In fact, it can be applied to any acyclic genome graph with fully phased path information.

Traditionally, longer variants are especially difficult to genotype based on short reads only, because such variants are often located in repetitive or duplicated regions of the genome, leading to the difficulty of unambiguously aligning the reads. Approaches based on k-mers additionally lack connectivity information contained in the reads because they do not use the order of k-mers stemming from the same read or read pair. PanGenie overcomes these limitations of short reads because it incorporates long-range haplotype information inherent to the pangenome reference panel that it uses. In comparison to BayesTyper, a graph-based genotyper relying on k-mers, PanGenie genotypes a large fraction of variants not typable by the former. For SVs and indels, PanGenie clearly outperforms mapping-based approaches, which require alignments of reads to a reference genome. Compared with Paragraph, a graph-based method relying on such read alignments, PanGenie produces better genotyping results while additionally providing the ability to jointly genotype SNPs, indels and SVs. Our approach was faster than the other methods, especially when comparing with the mapping-based approaches. The fast runtime makes PanGenie well suited for genotyping larger cohorts, providing the basis for population genetic analysis. In the present study, we have presented an application to a cohort of 300 samples that suggests that SVs in LD with disease-associated SNPs may functionally underlie these associations.

We have hence presented a method that is both fast and leverages a haplotype-resolved pangenome reference to enable genotyping of otherwise inaccessible variants. Although we have tested it only on human data so far, PanGenie can be applied to any diploid genome once corresponding panels of high-quality phased assemblies become available for other species. Still, some limitations remain. Although PanGenie improves results over other methods in repetitive regions of the genome, genotyping within these remains challenging. Although biallelic variants are less problematic, more complex cases such as segmental duplications, α-satellite repeats or acrocentric DNA are hard to access because of the lack of unique k-mers, but also because such regions are still difficult to assemble. Once a panel of telomere-to-telomere assemblies becomes available, future experiments can clarify which additional loci are amenable to genotyping with PanGenie.

Our model assumes that the unknown haplotypes of the sample to be genotyped are mosaics of the given panel haplotypes. Therefore, currently it cannot be used to genotype rare variants that are present only in the sample, but in none of the other haplotypes. We believe that there are exciting opportunities to develop methods to discover variation that our approach has not captured because it was not present in the reference panel. For example, one could either filter the reads for as yet ‘unexplained’ k-mers and use those for the discovery of rare variants, or utilize PanGenie’s output as a personalized pangenome reference graph to map reads to.

The runtime of our method depends on the number of input haplotypes, because we defined a hidden state for each possible pair of haplotypes that can be selected for each bubble. Therefore, additional engineering would be required to use much larger panels, which could be approached similarly to how statistical phasing packages prune the solution space and/or proceed iteratively65,66,67. Such techniques could also pave the way toward a version of PanGenie for polyploid genomes, which would be prohibitively slow when implemented without such additional optimization.

In summary, we have presented a method that, in combination with high-quality phased reference assemblies, offers a powerful approach for genotyping and association studies, on ever-larger cohorts, for all variant types—including those currently understudied due to technical limitations.

Methods

Sequencing data

We used publicly available sequencing data from the GIAB consortium45, 1000 Genomes Project high-coverage data46 and Human Genome Structural Variation Consortium (HGSVC)4. All datasets include only samples consented for public dissemination of the full genomes.

Statistics and reproducibility

For generating the assemblies, we used all 14 samples for which PacBio HiFi-data were available. For variant calling, the three children (HG00733, HG00514 and NA19240) were used for quality control and were not included in the final callsets/graphs, because they do not provide any additional information for genotyping. Code and pipelines to reproduce our analysis are available on Zenodo68,69.

Variant calling and pangenome construction

Assemblies

Fully phased assemblies for 14 samples (HG00731, HG00732, HG00733, HG00512, HG00513, HG00514, NA19238, NA19239, NA19240, NA12878, HG02818, HG03125, NA24385 and HG03486) were generated using a development version of the PGAS pipeline2,4 (parameter settings v.13). Compared with the previous PGAS production release (v.12 used in the HGSVC project4), this PGAS development update included a new version of the SaaRclust package70 (v.6cb8c96), controlled for adapter contamination in the input HiFi reads (reimplementation of the process published at https://github.com/sheinasim/HiFiAdapterFilt), and employed hifiasm71 v.0.15.2 as default assembler. In direct comparison to the previously used HiFi assembler Peregrine72, hifiasm substantially reduces the number of sequence collapses, leading to overall more correct assemblies (see the evaluation in Cheng et al.71). We provide assembly statistics in Supplementary Table 1.

Variant calling

We used haplotype-resolved assemblies of all 14 samples to call variants (Extended Data Fig. 1a). The three child samples (HG00733, HG00514 and NA19240) were used only for quality control and filtering, and thus were not part of our final callset/graph. For each sample, we separately mapped contigs of each haplotype (Supplementary Table 1) to the reference genome (GRCh38). This was done using minimap2 (ref. 53) (v.2.18) with parameters -cx asm20 -m 10000 -z 10000,50 -r 50000 --end-bonus=100 -O 5,56 -E 4,1 -B 5 --cs. In the next step, we called variants on each haplotype of all autosomes and chromosome X using paftools (https://github.com/lh3/minimap2/tree/master/misc) with default parameters. We generated a biallelic, VCF file containing variant calls made across all 11 unrelated samples (Extended Data Fig. 1a). If a region was not covered by any contig alignment in a sample, or the sample had multiple overlapping contig alignments, we set all its genotypes in this region to missing (“./.”), because it is unclear what the true genotype alleles are in this case. Furthermore, we removed variants from our callset for which >20% of the samples have missing genotype information. The remaining regions covered 91.8% (2.8 Gbp) of chromosomes 1–22 and chromosome X. Of the 8.2% of regions not covered, 48.3% were gaps in GRCh38 and 24.0% were centromeres.

We computed the Mendelian consistency for the Puerto Rican (HG00731, HG00732, HG00733), Chinese (HG00512, HG00513, HG00514) and Yoruban (NA19238, NA19239, NA19240) trios and observed that 97.9%, 96.8% and 97.6% of all variants were consistent with Mendelian laws, respectively. We removed a variant from our callset if there was a Mendelian conflict in at least one of the three trios. We show the number of variants in our final callset and the intermediate stages of variant calling in the first three columns of Supplementary Table 2.

Pangenome construction

Given the filtered variant calls, our goal was to construct an acyclic and directed graph by inserting the variants of all haplotypes into the linear reference genome. Variants produce bubbles in the graph with branches that define the corresponding alleles. The input haplotypes can be represented as paths through the resulting pangenome. When constructing the graph, we represent sets of variants overlapping across haplotypes as a single bubble, with potentially multiple branches reflecting all the allele sequences observed in the haplotypes in the respective genomic region (Extended Data Fig. 1b). The total number of bubbles in the resulting graph is presented in the last column of Supplementary Table 2. We represent the pangenome in terms of a fully phased, multisample VCF file that contains one entry for each bubble in the graph (Extended Data Fig. 1b). At each site, the number of branches of the bubble is limited by the number of input haplotype sequences and the genotypes of each sample define two paths through this graph, corresponding to the respective haplotypes. We keep track of which individual input variants contribute to each bubble in the graph, so that we can convert our pangenome graph representation back to the set of input variants. In this way, we can convert genotypes computed by a genotyper for all these bubbles to genotypes for each individual callset variants.

PanGenie’s genotyping algorithm

We define a hidden Markov model that can be used to compute the two most likely haplotype sequences of a given sample based on known haplotype paths and the sample reads. The new haplotype sequences are combinations of the existing paths through the graph and are computed based on the copy numbers of unique k-mers observed in the sequencing reads provided for the sample to be genotyped.

Identifying unique k-mers

Sets of bubbles that are less than the k-mer size apart (we use k = 31) are combined and treated as a single bubble. The alleles corresponding to such a combined bubble are defined by the haplotype paths in the respective region. For each bubble position v, we determine a set of k-mers, kmersv, that uniquely characterize the region. This is done by counting all k-mers along haplotype paths in the pangenome graph using Jellyfish73 (v.2.2.10), and then determining a set of k-mers for each bubble that occurs at most once within a single allele sequence and are not found anywhere outside the variant bubble. We additionally counted all k-mers of the graph in the sequencing reads. This allows us to compute the mean k-mer coverage of the data, which we use later to compute emission probabilities (see Observable states).

Hidden states and transitions

We assume being given N haplotype paths Hi, i = 1, …, N, through the graph. Furthermore, for each bubble v, v = 1, …, M, we are given a vector of k-mers, kmersv, that uniquely characterize the alleles of a bubble. We assume some (arbitrary) order of the elements in kmersv and refer to the ith k-mer as kmersv[i]. In addition, we are given sequencing data of the sample to be genotyped and corresponding k-mer counts for all k-mers in kmersv. For each bubble v, we define a set of hidden states \(\eta _v = \left\{ {H_{v,i,j}|i,j \le N} \right\}\) which contain a state for each possible pair of the N given haplotype paths in the graph. Each such state Hv,i,j induces an assignment of copy numbers to all k-mers in kmersv. We define a vector av,i,j such that the kth position contains the copy number assigned to the kth k-mer in kmersv:

$${\mathbf{a}}_{v,i,j}\left[ k \right] = \left\{ {\begin{array}{*{20}{l}} {0\quad kmers_v\left[ k \right] \notin H_i \cup H_j} \\ {1\quad kmers_v\left[ k \right] \in H_i\backslash H_j} \\ {1\quad kmers_v\left[ k \right] \in H_j\backslash H_i} \\ {2\quad kmers_v\left[ k \right] \in H_i \cap H_j}. \end{array}\quad } \right.\forall k = 1, \ldots ,\left| {kmers_v} \right|$$

The idea here is that we expect to see copy number 2 for all k-mers occurring on both haplotype paths. In case only one of the haplotypes contains a k-mer, its copy number must be 1 and k-mers that do not appear in any of the two paths must have copy number 0. From each state Hv,i,j in ηv that corresponds to bubble position v, there is a transition to each state corresponding to the next position, v + 1. In addition, there is a start state, from which there is a transition to each state of the first bubble, and an end state, to which there is a transition from each state that corresponds to the last bubble.

Transition probabilities

Transition probabilities are computed following the Li–Stephens model37. Given a recombination rate r, the effective population size Ne and the distance x (in basepairs) between two ascending bubbles v − 1 and v we define:

$$d = x \times \frac{1}{{1,000,000}} \times 4rN_e.$$

We compute the Li–Stephens transition probabilities as:

$$p_r = \left( {1 - {\mathrm{exp}}\left( { - \frac{d}{N}} \right)} \right) \times \frac{1}{N}$$
$$q_r = {\mathrm{exp}}\left( { - \frac{d}{N}} \right) + p_r.$$

Finally, the transition probability from state Hv–1,k,l to state Hv,i,j is computed as shown below:

$$P\left( {H_{v,i,j}{{{\mathrm{|}}}}H_{v-1,k,l}} \right) = \left\{ {\begin{array}{*{20}{c}} {q_r \times q_r\quad i = k,j = l} \\ {q_r \times p_r\quad i = k,j \ne l} \\ {q_r \times p_r\quad i \ne k,j = l} \\ {p_r \times p_r\quad i \ne k,j \ne l}. \end{array}} \right.$$

Observable states

Each hidden state Hv,i,j in ηv outputs a count for each k-mer in kmersv. Let Ov be a vector of length |kmersv| for bubble v such that Ov[k] contains the observed k-mer count of the kth k-mer in the sequencing reads. To define the emission probabilities, we first need to model the distribution of k-mer counts for each copy number, \(P\left( {{\mathbf{O}}_v\left[ k \right]|a_{v,i,j}\left[ k \right] = c} \right),c = 0,1,2\). For copy number 2, we use a Poisson distribution with mean λ which we set to the mean k-mer coverage that we compute from the k-mer counts of all graph k-mers. Similarly, we approximate the k-mer count distribution for copy number 1 in terms of a Poisson distribution with mean \(\frac{\lambda }{2}\). For copy number 0, we need to model the erroneous k-mers that arise from sequencing errors. This is done using a geometric distribution, the parameter p of which we choose based on the mean k-mer coverage. Finally, we compute the emission probability for a given state and given observed read k-mer counts as shown below, making the assumption that the k-mer counts are independent:

$$P\left( {{\mathbf{O}}_v|H_{v,i,j}} \right) = \mathop {\prod }\limits_{l = 1}^{\left| {kmers_v} \right|} P\left( {{\mathbf{O}}_v\left[ l \right]|{\mathbf{a}}_{v,i,j}\left[ l \right]} \right).$$

Genotypes and haplotypes

In this model, genotypes correspond to pairs of given haplotype paths at each bubble position. Genotype likelihoods can be computed using the forward–backward algorithm.

Forward–backward algorithm

The initial distribution of our HMM is such that we assign probability 1 to the start state and 0 to all others. Forward probabilities αv() are computed in the following way:

$$\alpha _0\left( {{\mathrm{start}}} \right) = 1.$$

For states corresponding to bubbles v = 1, …, M, the forward probabilities are computed as shown below. The set of observed k-mer counts at position v is given by Ov:

$$\begin{array}{l}\alpha _v\left( {H_{v,i,j}} \right) =\\ \mathop {\sum}\nolimits_{H_{v - 1,s,t} \in \eta _{v - 1}} {\alpha _{v - 1}\left( {H_{v - 1,s,t}} \right) \times P\left( {H_{v,i,j}|H_{v - 1,s,t}} \right) \times P\left( {{\mathbf{O}}_v|H_{v,i,j}} \right)\forall i,j}.\end{array}$$

The transition probabilities are computed as described above, except for transitions from the start state to all states in the first column, which we assume to have uniform probabilities. Backward probabilities are computed in a similar manner. We set:

$$\beta _M\left( {{\mathrm{end}}} \right) = 1.$$

For \(v = 1,...,M - 1\), we compute them as:

$$\begin{array}{l}\beta _v\left( {H_{v,i,j}} \right) =\\ \mathop {\sum}\nolimits_{H_{v + 1,s,t} \in \eta _{v + 1}} {\beta _{v + 1}\left( {H_{v + 1,s,t}} \right) \times P\left( {H_{v + 1,s,t}|H_{v,i,j}} \right) \times P\left( {{\mathbf{O}}_{v + 1}|H_{v + 1,s,t}} \right)\forall i,j}.\end{array}$$

Finally, posterior probabilities for the states can be computed:

$$P\left( {H_{v,i,j}|{\mathbf{O}_1,\mathbf{O}_2},\ldots,\mathbf{O}_M} \right) = \frac{{\alpha _v\left( {H_{v,i,j}} \right) \times \beta _v\left( {H_{v,i,j}} \right)}}{{\mathop {\sum}\nolimits_{h \in \eta _v} {\alpha _v\left( h \right)\beta _v\left( h \right)} }}.$$

Several states at a bubble position v can correspond to the same genotype, because different paths can cover the same allele. Also, the alleles in a genotype are unordered, therefore states Hv,i,j and Hv,j,i always lead to the same genotype. To compute genotype likelihoods, we sum up the posterior probabilities for all states that correspond to the same genotype. In this way, we can compute genotype likelihoods for all genotypes at a bubble position, based on which a genotype prediction can be made.

Comparison to existing genotyping methods

We conducted a ‘leave-one-out’ experiment to mimic a realistic scenario in which we genotyped variants detected from haplotype-resolved assemblies of a set of known samples in a new, unknown sample. We collected variants called across all but one sample and used them as input for genotyping the left-out sample (we refer to this set as known variants in the following). We used the set of variants called from the assemblies of the left-out sample for evaluation (evaluation variants). We ran this experiment twice, removing samples NA12878 and NA24385, respectively. As input for PanGenie (commit 1f3d2d2 (ref. 68)), BayesTyper (v.v1.5) and Paragraph (v.2a), we constructed a pangenome graph representation based on the known variants in the same way as described in Constructing a pangenome reference. We kept track of which variant alleles each resulting bubble consists of, so that genotypes derived for all bubbles can later be converted back to the original variant representation. For the other genotypers tested (GATK 4.1.3.0, Platypus 0.8.1, GraphTyper 2.7.1 and Giraffe v.1.30.0), we directly used the set of known variants as input, without generating the graph representation first, because we observed that these tools could better handle variants represented in this way. As a result of running all genotypers, we had one VCF file per tool containing genotypes for all our known variants. We used the evaluation variants to evaluate the genotype predictions of all tools. Extended Data Fig. 2 provides an illustration of the leave-one-out experiment.

Note that re-genotyping a set of known variants in a new sample is different from variant detection. Variants present in the new sample that have not been seen in the callset samples can thus not be genotyped because genotypers can genotype only variants that they have seen before. We provide the number of unique variants of each panel sample in Supplementary Table 3. Most genotyping metrics (weighted genotyping concordance, adjusted precision/recall) explained in detail in Supplementary Note exclude these variants.

Besides re-genotyping our callset variants, we additionally ran GATK and Platypus in discovery mode to detect and genotype their own variants. We evaluated the results by computing precision/recall based on our ground-truth variants (Supplementary Figs. 10 and 11).

Evaluation regions

Some genomic regions are more difficult to genotype than others, such as SVs that tend to be located in repetitive and more complex regions of the genome. Therefore, we looked at variants located inside and outside of STR/VNTR regions which we obtained from the UCSC genome browser (Simple Repeats Track for GRCh38)47. In addition, we classified the genome into ‘complex’ and ‘biallelic’ regions based on the bubble structure of our pangenome graph: all variants located inside of complex bubbles, that is, bubbles with more than two branches, fell into the first category, and the remaining regions into the second. Consider Extended Data Fig. 1 for an example: the first and third bubbles are complex, thus all variants contained inside these bubbles fall into the category ‘complex’. The second bubble is biallelic and therefore the corresponding SNP variant is considered ‘biallelic’.

For our ‘leave-one-out’ experiment for sample NA12878, we show the number of variants falling into the different categories in Fig. 3, Extended Data Figs. 38 and Supplementary Table 4. It can be observed that most complex bubbles are located inside STR/VNTR regions (Supplementary Table 4). In addition, more than half of all midsize and large variants are located in these repetitive regions.

Genotyping larger cohorts

We randomly selected 100 trios (20 of each superpopulation: AFR, AMR, EAS, EUR, South Asian (SAS)) that are part of the 1000 Genomes Project and genotyped all our variant calls across these 300 samples. We used our pangenome graph representation containing all 11 assembly samples as an input for PanGenie, genotyped all bubbles and later converted the resulting genotypes back to obtain genotypes for the individual callset variants. Our callset might contain variants that are difficult to genotype correctly. To identify a high-quality subset of variants that we could reliably genotype, we defined different filters based on the predicted genotypes that we list below. One metric used for defining filters is the Mendelian consistency. We computed the Mendelian consistency for each variant by counting the number of trios for which the predicted genotypes are consistent with Mendelian laws. We considered only trios with at least two different genotypes, that is, we excluded a trio if all three genotypes were 0/0, 0/1 or 1/1. This resulted in a more strict definition of Mendelian consistency (Supplementary Fig. 14). In addition to genotyping all 300 trio samples, we also genotyped all 11 panel samples using the full input panel. Genotyping samples that are also in the panel helped us to find cases where panel haplotypes and reads disagreed and thus was another useful filter criterion. We defined filters as follows: (1) ac0-fail: a variant fails this filter if it was genotyped with AF 0.0 across all samples; (2) mendel-fail: a variant fails this filter if the fraction of Mendelian consistent trios was <90% (our definition of Mendelian consistency excludes all trios with all 0/0, all 0/1 or all 1/1 genotypes and only considers such with at least two different genotypes); (3) gq-fail: a variant failed this filter if it was genotyped with a genotype quality <200 in >5 samples; (4) self-fail: in addition to the 100 trios, we also genotyped the 11 panel samples; a variant failed this filter if the genotype concordance across all panel samples was <90%; and (5) non-ref-fail: the variant was genotyped as 0/0 across all panel samples.

For all combinations of filters, we show the number of large deletions and large insertions in each category in Supplementary Fig. 15. To define a strict, high-quality set of variants, we selected all that passed all five filters (Supplementary Table 7).

For quality control, we analyzed allele frequencies and the fraction of heterozygous genotypes for all variants contained in our unfiltered and strict sets (Supplementary Figs. 16 and 17). In addition, we used VCFTools74 (v.0.1.16) to test the genotype predictions of all variants typed with an AF > 0.0 for conformance with the HWE and corrected for multiple hypothesis testing by applying the Benjamini–Hochberg correction75 (α = 0.05).

In addition to defining a strict set, we constructed a more lenient set for our SV calls (≥50 bp) using a machine-learning approach based on support vector regression. We used the strict set as a positive set and defined a negative set consisting of all variants that were typed with an AF > 0.0 and failed at least three filters. For large insertions, the negative set contained 2,611 variants, and for large deletions 1,125. The model then predicted scores between −1 (worst) and 1 (best) for all variants that were in neither the positive nor the negative set. We show the distribution of scores for our variant calls in Supplementary Fig. 18. The lenient set was then constructed by adding all variants with a score >−0.5 to our strict SV set (Supplementary Table 8 and Supplementary Fig. 18).

LD analysis

We performed an LD analysis based on the genotypes we obtained across all 200 unrelated samples. We used gatk4 (ref. 16) (v.4.1.9.0) to annotate the calls with variant IDs from dbSNP (build 154)76. We selected variants that are contained in the NHGRI-EBU GWAS catalog58 and used plink77 (v.190b618) to determine SVs that are in LD with the GWAS variants (r2 ≥ 0.8). For comparison with other nonhuman primates, human genomic sequence (GRCh38; chr9:133278657-133279020) corresponding to 50 bp flanking the annotated LTR10B2 VNTR was used to retrieve the corresponding orthologous sequence from primate genomes62 or HiFi PacBio sequence data from nonhuman primates63. Multiple sequence alignments were constructed using MAFFT and manually inspected for VNTR copy number.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.