Abstract
While significant strides have been made in understanding pharmacogenetics (PGx) and gene-drug interactions, there remains limited characterization of population-level PGx variation. This study aims to comprehensively profile global star alleles (haplotype patterns) and phenotype frequencies in 58 pharmacogenes associated with drug absorption, distribution, metabolism, and excretion. PyPGx, a star-allele calling tool, was employed to identify star alleles within high-coverage whole genome sequencing (WGS) data from the 1000 Genomes Project (Nā=ā2504; 26 global populations). This process involved detecting structural variants (SVs), such as gene deletions, duplications, hybrids, as well as single nucleotide variants and insertion-deletion variants. The majority of our PyPGx calls for star alleles and phenotype frequencies aligned with the Pharmacogenomics Knowledge Base, although notable population-specific frequencies differed at least twofold. Validation efforts confirmed known SVs while uncovering several novel SVs currently undefined as star alleles. Additionally, we identified 210 small nucleotide variants associated with severe functional consequences that are not defined as star alleles. The study serves as a valuable resource, providing updated population-level star allele and phenotype frequencies while incorporating SVs. It also highlights the burgeoning potential of cost-effective WGS for PGx genotyping, offering invaluable insights to improve tailored drug therapies across diverse populations.
Similar content being viewed by others
Introduction
Pharmacogenetics (PGx) explores links between genetic variations and drug responses, offering potential optimization of medication selection, dosage, and efficacy1. Its capacity to mitigate adverse drug events, encompassing potentially harmful reactions or side effects, stands as a primary concern for global health systems2. There are growing efforts to characterize global PGx variation to address the current lack of reliable haplotype frequencies across diverse populations.
The āstar alleleā nomenclature system is used in PGx to standardize genotypes (e.g., CYP2D6*4) with predicted clinical phenotypes (e.g., poor metabolizer)3. There are exceptions to this nomenclature; for example, the G6PD gene uses its own system4. The principle remains to match genetic data (diplotype calls such as CYP2D6*2/*4) with predicted phenotypes based on translation tables stored in PGx databases. Despite its utility, accurately measuring star alleles faces challenges in detecting rare variants5 and structural variants (SVs) like gene deletions, duplications, and hybrids. Previous genotyping methods (TaqMan assays, Sanger sequencing, etc.) are time consuming and heavily biased toward the detection of known variants and struggle with SV detection and interpretation6. Advancements in next-generation sequencing, particularly whole genome sequencing (WGS), with a rapid decrease in sequencing cost, as low as $100 per genome7, provide deeper insights into genetic and PGx variation. Computational tools are increasingly becoming available to interpret PGx variation and their functional implications8. However, challenges persist due to non-linearity in sequenced reads, resulting from the high sequence homology among functional genes and nonfunctional pseudogenes, which adversely affects sequence alignment9,10.
Previous studies on population-level PGx have enriched our comprehension of genetic variability and underscored the significance of considering varying frequencies across diverse populations11,12,13,14. However, these studies have often concentrated on only a few genes or a single gene, such as cytochrome P450 enzymes (CYPs)15 like CYP2D6, renowned for its high polymorphism, complex SVs, inclusion of a pseudogene (CYP2D7), and pivotal involvement in drug metabolism16. Another limitation of past research lies in their underpowered analyses due to smaller and less diverse sample sets. For instance, the Dutch Pharmacogenetics Working Group utilized WGS data to identify SVs within CYP2D6 yet with a relatively small sample size (Nā=ā547)17. Moreover, there are limited studies that have examined the role of SVs on pharmacogenes18. There are also limited computational methods for examining SVs within pharmacogenes19. Consequently, further endeavors are imperative to explore the impact of SVs on pharmacogenes and to enhance population-level PGx profiles.
In this study, we extended the work of PyPGx20, a Python package that can predict PGx genotypes and phenotypes from next-generation sequencing data by detecting SVs using a machine learning-based approach. We then used PyPGx to characterize PGx variation and phenotype landscapes at the population level, analyzing data from 2504 high-coverage WGS samples obtained from the 1000 Genomes Project (1KGP) encompassing five biogeographical populations: African (AFR), American (AMR), East Asian (EAS), European (EUR), and South Asian (SAS). 1KGP is an international effort to characterize global human genetic variation to improve our understanding of genetic contributors to human health and disease21,22. Notably, 1KGP relies on short-read sequencing, which raises challenges to identifying breakpoints and relative orientation of SVs23; long-read sequencing would be needed to confirm sequence of pharmacogenes. PyPGx does not identify breakpoints, but with its machine learning-based approach estimates copy number and the copy number signal helps to detect SVs. This study stands as a valuable resource, providing updated and precise population-level haplotype and phenotype information, while incorporating SVs.
Materials and methods
High-coverage WGS data
We downloaded publicly available, high coverage WGS data (meanā=ā30x) for unrelated, healthy 1KGP samples (Nā=ā2504) from Phase III generated by the New York Genome Center24. There are 26 subpopulations that are grouped into five global populations (Supplementary Table 1). In accordance with recommendations25, we continue to use these five global populations, which have been grouped by greater genetic similarity in previous work26. We will refer to the populations as ā1KGPā and then the referenced population. Briefly, we obtained FASTQ files from the European Nucleotide Archive (study accession: PRJEB31736) using the SRA Toolkit (v3.0.0; https://github.com/ncbi/sra-tools) to include only sequence reads that had been aligned to PGx regions of interest. Next, we re-aligned those reads to the Genome Reference Consortium Human Build 37 (GRCh37) reference genome using the āngs-fq2bamā command from the fuc package20.
Star allele identification
From the WGS data we inferred star alleles in 58 PGx genes (Table 1) using the PyPGx package (v0.16.0)20 whose algorithm follows a modified version of the Stargazer genotyping pipeline27,28. The PyPGx pipeline starts by statistically phasing observed small variants (i.e., SNVs and indels) into two haplotypes per individual, which are then matched to candidate star alleles by cross-referencing against the target geneās haplotype translation table. When a given haplotype produces multiple candidate star alleles, PyPGx sorts them by priority to pick the final allele to report. The sorting is performed as follows (in decreasing priority): (1) allele function (e.g. āNo Functionāā>āāNormal Functionā), (2) number of core variants (e.g. three SNVsā>āone SNV), (3) number of core variants that impact protein coding (e.g. two missense variantsā>āone missense variant plus one intron variant), and (4) reference allele status (e.g. non-reference allele with two SNVsā>āreference allele with two SNVs). By default, PyPGx uses the Beagle program29 for statistical phasing with the entire 1KGP haplotype panel21 as reference solution. Next, per-base copy number is computed from read depth data through intra-sample normalization using a control gene as anchor. SVs are then detected from copy number data using a pre-trained support vector machine (SVM)-based classifier (see āStructural Variant Detectionā). This version of PyPGx (v0.16.0) supports genotyping of 59 genes in total; briefly, these genes were selected based on their drug metabolism/response role (CPIC, FDA), allelic variation catalogs (PharmVar, PharmGKB, DGV), genotyping reference materials (GeT-RM), and overlap with tools like PGRNseq and Stargazer20. All the genes are listed in Table 1, except for GSTT1, which was excluded from the analysis because it is located on an alternative contig (chr1_KI270762v1_alt) for the human genome build GRCh38. For training the SVM-based classifier, both GRCh37 and GRCh38 builds were required. PyPGx outputs copy number and allele fraction profiles to allow users to manually inspect the quality of SV calls. Finally, candidate star alleles and SV results are combined to inform the final diplotype assignment (e.g. CYP2D6*1/*2).
To facilitate parallel computing, we divided the samples into ten non-overlapping batches of Nā=ā250 except for the last one with Nā=ā254. For every batch we then ran the ārun-ngs-pipelineā command from PyPGx for each target gene with three input files: (1) a multi-sample VCF file, (2) a depth of coverage file, and (3) a control statistics file. These input files were created for every batch of BAM files using the ācreate-input-vcfā, āprepare-depth-of-coverageā, and ācompute-control-statisticsā commands from PyPGx, respectively. All of the genotyping analyses shown in the results section were conducted using the VDR gene as the control locus.
Comparison of haplotype and phenotype frequencies
We compared our calculated frequencies of select genes across ancestral populations to the Pharmacogenomics Knowledge Base (PharmGKB)30, in which frequencies were estimated using the formula for Hardy Weinberg equilibrium based on reported allele frequencies. PharmGKBās grouping system is based on genetic similarity using data from the 1KGP and the Human Genome Diversity Project31. Differences in the population grouping between 1KGP and PharmGKB can introduce biases, as PharmGKB has finer stratification and may limit comparisons. We compared our haplotype frequencies of 12 of the most polymorphic genes (Fig.Ā 1) to haplotype frequencies within PharmGKB; seven of the twelve genes were available: CYP2B6, CYP2C9, CYP2C19, CYP2D6, DPYD, G6PD, and TPMT. Ratios (our values over the literature) were calculated, and values that differed by at least a twofold difference in either direction were highlighted. PharmGKB has nine biogeographical groups within the dataset31. 1KGP-AFR populations were compared to PharmGKB-Sub-Saharan African (SSA); 1KGP-AMR was compared to PharmGKB-American (AME) when available; 1KGP-EAS was compared to PharmGKB-East Asian (EAS); 1KGP-EUR was compared to PharmGKB-European (EUR) populations; 1KGP-SAS was compared to PharmGKB-Central/South Asian (SAS). PyPGx produces phenotypes for 16 genes (see āPhenotype Predictionā). As above, we compared 1KGP phenotype frequencies to PharmGKB; ten of the sixteen genes were available: ABCG2, CYP2B6, CYP2C9, CYP2C19, CYP2D6, CYP3A5, DPYD, NUDT15, TPMT and UGT1A1. For both haplotype and phenotype comparisons, SLCO1B1 was excluded because PharmGKB and PyPGx used different PGx databases. PharmGKB provided only activity scores for CYP2D6 and DPYD, and translation tables from PyPGx were utilized to convert these activity scores into phenotype predictions.
Structural variation detection
PyPGx supports SV detection in 11 of the 58 genes assessed. These include CYP2A6, CYP2B6, CYP2D6, CYP2E1, CYP4F2, GSTM1, SLC22A2, SULT1A1, UGT1A4, UGT2B15, and UGT2B17. SVs are detected from per-base copy number data using a pre-trained SVM-based multiclass classifier using the one-vs-rest strategy. This means the classifier cannot detect SVs that it has not seen before. Therefore, for each gene with SV we manually combed through individual samples and their copy number and allele fraction profiles to generate training and testing datasets for both known and novel SVs. For rare SVs we sought to synthetically increase sample size through simulation using the āpypgx.sdk.utils.simulate_copy_numberā method which introduces random noise to existing samples. Once datasets were gathered, we combined them with existing datasets from PyPGx and then re-trained SVM classifiers for both GRCh37 and GRCh38.
The final training and testing datasets and corresponding classification accuracy are summarized in Supplementary Table 2 and can be accessed from https://github.com/sbslee/pypgx-data. After the update, the sample size increased on average three-fold in the training dataset (from Nā=ā93.0 to Nā=ā279.7) and two-fold in the testing dataset (from Nā=ā29.0 to Nā=ā67.1). Similarly, the average number of unique SVs increased from 4.1 to 9.6 for the train set and from 3.6 to 9.6 for the test set, significantly increasing the complexity of the SV space. The training accuracy was 100% for all of the genes except for CYP2D6 and SULT1A1 which still showed a high accuracy ranging from 0.990 to 0.997 depending on the reference genome build. The testing accuracy was 100% for all of the genes. Of note, PyPGx also supports SV detection in the G6PD gene, but it is solely for sex determination because the gene is located on the X chromosome. Since there was no need for additional training, the SVM classifier for G6PD was not updated in this study. To determine if our novel SVs overlapped with previously characterized SVs, we compared estimated endpoints from the copy number profile produced by PyPGx to the UCSC Genome Browser on Human (GRCh37/hg19) (https://genome.ucsc.edu/), including the Database of Genomic Variants32,33,34.
Phenotype prediction
In addition to diplotype calls, the ārun-ngs-pipelineā command from PyPGx automatically produces predicted phenotypes if the target gene is one of the 16 genes with a genotypeāphenotype table from CPIC. Nine genes (CYP2B6, CYP2C19, CYP2C9, CYP2D6, CYP3A5, DPYD, NUDT15, TPMT, UGT1A1) produce prediction of drug metabolizer status ranging from poor to ultrarapid metabolizer35; CFTR, F5, and IFNL3 produce prediction of responder status to certain therapeutics (i.e. favorable vs. unfavorable response)36,37,38; ABCG2 and SLCO1B1 produce prediction of enzymatic function status ranging from poor to increased function39; CACNA1S and RYR1 produce prediction of malignant hyperthermia susceptibility40.
More specifically, there are two phenotype prediction methods in PyPGx. The first method uses a simple diplotype-to-phenotype mapping system provided by CPIC. For instance, CYP2B6*1/*29 and *6/*6 diplotypes will be assigned an intermediate metabolizer and a poor metabolizer, respectively. The second method uses a standard unit of enzyme activity known as an activity score41. For example, the fully functional reference CYP2C9*1 allele is assigned a value of 1, decreased-function alleles such as CYP2C9*2 and *5 receive a value of 0.5, and nonfunctional alleles including CYP2C9*3 and *6 have a value of 0. The sum of values assigned to both alleles constitutes the activity score of a diplotype. Consequently, subjects with CYP2C9*1/*1, *1/*2, and *3/*6 diplotypes have an activity score of 2 (normal metabolizer), 1 (intermediate metabolizer), and 0 (poor metabolizer), respectively. PyPGx uses the second method for the CYP2C9, CYP2D6, and DPYD genes and the first method for the rest of the genes. The burden of āabnormal, priority, and high riskā responses according to CPIC (Supplementary Table 3) for each individual and ancestral population was determined by counting the number of genes with these predicted responses (see āSupplementary Materials and Methodsā). From the WGS data we identified a large number of SNVs and indels that were not used to define star alleles. To explore the potential contribution of these variants to enzymatic activity, SNVs and indels were functionally annotated using the Combined Annotation Dependent Depletion (CADD) tool42 (Supplementary Table 4) and were uploaded into the web interface, Ensembl Variant Effect Predictor (VEP) (Assembly: GRCh37.p13)43,44 (see āSupplementary Materials and Methodsā).
Results
Haplotype variation patterns
We applied the PyPGx program to the high coverage WGS data from 1KGP to generate diplotype calls in 58 pharmacogenes from 2,504 unrelated samples. All individual PyPGx calls (Nā=ā145,232), including diplotypes, SVs, and predicted phenotypes, can be found in Supplementary Table 5. Of those calls, we observed a total of 538 unique star alleles, including reference alleles (Table1; Table 2). The CYP2D6 gene had the highest number of unique alleles, 64, followed by DPYD with 42, CYP2B6 with 27, CYP2A6 with 26, and G6PD with 24. Conversely, the CACNA1S and CYP17A1 genes showed the least polymorphism with zero non-reference alleles. We then computed star allele frequencies for each gene for each of the five global populations in 1KGP (Supplementary Table 6). FigureĀ 1 shows the relative proportion of observed star alleles for the 12 most polymorphic genes that have the highest number of unique alleles.
Among the 12 most polymorphic genes, 7 had population-specific allele frequencies available from PharmGKB for comparison (Supplementary Table 7). After comparing 208 unique star alleles across the seven genes, 97 unique alleles overlapped with previously reported PharmGKB frequencies. Over half of the alleles were consistent with the literature and fell within a twofold range (59.6%; Nā=ā58/97). To illustrate, the frequencies for both CYP2D6*17, a reduced-function haplotype that is common in individuals of African ancestry (e.g., c.1023Cā>āT; rs28371706)45,46, and CYP2C19*2, a common loss-of-function haplotype (e.g., c.681Gā>āA; rs4244285)47, were consistent with the previously reported frequencies across populations. For example, the frequency for CYP2D6*17 for both 1KGP-AFR and PharmGKB-SSA was around 19%, and for the rest of the populations, both 1KGP and PharmGKB frequencies wereāā¤ā1%. For CYP2C19*2 all compared populations had similar frequencies: 1KGP-AFR (17%) and PharmGKB-SSA (16%); 1KGP-EAS (31%) and PharmGKB-EAS (28%); 1KGP-EUR (15%) and PharmGKB-EUR (15%); 1KGP-SAS (36%) and PharmGKB-SAS (27%); 1KGP-AMR (10%) and PharmGKB-AME (12%) (Supplementary Table 7).
Although the majority of our frequency calls fell within a two-fold range, our analysis revealed that 40.2% of haplotype calls (Nā=ā39/97) exhibited a difference of two-fold or greater. Among these, 30.3% (Nā=ā30/97) showed fold-changes ranging from 2x to 10x. Notably, there were āextremeā cases, accounting for 10.1% (Nā=ā10/97), where fold-changes exceeded 10x. Half of these extreme cases (Nā=ā5) ranged from approximately 15x to 160x, yet all had frequencies below 0.6% in both 1KGP and PharmGKB databases, notably involving G6PD Ube Konan (EAS), DPYD c.1905Cā>āG (EUR), DPYD c.967Gā>āA (EUR), CYP2C19*4 (AMR), and CYP2D6*3 (AMR). Consequently, these extreme cases lack significant clinical relevance. In PharmGKB, a wide range of reported frequencies resulted in additional extreme cases with fold-changes ranging from approximately 15x to 70xā (Nā=ā3). Notable instances include PharmGKB-SSA involving CYP2B6*11 (range: 0ā13%; average 7.1%) and *15 (range: 0ā7.7%; average: 1.5%), as well as PharmGKB-AMR, featuring CYP2C9*8 (range: 0ā4.4%). For all three cases, a frequency of 0.1% in 1KGP was determined, aligning with studies included in PharmGKB and other databases such as dbSNP. It is important to note that PharmGKB-AMR and SSA represent heterogeneous groups, with sub-populations potentially exhibiting varying frequencies. Furthermore, we observed extreme cases (Nā=ā2) where the DPYD c.1896Tā>āC haplotype (rs17376848)48, resulting in normal enzyme function, exhibited a 125x fold-difference for EAS and an 18x fold-difference for SAS. Given that both the reference allele (rs1801265 in GRCh37) and the c.1896Ā Tā>āC allele share: (1) normal function, (2) the same number of core variants (Nā=ā1), the deciding factor when PyPGx was choosing which allele to report, if both rs1801265 and rs17376848 were present in the same haplotype, was the impact on protein coding (refer to "Star allele identification" for more details). Notably, DPYD reference harbors a missense variant, while c.1896Tā>āC carries a synonymous variant, leading PyPGx to prioritize the reference over c.1896Tā>āC.
Detection of known and novel SVs
We identified 53 known and novel SV-carrying star alleles in the 11 genes that were assessed for the presence of SV (Table 2). The alleles consisted of gene deletions (e.g., CYP2A6*4, CYP2D6*5, GSTM1*0), duplications (e.g., CYP2A6*1āĆā2, CYP2E1*7āĆā2, GSTM1*Ax2), multiplications (e.g., CYP2D6*36āĆā3ā+ā*10, CYP2E1*7āĆā3, SULT1A1*2āĆā3), and hybrids (e.g., CYP2A6*12, CYP2B6*29, CYP2D6*68ā+ā*4). These were collectively found in 22.9% (6,305/27,544) of the diplotypes examined for SV. The relative proportion of detected SVs for the 11 genes is shown in Fig.Ā 2. About 1.2% (329/27,544) of the diplotypes were returned as āIndeterminateā due to the difficulty in interpretation of SV.
Among the SVs identified, one stood out for its high frequency (34.4%) among 1KGP-EAS: CYP2D6*36ā+ā*10 (hybrid), which, notably, was not cataloged in PharmGKB for comparison. However, a comprehensive literature review spanning PubMed publications from 1995 to 2015 revealed an average allele frequency of 26.41% (range 22.45ā32.65%)49 for CYP2D6*36ā+ā*10 among populations of East Asian descent. In contrast, the average reported frequency for CYP2D6*10 in East Asian populations was 42.58% (range: 8.6ā64.1%)49, while our estimation yielded a frequency of 18.8%. This variance suggests challenges in accurately determining the frequency of the CYP2D6*36ā+ā*10. Moreover, our analysis allowed us to ascertain the frequencies of CYP2D6*36āĆā2ā+ā*10 and *36āĆā3ā+ā*10 across various populations (Supplementary Table 6). It is worth noting that both PharmGKB and the extensive literature review rely on available studies to estimate SV frequencies. However, these studies often employ genotyping assays prone to error and influenced by the diverse methodologies of different laboratories. In contrast, our WGS approach offers a more direct and updated method to determine frequencies for haplotypes containing SVs.
FigureĀ 3 showcases six representative examples of well-known SVs, demonstrating PyPGxās robust detection capabilities. Additionally, Fig.Ā 4 presents six representative examples of novel SVs identified in this study. We categorize these predicted SVs as ānovelā because they have not been previously utilized to define star alleles. However, it is likely that some of these novel SVs have been identified by others, not necessarily within the context of pharmacogenomics, given the extensive study of the 1KGP dataset. Therefore, to identify previously reported SVs within the novel SV gene regions, we utilized the UCSC Genome Browser. One representative example is illustrated in Supplementary Fig.Ā 1. Among our findings, we identified gssvL58571, a copy number variation region (chr19:41,352,371ā41,397,661), as a potential candidate for a whole-gene deletion of CYP2A7 (Fig.Ā 4A). Additionally, nssv3567503 represents a gain of CNV region (chr19:41,497,274ā41,558,271), indicating a whole-gene duplication of CYP2B6 (Fig.Ā 4B). Interestingly, this duplication, CYP2B6*22āĆā2, was previously detected by PyPGx in an Vindija Neanderthal individual50. Another notable finding is esv33893, presenting a gain and loss variation (chr22:42,522,622ā42,538,228), suggestive of multiple copies of a complex CYP2D6/CYP2D7 hybrid variation (Fig.Ā 4C). Furthermore, we highlight gssvG14505, delineating a CNV region (chr16:28,609,490ā28,626,916), which included the same sample (NA19143) utilized in our study, indicating a whole-gene multiplication in SULT1A1 (Fig.Ā 4D). Lastly, esv25622 represents a loss variation (chr2:234,648,159ā234,659,534), one of the few candidates showing a homozygous partial gene deletion in UGT1A4 (Fig.Ā 4E), while gssvG27730 represents a gain of CNV region (chr4:69,217,756ā69,592,846), overlapping with a partial gene duplication in UGT2B15 (Fig.Ā 4F).
PGx phenotype frequencies
PyPGx generated predicted phenotypes utilizing genotypeāphenotype translation tables from CPIC across 16 genes. The frequencies of these phenotypes are outlined in detail in Supplementary Table 8 and summarized in Fig.Ā 5. Furthermore, we assessed the prevalence of gene-phenotype patterns across thirteen genes, leading to abnormal, priority, and high-risk phenotypes according to CPIC guidelines (Supplementary Table 3; Fig.Ā 6). On average, subjects exhibited approximately 3 non-typical response phenotypes for the thirteen pharmacogenes, with 98.2% of individuals demonstrating at least one non-typical drug response. For example, one individual from 1KGP-AFR exhibited the following predicted phenotypes: CYP2B6-poor metabolizer, CYP2C19-rapid metabolizer, and CYP2D6-intermediate metabolizer.
Additionally, we compared phenotype frequencies to those in PharmGKB for 10 out of the 16 genes (Supplementary Table 9). Among 88 unique gene-phenotype pairs (e.g., CYP2D6-Normal Metabolizer), 59.0% (Nā=ā52/88) overlapped within the PharmGKB database. Notably, 20 of these pairs exhibited at least a twofold difference in frequency across populations. Among these, 2 showed fold-changes greater than 10x, and 18 showed fold-changes between 2x to 10x. To investigate the identified differences in phenotype frequencies, we examined the most significant disparities in DPYD-poor metabolizer (SAS) and DPYD-intermediate metabolizer (AFR) (Supplementary Table 9). For DPYD-poor metabolizers, SAS exhibited a 53-fold difference, despite both 1KGP and PharmGKB reporting frequencies of less than 0.2%. Similarly, for DPYD-intermediate metabolizers, AFR/SSA displayed a fold-difference of 69.8x. However, within PharmGKB-SSA, there is a comparable frequency for DPYD-likely intermediate metabolizer to 1KGP-AFR DPYD-intermediate metabolizer; hence, ālikelyā phenotypes could impact our observed fold-differences.
Clinically relevant PGx variants
From the WGS data we found a total of 190,398 unique small nucleotide variants in 58 genes, of which only 502 (0.26%) were used to define star alleles. To assess the potential clinical relevance of the remaining 189,910 variants not part of any known star allele nomenclature, we filtered the variants based on their annotated consequences (āstop_gainedā, āframeshiftā, āmissenseā, āsplice_donorā, and āregulatoryā). This resulted in 210 variants (Supplementary Table 4). The distribution of these variants across populations was: 1KGP-AFR (Nā=ā67)ā>ā1KGP-SAS (Nā=ā56)ā>ā1KGP-EAS (Nā=ā54)ā>ā1KGP-EUR (Nā=ā41)ā>ā1KGP-AMR (Nā=ā27). The consequences for the 210 variants across 48 genes are shown in Supplementary Fig.Ā 2. We identified eight unique SNVs previously reported to have ClinVar significance: āLikely pathogenicā in RYR1 (rs200563280 Cā>āT and rs199895006 Cā>āT); ādrug responseā in CYP2D6 (rs1058172 Cā>āT) and ABCB1 (rs201122883 Gā>āA); and āPathogenicā in CYP1B1 (rs377049098 Gā>āA), XPC (rs121965088 Gā>āA), and CFTR (rs397508198 Gā>āT and rs74597325 Cā>āT).
Discussion
In this study, we conducted an in-depth analysis of population-level PGx variation and phenotype landscape using WGS data from the 1KGP and a reliable SV-aware method, PyPGx. Overall, (1) we identified global haplotype and phenotype frequencies by including SVs for seven and ten pharmacogenes, respectively, (2) we validated PyPGx via detection of known SVs and identified novel SVs, including SVs that have not been used to define star alleles in this well-characterized dataset, (3) most of the cohort (98.2%) had at least one non-typical drug response according to CPIC guidelines, consistent with previous findings11, and (4) we identified a large number of variants (210 SNVs and indels) not previously used to define star alleles that were potentially deleterious with high CADD scores and clinical significance. These findings contribute to an enhanced understanding of global PGx variation.
The precise determination of haplotype and phenotype frequencies across populations is crucial for enhancing drug selection and dosage accuracy, with the potential to mitigate adverse drug events, though we acknowledge that these frequencies are generalizations of diverse communities, and that individualized PGx testing is optimal for personalizing care. While we have successfully updated frequencies for specific examples, there remains a gap in reporting SVs comprehensively across all populations, presenting challenges for meaningful comparisons. One of the most comprehensive studies of PGx variation to date11 excluded SVs analyses. Further, our study used high-coverage WGS data, in contrast to the previous studyās use of imputed, exome, and integrated datasets. Another recent study18, using WGS data, conducted a thorough examination of coding and non-coding SVs influencing drug absorption, distribution, metabolism, and excretion. While the study was comprehensive18, PyPGx adds another dimension by facilitating the detection of ānovelā SVs, those not currently used to define star alleles. Our study demonstrates the utility of PyPGx and will significantly enhance the scientific communitiesā ability to accurately detect and measure SV frequencies.
PyPGxās capacity to detect SVs also enables the correction of misidentified haplotype calls, particularly for complex, polymorphic pharmacogenes, such as CYP2D6. While CYP2D6 genotypes are well studied, many haplotype frequencies may exclude SVs. Using our pipeline, we were able to enhance CYP2D6 haplotype calls containing SVs in 1KGP, including: *1āĆā2, *4āĆā2, *2āĆā2, *5, *10āĆā2 (Supplementary Table 5). While beyond the scope of this paper, a more comprehensive exploration of the biological and PGx impacts of SVs is needed.
We also examined variants that had not been used to define star alleles to determine if any were clinically relevant. Notably, we identified 8 variants with previous reports, including a nonsense variant within CFTR (rs74597325), which had the most journal submissions (nā=ā26) with the highest review status (4/4; practice guidelines), as well as pathogenic implications in cystic fibrosis51. In addition, a nonsense variant within RYR1 (rs200563280) had the next highest amount of evidence (nā=ā20) of pathogenicity and likely pathogenicity, with association with malignant hyperthermia susceptibility52. Both variants have yet to be utilized in defining a star allele. However, the rest of the variants had less supporting evidence and publications (nā<ā6).
While our work has numerous strengths, including population-level comparisons, SV analyses, and an exploration of potentially clinically significant variants, it is not without limitations. Both the 1KGP and PharmGKB grouping systems rely on genetic similarity; however, due to differences in the grouping system and potential sampling bias, the comparisons may be indirect or incomplete. 1KGP-SAS, 1KGP-EAS, and 1KGP-EUR are consistent with PharmGKB; however, 1KGP-AMR and 1KGP-AFR are not fully consistent 31; findings are suggested to be viewed as approximations31. These population descriptors are used to enhance comparability and reproducibility, but are limited. Notably, 1KGP-SSA is underrepresented within the dataset53. While PyPGx cannot detect breakpoints of SVs, applying machine-learning algorithms, like SVM, has helped to improve identification of SVs and enhances our understanding of the complexity of variation within pharmacogenes. We also encountered āextremeā fold changes when comparing haplotype frequencies (Nā=ā10), while most could be explained by low frequencies (Nā=ā5/10) and study outliers within PharmGKB (Nā=ā3/10), we encountered āextremeā instances with DPYD specifically (Nā=ā2/10) due to technical aspects of PyPGx and the PharmVar star allele definition of DPYD c.1896Tā>āC haplotype, which results in normal function, but is not considered a reference haplotype. Additionally, PyPGx detected frequencies for haplotypes, including those containing SVs, and phenotypes (e.g., āLikelyā or āPossibleā status) were not available for comparison within PharmGKB. Yet, this does highlight that PyPGx enables us to expand upon our current characterization of haplotype and phenotype frequencies. Ultimately, while most of our haplotype and phenotype calls are largely consistent with existing literature, we were able to offer valuable population-level insights into relevant haplotype and phenotypes, while incorporating SVs.
The utilization of high-resolution WGS data significantly enhances our comprehension of PGx variation across populations and has the potential to optimize precision medicine; yet, further efforts are needed to ensure all receive equitable benefits of PGx research54. Further, individual testing is optimal for precision medicine efforts. Potential future directions of this work include functional validation studies to validate findings, especially of the 8 variants identified, as well as assays to confirm phenotypic predictions across populations; however, that is outside the scope of this work. Additionally, the methods used here could be implemented in larger cohorts, such as the UK Biobank, which recently released WGS data for all 500,000 participants. In summary, our study expands upon similar efforts14 and stands as a valuable resource on global PGx variation and underscores the potential of WGS data to play a pivotal role in advancing precision medicine.
Data availability
Code and corresponding data supporting the current study is available at: https://github.com/sbslee/1kgp-pgx-paper.
References
Papachristos, A., Patel, J., Vasileiou, M. & Patrinos, G. P. Dose optimization in oncology drug development: The emerging role of pharmacogenomics, pharmacokinetics, and pharmacodynamics. Cancers15, 3233 (2023).
Adverse Drug Events in Adults | Medication Safety Program | CDC. https://www.cdc.gov/medicationsafety/adult_adversedrugevents.html (2022).
Robarge, J. D., Li, L., Desta, Z., Nguyen, A. & Flockhart, D. A. The star-allele nomenclature: Retooling for translational genomics. Clin. Pharmacol. Ther.82, 244ā248 (2007).
Gammal, R. S. et al. Expanded clinical pharmacogenetics implementation consortium guideline for medication use in the context of G6PD genotype. Clin. Pharmacol. Ther.113, 973ā985 (2023).
Tafazoli, A., Guggilla, R. K., Kamel-Koleti, Z. & Miltyk, W. Strategies to improve the clinical outcomes for direct-to-consumer pharmacogenomic tests. Genes12, 361 (2021).
Gaedigk, A. et al. Cytochrome P4502D6 (CYP2D6) gene locus heterogeneity: Characterization of gene duplication events. Clin. Pharmacol. Ther.81, 242ā251 (2007).
Almogy, G. et al. Cost-efficient whole genome-sequencing using novel mostly natural sequencing-by-synthesis chemistry and open fluidics platform. 2022.05.29.493900 Preprint at https://doi.org/10.1101/2022.05.29.493900 (2022).
Tremmel, R., Pirmann, S., Zhou, Y. & Lauschke, V. M. Translating pharmacogenomic sequencing data into drug response predictionsāHow to interpret variants of unknown significance. Br. J. Clin. Pharmacol.https://doi.org/10.1111/bcp.15915 (2023).
NumanagiÄ, I. et al. Allelic decomposition and exact genotyping of highly polymorphic and structurally variant genes. Nat. Commun.9, 828 (2018).
Tremmel, R. et al. Copy number variation profiling in pharmacogenes using panel-based exome resequencing and correlation to human liver expression. Hum. Genet.139, 137ā149 (2020).
McInnes, G. et al. Pharmacogenetics at Scale: An Analysis of the UK Biobank. Clin. Pharmacol. Ther.109, 1528ā1537 (2021).
Santos, M. et al. Novel copy-number variations in pharmacogenes contribute to interindividual differences in drug pharmacokinetics. Genet. Med.20, 622ā629 (2018).
Zhou, Y. & Lauschke, V. M. Population pharmacogenomics: An update on ethnogeographic differences and opportunities for precision public health. Hum. Genet.141, 1113ā1136 (2022).
Lakiotaki, K. et al. Exploring public genomics data for population pharmacogenomics. PLOS ONE12, e0182138 (2017).
Zhou, Y. & Lauschke, V. M. The genetic landscape of major drug metabolizing cytochrome P450 genes-an updated analysis of population-scale sequencing data. Pharmacogenomics J.22, 284ā293 (2022).
Vuppalanchi, R. Metabolism of Drugs and Xenobiotics (Elsevier, 2011).
Caspar, S. M., Schneider, T., Meienberg, J. & Matyas, G. Added value of clinical sequencing: WGS-based profiling of pharmacogenes. Int. J. Mol. Sci.21, 2308 (2020).
Tremmel, R., Zhou, Y., Schwab, M. & Lauschke, V. M. Structural variation of the coding and non-coding human pharmacogenome. NPJ Genomic Med.8, 24 (2023).
Zhou, Y. & Lauschke, V. M. Computational tools to assess the functional consequences of rare and noncoding pharmacogenetic variability. Clin. Pharmacol. Ther.110, 626ā636 (2021).
Lee, S., Shin, J.-Y., Kwon, N.-J., Kim, C. & Seo, J.-S. ClinPharmSeq: A targeted sequencing panel for clinical pharmacogenetics implementation. PLOS ONE17, e0272129 (2022).
Auton, A. et al. A global reference for human genetic variation. Nature526, 68ā74 (2015).
Devuyst, O. The 1000 genomes project: Welcome to a new world. Perit. Dial. Int. J. Int. Soc. Perit. Dial.35, 676ā677 (2015).
Zhao, X. et al. Expectations and blind spots for structural variation detection from long-read assemblies and short-read genome sequencing technologies. Am. J. Hum. Genet.108, 919ā928 (2021).
Byrska-Bishop, M. et al. High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. 2021.02.06.430068 Preprint at https://doi.org/10.1101/2021.02.06.430068 (2021).
Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. (National Academies Press, Washington, D.C., 2023). https://doi.org/10.17226/26902.
Gaspar, H. A. & Breen, G. Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics. BMC Bioinf.20, 116 (2019).
Lee, S.-B. et al. Stargazer: a software tool for calling star alleles from next-generation sequencing data using CYP2D6 as a model. Genet. Med. Off. J. Am. Coll. Med. Genet.21, 361ā372 (2019).
Lee, S.-B., Wheeler, M. M., Thummel, K. E. & Nickerson, D. A. Calling star alleles with stargazer in 28 pharmacogenes with whole genome sequences. Clin. Pharmacol. Ther.106, 1328ā1337 (2019).
Browning, S. R. & Browning, B. L. Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet.81, 1084ā1097 (2007).
McDonagh, E. M., Whirl-Carrillo, M., Garten, Y., Altman, R. B. & Klein, T. E. From pharmacogenomic knowledge acquisition to clinical applications: The PharmGKB as a clinical pharmacogenomic biomarker resource. Biomark. Med.5, 795ā806 (2011).
Huddart, R. et al. Standardized biogeographic grouping system for annotating populations in pharmacogenetic research. Clin. Pharmacol. Ther.105, 1256ā1262 (2019).
Iafrate, A. J. et al. Detection of large-scale variation in the human genome. Nat. Genet.36, 949ā951 (2004).
MacDonald, J. R., Ziman, R., Yuen, R. K. C., Feuk, L. & Scherer, S. W. The database of genomic variants: A curated collection of structural variation in the human genome. Nucl. Acids Res.42, D986-992 (2014).
Zhang, J., Feuk, L., Duggan, G. E., Khaja, R. & Scherer, S. W. Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome. Cytogenet. Genome Res.115, 205ā214 (2006).
Caudle, K. E. et al. Standardizing terms for clinical pharmacogenetic test results: Consensus terms from the Clinical Pharmacogenetics Implementation Consortium (CPIC). Genet. Med.19, 215ā223 (2017).
Clancy, J. P. et al. Clinical pharmacogenetics implementation consortium (CPIC) Guidelines for ivacaftor therapy in the context of CFTR genotype. Clin. Pharmacol. Ther.95, 592ā597 (2014).
Swen, J. J. et al. Pharmacogenetics: From bench to byteāan update of guidelines. Clin. Pharmacol. Ther.89, 662ā673 (2011).
Muir, A. J. et al. Clinical pharmacogenetics implementation consortium (CPIC) Guidelines for IFNL3 (IL28B) genotype and peg interferon-Ī±ābased regimens. Clin. Pharmacol. Ther.95, 141ā146 (2014).
Cooper, Y. A., Guo, Q. & Geschwind, D. H. Multiplexed functional genomic assays to decipher the noncoding genome. Hum. Mol. Genet.https://doi.org/10.1093/hmg/ddac194 (2022).
Gonsalves, S. G. et al. Clinical pharmacogenetics implementation consortium (CPIC) guideline for the use of potent volatile anesthetic agents and succinylcholine in the context of RYR1 or CACNA1S genotypes. Clin. Pharmacol. Ther.105, 1338ā1344 (2019).
Gaedigk, A. et al. The CYP2D6 activity score: Translating genotype information into a qualitative measure of phenotype. Clin. Pharmacol. Ther.83, 234ā242 (2008).
Rentzsch, P., Schubach, M., Shendure, J. & Kircher, M. CADD-Spliceāimproving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Med.13, 31 (2021).
Hunt, S. E. et al. Annotating and prioritizing genomic variants using the ensembl variant effect predictorāa tutorial. Hum. Mutat.43, 986ā997 (2022).
McLaren, W. et al. The ensembl variant effect predictor. Genome Biol.17, 122 (2016).
Masimirembwa, C., Persson, I., Bertilsson, L., Hasler, J. & Ingelman-Sundberg, M. A novel mutant variant of the CYP2D6 gene (CYP2D6*17) common in a black African population: Association with diminished debrisoquine hydroxylase activity. Br. J. Clin. Pharmacol.42, 713ā719 (1996).
Muyambo, S. et al. Warfarin pharmacogenomics for precision medicine in real-life clinical practice in Southern Africa: Harnessing 73 variants in 29 pharmacogenes. OMICS J. Integr. Biol.26, 35ā50 (2022).
de Morais, S. M. et al. The major genetic defect responsible for the polymorphism of S-mephenytoin metabolism in humans. J. Biol. Chem.269, 15419ā15422 (1994).
Offer, S. M., Wegner, N. J., Fossum, C., Wang, K. & Diasio, R. B. Phenotypic profiling of DPYD variations relevant to 5-fluorouracil sensitivity using real-time cellular analysis and in vitro measurement of enzyme activity. Cancer Res.73, 1958 (2013).
Gaedigk, A., Sangkuhl, K., Whirl-Carrillo, M., Klein, T. & Leeder, J. S. Prediction of CYP2D6 phenotype from genotype across world populations. Genet. Med. Off. J. Am. Coll. Med. Genet.19, 69ā76 (2017).
Wroblewski, T. H. et al. Pharmacogenetic variation in neanderthals and denisovans and implications for human health and response to medications. Genome Biol. Evol.15, evad222 (2023).
Watson, M. S. et al. Cystic fibrosis population carrier screening: 2004 revision of American college of medical genetics mutation panel. Genet. Med.6, 387ā391 (2004).
Gonsalves, S. G. et al. Using exome data to identify malignant hyperthermia susceptibility mutations. Anesthesiology119, 1043ā1053 (2013).
Sengupta, D., Choudhury, A., Basu, A. & Ramsay, M. Population stratification and underrepresentation of indian subcontinent genetic diversity in the 1000 genomes project dataset. Genome Biol. Evol.8, 3460ā3470 (2016).
Magavern, E. F., Gurdasani, D., Ng, F. L. & Lee, S. S. Health equality, race and pharmacogenomics. Br. J. Clin. Pharmacol.88, 27ā33 (2022).
Acknowledgements
The authors acknowledge the New York Genome Center for their generous contribution of high-coverage WGS data.
Funding
Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number R35HG011319, DEIA Supplement. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Author information
Authors and Affiliations
Contributions
Conceptualization: S.B.L.; Data curation: S.B.L., C.A.S.; Formal analysis: C.A.S., S.B.L.; Funding acquisition: K.G.C.; Investigation: S.B.L, C.A.S.; Methodology: S.B.L., C.A.S.; Project Administration: C.A.S., S.B.L., K.G.C.; Software: S.B.L.; Supervision: K.G.C., S.B.L.; Visualization: S.B.L, C.A.S.; Writing-original draft: C.A.S., S.B.L.; Writing-review & editing: C.A.S., S.B.L., K.G.C.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisherās note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the articleās Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleās Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sherman, C.A., Claw, K.G. & Lee, Sb. Pharmacogenetic analysis of structural variation in the 1000 genomes project using whole genome sequences. Sci Rep 14, 22774 (2024). https://doi.org/10.1038/s41598-024-73748-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-73748-3