The genetic landscape of major drug metabolizing cytochrome P450 genes—an updated analysis of population-scale sequencing data

Genes encoding cytochrome P450 enzymes (CYPs) are extremely polymorphic and multiple CYP variants constitute clinically relevant biomarkers for the guidance of drug selection and dosing. We previously reported the distribution of the most relevant CYP alleles using population-scale sequencing data. Here, we update these findings by making use of the increasing wealth of data, incorporating whole exome and whole genome sequencing data from 141,614 unrelated individuals across 12 human populations. We furthermore extend our previous studies by systematically considering also uncharacterized rare alleles and reveal that they contribute between 1.5% and 17.5% to the overall genetically encoded functional variability. By using established guidelines, we aggregate and translate the available sequencing data into population-specific patterns of metabolizer phenotypes. Combined, the presented data refine the worldwide landscape of ethnogeographic variability in CYP genes and aspire to provide a relevant resource for the optimization of population-specific genotyping strategies and precision public health.


INTRODUCTION
Drug response can vary substantially between individuals with up to 50% of patients undergoing pharmacotherapy suffering from low treatment efficacy or adverse drug reactions [1,2]. Overall, around 30% of this variability has been attributed to genetic factors and variations in cytochrome P450 (CYP) genes alone have been estimated to be relevant for 10-20% of all drug therapies [3]. Notably, of the 57 CYP enzymes encoded in the human genome, eight (CYP2A6, CYP2B6, CYP2C8, CYP2C9, CYP2C19, CYP2D6, CYP3A4 and CYP3A5) are responsible for the metabolism of most drugs in clinical use [4]. With the exception of CYP3A4, these enzymes lack important endogeneous substrates and, consequently, these CYP genes are extremely polymorphic with a plethora of single nucleotide variations (SNVs) and structural variants [5,6]. Genetic drift, population admixture and isolation compound the genetic complexity and result in substantial ethnogeographic differences in CYP gene variability across human populations [7].
Besides common polymorphisms, recent advances in Next Generation Sequencing (NGS) have facilitated the identification of tens of thousands of rare variants across the human pharmacogenome [8][9][10]. However, the vast majority of studies to date only analyzed the frequency and distribution of common genetic candidate polymorphisms in CYP genes. We previously used exome sequencing data from 56,945 individuals to analyze the distribution of clinically relevant CYP star alleles across five major human populations [11]. Here, we update these data for 98 star alleles across the eight clinically most relevant CYP genes.
Specifically, we analyzed fully consistent and compatible whole exome and whole genome sequencing data from 141,614 individuals and extend our analyses to twelve well-defined ethnogeographic groups. In addition, we comprehensively map and functionally interpret the rare genetic variability within these CYP genes and integrate both star alleles and uncharacterized variants to infer interethnic differences in phenotype distributions and human drug metabolism.

MATERIALS AND METHODS Data sources
Star alleles of the analyzed CYP genes were defined based on PharmVar [12]. Variant frequency data from a total of 141,614 individuals were derived from the aggregated publicly available sequencing resource gnomAD [13]. Haplotype frequencies were derived considering population-specific linkage disequilibria between the respective polymorphisms based on data from the 1000 Genomes Project using LDlink [14]. For Ashkenazim and Koreans, linkage information from Europe and East Asia were used, respectively. As suballeles of a given star allele do not differ in allele function, they were aggregated throughout this study. Variant calls from short-read sequencing data can be problematic for CYP2B6 [15] and CYP2D6 [16], resulting in possible underestimations of variant frequency. To ameliorate these issues, we matched the extracted frequency information from gnomAD to data from the National Center for Biotechnology Information (NCBI) Allele Frequency Aggregator (ALFA), which includes sequencing data generated using longer reads. Notably, of all analyzed CYP2B6 and CYP2D6 variants, only rs2279343 was considerably underestimated in gnomAD (10.7% and 24.1% for Europeans and Africans in gnomAD vs 23.1% and 33.1% in ALFA, respectively; Supplementary

Evaluation of variant functionality
The functional effects of star allele variants were obtained from the literature. For genetic variants for which effects have not been described, functional consequences were estimated using the ADME-optimized prediction framework (APF) [17]. In brief, APF generates an ensemble score for each variant by integrating five computational algorithms (LRT, MutationAssessor, PROVEAN, VEST3 and CADD) using parameter configurations specifically optimized for pharmacogenomic assessments. Notably, while APF is in principle also applicable to non-coding variations, it has not yet been benchmarked for this purpose and is thus applied here only to variants that affect the amino acid sequence of the respective gene product.

Inferring phenotypes from functional variants
To infer phenotypes, we considered exonic variants with known or putative functional effects, copy number variations in CYP2A6 and CYP2D6, as well as selected intronic (CYP3A4*22 and CYP3A5*3) and regulatory (CYP2A6*9, CYP2B6*22 and CYP2C19*17) variants outside of exons. All variations and haplotypes were assigned a functionality score based on the respective CPIC guidelines where available. For uncharacterized variations we used APF scores as activity score predictions. Both established (CPIC) and predicted (APF) activity scores of variants were then aggregated to infer metabolizer phenotype distributions for each population by calculating diplotype frequencies assuming Hardy-Weinberg equilibrium.
For CYP2C8, the genetic variability was overall considerably lower than for CYP2A6 and CYP2B6. European, Middle Eastern and African populations feature similar frequency of decreased   (Fig. 1c). In Africans, the major allele is CYP2C8*2 (MAF = 15.2%), whereas *3 (MAF = 2-15.2%) and *4 (MAF = 1.8-5.8%) are the primary minor alleles in the other populations (Table 3). In contrast, CYP2C8 is extremely conserved in East Asian populations with >99.4% of all alleles corresponding to the reference sequence.
In CYP2C19, only the splice site loss-of-function allele CYP2C19*2 is common in all populations analyzed (MAF = 8.7-32.4%; Table 3). The regulatory increased function allele CYP2C19*17 is frequent in European, admixed American, Middle Eastern and African populations with MAFs between 10.1% and 23.1%, whereas it is rare across East Asia (MAF = 0-0.7%). Notable among the population-specific variations are the loss-of-function variants CYP2C19*3 and *4 as well as the decreased function variant *9, which are exclusively found in East Asian (MAF ≥ 6.3%), Ashkenazim (MAF = 1.6%) and African populations (MAF = 1.3%), respectively. Overall, unlike for the other CYP2C genes, CYP2C19 loss-offunction alleles are most common and increased function alleles are most rare in East Asians, whereas patterns of allele activity are very similar across all other analyzed populations (Fig. 1e).
CYP2D6 constitutes the most polymorphic pharmacogene and harbors a multitude of common variants with clinical relevance, particularly for the treatment with antidepressants, antipsychotics, opioid analgesics and antihypertensives. Here, we studied 17 CYP2D6 alleles, 13 of which are associated with altered allele function. Decreased function alleles are most prevalent in East Asians due to high frequencies of CYP2D6*10 (MAF ≥ 35%), whereas loss-of-function alleles are most abundant in European populations primarily because of high frequencies of the splicing variant CYP2D6*4 (MAF up to 20.3%) ( Table 4 and Fig. 1f). Notably, reduced CYP2D6 allele function are also very common in Africans; however, in these populations the primary drivers are *17 (MAF = 20.5%) and *29 (MAF = 8.9%), which are almost exclusively found in this population. Increased function due to the functional gene duplications CYP2D6*1xN and *2xN is most common in Middle Eastern populations (aggregated MAF = 7%), Ashkenazi Jews (aggregated MAF = 5.6%) and Europeans (aggregated MAF up to 4.7%) but were rare in East Asian populations (aggregated MAF < 1%).
Among the major drug metabolizing CYPs, CYP3A4 is the only enzyme with an important endogenous substrate and, as a consequence, CYP3A4 is the most conserved among the studied CYPs (Table 5 and Fig. 1g). In Europeans, admixed Americans, Ashkenazi Jews and Middle Easterners, CYP3A4*22 is the only common allele of functional relevance with frequencies ranging between 0.9% in South Europeans to 9% in Ashkenazim. In contrast, in East Asian populations the population-specific *16 and *18 alleles are common, the former of which only in Japanese (MAF = 3.9%), whereas *22 is absent. Notably, South Asian populations do not harbor common CYP3A4 alleles.
CYP3A5 activity is primarily governed by the CYP3A5*3 allele, which is defined by the presence of a variant that results in the generation of a cryptic splice site that causes a premature stop codon. The *3 allele constitutes the major allele in most populations studied (MAF = 66.8-93%) except for Africans (MAF = 29.8%; Table 5). The latter also harbor the population-specific lossof-function frameshift allele CYP3A5*7 (MAF = 10.2%) as well as CYP3A5*6 (MAF = 12.9%), a splice variant restricted to African and Middle Eastern populations. Consequently, 53% to 94% of all CYP3A5 alleles are inactive across all major populations (Fig. 1h).

The genetic landscape of CYP variability
In addition to the well-characterized star alleles, CYP genes harbor a multitude of variants with unknown functional consequences. Using whole exome and whole genome sequencing data from 141,614 unrelated individuals, a total of 10,176 genetic variants were identified across the eight CYPs, of which 6016 were exonic (Fig. 2a). Notably, intronic variations are likely underreported as the majority of samples were sequencing using exome sequencing, which does not systematically cover introns. We thus we focused our further analyses exclusively on exonic variants. Among the exonic variations, missense (n = 3560; 59%) and synonymous variants (n = 1364; 23%) were most common, whereas only 19 start-lost variants, 11 in-frame insertions and 5 stop lost variants were identified. Importantly, 98.8% (n = 5891) and 96.8% (n = 5695) of the exonic variants were rare with MAFs < 1% and MAF < 0.1%, respectively (Fig. 2b). Compared to the approximately 150 variants with established functional annotations based on experimental or epidemiological data, these results indicate that the functional impact of the vast majority of variants in drug metabolizing CYPs remains to be determined.
To estimate the functional relevance of these unexplored variations at the population scale, we thus utilized computational predictions. Specifically, we used the APF algorithm (see Methods), which has been specifically developed for the interpretation of genetic variability in pharmacogenes. Overall, we identified 2175 variants that were predicted to reduce enzyme function (Supplementary Table 2). While more variants were identified in CYP2D6  Fig. 2c). Next, we aggregated the functional predictions based on computational inference with the functional variability allotted to star alleles (Fig. 2d). Expectedly, CYP3A5 harbored the overall largest fraction of altered function alleles (81.2%), followed by CYP2D6 (45.7%) and CYP2B6 (41.1%) and lowest was in CYP2C9 (16.7%), CYP2C8 (13.8%) and CYP3A4 (3.7%). Globally, star alleles accounted for majority of functionally relevant alleles with non-star alleles contributing between 1.5% for CYP3A5 and 17.5% in CYP3A4 to the total genetically encoded functional variability. However, when stratifying the analysis by population, cases were observed where the putative functional impact of nonstar alleles was considerable (Fig. 2e and Supplementary Table 3). In East Asian populations, uncharacterized variants in CYP2C8 are predicted to be as relevant as the deleterious CYP2C8 star alleles *2, *3, *4, *5, *7 and *11. Similarly, non-star alleles in CYP3A4 were estimated to have similar impacts compared to star alleles in South Asian and African populations, and more than one fifth of the functional variability of CYP2D6 and CYPA6 can be attributed to nonstar alleles in admixed Americans and Africans, respectively. In contrast, the functional contribution from non-star alleles in Ashkenazi Jews and Finnish are found to be marginal (Fig. 2e).
Translation of CYP genetic variability profiles into populationspecific functional effects Lastly, we used the aggregated inferred functionality data of star alleles and predicted uncharacterized variants to calculate the distributions of population-specific metabolizer phenotypes (Fig. 3). Poor metabolizers (PM) and intermediate metabolizers (IM) of CYP2A6 were most common in East Asian populations (33% PM; 49% IM). In contrast, only 5% and 2% of the population were CYP2A6 PMs in Europe and the Middle East (Fig. 3a). For CYP2B6, non-normal metabolizer phenotypes were most frequent in South Asia and Africa where only 19% and 21% were classified as normalizer metabolizers (NM; Fig. 3b). By contrast, 58%, 49% and 45% of Finnish, non-Finish European and Ashkenazim individuals were inferred to be NM. Moreover, rapid metabolizers (RM) for CYP2B6 were substantially more frequent in Asian and Middle Eastern populations as well as in admixed Americans (14-18% compared to ≤7% in other populations).         In contrast to CYP2A6 and CYP2B6, the majority of CYP2C8 and CYP2C9 classified as NM in every population analyzed (Fig. 3c, d). Impaired activity (IM or PM) of CYP2C8 and CYP2C9 was most prevalent in and Middle Easterners (38% and 30%, respectively) and Africans (35% and 27%, respectively), whereas East Asians are the most conserved among all studied populations (≤3% for both enzymes). In these calculations we considered CYP2C8*3 as a decreased function allele; however, as discussed below, functional effects of this allele are not clear and might be substrate-specific. An alternative map in which CYP2C8*2 is considered as functionally neutral is provided in Supplementary Fig. 1. For CYP2C19, South Asians were estimated to be phenotypically most variable, with 12%, 45% and 16% being classified as PM, IM and RM, respectively. In East Asians the fraction of PM (15%) and IM (47%) individuals were even higher than South Asians. However, only 1% of East Asians were CYP2C19 RM, whereas the respective frequencies in Middle Easterners (36%), Europeans (34%), Africans (29%) and Ashkenazim (29%) was substantially higher (Fig. 3e).
While CYP2D6 has a multitude of population-specific polymorphisms with functional relevance, the fraction of non-normal CYP2D6 metabolizer phenotypes does not vary substantially across different ethnogeographic groups (Fig. 3f). The fraction of CYP2D6 IMs is highest in East Asians (51% of the population), whereas PMs are very rare (<1%). The highest frequency of CYP2D6 PMs is found in Europeans (7%) and Ashkenazim (4%). Normal CYP2D6 metabolizers are most common among individuals of Southeast Asian (65%) and admixed American (63%) descent while RMs are most frequent in the Middle East (9%).
For CYP3A4, the frequency of IMs is highest in Ashkenazi Jews and Europeans with an estimated 18% and 10% of the entire population, whereas the prevalence of IMs in other populations is <10% (Fig. 3g). In contrast, the majority of individuals across all populations are classified as IMs (9-50%) or PMs (29-90%; Fig. 3h). When using the older but still widely used classification scheme of expressors (defined as individuals with at least one active CYP3A5 allele) and non-expressors (defined as individuals with two inactive CYP3A5 alleles), the range of expressors varied between 10% in Europeans and 71% in Africans.

DISCUSSION
CYP genes are long known to be highly polymorphic with distinct genetic population differences. Comprehensive maps of interethnic differences in CYP variability have previously been presented for individual genes or alleles based on literature analyses [19][20][21][22][23][24][25]. Importantly however, genotyping strategies can differ between studies, which can impact allele frequency estimates particularly for genes with complex haplotype structures, such as CYP2B6 and CYP2D6. In addition, there have been multiple efforts in mapping CYP variability by genotyping individuals across populations using consistent profiling approaches [26,27]. However, due to practical limitations, these efforts were limited to relatively small cohorts. By analyzing comprehensive sequencing data from 141,614 individuals across a total of 12 populations we here update our previous metaanalysis [11] and provide a systematic overview of the global landscape of genetic variability for the eight CYP genes of highest clinical relevance.
Among the well-characterized CYP alleles, we find considerable differences between related ethnogeographic groups. In East Asia, despite substantial admixture of Japanese, Korean and Han  Chinese [28], frequencies of functionally relevant alleles can differ substantially. For instance, the reduced function allele CYP3A4*16 is common in Japanese (MAF = 3.9%), whereas it is almost absent in Koreans and other East Asian populations with important implications for the treatment with statins and immunosuppressants. Furthermore, frequencies of CYP2D6*10 and CYP2D6*41 were substantially lower in Japanese compared to other East Asian groups, suggesting that recognition of these differences might aid in the optimization of population-specific dosing recommendations for antipsychotic and antidepressant treatment [29]. Our results moreover confirm considerable differences between Finnish and other European populations. Specifically, we find substantially lower frequencies of the reduced function alleles CYP2D6*41 (3.2% in Finnish compared to 9.3% in other European populations) as well as a lack of CYP2B6*9 (0% vs. 4.4%).
The functional effects of some common alleles remain controversial. For instance, CYP2C8*3, the most frequent CYP2C8 variant allele in most populations, has been shown in vitro studies to result in reduced [30,31], normal [32] or increased metabolism [33,34]. Further complication is added by the fact that CYP2C8*3 is in strong linkage disequilibrium with the decreased activity allele CYP2C9*2 [35]. As both enzymes substantially overlap in their substrate specificity, this might have influenced previous clinical investigations in their interpretation that CYP2C8*3 results in decreased CYP2C8 activity [36].
Of note, phenotype assignments based on diplotypes differs between genes. For instance, for CYP2C9 NMs are defined as an activity score of 2 (i.e. individuals having two functional alleles), whereas IM status is assigned for individuals with activity scores of 1-1.5 and PMs are defined as 0-0.5 [37]. Similarly, an individual carrying a reduced function allele would be classified as IM for CYP2B6 [38] and CYP2C19 [39]. For CYP2D6 however, individuals with one reduced function and one normal function allele are classified as NMs and CYP2D6 IMs and PMs are defined as activity scores of 0-1.25 and 0, respectively [40]. Here, we followed these current established guidelines and conventions for the translation of genotypes into activity scores and metabolizer phenotypes. As a consequence of these differences in phenotype annotations, we would like to point out to the readers that individuals who carry, for instance, a normal function allele and the reduced function allele CYP2D6*10 (activity score = 0. 25   activity score of 0.5, results in an IM definition, whereas the same activity would be classified as PM in other CYPs.
While well-characterized star alleles play important roles in determining CYP function, a considerable amount of heritable variability in drug response remains unexplained. For instance, elegant twin studies have shown that while 90% of the pharmacokinetic variation of metoprolol and torsemide were heritable, known genetic variants in CYP2D6 and CYP2C9 only explained 39% and 2% of the respective variation [41], raising the possibility that additional variants in these or other genes might explain at least part of the "missing heritability". Further evidence for the potential relevance of as of yet uncharacterized variants in CYP genes comes from a retrospective study of 2087 patients taking the antidepressant escitalopram, which revealed that although common CYP2C19 alleles were well-corelated with escitalopram serum concentrations, substantial variability particularly in the *1/*1 group remained [42]. Our results indicate that, as expected, common star alleles indeed constitute the predominant genetic determinants of functional CYP variability. However, rare, as of yet uncharacterized variants are estimated to account for 1.5-17.5% of the overall genetically encoded functional variability in CYP genes. These estimates suggest that the relative contribution of rare variants in CYPs is overall lower compared to phase I and phase II enzymes or drug transporters, for which previous studies suggested that approximately 20-40% of the total variability is allotted to rare variants [43][44][45].
Notably, we here focused on genetic variability in coding regions of the analyzed genes. However, rare variants in regulatory or untranslated regions (UTRs) of CYP genes might also modulate gene activity. Previous studies indicated that expression of multiple CYPs, including CYP2C8, CYP2C9 and CYP2C19, was modulated by miRNAs and variants in the respective UTRs might impact these miRNA-mRNA interactions [46]. Furthermore, rare variations in regulatory regions, which are largely understudied due to the challenges associated with their functional interrogation, might contribute substantially to the missing heritability in CYP activity. Importantly, in recent years the repertoire of methodologies and algorithms to predict the functional impact of regulatory variants is rapidly increasing [47]. However, a systematic application of these tools to CYP variability in non-coding regions has not been presented and the assessment of the relative importance of such variants for inter-individual variability in drug metabolism remains an important frontier of pharmacogenomic research.
Non-star allele variants that were predicted to be deleterious by APF featured multiple population-specific variants with experimental support of reduced activity. For instance, rs181297724, causing a p.Ala161Pro amino acid exchange in CYP2C19, strongly reduced metabolic activity in vitro [48] and was found to be common in Finnish (MAF = 5.6%), whereas it was almost absent in other populations, including Europeans (MAF < 0.1%). In CYP2D6, rs79392742 (p.Ala449Asp) was detected in 3.5% of admixed American alleles and was associated with decreased activity. measured with substrates dextromethorphan and metoprolol [49]. Similarly, rs145014075 (p.Ser467X) and rs145308399 (p.Glu97Lys) in CYP2A6 are common in Ashkenazim and South Asians (MAF = 2.3%), respectively, and have been associated with decreased activity in vitro or in vivo [50,51]. Notably, the relevance of such variants can differ substantially between populations; for instance, uncharacterized variants in CYP2C8 and CYP3A4 are estimated to be as relevant as the characterized star alleles in Asian populations, whereas the contribution of non-star allele variants in CYP2B6 and CYP2Cs in Finnish and Ashkenazi Jewish individuals is very low. These results pinpoint variants and populations in which additional functionally relevant variability in CYP genes might be discovered, which might provide useful guidance for the optimization of population-specific genotyping strategies for related drugs.
In summary, the presented analyses provide CYP allele frequencies using population-scale sequencing data broadening the data base from previous work by increasing the number of individuals from 56,945 to 141,614. We furthermore make use of improved variant calling quality as well as updated functional annotations and linkage information. In addition, we complement our previous studies by systematically considering rare variants into functional effect predictions in CYP genes based on dedicated pharmacogenomic algorithms. The presented data refines the worldwide landscape of ethnogeographic variability in CYP genes and emphasizes the importance of considering genetic differences for the optimization of population-specific pharmacotherapy and precision public health.

DATA AVAILABILITY
All data were available in the main tables and the supplementary information.