Phenotype prediction and characterization of 25 pharmacogenes in Thais from whole genome sequencing for clinical implementation

Publicly available pharmacogenomics (PGx) databases enable translation of genotype data into clinically actionable information. As variation within pharmacogenes is population-specific, this study investigated the spectrum of 25 clinically relevant pharmacogenes in the Thai population (n = 291) from whole genome sequencing. The bioinformatics tool Stargazer was used for phenotype prediction, through assignment of alleles and detection of structural variation. Known and unreported potentially deleterious PGx variants were identified. Over 25% of Thais carried a high-risk diplotype in CYP3A5, CYP2C19, CYP2D6, NAT2, SLCO1B1, and UGT1A1. CYP2D6 structural variants accounted for 83.8% of all high-risk diplotypes. Of 39 known PGx variants identified, six variants associated with adverse drug reactions were common. Allele frequencies of CYP3A5*3 (rs776746), CYP2B6*6 (rs2279343), and NAT2 (rs1041983) were significantly higher in Thais than East-Asian and global populations. 121 unreported variants had potential to exert clinical impact, majority were rare and population-specific, with 60.3% of variants absent from gnomAD database. This study demonstrates the population-specific variation in clinically relevant pharmacogenes, the importance of CYP2D6 structural variation detection in the Thai population, and potential of unreported variants in explaining drug response. These findings are essential in development of dosing guidelines, PGx testing, clinical trials, and drugs.


Scientific Reports
| (2020) 10:18969 | https://doi.org/10.1038/s41598-020-76085-3 www.nature.com/scientificreports/ in incorrect allele assignments, hence phenotype prediction 3 . This create disparities in reports of the same sample that discourage the adaptation of PGx testing. Whole genome sequencing (WGS) has advantages over other platforms as it identifies all variants required for accurate allele assignment and novel clinically relevant PGx variants, which may account for unexplained differences in drug response. As alleles assignment required identifying large number of variants and detection of SV, bioinformatics tool was developed in facilitate calling of star alleles from next-generation sequencing data 4 . Current PGx resources and recommendations are based largely on a population of European descent. Studies have shown differences in pharmacogenes between ethnicities or even in closely related populations [5][6][7][8] . The aims of this study were to use WGS in investigating the prevalence of 25 high-evidence pharmacogenes (CPIC level A) star alleles and diplotypes for accurate phenotypes prediction and to examine allele frequency of known PGx variants, together with identifying potential novel deleterious variants in the Thai population.
Potentially deleterious PGx variants. Of 305 missense variants found in this cohort, 41 variants were previously reported to associate with drug response in PharmGKB database. Novel potentially deleterious missense variants found in Thais were reported in the Supplementary Table S2. One hundred and ten missense variants were considered novel, potentially deleterious, while 5 variants obtained Combined Annotation Dependent Depletion (CADD) PHRED-normalized scores of > 30 (Supplementary Table S2). Seventy-eight percent (n = 86) of novel potentially deleterious missense variants were only found once in this cohort. Ninety-four percent (n = 103) were rare in gnomAD database, while 61% (n = 67) were absent. Thirty percent (n = 33) had not been reported in dbSNP 150 database. Sixty-two percent of Thais carry up to 4 novel potentially deleterious missense variants.
Eleven high-confidence loss-of-function variants were found in 9 pharmacogenes (Supplementary Table S3), 8 variants were only found once in this cohort, 2 variants were rare (minor allele frequency [MAF] < 0.01), and 1 www.nature.com/scientificreports/ variant was found at low frequency (0.01 < MAF < 0.05). According to the gnomAD database, all loss-of-function variants were rare and 6 variants were absent. An enrichment of splice acceptor variant rs373134805 (CYP3A5) was found within the South East-Asian population in GenomeAsia 100 k database 14 .

Discussion
This is the first study to report the prevalence of star alleles, diplotypes, and phenotype predictions of 25 clinically relevant pharmacogenes, including CYP2D6 SV, from WGS in the Thai population. We then identified 39 known PGx variants in Thais and compared allele frequencies to East-Asian and global populations. We further identified 110 novel potentially deleterious missense variants and 11 high-confidence loss-of-function variants circulating within the population. The "star" nomenclature system used in this study is a powerful tool for predicting activity or function of enzymes, transporters, or drug targets, as it accounts for a combination effect of multiple variants within an allele 15 . We found high clinical relevance cytochrome P450 genes (CYP3A5, CYP2C19, and CYP2D6) exhibiting high variation in predicted phenotype. This could reflect the low evolutionary constraint within these enzymes, as they lack essential endogenous function 16 . SV, between CYP2D6 and its pseudogenes (CYP2D7, CYP2D8) established to alter enzymatic activity, found to exerted high importance in the Thai population 17 . It accounted for 60% of CYP2D6 star alleles detected and 83.8% of all high-risk diplotypes in this study. Interestingly, prevalence of CYP2D6 SV was also previously reported to be highest among Asians when compared to African Americans, Caucasians, and Hispanics 17 . Our finding emphasizes the importance of detecting CYP2D6 SV for accurate phenotype prediction especially in Thai population.
Variability in drug response among ethnicities had long been observed, but a recent increase in the number of populations studied unveiled another layer of genetic variability within the sub-population, such as distribution gradient of CYP2C19*17 found from Western to Eastern Europe 5 . In Thais, CYP3A5*3 (rs776746), CYP2B6*6 (rs2279343), and NAT2 (rs1041983) were significantly higher compared with East-Asian and global populations. Varying allele frequency of multiple VKORC1 variants to different populations found in this study supported the previously reported variation of rs9923231 among the East-Asian population where allele frequency of 0.96, 0.94, and 0.90 was observed in North-East Asians (Japanese, South Koreans, and Chinese) in comparison to 0.62, 0.69, 0.75 observed in South-East Asians (Filipinos, Malaysians, and Indonesians) 14 . In addition to known PGx variants, novel potentially deleterious variants were population specific with 94.2% identified were rare in gno-mAD database, and 60.3% were absent. This reflect previous finding that high impact variants are often rare and geographically localized as a result of purifying selection 18 . For example, potentially deleterious splice acceptor variant c.433-1G>C in CYP3A5 found at a low frequency in Thai (0.017) is population-specific South East-Asian populations including Vietnamese (0.018), Malaysian (0.039), and Indonesian (0.015), while extremely rare in the gnomAD database 14 . These variations within subpopulations of East-Asians demonstrate the benefit of PGx testing and highlight the precaution that must be taken when associating PGx with ethnicity labels. www.nature.com/scientificreports/ A focus on rare variants in explaining inter-individual variation in drug response is likely to increase as the cost of sequencing is reduced, making WGS more readily available. An important challenge remains in interpreting these rare variants of unknown significant. Repository SPHINX (Sequence, Phenotype, and pHarmacogenomics INtegration eXchange https ://emerg esphi nx.org), that link PGx variants of unknown significance with patients clinical phenotypes would facilitate researchers on studying these variants of unknown significant for future PGx discovery 19 .
The WGS ability to access PGx variants and SV in a single methodology reduced time and labor involved. This study demonstrates WGS to be a highly efficient platform in research and PGx testing. The current high cost and bioinfomatics required to process and translate large data could limit WGS application as a PGx testing platform in routine clinical setting. Development of bioinformatics tools use in translating genotype data are moving toward a more automated manner, such as under developing PharmCAT 20 . This would make interpreting WGS data more user-friendly and accessible to wider healthcare provider in the near future. In the meantime, an alternative more cost effective platform such as genotyping arrays could currently be a more applicable 21 .
We acknowledge several limitations and ways the study could be improved. A portion of enrolled participants was Brugada syndrome patients. Although none of the genes associated with Brugada syndrome were examined, results could be enhanced with unknown genetic factors influencing the disease. Previous study reported an inconsistent in star alleles calling in samples with complex SV when three bioinformatics tools were compared, this suggest that further confirmation, such as using high-resolution long-read sequencing that allows accurate variant calling and phasing, might be required in some samples with CYP2D6 complex SV 22 . Computational prediction tools like Loss-of-Function Transcript Effect Estimator (LOFTEE) and Combinded Annotation Dependent Depletion (CADD) used in this study and other studies are useful in prioritizing deleterious effects in variants of unknown significance; however, variants must be reported with caution and validated through a functional study before implementation in clinical settings 23,24 .
In conclusion, we reported a comprehensive overview of the PGx spectrum in a Thai population and its differences with East-Asian populations. We demonstrated the utilization of WGS in PGx testing, including accurate phenotype prediction using the "star" nomenclature system, SV detection, and identification of known and unknown potentially deleterious PGx variants. The reported findings and variations within pharmacogenes of the Thai population facilitate PGx-guided clinical decision making in Thailand and contribute to the database of the understudied South-East Asian population for further application of precision public health including dosing guidelines, drug development, clinical trials, and development of population-specific screening.

Materials and methods
WGS of 291 individuals from the Brugada cohort (Clinical Trial Registration Number NCT04232787), consisting of 108 patients with Brugada syndrome and 183 controls were enrolled into the study. Controls had no type I Brugada pattern on standard 12-lead echocardiogram or family history of sudden cardiac arrest and were matched to the cases for place of birth. Controls were volunteers from blood donor at multiple sites of National Blood Center, Thai Red Cross Society (n = 99) or visitors for health check-up and workers at King Chulalongkorn Memorial Hospital (n = 84). All subjects, case and control, were of Thai ethnic origin by self-report from 52 provinces in Thailand which can be separated into 5 major geographical regions: north, northeast, central, east and south as shown in Supplementary Table S1. The study was approved by the Institutional Review Board of the Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand (IRB No. 431/58). Informed consent was obtained from all participants. All methods were performed in accordance with relevant guidelines/regulations.
Sequencing of paired-end 150 bp fragment reads from polymerase chain reaction (PCR)-free sequencing libraries was performed on the HiSeqX (Illumina Ltd, Cambridge, UK). Sequencing, alignment, and variant calling were performed at Illumina Ltd, Cambridge, UK. Reads were aligned to NCBI GRCh38 human reference genome assembly. All samples have minimum mean depth coverage of 30 × with minimum of 97% autosome coverage at 10 × and 94% aligned reads.
Multidimensional scaling analysis was performed on 15,965 single nucleotide polymorphism, excluding indels, within 25 pharmacogenes using Plink (version 1.9). Multidimensional scaling plot illustrate no separation between cases and control for the first 4 components (Supplementary Fig. S1).

Star allele analysis.
Stargazer required a VCF file on genome coordinate GRCh37 and a gdf file for SV detection. Genome coordinates, reference, and alternative allele were converted from GRCh38 to GRCh37 using LiftoverVariants tools available in GATK package (version 4.1.6.0). To generate the gdf file for SV detection of CYP2D6, first, Bazam (version 1.0.1) was used to extract CYP2D6-CYP2D7 region from BAM file and realigned on GRCh37 coordinates 25 . Samtools (version 1.9) was used to extract read depth. Sdf2gdf script, which was available on Stargazer, was used to generate the gdf files. One sample with high missingness within the CYP2D6 region was excluded when calling star allele of CYP2D6. The haplotype, activity score, diplotype, and predicted phenotype called by Stargazer with VDR as a control gene were combined and visualized using R program (version 3.6.3, dplyr and ggplot2 package). High-risk diplotypes were defined as diplotypes which predicted phenotype was not normal.  26 . Annotated variants were classified into common (MAF ≥ 0.05), low frequency (0.05 > MAF ≥ 0.01), rare (MAF < 0.01), and absent (MAF = 0). Variants within each group were counted using VCFtools (version 0.1.15). Number of missense variants per coding sequence was calculated by number of missense variants Ensembl transcript length , where Ensembl transcript length was obtained from BioMart database (https ://www.Ensem bl.org/bioma rt/martv iew/) and had APPRIS annotation value as "primary assembly. " PGx variants were retrieved from PharmGKB database (https ://www.pharm gkb.org, accessed on 06/06/2020). Variants with evidence level 1A, 1B, and 2A were identified as known PGx variants. Allele frequencies were compared with those of the population in gnomAD using Chi-square test or Fisher's exact test. A p-value of < 0.001 was used as a significant level after Bonferroni correction.
Missense variants were extracted to examine novel potentially deleterious variants where variants reported in PharmGKB database or used in star allele analysis were excluded. CADD PHRED-normalized scores were obtained online. CADD PHRED-normalized scores ≥ 20, or 1% most deleterious single nucleotide variants within the reference genome, were considered potentially deleterious variants. Loss-of-function variants include stopgained, splice-site disrupting, frameshift insertion, and frameshift deletion variants. LOFTEE algorithm available in VEP-plugin was used to determine loss-of-function variants, and variants annotated as "high confidence" were considered loss-of-function in this study.

Data availability
The data that support the findings of this study are available from the National Research Council of Thailand but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are; however, available from the authors upon a reasonable request and with a permission of the National Research Council of Thailand.