The analysis of GSTA1 promoter genetic and functional diversity of human populations

GSTA1 encodes a member of a family of enzymes that function to add glutathione to target electrophilic compounds, including carcinogens, therapeutic drugs, environmental toxins, and products of oxidative stress. GSTA1 has several functional SNPs within its promoter region that are responsible for a change in its expression by altering promoter function. This study aims to investigate distributions of GSTA1 promoter haplotypes across different human populations and to assess their impact on the expression of GSTA1. PHASE 2.1.1 was used to infer haplotypes and diplotypes of six GSTA1 promoter SNPs on 2501 individuals from 26 populations classified by the 1000 Genomes Project into five super-populations that included Africa (N = 660), America (N = 347), East Asia (N = 504), Europe (N = 502), and South Asia (N = 488). We used pairwise FST analysis to compare sub-populations and luciferase reporter assay (LRA) to evaluate the impact of each SNP on activation of transcription and interaction with other SNPs. The distributions of GSTA1 promoter haplotypes and diplotypes were significantly different among the different human populations. Three new promoter haplotypes were found in the African super-population. LRA demonstrated that SNPs at -52 and -69 has the most impact on GSTA1 expression, however other SNPs have a significant impact on transcriptional activity. Based on LRA, a new model of cis-elements interaction is presented. Due to the significant differences in GSTA1 diplotype population frequencies, future pharmacogenomics or disease-related studies would benefit from the inclusion of the complete GSTA1 promoter haplotype based on the newly proposed metabolic grouping derived from the LRA results.

The glutathione-S-transferases (GST) are a group of enzymes that catalyse the addition of glutathione to target electrophilic compounds, including carcinogens, therapeutic drugs, environmental toxins, and products of oxidative stress 1,2 . At present, eight distinct classes of the soluble cytoplasmatic mammalian GSTs have been identified: alpha, kappa, mu, omega, pi, sigma, theta, and zeta. The alpha class genes, specifically GSTA1 are most abundantly expressed in the liver (hepatocytes), kidney (proximal tubules), adrenal glands, pancreas, and testis, while expression in a wide range of other tissues is low 3,4 . Aberrant overexpression has been observed in various malignancies such as colorectal 5 and lung cancer 6 , while a decrease in alpha class GSTs have been observed in stomach and liver tumours 7 .
In addition to metabolising bilirubin and certain anti-cancer drugs in the liver, GSTs act as modulators of mitogen-activated protein kinase (MAPK) signal transduction pathway via a mechanism involving protein-protein interactions [8][9][10][11] . GSTA1 itself has been shown to form complexes with JNK and influences the development of apoptosis 12 .
In terms of pharmacogenomics, GSTA1 holds much importance in the field of oncology, as it is involved in the metabolic pathway of many important chemotherapeutic agents such as busulfan (Bu) 13 , thiotepa 14,15 , doxorubicin 16 , cyclophosphamide 17 , and chlorambucil 18 . Furthermore, because GSTA1 is the most abundantly expressed enzyme of its group in the liver, it establishes itself as the top candidate gene for influencing drug clearance 19 . It is also expressed in the breast, thus thought to be the reason why it influences the efficacy of www.nature.com/scientificreports/ cyclophosphamide in breast cancer patients 20 . Furthermore, of great interest in onco-haematology is Bu, an alkylating agent used in the conditioning regimen before haematopoietic stem cell transplant (HSCT) as a treatment for different types of malignancies and non-malignancies 8 . Diverse polymorphisms (SNPs) have been detected after analysis of the proximal promoter of gene encoding GSTA1, which is believed to affect its expression. These genetic polymorphisms consist of 2 leading haplotypes, GSTA1*A and GSTA1*B, containing 3 linked base substitutions in the proximal promoter, at positions -52, -69, and -567. Luciferase reporter assays (LRA) showed that GSTA1*A is more highly expressed than GSTA1*B 21 , most likely due to selective binding of Sp1 transcription factor (TF) at positions -52 and -69 21 .
Although most studies in patients receiving Bu based conditioning regimens show positive significant contributions of these haplotypes with Bu pharmacokinetics (PK), adverse events, or disease risk [22][23][24][25] , there are still several other studies that have reported negative findings, casting doubt on the role of the promoter SNPs. For a review of GSTA1′s role in Bu metabolism see Huezo-Diaz et al. 8 and for those related to disease-related studies see Deng et al. 26 . The reason for those discrepancies may arise, at least in part, from the incomplete knowledge of GSTA1 promoter haplotypes distribution in human populations and their impact on GSTA1 promoter activity, consequently resulting in differences in GSTA1 metabolic potential.
Recently, our group provided additional evidence to suggest that three other SNPs (-513, -631, and -1142) in linkange disequlibrium (LD) with -52, -69, and -567 contribute to the altered GSTA1 promoter activity, enabling the refinement of the haplotypes and thus possibly explaining further the variability between individuals and their impact on Bu metabolism 27 , which could also be true of other substrates of GSTA1.
Thus, with this study, we want to assess the distribution of GSTA1 haplotypes and diplotypes in different human populations, which can potentially explain different metabolic phenotype distribution among populations. Additionally, this study expands and complements our previous report 27 by providing further experimental and in silico data aiming to explain the functional contribution of each SNP and to identify TF binding sites encompassing these SNPs.

Methods
Population genetics study. We analysed the six SNPs (-52: rs3957356; -69: rs3957357; -513: rs11964968; -567: rs4715332; -631: rs4715333; -1142: rs58912740) genotype data of GSTA1 promoter from the 1000 Genomes Project Phase 3 Pipeline, Homo sapiens: GRCh37.p13 (GCF_000001405.25) Chr 6 (NC_000006.11), that included 2504 individuals from 26 human populations as described in NCBI Variation Glossary (http://www. ncbi.nlm.nih.gov/varia tion/docs/gloss ary). The VCF files containing genotype data from each SNP (rs3957356; rs3957357; rs4715332; rs4715333; rs11964968 and rs58912740) and individual's ID that identified ethnic group was downloaded and compiled into SPSS. Populations were classified as specified in the 1000 Genome Project into five 5 super-populations that included East Asia, South Asia, Africa, Americas, and Europe. Within each super-population, sub-populations were established that correspond to the country of origin. We used PHASE 2.1.1 software to infer the haplotypes 28 . All SNPs in the article are written in gene-wise 5′ to 3′ orientation. After establishing the haplotypes, we established their frequencies for each super-population and sub-population and compared them using pair-wise FST analysis. Next, diplotype frequencies per super-population were established. Lastly, frequency charts of the metabolic status groups were developed, as previously reported 27 to evaluate the potential functional impact of GSTA1 genetic diversity worldwide.
Distinguishing between A1B1a and A3B2. As PHASE was not able to distinguish the diplotypes *A1 *B1a vs *A3 *B2, we established a reliable PCR-based genotyping method. Forward GSTA1-F-1336BP 5′-TGG ATC CCT CAG TTT TGT AAGG -3′ and reverse GSTA1-R-1336BP 5′-TAA ACG CTG TCA CCG TCC -3′ oligos were used to specifically amplify the promoter region of GSTA1 using Platinum SuperFi II PCR Master Mix (Ther-moFisher, USA) and the following cycling conditions: initiation for 3 min at 95 °C followed by 38 cycles at 95 °C for 30 s, 64 °C for 30 s and 72 °C for 45 s. PCR products were genotyped using forward and reverse PCR primers and Sanger sequencing service (Fasteris, Switzerland). In the case of ambiguous diplotype, PCR products were TOPO TA cloned using linearized pMiniT 2.0 vector (E1202S, New England Biolabs, USA) according to manufacturer's recommendations. Several colonies were picked for sequencing using standard SP6 and T7 oligo and E. coli NightSeq Sanger sequencing service (Microsynth, Switzerland). This method was validated on a patient showing ambiguous genotype from Ansari et al. 27 cohort (Table 2) and eight 1000 Genomes Project samples ( Supplementary Fig. 2).
Luciferase reporter gene assay (LRA). LRA was used to determine how changes to the GSTA1 promoter sequence due to SNPs affect GSTA1 promoter function. We constructed 18 different plasmids corresponding to all the haplotypes identified from the population analysis. In brief, the human GSTA1 promoter (-1430 to -1 nt) was PCR amplified and cloned directly into the SacI-XhoI site of luciferase reporter gene plasmid pGL4.10basic (Promega, USA). Human hepatoblastoma (HepG2) cells were co-transfected with the pGL4.10 GSTA1 constructs and the pRL-SV40 vector that codes for Renilla luciferase for transfection control and normalisation (Promega, USA). Promoterless pGL4.10-basic plasmid was used to determine baseline expression (Promega, USA). Transfections of HepG2 were accomplished by X-tremeGENE HP DNA Transfection Reagent kit (Sigma-Aldrich, USA). Cells were seeded 1 day before transfection and grown for 24 h in 12-well culture plates containing DMEM F12 cell culture medium GlutaMAX supplemented (ThermoFisher, USA). Dual luciferase assay (Promega. USA) and Lumat3 LB 9508 (Berthold Technologies, Germany) were used to measure chemiluminescence as described in the manufacturer's protocol. Data analysis was performed by normalizing luciferase to renilla chemiluminescence. Next, expression values for each plasmid were compared to the expression of reference *A1 haplotype by calculating the ratio. All measurements were performed at least in triplicates. The In silico method for identification of transcription factor binding sites. A non-redundant set of position-weight matrices (PWMs) from binding profiles of known TFs was collected. We used 579 PWMs of vertebrate's TFs provided from the JASPAR2018 database (v1.1.1, http://jaspa r.gener eg.net/). Next, using Bedtools (v2.29.2) we extracted the sequence of a 2 kb region around GSTA1 transcription start site (TSS) from GRCh38.p13 assembly (1.5 kb upstream and 0.5 kb downstream). That region contained the 6 SNPs of interest described above. The sequence was modified to correspond to each haplotype and scanned using HOMER (v4.11). All predictions with a log odd-score higher than 6 were conserved and those having score higher than 10 are presented in Table 3. All haplotype combinations were generated and predictions were sorted by their relative positions to the SNP variation, except for the two nearest SNPs (-52 and -69) from TSS that were packed together due to their proximity.
In our previous study 29 , we have determined three functional GSTA1 metabolic phenotypes based on the combinations of the identified six haplotypes. Using the same categorization, the distribution of these metabolic phenotypes demonstrates clear ethnic differences (Fig. 1B). The frequency of the fast metaboliser phenotype (group 1) is highest in the African super-population (46.9%), followed by East Asian (23.5%), South Asian (13.1%), American (6.4%), and European super-population (3.6%). On the other hand, the frequency of the slow metaboliser phenotype is highest in the South Asian super-population (32.9%) followed by East Asian (25.1%), European (22.4%), American (12.0%), and African (9.6%) super-populations.
The reason for the high frequency of slow metaboliser phenotype in East and South Asians is due mostly to the higher frequency of *B1b haplotype (13.0% and 15.9%, respectively), whereas in the European super-population slow metabolisers carry mostly *B1a and *B1b haplotypes (allele frequency of 36.9% and 5.3%, respectively) ( Tables 2 and 3).
-52 and -69 (rs3957356 and rs3957357). Present results suggest that position -52 is more important for the functioning of the leading haplotype in comparison to -69. Also, the position -1142 interacts with both -52 and -69 SNPs ( Fig. 3A and B). Comparing plasmids GCATTC (*A1), GCATTG, and GTATTC against ACATTC, Table 1. Nine identified haplotypes in the 1000 Genomes Project population and their composition.
-1142 (rs58912740). The results for -1142 further support the above-described interaction between the -52/-69 and the -1142 SNP (Fig. 3F). The C to G change at position -1142 resulted in a 1.62-fold (p = 0.002) higher expression when analysed in the context of haplotype *A (GCATT C (*A1) vs GCATT G (*A3)) while an inverse relationship was found when comparing GCATG C (*A2) to GCATG G (*A1a), whereby a 2.70-fold (p = 0.014) decrease in expression resulted when-631 was also changed. In the context of the *B haplotype, we observed the repetition of the expression pattern when -631 was taken into consideration. However, the differences in expression were much smaller suggesting an interaction between the -52/-69 and the -1142 site. We observed a 1.20fold (p > 0.5) increase of expression between ATAGG G (*B1a) and ATAGG C (*B2), while a 1.14-fold (p = 0.06) increase between ATATT C and ATATT G. Interestingly, ACATT C demonstrated a 2.40-fold (p = 0.001) increase compared to ACATT G. Expected diplotype functions, shown in Fig. 4, are predicted by calculating the mean expression activity. The additive model was based on the measurement of luciferase expression of equimolar mixtures. Results confirmed the additive relationship between haplotypes (Supplementary Fig. 3). Out of 45 possible diplotype combinations of nine haplotypes, we detected 23 diplotypes in the 1000 Genomes Project population. Seventeen were reported previously in a Bu pharmacokinetic association study 27 , which were subsequently incorporated into a population pharmacokinetic model whereby two extreme groups of diplotypes carried by around 30% of the analysed population were confirmed of having extreme metabolic potential in comparison with the reference 30 (Fig. 4).
Considering the classification in those three Bu-based metabolic groups, calculated LRA expressions showed that group 1 (fast metabolisers) present around 1.5-fold higher expression than the reference diplotype *A1*A1, whereas group 3 (slow metabolisers) have around 1.6-fold lower expression than the reference (Fig. 4). Based on our results, new diplotypes (in green) *A1a*A2, *A2*B2a and *A2*B2b are likely to be placed in the intermediate  www.nature.com/scientificreports/ metabolic group while *B1a*B2b, *B2*B2a, and *B2*B2b are most likely resulting in slow GSTA1 activity. Based on the expected LRA expressions, diplotypes containing *B1b haplotype (*A2*B1b and *A3*B1b) are the only diplotypes that did not follow the GSTA1 metabolic categorization, as previously described 27 (Fig. 4). Twenty-two diplotypes were not detected in the 1000 Genomes Project population (light grey) due to the low haplotype frequencies and were also not evaluated for association with BU-clearance. Using LRA results we predicted their ability to express GSTA1. Only *A3*A3 would be placed in the high Bu-metabolising group. Seven diplotypes (*A1*A3, *A1a*A3, *A3*B2a, *A3*B2b, *A3*B1b, *A1*A1a, *A1*B2a) would be placed in intermediate BU-metabolising group (Group 2). Three diplotypes (between *A1*B1a and *A1*B1b): *A1a*A1a, *A1*B2b, and *A1a*B2a are borderline between groups 2 and 3, while the remaining eleven diplotypes are in the slow Bu-metabolising group 3. Group 3 is composed exclusively of *B haplotypes or the lowest expressing form of *A haplotype *A1a. The only exception is *A1*B1b containing *B1b previously associated with slow metaboliser status 27 . In silico method for identification of transcription factor binding sites (DNA-protein). Using in silico analysis, promoter region encompassing all SNPs' positions were analysed for binding of putative TFs (Table 4 and Supplementary Fig. 5). Because -52 and -69 sites are close together, we chose to analyse both locations together as either GC (*A) or TA (*B). The analysis identified RARA, E2F6, Sp1, NR2F1, and KLF5 as potentially binding to this location in the case of GC haplotype. Particularly, the detection of Sp1 was important as Sp1 has previously demonstrated to bind to this position. In the case of haplotype TA, only the NR2E1 binding motif was found to have a high similarity to this position. Position -513 was the most variable position out Table 4. Prediction of TFs binding to 6 different SNP sites. TFs with a score of more than 10 or a difference in score of more than 4 are shown. www.nature.com/scientificreports/ of all six positions. Fourteen different TFs from the FOX family, 4 TFs from Sox, 2 TFs from HOX and CDX2 families were predicted to bind to this site in the case of the A allele, whereas no unique TFs were found to bind when the G allele was present (*B1b). Four TFs were found to bind to both alleles but again preferentially to the A allele with a score difference of more than 4: 2 HOX family TFs, FOXF2, and CDX1. At position -567, it was found that only GATA1 exhibited high binding capacity. The score difference of -5.451 suggests that the G allele is more similar to the consensus GATA binding sequence in comparison to the T allele. Predictions at position -631 suggested binding of PRDM1 and IRF1 but allele-specific binding scores were almost identical. All other TFs had a score lower than 10. It is of note nevertheless that there are possibly 13 TFs binding to the G allele in comparison only to 2 in the T allele. At position -1142 only 2 potential TFs were predicted to bind to the G allele (NR1A4 and NFIC) but had a low score and no TFs were found for the C allele.

Discussion
In the present study, we have analysed the distribution of the GSTA1 promoter haplotypes in varying human populations to understand how ethnicity could impact the GSTA1 functional profile. Six SNPs that were previously investigated for functional impact on the GSTA1 function were selected for this study 4,27,[30][31][32] . We show that there are significant differences between human populations with regards to their potential functionality to express GSTA1 as assessed by LRA.
Although the population data demonstrate a very similar distribution of the GSTA1*A and GSTA1*B across the populations, a significant difference was apparent when a more detailed analysis of the additional SNPs, located in the GSTA1 promoter region, were included in the haplotype analyses. The results suggest that the genotyping of all six polymorphisms in the GSTA1 promoter should be performed in some populations but may not be crucial for others.
One such example where genotyping only for -52/-69 position could be sufficient is African super-population which is fairly homogenous, composed mainly of *A2 (representing 96% of all *A haplotypes) and *B2 (representing 76% of all *B haplotypes) which altogether represent 89,6% of all haplotypes. Nevertheless, even in such a population, a significant number of individuals could be mischaracterized due to the presence of other less frequent alleles. In contrast, in American, European, and South Asian super-populations, the variations of *A and *B are more frequent, leading to diverse diplotype possibilities of potentially very different GSTA1 functionality, making the functional prediction of GSTA1 through promoter genotyping more challenging. East Asian super-population is particular because it has a homogenous frequency of *B1b which accounts for more than 96% of all *B haplotypes. Thus, the detection of one haplotype *B, regardless of the diplotype, will result in poor Bu metabolisation since *B1b has been associated with low expression even in combination with *A 27 . Nevertheless, *A represents 86.5% of all haplotypes and is very diverse in this population, thus the genotyping of positions -631 and -1142 is still necessary to distinguish between rapid and intermediate metabolisers in *A homozygous individuals.
These findings could explain some discordant results reported by other groups concerning the association of Bu clearance and GSTA1 *A/*B haplotypes through population PK modeling. Zwaveling et al. 33 , in a study on children mostly of European origin, failed to detect any association between GSTA1 promoter haplotypes and the clearance of Bu based only on the *A/*B haplotypes 33 . In contrast, Choi et al. 34 , reported a 15% decrease of Bu clearance in Korean adults carrying at least one *B haplotype 34 . Results reported in this study, suggest that the Korean patients genotyped as *B may have a *B1b haplotype. This haplotype is consistently associated with the lowest expression of GSTA1, according to our LRA results. On the other hand, in the European super-population, the most frequent *B haplotype is *B1a (88% of all *B) which has a 1.63-fold higher expression than *B1b. In contrast to *B1b, *B1a leads to a poor GSTA1 expression only when in combination with another *B haplotype, otherwise the combinations with the most common *A haplotypes of European super-population result in an intermediate Bu metabolising capacity.
Although the majority of diplotypes can be deduced with high probability after the use of common genotyping methods, there is a possibility of diplotype mischaracterization given a large number of theoretically possible diplotypes. The most notable example is ambiguity between *A1*B1a and *A3*B2. Even though both diplotypes fall into the same metabolic group with regards to Bu clearance, they were significantly different when compared with LRA and might result in being part of different metabolic groups when other drugs are involved. Our newly optimized genotyping protocol demonstrated that in case of ambiguous or potentially faulty genotyping results real diplotypes could be obtained with ease and without the need to rely on inference ( Supplementary Fig. 3).
Using different variations of the GSTA1 promoter regions, we were able to infer the functional impact of these differences and establish a new model of TF interactions based on these results. As demonstrated in our previous research the GSTA1 metabolic profile as assessed by LRA will not necessarily be completely concordant with metabolic groups established through the use of in vivo data (Figs. 1B and 4 -*B1b combinations with *A haplotypes). Nevertheless, as demonstrated in our previous publications 27,29,30 , three Bu-based metabolic groups that tightly follow LRA stratification of GSTA1 could be established using PK data of Bu clearance in humans.
Lastly, the data from LRA allowed us to better understand the functioning of the GSTA1 promoter region. As demonstrated in the LRA results and as published previously the *A/*B haplotype determines the strength of the promoter 21 . However, other sites (-513, -567, and -1142) interact with the -52/-69 position and in some cases between each other. Of particular interest is the SNP at site -1142, which appears to be modulated by both -631 and by the -52/-69 and site -513 which is modulated by -1142 and by the -52/-69. On the other hand, the -567 position seemed to be of lesser importance. Previous research demonstrated that -52/-69 is a site of Sp1 TF 21 . Sp1 is known to be able to partner with other TFs 35 . To understand which other TFs could be bound to these sites, in silico predictions were performed. Consistent with previous research, we identified an Sp1 binding site at position -52/-69. In accordance with LRA results, GC (*A) allele appears to bind more TFs (6)  www.nature.com/scientificreports/ allele (*B) (1). Some of those like E2F6 suggest the presence of a nearby transcription start site and may not be present if there is an AT allele pointing to the reason why AT allele (*B haplotypes) in general expresses GSTA1 at a lower rate 36 . In our model, site -513 is predicted to interact with the -52/-69 and with -1142. Curiously, the in silico prediction demonstrates a wide range of possible TFs binding at this location. The most notable appears to be TFs from FOX, Hox, Sox, and CDK family. The DNA sequence of an A allele appears to be more similar to the consensus sequences of all 4 families. The least explored site -567 which is in strong LD with the -52/-69, presents itself as a possible GATA binding position. The results of the LRA suggest that the -567 site is in strong co-operation with the -52/-69 possibly through the interaction of Sp1 and GATA1 35 . This interaction could explain why there is a further boost to transcription activity in the context of *A haplotype (GCA TTC vs GCA GTC ) but not *B haplotypes (Fig. 3D). In addition to GATA1, G and T alleles are both predicted to bind 6 unique TFs, however, due to the lower score, it is questionable if this finding could have any functional significance. The last two positions, -631 and -1142, that interact with each other demonstrated no clear TF predictions for -1142 apart from NR1A4 and NFIC that demonstrated only low similarity in presence of G allele. Interestingly, the G allele at site -631 had a higher number of predictions in comparison to T allele, which demonstrated similarity to only 2 unique TFs. In addition to the interaction between each other TFs at the -1142 site appears to be in strong co-operation with the -52/-69 and above discussed -513 site. On the other hand, changing the -52/-69 haplotype or -567 does not appear to change the functioning of -631 suggesting that -631 does not co-operate with those two sites. It has to be noted that there are other SNPs present in the GSTA1 promoter region. Particularly, the African super-population is genetically diverse. It has at least 4 SNPs which have a frequency of more than 4.5% that are very rare or absent in all other human populations (rs3996998, rs9296692, rs3756985, rs56320607). Our approach was to include globally more frequent SNPs in our analysis for which functionality had been previously assessed. Although scientifically relevant, the inclusion of population-specific SNPs would lead to the prohibitively high number of reporter plasmid combinations. In addition to the above-mentioned limitations, it remains unclear why individuals with *B1b haplotype demonstrated low Bu-clearance. One explanation could be that the additive model of haplotypes does not reflect the reality in the case of *B1b and that there is an interaction between 2 alleles.
In conclusion, this study demonstrates a clear ethnic difference in the promoter region of this important pharmaco-gene: GSTA1. We were able to identify new GSTA1 promoter haplotypes, characterize their impact on expression by using LRA, and predict their metabolic impact. A new model of interaction of six polymorphic positions was established with a clear interaction between positions -52/-69, -513, and -1142. While the -52/-69 Sp1 binding element remains the most important determinant of GSTA1 expression, it has to be complemented with the genotyping of additional SNPs to determine a metabolic status even if the ethnicity of the individual is known. We established a specific and optimized method for reliable haplotyping of GSTA1 promoter that does not rely on computational inference but rather precise molecular characterization through cloning. Lastly, the data in the current paper represent the opportunity to establish similar metabolic groups for all drugs metabolised by GSTA1. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.