Introduction

The CYP2C8 and CYP2C9 enzymes are members of the cytochrome-P450 CYP2C subfamily of enzymes that are responsible for metabolizing approximately 20% of pharmaceutical drugs used today. CYP2C8 metabolizes various endogenous compounds including arachidonic acid, retinoic acid, and therapeutic drugs such as the widely used chemotherapeutic agent paclitaxel.1 CYP2C8 and CYP2C9 activities overlap in that both enzymes can metabolize arachidonic acid, several nonsteroidal anti-inflammatory drugs, retinoic acid and others. CYP2C9 separately has been implicated in the metabolism of warfarin, celecoxib, nateglinide and others.2, 3

Both CYP2C8 and CYP2C9, the genes coding for these enzymes, are known to be polymorphic (first reported by Dai4 and Stubbins5), and previously published data suggest frequencies of the variant alleles differ among ethnic populations.6 This genetic variation has been shown to have pharmacological implications. The specific functional consequences of the derived variants depend on the substrate. For CYP2C8 264Met, the activity ranges from 0 to 15% compared to 264Ile,7 and for 269Phe the activity for paclitaxel is 50% that of 269Ile.4 At CYP2C9, activity of 144Cys ranges from 10 to 90% of the activity of 144Arg.2 The derived CYP2C8 variants, 139Lys and 399Arg, which occur together on the same molecule, produce an enzyme with 15–40% of activity compared to the ancestral form,4 depending on substrate. Clearer understanding of the associations can move us a step forward toward individualized drug administration and dosing.8 The two genes are adjacent and within a genomic segment of 130 kb on chromosome 10.

Several studies suggest a relationship between CYP2C8 variants, the metabolic index of paclitaxel and drug-related toxicities.4, 9, 10, 11, 12, 13 Though many of those studies showed interindividual variation in paclitaxel pharmacokinetics, some found no association with specific functional variants at CYP2C8, arguing that those variants were not responsible for the variation. However, other variants at CYP2C8 do seem relevant.13 In addition, ethnic distributions of the CYP2C8 polymorphisms suggest that whichever variant alleles affect paclitaxel metabolism may be more common in a particular ethnic group.14, 15, 16 This observation suggests one explanation for varying degrees of toxicities from therapeutic drugs, not limited to paclitaxel, seen in patients of different ethnic ancestries.17, 18, 19, 20, 21, 22 To test the hypothesis that CYP2C8 variation is related to paclitaxel metabolism and toxicity, we are conducting a clinical trial testing the effects of CYP2C8 haplotypes on drug-related toxicities from paclitaxel in different ethnic groups of breast cancer patients. In designing and conducting a genotype–phenotype association study that can also address interethnic genetic variation, an accurate characterization of global variation in target genes is critical. As such data for CYP2C8 haplotypes are scanty, we have undertaken to assemble the necessary dataset. As linkage disequilibrium (LD) between CYP2C8 and CYP2C9 has been reported23 and can be seen in HapMap data, we included some CYP2C9 polymorphisms in this initial study.

Using available public databases detailing known common coding and noncoding polymorphisms, we selected and studied 10 single nucleotide polymorphisms (SNPs) of the CYP2C8 and CYP2C9 genes in populations around the world (Table 1). We also estimated haplotype frequencies for the CYP2C8–CYP2C9 region in these diverse populations. In this paper, we present our initial data demonstrating population-specific allele and haplotype frequencies of CYP2C8 and CYP2C9. This unique dataset will be instrumental in future investigations of differences in CYP2C8-mediated drug metabolism among ethnic populations and in designing clinical trials to examine the putative functional impact of genetic polymorphisms of CYP2C8 and CYP2C9.

Table 1 The 10 polymorphisms (SNPs) studied across CYP2C8 and CYP2C9

Results

CYP2C8-CYP2C9 polymorphism frequencies

Samples from 2500 individuals were typed for 10 selected SNPs across a region of 132 kb on chromosome 10 encompassing CYP2C8 and CYP2C9. SNPs across the CYP2C8CYP2C9 region showed significant allele frequency variation among populations in different geographic regions (Table 1; Figure 1). None of the 450 Hardy–Weinberg tests (45 populations for 10 SNPs) was significant when multiple testing was taken into account. Thus, our assumption that no cryptic variation interfered with the genotyping assays seems reasonable, though we cannot exclude rare sequence variants that may have interfered with the TaqMan assay resulting in a ‘null’ allele. All of the allele frequency data can be found in ALFRED (the ALlele FREquency Database) using the dbSNP rs# in Table 1 as a keyword. The typing of our non-human primate samples unambiguously identified the ancestral human sequences and was in agreement with the data in the UCSC Genome Browser whenever primate sequences were available.

Figure 1
figure 1

A graphical depiction of the frequencies of the ancestral alleles at the 10 single nucleotide polymorphisms (SNPs) studied in the 45 populations. The figure illustrates the frequencies of these alleles in 45 different populations (listed across the bottom) from around the world. Populations are ordered generally as Africa, Europe, East Asia, Americas.

Haplotype frequencies and linkage disequilibrium

Seventeen common haplotypes were found (Table 2). Estimated haplotype frequencies, plotted in Figure 2, showed that considerable geographic variation exists in haplotype frequencies and that some of the commonly studied functional variants exist on more than one haplotype. Moreover, individual haplotypes contain different combinations of some of the functional variants. Haplotype frequencies are available in ALFRED. There is a complex pattern of pairwise LD that varies among populations as different haplotypes become more or less frequent (data not shown). Because of the large sample sizes studied in many populations, and the large number of populations, we have a clearer understanding of the genetic variation seen previously in individual populations24 and at each locus independently.25 Haplotype G, seen at moderate frequencies (0.05–0.23) only in Europe and Southwest Asia, contains three coding variants: CYP2C8*139Lys, CYP2C8*399Arg and CYP2C9*144Cys. Other combinations of those variants occur in haplotypes H and K. In Africa, the combined frequencies of CYP2C8*269Phe-containing haplotypes B, L and M range from 0.06 to 0.28, but these haplotypes are virtually absent elsewhere in the world. Haplotypes C and N have the CYP2C8*264Ile variant and occur primarily in European populations, but always at low frequencies. Known functional variation is vanishingly rare in Central Asia through East Asia, the Pacific and the Americas.

Table 2 The haplotype compositions of the 17 common haplotypes, color coded as in Figure 2
Figure 2
figure 2

CYP2C9CYP2C8 haplotype frequencies. The frequencies of the 17 common haplotypes of CYP2C8–C9 single nucleotide polymorphisms (SNPs). The figure illustrates the frequencies of these haplotypes in 45 different populations listed from top to bottom on the left side in the same order as in Figure 1. The length of each colored bar represents the frequency of the corresponding haplotype in that population. The allelic compositions for the haplotypes are given in Table 2; the frequencies are in ALFRED. Haplotypes that are rare (less than 5%) in all populations are combined into a residual class. Note that coding variants can occur in different combinations and as part of different haplotypes.

Discussion

Genetic variation and cancer therapy

Genetic variation has been considered, for some time, to be a major reason for varying susceptibility of individuals to diseases and responses to drugs. Following the completion of the human genome project, the field of pharmacogenomics has captured the imagination of many clinical investigators with the prospect of realizing personalized medicine. Naturally, cancer pharmacologists have invested significant effort and resources into investigating variation among individuals in drug responses and toxicities to chemotherapy agents, as anticancer agents often have a narrow therapeutic window and consequences of suboptimal dosing can be detrimental. For example, the implications of variant alleles of dihydropyrimidine dehydrogenase on toxicity of 5-fluorouracil-based agents are well described in the literature.26 In addition, polymorphism at CYP2D6 has a predictive value in treatment of breast cancer with tamoxifen, a known substrate of this enzyme. Furthermore, the observation that genetically variant UDP-glucuronosyltransferase 1A1 accounts for significant toxicity from irinotecan therapy led US Food and Drug Administration to recommend dose adjustment according to each patient's genomic profile.

CYP2C8, one of the first human cytochrome P450 genes to be cloned,27 is well recognized as a significant enzyme in metabolism of numerous therapeutic drugs. Considerable interindividual variation in the metabolism of CYP2C8-specific substrates does exist12, 13 and this variation has been associated with polymorphism in CYP2C8, which is reported to be the primary enzyme responsible for the elimination and detoxification of paclitaxel. Several association studies have demonstrated a relationship between CYP2C8 variant alleles, the metabolic index of paclitaxel9, 11, 13, 28, 29 and drug-related toxicities. In addition, the ethnic distributions of these polymorphisms had suggested that variant alleles that affect paclitaxel metabolism are more common in particular ethnic groups.6 Extensive ethnic variation in these polymorphisms is now confirmed in this study, but this study does not directly address the relevance of these polymorphisms to paclitaxel metabolism.

Value of haplotypes

It has become clear that the analysis of a single SNP in a candidate gene may be grossly inadequate in examining genotype–phenotype relationships. Analysis of haplotypes, the combinations of SNP alleles on one chromosome, can show that certain alleles at different SNPs often, or usually, occur on the same chromosomes in the population at frequencies greater than expected by chance, that is, there is LD. Proteins are encoded in cis from each single chromosome; so the combination of coding variants on a chromosome determines the resulting properties of the enzyme encoded by that chromosome. In addition, many cis-acting regulatory sequences can show variation and determine how much of and in which tissues a protein is synthesized. Hence, haplotype analysis is an essential method for detecting associations between sets of genetic variation and gene function whenever more than a single functional SNP may be involved. The value of haplotypes is illustrated by the work of Rodriguiz-Antona et al.13 who studied 12 SNPs across the CYP2C8 gene in a moderate sample of 54 unrelated individuals defined loosely only as ‘Caucasians’. They could not find an effect on paclitaxel metabolism associated with heterozygosity for either the CYP2C8*399Arg and CYP2C8*139Lys containing haplotype or the CYP2C8*264Met containing haplotype. However, a different haplotype, labeled B in their study, did have increased 6α-hydroxylation of paclitaxel. This haplotype B had derived nucleotides in the promoter region, in intron 2 and in intron 7.

Haplotypes in our study

In this study, we genotyped and haplotyped CYP2C8 and CYP2C9 alleles in a large population cohort representing 45 populations from around the world. This dataset represents the most comprehensive study of frequencies of multiple CYP2C8 and CYP2C9 SNPs in diverse populations thus far. We observed striking allele frequency variation of individual polymorphisms among different populations (Figure 1; data in ALFRED). This finding is in line with general expectations for human polymorphisms and long standing clinical observation of ethnic differences in response to and toxicity from various drugs metabolized by the CYP2C subfamily. The global variation in haplotype frequencies (Figure 2; data in ALFRED) similarly shows large interpopulation variation. For example, haplotype G reaches an average frequency of 0.10 in Europe, whereas its frequency averages less than 0.01 in other regions of the world. It is reasonable to hypothesize that some of these population-specific distributions of individual alleles and/or haplotypes may have functional impact, resulting in interindividual and interethnic heterogeneity in drug metabolism, response and toxicity.

Linkage disequilibrium between CYP2C9 and CYP2C8

Three of the coding missense variants are especially interesting from a population genetics perspective and have significant research implications, and potentially clinical implications, in European populations. The three are CYP2C9*144Cys, CYP2C8*399Arg and CYP2C8*139Lys. These three occur in only three haplotypes: G, H and K (Table 2 and Figure 2). Although CYP2C8*399Arg and CYP2C8*139Lys are always seen in cis, they occur with the CYP2C9*144Cys variant in haplotype G and without it in haplotype K. The CYP2C9*144Cys allele occurs by itself in haplotype H. Only in the European and Southwest Asian populations do we see haplotypes G, H and K simultaneously present. In these populations on average 90% of chromosomes with the CYP2C9*144Cys allele also carry these two CYP2C8 variants (Table 3). Thus, an exploratory study in Europeans and/or Southwest Asians of association of the CYP2C9*144Cys allele with a trait (disease risk, therapy response and others) will usually be simultaneously studying CYP2C8 variation. Conversely, on average 89% of chromosomes with either of the two CYP2C8 variants also carry the CYP2C9*144Cys allele. This strong association implies that if any of the three relevant SNPs gives a positive result in a case–control study, it will be difficult to distinguish any association with CYP2C9 from an association with CYP2C8 and vice versa. Given studies of CYP2C8/CYP2C9 pharmacokinetics of ibuprofen,29 it is likely that haplotype effects will be seen for other joint substrates of CYP2C8 and CYP2C9, such as verapamil, isotretinoin, retinoic acid, arachidonic acid and fluvastatin.1, 30 Moreover, given this pattern of LD, there could well be other important variation yet to be identified, especially on haplotype G.

Table 3 The interdependencies of the CYP2C9*144Lys and CYP2C8*399Arg plus CYP2C8*139Lys alleles in European and Southwest Asian populations

Other functional variation?

Previous studies of variation at CYP2C8 and CYP2C9 may not have uncovered the true wealth of African-specific functional variation. For example, haplotypes B, L and M, containing the CYP2C8*269Phe allele, reach a combined average frequency of 0.17 in Africa, whereas occurring at less than 0.01 in the rest of the world. Thus, it is also clear that 17% of African patients may have a very different drug response, due to their haplotypes with the variant CYP2C8*269Phe allele. In addition, at least three other haplotypes (D, F and O) reach an average frequency of greater than 0.10 in Africa, but are less frequent, on average, in Europe and East Asia, where most previous resequencing efforts have been concentrated. Resequencing a number of African individuals with these haplotypes may discover frequent coding variants, which may provide explanations for pharmacogenomic diversity unexplained by the Arg139Lys, Met264Ile, Ile269Phe and Lys399Arg polymorphisms in CYP2C8 and the Arg144Cys polymorphism in CYP2C9. As we approach the ability to truly examine each individual's DNA sequence to predict optimal therapy, we will need to know what actual genetic variation exists in the population to interpret an individual's DNA sequence.

As a further guide to optimizing design of resequencing efforts to identify additional sequence variation, it is useful to identify the distinct evolutionary sequences that represent lineages that may have independently accumulated variation. To that end we have examined the existing haplotypes we have identified in conjunction with knowledge of the ancestral sequences. Three segments of this 132 kb segment of chromosome 10 show no obligate crossover products. The 17 common haplotypes can be explained by a series of historical crossovers among the different haplotypes of the three segments (Supplementary Material). In addition to testing multiple populations for whatever new polymorphic variation is found, we need to test more existing SNPs in known regulatory regions as well as those that might have less obvious functional consequences: synonymous coding SNPs that could affect mRNA folding and/or mRNA stability as well as intronic SNPs that could affect splicing or may have regulatory function. SNPs recently implicated in paxitaxel pharmacokinetics have especially high priority (for example, 13). Whatever complexity additional SNPs may add to the initial haplotype frequency data we have presented here, additional SNPs can only subdivide the haplotypes defined by the 10 SNPs we have studied and cannot alter the frequencies of the existing combinations of alleles at the missense SNPs.

Lack of comparability across studies

The Rodriguiz-Antona et al. study13 also illustrates a general problem: the need to consider these loci as haplotypes based on a common set of polymorphisms with each haplotype treated as an allele. We are not able to rigorously compare our 8-SNP haplotypes across CYP2C8 (ignoring the 2 SNPs at CYP2C9) with the Rodriguiz-Antona et al.13 12-SNP haplotypes because only four SNPs were studied in common. Based on the four SNPs studied in common, their haplotype B (which shows altered paclitaxel metabolism) falls within a pool of 10 of our 17 common haplotypes (see Supplementary Data). This pool of haplotypes has a total frequency ranging from 68 to 85% among our 10 specific European samples. An obvious future study for us is to add into our dataset the SNPs that specifically identified their haplotype B to allow meaningful comparison of datasets. Similarly, many other studies of CYP2C8 have used only subsets or overlapping sets of polymorphisms to define the relevant haplotypes. We believe that a database of dense marker coverage of each gene in multiple populations will be necessary to define what haplotypes are common in different parts of the world. Such data allow reasonable inference from a few markers of the actual haplotypes present in a future study and allow meaningful comparisons among studies. Our study is an initial step toward such a database.

We believe this study also provides a basis for systematic searches for additional functional variation that may occur at a moderate frequency in one or more regions of the world. When more SNPs are systematically typed across CYP2C8 and CYP2C9 (and the nearby CYP2C18 and CYP2C19), a better picture of the population genetic structure of this region will emerge.

Methods

Markers studied

Using dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/), the International HapMap Project (www.hapmap.org), Applied Biosystems' SNPBrowser (http://marketing.appliedbiosystems.com/mk/get/snpb_landing) and the UCSC Genome Browser (http://genome.ucsc.edu/) we compiled data on confirmed SNPs, including the more common functional variants at CYP2C8 and CYP2C9. From this set we selected 10 SNPs likely to be reasonably heterozygous in some part of the world, or confirmed to have functional consequences and frequencies in the polymorphic range, that is, the frequency of the less common allele was >0.01 (Table 1). Of the 10 SNPs, 5 were coding SNPs (cSNPs), 4 in CYP2C8 and 1 in CYP2C9. We emphasized SNPs at CYP2C8 but, based on some reports of LD with CYP2C9,6, 23 we included two SNPs at the 5′ end of CYP2C9. The five noncoding SNPs were chosen to show different frequency patterns among populations already studied in an attempt to maximize haplotype diversity with a small number of SNPs. The final set covered CYP2C8 from just upstream of the coding sequence to just shy of the last exon. The two CYP2C9 SNPs extended the total interval covered to just past the 5′ end of CYP2C9. We did not study variants that had only been reported in single individuals.

Populations studied

We used our resource at Yale University that contains DNA of more than 2500 unrelated individuals representing populations from around the world; this resource has been used for many genetic studies.31, 32, 33, 34 The 45 populations represented in this study include 10 African (Biaka, Mbuti, Yoruba, Ibo, Hausa, Chagga, Masai, Sandawe, Ethiopian Jews and African Americans), 3 Southwest Asian (Yemenite Jews, Druze and Samaritans), 10 European (Adygei, Chuvash, Vologda Russians, Archangel Russians, Ashkenazi Jews, Finns, Hungarians, Danes, Irish and European Americans), 2 Northwest Asian (Komi Zyriane and Khanty), 8 East Asian (Chinese from San Francisco, Taiwan Han Chinese, Hakka, Koreans, Japanese, Ami, Atayal and Cambodians), 1 Northeast Siberian (Yakut), 2 from Pacific Islands (Nasioi Melanesians and Micronesians), 4 North American (Cheyenne, Pima from Arizona, Pima from Mexico, Maya) and 4 South American (Quechua, Ticuna, Rondonia Surui, Karitiana). All subjects gave informed consent under protocols approved by the committees governing human subjects research relevant to each of the population samples. Sample descriptions and sample sizes can be found in ALFRED, the ALlele FREquency Database (http://alfred.med.yale.edu) and in a previous publication.34 We also typed three individuals from each of five other primate groups: Pan troglodytes, Pan paniscus, Gorilla gorilla, Pongo pygmaeus and Hylobates sp.

Marker typing

All markers were typed using TaqMan assays purchased from Applied Biosystems; the assay numbers are given in Table 1. Manufacturer's protocols were followed, with reaction volumes reduced to 3 μl, run in 384-well plates and read on an AB9700HT using Applied Biosystems' SDS (sequence detection system) software.

Statistical analyses

Genotype and allele frequencies for each individual site were estimated by simple gene counting, assuming codominant inheritance with no silent alleles; data were consistent with that assumption. Hardy–Weinberg ratios were tested by χ2 and by an ‘exact’ test using 1000 simulations when small (<5) observed numbers were present for one or more genotype in a population. Haplotypes were estimated using PHASE35 and fastPHASE36 with subpopulation information. There were <1% discrepancies in estimation between the two methods; discrepancies were individually examined. Differences in estimation of missing data for a SNP were responsible for all phasing discrepancies, with the data from PHASE generally being more consistent (fewer rare haplotypes), so the PHASE haplotypes were used. Genotypes of individuals were analyzed by HAPLOT37 for identifying regions with strong LD, sometimes called ‘LD blocks’ (data not shown). We have not attempted to relate these haplotypes to the CYPAlleles nomenclature (www.cypalleles.ki.se) because we have not studied all of the SNPs used in some of the definitions and, as demonstrated here, the allelic definitions for CYP2C9 are not independent of those for CYP2C8.

Conflict of interest

The authors declare no conflict of interest.