Introduction

Single nucleotide polymorphism (SNP) is the most common form of genetic variation in the human genome. SNPs mirror preceding mutations that were theoretically occasional and not recurrent. Hence, two individuals sharing a variant allele are more likely to be descended from a more recent common ancestor. Moreover, since particular alleles at adjacent loci tend to be coinherited, for tightly linked loci, this might lead to a nonrandom association among SNPs, termed “linkage disequilibrium” (LD), along chromosomes such that the presence of one variant predicts the presence of other variants (Ardlie et al. 2002; Weiss and Clark 2002). Consequently, the study of the pattern of LD enables us to gain new insight about ancestry of the examined population since LD mapping relies on detection of recombinationally conserved regions around an ancestral mutation (Reich et al. 2001). It is now possible to determine SNPs and linkage disequilibrium of SNPs in different individuals’ genomes, paving the way to the practice of personalized medicine (Kleyn and Vesell 1998; Nebert 1999; McCarthy and Hilfiker 2000; The International SNP Map Working Group 2001).

It is well known that all drugs have the potential of causing different responses from different individuals, and this is a major obstacle in drug development and clinical practice (Meyer 2000). Previous pharmacogenetic studies have shown the existence of striking differences in allelic frequency distributions between populations in genes known to encode drug-metabolizing enzymes (DMEs), drug receptors, drug transporters, and ion channels (Kleyn and Vesell 1998; Pirmohamed and Park 2001; Xie et al. 2001), suggesting the urgency in understanding the ethnical background underlying these genes so as to accelerate the development of personalized medicine.

Many published pharmacogenetic studies (Meyer 1994; Ainoon et al. 1999; Hustert et al. 2001; Iwai et al. 2001; Xie et al. 2001; Ainoon et al. 2001; Chang et al. 2002; Chowbay et al. 2003; Tang et al. 2002; Kroetz et al. 2003; Tan et al. 2003) focused on individual genes and limited mutations. There is restricted information on allele frequency and LD distributions of these genes on a genome-wide scale. A pioneer study by Goddard et al. (2000) elucidated the allele frequency and LD distributions of 114 SNPs in five human populations. However, hitherto there is little information about people living in Asia.

In this study, we genotyped 240 sites known to be polymorphic in the Japanese population and located in 109 drug-related or candidate drug-related genes in each of 270 unrelated healthy individuals comprising 90 each of Malays, Indians, and Chinese living in Malaysia. Allele frequency and linkage disequilibrium distributions of these 240 candidate SNPs were determined, analyzed, and compared among themselves and with genotyping data of 752 unrelated healthy Japanese comprising 449 males and 303 females (Kamatani et al. 2004; Unpublished data of the Laboratory for SNP Genotyping, SRC, RIKEN). The implication of the similarities and differences in the number of polymorphic loci, as well as allele frequency and intermarker LD distributions, within and between populations of different ancestry for future pharmacogenetic studies was interpreted.

Materials and methods

Population samples

Blood samples were obtained with consent during various blood donation campaigns held in Kuala Lumpur and the Klang Valley from 270 unrelated healthy individuals (90 comprising 30 females and 60 males each of Malays, Indians, and Chinese) living in Malaysia. Each donor was given a short interview before the donation to ensure their ethnicity as well as family history (not descendant of intermarriage for at least three generations). Although the samples were collected in the capital city of Malaysia, the donors actually originated from various states in Malaysia.

Markers selection

For each individual, 240 sites known to be polymorphic in the Japanese population were screened. These candidate SNPs are located in 109 genes that are drug related, such as genes that encode DMEs and drug transporters, and are distributed on 20 human chromosomes excluding the sex chromosomes (Appendix ). For each gene, a total of one to four sites were genotyped. Allele frequencies and genotype frequencies were both evaluated for these 240 sites. For those SNPs located within the same gene, pairwise intermarker LD was estimated through standard measurements (LD coefficient, D′, and difference in proportions, d). LD measurements were not estimated for all possible SNP pairs located on the same chromosome but in different genes.

Genotyping methods

Genomic DNA was extracted from white blood cells by using the Perfect gDNA Blood Mini Kit (Eppendorf, Germany). For each DNA sample, multiplex PCR was performed with 40 ng of genomic DNA in a total reaction volume of 20 μl. In each multiplex reaction, up to 96 pairs of primers were used for amplification with Ex-Taq DNA polymerase (Takara Bio Inc, Japan) in the presence of TaqStart Antibody (Clontech Laboratories, BD Biosciences, USA), which reduces the formation of primer–dimer (Ohnishi et al. 2001). PCR was carried out with an initial denaturation step of 94°C for 2 min followed by 40 cycles of denaturation at 94°C for 30 s, annealing at 60°C for 30 s, and extension at 72°C for 2 min. These cycling reactions were followed by a final extension at 72°C for 7 min. All primer sequences used were the same as those developed for PCR amplifications in the Japanese Single Nucleotide Polymorphisms (JSNP) discovery project (Hirakawa et al. 2001; Haga et al. 2002) and provided by RIKEN. Information about these SNPs can be found on the following Web page, http://snp.ims.u-tokyo.ac.jp/. All SNPs were genotyped by the invader assay that combines structure-specific cleavage enzymes with universal fluorescent resonance energy transfer (FRET) system (Ryan et al. 1999; Mein et al. 2000). Allele-specific oligonucleotide pairs were designed and supplied by Third Wave Technologies and provided by RIKEN. FRET probes were labeled with FAM or VIC corresponding to each allele. Signal intensity was indicated as the ratio of FAM or VIC to ROX, an internal reference. Each total reaction volume of 5 μl contained 0.25 μl of FRET probe, 0.25 μl of signal buffer, 0.25 μl of structure-specific cleavage enzyme, 0.5 μl of allele-specific probes, 1.75 μl of distilled water, and 2 μl of PCR product, which was diluted one in 20. Each PCR-invader reaction was set up in a 96-well format. Samples were incubated in GeneAmp PCR system 9700 (Applied Biosystems, USA) with an initial denaturation at 94°C for 5 min followed by incubation at 63°C for 15 min (Ohnishi et al. 2001). Signal was detected at the end point with the ABIPRISM 7700 Sequence Detection System.

Statistical analyses

Genotype and allele frequencies for each marker were calculated and tested with the standard Chi-square test of Hardy–Weinberg equilibrium (HWE) (Weir 1996) at each locus. SNPs showing deviation from the HWE were eliminated from the following LD calculations and analyses. Haplotype frequencies for each locus were estimated by using the Expectation–Maximization (EM) algorithm (Slatkin and Excoffier 1996). Pairwise intermarker LD was expressed by D′ and d; D′=D/Dmax (Devlin and Risch 1995) and represents LD coefficient whereas d is the difference in proportions as mentioned in Nei and Li (1980) and d=π11/π·1π12/π·2, where πij is the frequency of haplotypes with allele i at the first marker and allele j at the second, and π·j is the frequency of haplotypes with either allele at the first marker and allele j at the second. The measures of LD were not calculated when only one allele was observed for at least one SNP in the pair.

Results

Population differences in number of polymorphic loci

Of the 240 sites screened in each individual of the three Malaysian populations investigated, we found that some of these sites were nonpolymorphic in certain Malaysian populations. In Chinese, 226 sites were polymorphic while in Malays and Indians, 222 sites were polymorphic. On the whole, these three populations and the Japanese population shared 216 out of the 240 sites investigated. However, even though the Japanese population has the greatest variability of SNPs, most of them are of low minor-allele frequencies.

Departures from the Hardy–Weinberg equilibrium

Of the 240 sites genotyped, eight SNPs in each Malaysian population showed p values that were statistically significant at a critical level of 0.01, similar to the Japanese population in which seven markers were found to show violations from the HWE. Most of these violations were private to the corresponding population. Only three out of the 27 deviations were observed in more than one population. This Hardy–Weinberg disequilibrium (HWD) might indicate the existence of population structure in these subject populations. Allele frequencies of all genotyped SNPs are summarized in Appendix with SNPs showing deviation from HWE marked with double asterisks (**).

Differences in minor-allele frequencies according to SNP locations

We eliminated the 27 SNPs that were not in HWE for the following analyses since the subsequent EM-algorithm-based haplotype estimation required HWE assumption (Weir 1996). Among the remaining 213 SNPs, 152 are in the introns of genes and 61 in the exonic regions. Of the 61 exonic SNPs, 33 are in the coding regions. Among them, 16 caused synonymous mutations, 16 resulted in nonsynonymous mutations, and the remaining one caused a nonsense mutation.

On the whole, intronic SNPs were found to have higher minor-allele frequencies. The average minor-allele frequencies for the intronic SNPs of the four populations ranged between 0.224 and 0.230. In contrast, the average minor-allele frequencies of the exonic SNPs ranged between 0.188 and 0.198. Similarly, SNPs that caused synonymous mutations had greater minor-allele frequencies than those that resulted in nonsynonymous mutations; the average minor-allele frequencies of the former and latter SNPs were 0.228 and 0.151, respectively. These observations agreed with the assumption that there might be stronger selection pressure acting on the exonic regions of the genes, especially against nonsynonymous mutations. Generally, we found that minor-allele frequency distributions were similar in the four populations (Table 1), suggesting that differences in minor-allele frequencies according to their locations in genes are independent of ethnical factor.

Table 1 Average minor-allele frequencies of single nucleotide polymorphisms (SNPs) according to SNP locations and types of mutations

When these SNPs were resolved into six groups according to their minor-allele frequencies (Table 2), we found that 54% of them had minor-allele frequencies of greater than 0.2 in both Japanese and Malaysian Chinese. On the other hand, for Malaysian Malays and Indians, 50% of these SNPs had minor-allele frequencies of more than 0.2. Since a previous study (Kruglyak 1997) revealed that markers with minor-allele frequencies of 0.2–0.5 have equivalent information content whereas markers with minor-allele frequencies below 0.2 are less informative, we further eliminated those SNPs with minor-allele frequencies of less than 0.2 in the Japanese population. As a result, 114 SNPs remained, and we observed that 79% and 87% of them had minor-allele frequencies of 0.2–0.5 in Malaysian Malays and Chinese, respectively. In contrast, only 63% of the remaining SNPs in Malaysian Indians had minor-allele frequencies of 0.2–0.5.

Table 2 Minor-allele frequency distributions of single nucleotide polymorphisms (SNPs) in Hardy–Weinberg equilibrium (HWE)

Population differences in allele frequency distributions

When minor-allele frequencies of these SNPs were arranged in an upstream pattern and resolved into histograms (Fig. 1), striking patterns were observed. These population-specific histograms show at a glance that there was a greater similarity between the minor-allele frequency distributions in Japanese and Malaysian Chinese. With some exceptions observed for some of the markers, the minor-allele frequency distribution in Malaysian Malays resembled the minor-allele frequency distributions in Japanese and Malaysian Chinese. However, the minor-allele frequency distribution in Malaysian Indians was distinctively different from the other three populations under investigation. Similar observation was also denoted by pairwise correlation graphs comparing allele frequencies of each marker in these four populations (Appendix).

Fig. 1
figure 1

Minor-allele frequencies of 213 single nucleotide polymorphisms (SNPs) that remained after elimination of SNPs showing deviations from the Hardy–Weinberg equilibrium (HWE). SNPs were arranged in an upstream pattern according to their minor-allele frequencies detected in the Japanese population. The minor-allele frequency distributions in a Japanese, d Malaysian Chinese, and b Malaysian Malays showed great similarities. c Malaysian Indians showed minor-allele frequency distribution distinctly different from the other three populations

In general, allele frequencies were highly correlated among Japanese, Malaysian Malays, and Chinese. In contrast, both comparisons involving Japanese and Malaysian Chinese with Malaysian Indians showed results of distinct difference. However, correlation of allele frequencies between Malay and Indian populations was comparatively high. Besides, we also performed the standard Chi-square test for pairwise comparisons of allele and genotype frequencies between each population. Results were shown in the Appendix. Significant differences between populations were each indicated with an asterisk (*).

Linkage disequilibrium and distances between SNPs

By considering only those SNPs shared by the four populations and eliminating those SNPs showing deviation from the HWE, we calculated the LD coefficient, D′, for each SNP pair located in the same gene. Plots of D′ versus physical distance between two SNPs in the same pair (Fig. 2) show that D′ decayed rapidly with an increase in distance. We observed a dramatic drop in D′ when physical distance between SNP pairs increased to 50–100 kb. D′ value for markers 50–100 kb apart was about half that for markers 10–50 kb apart. This finding is similar to that reported by Dunning et al. (2000) though at a different extent. However, we also found closely located markers that were not in high LD. Similar results were observed for all four populations, suggesting that the relationship between distance and D′ is a universal characteristic, which is ethnic-independent. Another interesting observation was the sudden increase in LD for markers more than 100 kb apart. This phenomenon has been observed in all populations under investigation and may reflect false LD. Therefore, the 11 SNP pairs with physical distance between SNPs of more than 100 kb have been further eliminated from the following study.

Fig. 2
figure 2

Relationship between LD coefficient (D′) and physical distance separating two single nucleotide polymorphisms (SNPs) in a pair

Differences in linkage disequilibrium distributions

After SNP pairs with inter-SNP distance of more than 100 kb were removed, 101 SNP pairs were left for further analysis. The d values were calculated for different SNP pairs in individuals of each population and then resolved into plots of correlation, as shown by the six graphs in the lower triangle in Fig. 3. Subsequently, 38 SNP pairs were removed by using a minor allele frequency cutoff at 0.1. Similarly, d values for the remaining 63 SNP pairs were resolved into plots of correlation and were represented by six graphs in the upper triangle in Fig. 3. There was an apparent increase in the coefficient of determination values, R2, for each correlation graph when those SNPs with minor allele frequencies of less than 0.1 were eliminated from the plots.

Fig. 3
figure 3

Pairwise comparisons of difference in proportions (d) values among four investigated populations after elimination of single nucleotide polymorphisms (SNPs) in Hardy–Weinberg disequilibrium (HWD) and SNPs with an inter-SNP distance of more than 100 kb. Each of six graphs in the lower triangle represents correlation of d between two populations by using 101 SNP pairs without minor-allele frequency cutoff. Each of six graphs in the upper triangle represents correlation of d between two populations by using the remaining 63 SNP pairs after elimination of SNPs with minor-allele frequencies of less than 0.1. The coefficient of determination value for each correlation, R2, is shown at the lower right corner of each plot

Similarly, all correlation graphs that compared Japanese and Malaysian Chinese, Malaysian Malays and Chinese, Malaysian Malays and Indians, and Japanese and Malaysian Malays showed higher similarity regardless of minor-allele frequency cutoff. In contrast, both comparisons that involved Japanese and Malaysian Chinese with Malaysian Indians showed comparatively lower similarity. Again, correlation between Malays and Indians remained high.

Discussion

The fact that not all investigated sites were polymorphic in the three Malaysian populations may not imply that the Japanese population has more variation or has higher antiquity compared with other populations (Stephens et al. 2001). In fact, this may most possibly result from the ascertainment bias during SNP discovery. Since the investigated sites were previously discovered by using only 48 individuals of Japanese descent (Iida et al. 2001a,b,c, 2002a,b,c; Saito et al. 2001a,b, 2002a,b,c,d, 2003; Unpublished data of Laboratory for Genotyping, SRC) and then scored in 752 Japanese individuals (Kamatani et al. 2004; Unpublished data of Laboratory for Genotyping, SRC), more often than not, they tend to have higher population frequencies (Clark et al. 2003). Furthermore, since they were genotyped but not resequenced in the other three Malaysian populations, there is a tendency to introduce a bias toward lower frequencies in other populations (Weiss and Clark 2002).

Because DMEs are evolutionarily very old enzymes that existed before the divergence of eubacteria from eukaryotes (Nebert 1997), those seemingly population-specific SNPs might have been newly arisen, at least after populations′ divergence (Kalow 2001). In addition to that, since genetic drift usually accumulates with time, this interpopulation variability might also indicate that these SNPs represent ancient polymorphisms with one of their alleles fixed while another drifted away and no longer existing in a particular population (Cargill et al. 1999; Halushka et al. 1999).

However, though different in the number of polymorphic sites, the extent of interpopulation variation is limited. This is reasonable, as DMEs are responsible for critical life functions (Ingelman-Sundberg et al. 1999). Therefore, genes encoding them might confer limited changes since their emergence because they might be subjected to higher constraints of selection pressures, especially in the exonic regions. This assumption is supported by our observation that for all four populations studied, exonic SNPs, especially those causing nonsynonymous mutations, always have lower minor-allele frequencies compared with intronic SNPs. This observation is also compatible with previous reports (Cargill et al. 1999; Halushka et al. 1999; Goddard et al. 2000).

When we confined our subjected SNPs to those with minor-allele frequencies of greater than 0.2 in the Japanese population, we observed that about 80% of them had minor-allele frequencies of 0.2–0.5 in Malaysian Malays and Chinese. Since these SNPs have equivalent information content in association studies (Kruglyak 1997), and it is laborious and cost consuming to establish population-specific SNP databases at the present, this observation suggested the possibility and rational of using currently existing JSNP database as a reference to accelerate future pharmacogenetic studies involving Malaysian Malays and Chinese.

Population-specific histograms and pairwise correlation graphs of allele frequencies of these SNPs for the four subject populations (graphs not shown; the coefficients of determination values, R2, for each correlation graph were shown in Appendix) suggested classifying them into two distinctive genetic clusters of different ancestries. The first group consists of three populations: Japanese, Malaysian Malays, and Chinese, while the second comprises Malaysian Indians. In both cases, greater similarities have been observed between Japanese, Malaysian Chinese, and Malays. Correlations between Japanese and Malaysian Chinese with Indians remained low all the time.

Generally, allele frequencies among populations are correlated overall, suggesting that these SNPs originated from an identical ancestral haplotype. But with time, these populations deviated not only from the ancestral condition but also from each other under the pressure of mutation, random genetic drift, and natural selection. Similarly, the number of common alleles shared by these populations gradually declined. Hence, various degrees of correlation of allele frequency observed for pairwise comparisons are good indicators to infer the degree of drift from the point when two groups separated.

The result goes along well with evidences from recent study of paleontology as well as anthropology, which suggested that Japanese [with a Northeast Asian origin (Nei 1995)], Malaysian Malays [the oldest indigenous people who inhabit the Malay Peninsula and are believed to be originated from the Northwestern part of Yunnan in China (Skeat 1902), and a distant offshoot of the Mongolian stock (Winstedt 1961)], and Chinese [descendants of immigrants from the Southern part of China who settled in number in the Malay Peninsula in the nineteenth century (Ryan 1971)] shared a more recent origin (Nei 1985), while Malaysian Indians, who originated from India, belong to another distinct genetic group: Caucasoid. The result was further supported by the recent study of ancient remains found in the Southern part of China, Minatogawa in Okinawa, Japan, and Niah Cave in Borneo, Malaysia, which showed that there were genetic contacts between people living in these areas (Bodmer and Cavalli-Sforza 1976; Cavalli-Sforza et al. 1994; Etler 1996).

Though the observation that Malaysian Malays also shared a high similarity with Malaysian Indians may render the above classification obscure, it can be explained by the histories of these populations in the Peninsula of Malaysia. The geographic location of Malaysia, which is halfway on the sea route between China and India, has established her political and cultural contacts with both of these countries over the past 2,000 years (Ryan 1971). Although Indians only settled en mass in the Peninsula of Malaysia in the nineteenth century (Nagata 1979), Malaysia has received her early civilization and two great religions, Hinduism and Islam, from India. Apart from the evidences of early Indian settlements in some of the states in Malaysia, the unremitting traffic of traders between the Archipelago and India has resulted in frequent intermarriages between the older-day Indians and Malays along the whole of the west coast of the peninsula (Ryan 1971).

In contrast, early contacts between the Malay Peninsula and China were largely confined to trades. Differences in cultures rendered limited intermarriages between Chinese and Malays. Apart from that, migration of citizens from China was discouraged by the Manchu government (1641–1911). Noticeable increase in emigration only came in the mid-nineteenth century (Nagata 1979) and the majority of the emigrants came from the southern provinces of Kwangtung and Fukien, which were nearest to Southeast Asia and most opposed to Manchu rule (Ryan 1971).

In fact, our preliminary analyses of the genotyping data by using the program STRUCTURE (Pritchard et al. 2000) revealed that Japanese, Malaysian Malays, and Chinese had similar genetic structures, with Malays genetically heterogeneous and showing substantial similarities with Malaysian Indians. On the other hand, Malaysian Indians were clustered into another distinct genetic group similar with that reported by Rosenberg et al. (2002). Consistent with their population histories, the admixture event between Malays and Indians, as mentioned above, is believed to have occurred quite some time ago between older-day Malays and Indians, with contemporary Malays representing the descendants of this admixture. However, present-day Malaysian Indians are descendants of Indian emigrants during the nineteenth century. This is supported by the fact that all donors in the present study were interviewed before blood donation to ensure that they were not descendants of mixed marriages for at least the nearest two generations.

Another plausible explanation in elucidating the substantial similarities shared between Malaysian Malay and Indian populations may be related with ecogenetic factors. Although genetic variations observed between populations were created by spontaneous mutations and passed on to the next generation under the chance event of random genetic drift, both of these events per se can hardly explain the existence of substantial allele frequency differences between populations (Nebert 1997). Although the higher generation turnover of the Indian population owing to early age at marriage may leave some effects, with the relatively low rates of occurrence of both events mentioned above and the long generation time for human population, other forces such as migration and natural selection play a more important role in shaping the distribution patterns of these genetic variations in different populations with regard to the environment.

For instance, the differences observed across various populations may reflect the different impacts that ecogenetic factors have on people from different geographic regions. In this case, the highest selection pressure is believed to be of dietary origin (Ingelman-Sundberg et al. 1999). It is possible for striking ethnic differences in DME polymorphisms to have arisen from selective pressures caused by tribal differences in diet or exposure to other environmental agents/signals in a span of 200–2,000 years that would represent 10–50 generations for the human population (Nebert 1997). This might also suggest the possibility of the existence of balanced polymorphisms, which are unappreciated at the moment.

Here we presented two sets of correlation graphs to compare the d values of the four populations by using SNP pairs with and without minor-allele frequency cutoff. Since SNPs of lower minor-allele frequencies are in general those of younger age (Watterson and Guess 1977), applying minor-allele frequency cutoff is likened to eliminating recent SNPs of younger age. Though this may also risk losing some ancient SNPs with one of their alleles fixed in the population, Halushka et al. (1999) found that only 5% of the rare SNPs that they investigated represent ancient allele that is being lost through evolution.

Increasing coefficient of determination values, R2, for all correlation graphs without SNPs with minor-allele frequency of less than 0.1 further supports the single-origin hypothesis of human populations (Lewin 1987) and reflects that those young variants are shared by all populations and arose prior to population divergence, their fate in evolution varies as different environmental factors (cultural and habitual) act on them in different populations. Their elimination in this study therefore provides higher stringency to the study of population ancestry.

On the whole, inference from correlations of intermarker LD among these populations further supports the results that we obtained from the allele frequency distributions both on population ancestries and the single-origin hypothesis of the investigated populations. The results also showed that the genetic distances suggested by the coefficient of determination values for correlations of allele frequency as well as LD index of different populations seemed to agree with the historical facts of their population origins.

In conclusion, we have shown here the existence of extensive similarities between populations of similar ancestry, such as the number of polymorphic loci shared, allele frequency distributions, as well as LD index, despite some minor variations that might be due to extensive interactions among genetic (population ancestry), environmental (diet), and habitual (like smoking and alcohol consumption) factors, which will require further investigations. We have also shown that since most of those shared SNPs are of equivalent information content, and because it is laborious and cost consuming to establish population-specific SNP databases by using present-day technologies, our observation suggested the possibility and rational of using existing SNP database as a reference to hasten future pharmacogenetic studies involving populations of similar ancestry.

For instance, a genetic study on a clinically significant variant in response to a drug by using the Japanese samples will allow us to make a better and more relevant prediction on Malays and Chinese compared with Indians. Therefore, a better understanding in knowledge of pharmacoanthropology not only promises to reduce the time and costs associated with producing effective, marketable, and competitive new drugs (Kleyn and Vesell 1998) but also allows the development of drugs a priori that will avoid metabolic pathways with adverse genetic variability as well as by modifying drug selection/dosing in patients with clinically significant genetic variation.