Introduction

In an ideal situation, researchers involved in mapping variants predisposing to disease would study most of the genetic variation present in the population by resequencing a representative sample of cases and controls from the same population.1, 2 Since polymorphisms that predispose to human disease are expected to be of small effect3 large-scale association studies are required to detect such effects. Hence, resequencing is at the moment impracticable on such large scale, and so researchers working on large-scale association studies rely on the correlation among linked loci (linkage disequilibrium (LD)) to find those polymorphisms that predispose to human disease. Since LD is a population-dependent parameter, there has been controversy on whether LD patterns observed in one population would be observed in a different population and therefore doubts have been expressed on the utility of large-scale genotyping efforts such as the HapMap project.4, 5 The utility of such an effort for mapping studies will ultimately depend on the power the subset of loci typed on the study population has to detect the disease locus (DL) or loci and not necessarily on whether LD patterns or block boundaries are conserved among populations. Here, we present a simulation study in which SNPs tagging (tSNPs) a 10 Mb region of human chromosome 20 were selected independently for each population and tested for association both within the same population and also on an independent population. Power (defined as the proportion of all tSNPs that showed a significant association with disease, at the 5% level) did not depend on which population was used for tagging.

Materials and methods

Data

We used available genotype data for a 10 Mb region on chromosome 20 (http://www.well.ox.ac.uk/~xiayi/data/chr20/10Mb/index.html). A total of 4427, 5324, 4160 and 4160 SNPs were available for n=96, 47, 20 and 22 unrelated individuals from four different populations (UK Caucasian, CEPH, Han Chinese and Japanese, respectively). For each population pair, we first selected those loci that were segregating in the two populations and did not have any missing values. The number of loci analysed for each population comparison varied from 1012 (for UK Caucasian and CEPH) to 2100 (for Han Chinese and Japanese).

Tagging methods

SpD

tSNPs were selected using the spD method described by Meng et al.6 This method is based on calculating the pair-wise LD matrix among m candidate markers where the components of the m × m matrix are the composite LD measure7 divided by its SD. Standard principal components analysis is performed on this matrix and those SNPs that contribute most to the eigenvectors of the largest eigenvalues are selected as tags.

For large genomic regions as the one considered here, the method is applied on sliding windows of much smaller size and SNPs selected as tSNPs if they had been selected in a given proportion of the windows in which they were included.

tSNPs were selected using 20 individuals with a sliding window size of five SNPs. The selected SNPs explained 85% of the variation within each sliding window and were finally selected if they were preselected in 80% of the sliding windows in which they were present. These same parameters were used for all the results concerning the SpD tagging method. The results shown below are based on SpD unless otherwise stated.

ldSelect

In order to investigate the effect of the tagging method on the portability of tSNPs across populations ldSelect8 was also used to select tSNPs. As with SpD, 20 individuals were selected with replacement among the n available individuals to be used as the selection panel. The program was then run with the following parameters: r2 value9 of 0.8 and no selection of SNPs by allele frequency.

Simulations

After selecting tSNPs, case–control samples were generated by first selecting one of the available SNPs at random and assuming it to be the DL. The DL could be a tSNP or not. Then, one of the two alleles at the DL was selected at random to be the one that increased disease susceptibility. Finally, a sample of N cases and controls was generated as follows:

Simulation of cases

Assume Q is the allele that increases disease susceptibility at the DL. Genotype frequencies at the DL were estimated from the n available individuals.

The prevalence of the disease in the population (K) is equal to

where P(D∣G) is the penetrance and P(G) is the frequency of the G genotype.

The genotype relative risks for QQ, Qq and qq genotypes are respectively GRRQQ=P(D∣QQ)/P(D∣qq), GRRQq=P(D∣Qq)/P(D∣qq) and GRRqq=1.

Then,

Case subjects were simulated by selecting individuals at random, checking the genotype (G) at the preassigned DL and drawing a random number between 0 and 1 from a uniform distribution. If this number was equal or less than P(D∣G) then the individual was considered a case if it was not the individual was rejected and a new individual drawn at random. The process was repeated until N cases were obtained.

Simulation of controls

It was assumed that the genotype frequency of the control subjects was the same as the general population. Therefore, controls were simulated by selecting with replacement N individuals from the n available samples.

Replicates

In order to account for:

  1. 1)

    Variation among samples used for tagging, 10 different samples were obtained by sampling with replacement 20 individuals among the n individuals from each population.

  2. 2)

    Variation due to different case–control samples was taken into account by sampling with replacement four replicates within each tSNP sample.

  3. 3)

    Differences in LD patterns among different regions within the 10 Mb region, 100 different loci were simulated within each case–control sample.

In summary, we used resampling methods within one population in order to investigate the behaviour of tags within that population. We compared these results to those obtained by testing those tags in different populations. If similar results were obtained from tags developed in one population (tagging population) and applied to the second population (ie case–control population), then these simulations would suggest that tSNPs would have validity between populations.

Results

Table 1 shows the average power obtained for a case–control study with 5000 cases and 5000 controls when the assumed disease model is multiplicative (ie genotype relative risks were 1, and 2.5) and the prevalence was equal to 0.005.

Table 1 Across replicates mean proportion of SNPs selected as tags (SpD tagging method), the mean and the SD of the proportion (over the total of tSNPs) of the tSNPs that were significant at the 5% level

The percentage of tSNPs that showed statistical significance in a case–control study did not substantially vary with the population on which the tagging was performed.

The higher percentage of significant tSNPs observed for the Han Chinese and Japanese compared to UK Caucasian and to a lesser extent to CEPH reflects the fact that there is a much smaller sample size (n) for those two populations and cases and controls tend to be more alike increasing therefore the chances of detecting an association. In order to investigate this further, n=20 individuals for each of the populations were selected at random without replacement and the same analysis performed. Table 2 shows that if n is kept equal in all populations then this effect disappears.

Table 2 Across replicates mean proportion of SNPs selected as tags (SpD tagging method), the mean and the SD of the proportion of tSNPs (over the total of tSNPs) that were significant at the 5% level

We used a different tagging method (ldSelect) to assess the effect the tagging method has on the portability of tSNPs across populations. ldSelect-selected tags also showed that tSNPs selected in one population are valid for another (Table 3). The different proportion of significant tSNPs for SpD and ldSelect is because the proportion of variation explained by the tSNPs is not exactly comparable for the two methods. Table 3 shows that there are small differences when the DL is selected as tSNP or not.

Table 3 The mean and the SD of the proportion of tSNPs (over the total of tSNPs) that were significant at the 5% level when using ldSelect

A higher proportion of tSNPs show a significant association when the DL is not a tSNP than when it is. This is probably because more than one tSNPs are selected to tag the DL when the DL is not a tag, whereas only one (the DL) is required when the DL is a tSNP.

Table 4 shows results for three different scenarios (A) DL is a tSNP (B) DL is not a tSNP but is in the original set used for tSNP selection (C) DL is not in the original set for tSNP selection. Differences in power are very small under the three scenarios. As shown above for ldSelect scenario B has a slight increase in the proportion of significant tSNPs compared to scenario A. Similarly, for scenarios C and A.

Table 4 The mean and the SD of the proportion of tSNPs (over the total of tSNPs) that were significant at the 5% level when using SpD

We repeated the analysis but this time we split the SNPs into two categories: (A) those SNPs that were within genes and (B) those SNPs that were outwith genes. Table 5 shows a summary of these results. Overall, SNPs within genes exhibited a higher proportion of significant tSNPs, although the difference was very small. The population from which tSNPs were selected had little or no effect on the proportion of significant tSNPs both for SNPs within and out of genes.

Table 5 The mean and the SD of the proportion of tSNPs (over the total of tSNPs) that were significant at the 5% level when using SpD

Figure 1 shows the fitted second-degree polynomial to the percentage of significant tSNPs as a function of the DL frequency when the tagging and the case–control study was performed on the same or different population. Lines mostly overlap over the whole DL frequency spectrum studied. Taken together, these results suggest that SNPs tagged by the HapMap project on the CEPH, Han Chinese and Japanese samples would yield similar power in populations of European and Asian descent.

Figure 1
figure 1

Proportion of significant tSNPs over all populations as a function of the disease locus frequency when the tagging and the case–control study was done on the same or different population.

Discussion

We have shown that tSNPs selected in Han Chinese, Japanese, CEPH and UK Caucasian populations perform similarly (in terms of power) in all these populations.

Ahmadi et al10 studied the performance of tSNPs selected in CEPH samples on a Japanese population by evaluating how well tags represent the variation present in another sample. Their study differs from ours in that they focused on 55 genes involved in drug metabolism and transport whereas we focused on a large chromosomal region. Mueller et al11 studied LD patterns across different European populations and concluded that CEPH-derived tags were of restricted applicability to European populations. However, the comparisons were based on only four genes spanning about 749 kb. Evans and Cardon12 studied the same samples presented here but used only adjacent markers for their analysis. In addition, all of the previous studies based their evaluation of performance in the LD measure9 r2 and Evans and Cardon12 suggested that the differences they observed might be because of the dependence of r2 on, for example allele frequencies.12

In our simulation study only SNPs that were segregating in the two populations were used. This will surely overestimate the performance of across population tags since it could be the case that one DL is segregating in one population but not in another or that one SNP might be a tag in one population and not segregate in another. Nonetheless, we have shown (Figure 1) that our conclusions are consistent even for relatively extreme DL allele frequencies. In any case, if a DL was not segregating in a population, it could not be mapped in this population even if the tags had been obtained from it. If a tSNP selected in one population was not segregating in a case–control population, then there would be a reduction in power since a smaller proportion of the genetic variation would be represented. Researchers using the HapMap data might want to check that the SNPs they have selected as tags are segregating in different populations.

The allele frequency distribution of the unselected SNPs and tSNPs was uniform (results not shown), however if there was not ascertainment bias one would expect an L-shaped distribution. Hence, ascertainment bias may have an important effect on our general conclusions since a uniform allele frequency distribution will tend to overestimate the amount of LD and therefore decrease the number of tSNPs required. Also, low-frequency variants could be missed if they were segregating at low frequency in one population but not segregating at all in another. Hence, the conclusions drawn here will be more appropriate for common genetic variation as studied by the HapMap project, which will help genetic association studies on different populations to those studied by it.