Introduction

Genetic associations of single-nucleotide polymorphisms (SNPs) that were identified by genome-wide association studies (GWAS) and successive replication efforts or meta-analyses as having robust associations with most complex diseases are of relatively small to modest magnitudes (odds ratios (ORs) <1.50).1, 2, 3 Genetic association studies typically require a very large sample size for the desired power to detect associations of such magnitude, as a stringent significance level (usually α=5 × 10−8 for genome-wide studies) is generally applied in order to minimize detection of false associations. To attain the required sample size, large-scale multi-team collaborative studies with participants recruited from distinct populations defined by country of origin, regional ancestry, ethnicity, or study center, or meta-analyses of individual studies are necessary.4 Meta-analyses of genome-wide and/or replication studies have been successful in identifying novel genetic variants for complex diseases not previously identified by single studies.5, 6, 7, 8

One important challenge that remains, however, is that multi-team collaborative studies or meta-analyses from distinct populations, hereafter called cohorts, are more likely to demonstrate inconsistent estimates of SNP associations across cohorts because of genuine diversity in genetic associations, or differential errors or biases across cohorts.9, 10, 11, 12 Between-cohort heterogeneity may result from the associations that truly exist in one, some or all cohorts with different magnitudes (eg, due to local gene-environment interactions, which might be further exaggerated by sampling variation), or which could be a false signal due to methodological errors and biases (eg, because of different linkage disequilibrium (LD) patterns of tagged markers with causal variants across cohorts, the phenotype of interest being correlated with other phenotype with which the SNP is correlated, population stratification, different study designs with differential ascertainment of phenotype across cohorts, differential genotyping errors) or merely by chance.10, 11 Therefore, heterogeneity could be a signal rather than a noise in genetic association studies. Even if the associations are modestly or highly heterogeneous across cohorts, association in some or all cohorts may be genuine and are of interest.

Traditionally, in meta-analyses of clinical trials and epidemiological studies, the fixed-effect (FE) approach13 has been used when cohort-specific associations are more or less similar and the random-effects (RE) approach14 has been used when heterogeneity is suspected, to test whether there exists an average effect of a treatment or an exposure. In genetic association studies, as heterogeneity may result for any reason, detecting an ‘association’ if it truly exists even in a single cohort, however, is of prime interest rather than detecting a non-null ‘average effect’ over cohorts.15, 16 Unfortunately, as between-cohort heterogeneity increases it needs even larger sample size to detect associations by using traditional meta-analytic approaches.17 When heterogeneity is suspected, traditionally preferred approach, the random-effect method is less powerful in detecting a genuine association as it produces more conservative P-values than FE approach18, 19 and would be too conservative when a stringent significance level is used. Hence, even large multi-cohort studies or meta-analyses employing traditional approaches might fail to demonstrate associations for some genetic variants that do not have consistent associations across cohorts. So rather than increasing the sample size by including additional data from more cohorts or waiting until sufficient data are generated, it is more desirable to find statistical methods that have increased potential to detect associations in the presence of heterogeneity.

To overcome this limitation of traditional methods in detecting an association, Lebrec et al15 recently proposed global methods as new sets of screening tools for associations in heterogeneous conditions for multi-cohort genetic association studies. The new FE global method aims to test if an association exists in at least one cohort and the new random-effect global method aims to test if the overall association or between-cohort variance of associations is non-zero. They argued that detecting a genuine association is important, so it’s a matter of efficiency rather than principle in choosing which method to apply. In their simulation study, these global methods had higher power than both the traditional methods at nominal significance level when there was moderate to substantial between-cohort heterogeneity. More recently, Han and Eskin16 compared the power of this new global RE method with traditional methods and found similar results in the presence of heterogeneity, suggesting that the new RE method can be used to discover genes with robust association in meta-analysis. However, Lebrec et al15 reported results for a very simple scenario, and did not assess the comparative performance of these methods at more realistic scenarios or using real genomic data. Han and Eskin16 did not assess the performance of the new FE method, which was shown to have higher power than the new RE method in the presence of high heterogeneity as seen in Lebrec et al’s15 simulation. Earlier, Pereira et al20 investigated the impacts of heterogeneity and genetic model mis-specification on power and other issues for traditional FE and RE methods in a simulation study. But to date, no comparative studies of both the newly proposed meta-analytic approaches have been carried out in a wide range of realistic scenarios. It is therefore not clear under which circumstances these new methods perform better or are of greater practical utility than traditional methods in screening for or discovering of genetic associations.

In this study, our objective was to assess the performance of new and traditional meta-analytic methods with respect to type I error and statistical power through extensive simulations in a wide range of realistic applications. For instance, our simulation included scenarios such as: (1) a genetic variant is less common, (2) only few cohorts are available, (3) failure to adjust for important prognostic factors, and (4) assume a wrong genetic model. To determine the practical utility of these global methods in real data application, we applied these methods to West Nile virus infection complications data from a multi-center association study.

Materials and methods

Hypotheses and tests

Here, we briefly describe the methods given by Lebrec and colleagues (see Lebrec et al15 for detailed descriptions). In a multi-cohort study or meta-analysis with k distinct cohorts for a binary phenotype, Y (y=0 for control, y=1 for case in a case–control study), suppose the information on the number of copies of the minor allele in a genotype, X (x=0, 1, 2), at an autosomal biallelic SNP locus and a set of covariates, Z, are available. Let the SNP effect in the ith cohort be βi=ln(ORi) and its SE be si (i=1,2,…,k). Then the multiplicative (log-additive) genetic model of phenotype risk in the ith cohort is

where βi’s are the parameters of interest whereas αi’s and γi’s are nuisance parameters. Similarly for the quantitative phenotype, Y, the additive genetic model in the ith cohort is

where ei∼N(0, σ2) for all x and i. Different hypotheses and corresponding test statistics are given below.

FE and RE methods in traditional meta-analysis

FE level method (the traditional FE method)

Under the FE assumption, effects are assumed to be similar across cohorts and hypothesis is tested for the average effect across all cohorts as: . The overall effect β is typically estimated as weighted average of cohort-specific effects using inverse variance weights as and variance as ; where weight Wi=1/S2i for cohort i. Then corresponding test statistic under H0 is (asymptotically).

RE level method (the traditional RE method)

Under the RE assumption, consider that cohort-specific effects, βi’s, represent a random sample from a grand normal population with overall mean μ (for example, μ=ln(OR) with OR being the overall OR across cohorts for the binary trait, or the average mean difference across cohorts for the quantitative trait per one copy increase in number of the minor allele in a genotype under the multiplicative genetic model) and between-cohort variance τ2; that is, . Here, τ2 represents the extent of heterogeneity in effects across cohorts. The overall effect is estimated as and variance as ; where weight ) for cohort i and is the estimate of τ2. Then the hypothesis is tested for overall effect across all cohorts as: H0: μ=0 vs H1: μ≠0. The Wald test, under H0, is the standard test of the hypothesis.

New FE and RE global methods for multi-cohort association studies

FE global method (new FE method)

Under the FE assumption, Lebrec et al proposed to test whether an association is present in any cohort: in at least one cohort i. The score test, (asymptotically) under H0, can be used to test the hypothesis.

RE global method (new RE method)

The new RE model tests whether a non-null average association exists, or the between-cohort variance is non-zero (ie, a significant between-cohort heterogeneity is present), that is, H0: μ=0 and τ2=0 vs H1: μ≠0 or τ2>0. The likelihood ratio test, (asymptotically) under H0, can be used to test this hypothesis. Here, for large cohorts, , but as , the approximate marginal distribution of the estimate in the ith cohort is with the corresponding log-likelihood and the total log-likelihood .

Tests of heterogeneity

Cochran’s Q statistic for test of heterogeneity13 is obtained as . The estimate of between-cohort variance can be obtained by the method of moment as , which is 0 when Q<k−1. τ2 can also be estimated by maximizing the profile likelihood, by the method of maximum likelihood or restricted maximum likelihood. I2, an estimate of the degree of between-cohort heterogeneity due to factors other than chance is obtained as,18, 19 , which is 0 when Q<k−1. In meta-analysis of clinical trials and epidemiological studies, heterogeneity is suspected if P-value <0.10 in Cochran’s Q-test. Also, 25≤I2<50 and I2≥50% are considered evidences of modest and large heterogeneity, respectively.11, 19

Simulation study

For a SNP with effect ln(ORi)=βi, minor allele frequency (MAF)=fi, and proportion of cases=π0i and satisfying Hardy–Weinberg Equilibrium (HWE) in the ith cohort (i =1,2,…,k), we generated the case or control status for a subject with genotype x (0,1,2), using multiplicative genetic model:

where, . Furthermore, to assess the impact of genetic model (mis)specification, we generated data under the dominant, recessive and multiplicative genetic model assumption. For quantitative trait, we generated the population data from

where we used αi=0.5 and ei∼N(0, 1) for all x(0,1,2) and i (1,2,…,k). We simulated β1,β1,...,βk from N(μ, τ2). We ran 10 000 simulations for each combination of (μ, τ) under a variety of realistic scenarios listed in Table 1. For instance, we considered the overall association, μ, from null (μ=0) through small to modest in sizes (μ=(0.05, 0.10, 0.15, 0.20, 0.25, and 0.30)) with corresponding ORs, exp(μ): (1.00, 1.05, 1.11, 1.16, 1.22, 1.28, and 1.35) and for such effect sizes we considered between-cohort SD, τ, ranging from none (τ=0) through low (τ=0.1), moderate (τ=0.2), and substantial (τ=0.3) heterogeneity for a binary trait. We analyzed the binary data using logistic regression assuming multiplicative genetic risk effect per genotype. In a separate analysis, each of the dominant, recessive, and multiplicative genetic models was assumed while analyzing each of the data sets generated under each of these models using logistic regression. Data generated for quantitative traits were analyzed using linear regression assuming additive genetic risk. We assessed both the type I error rate and statistical power of these tests at nominal significance level, α=0.05 as well as more stringent significance levels, α=5.0 × 10−6 and 5.0 × 10−8.

Table 1 Parameters setting for different simulation scenarios

Application to real data

We used the West Nile virus infection severity data set,21 where SNPs were genotyped by Illumina HumanNS-12 BeadChip, and subjects were recruited from seven study centers (cohorts) from Canada and the United States. We restricted the analysis to Caucasian population of Northern and Western European origin. Using PLINK: Whole genome data analysis tool set (http://pngu.mgh.harvard.edu/~purcell/plink/), we first applied standard quality control (QC) inclusion criteria: MAF ≥5%, genotyping error rate per SNP<5%, P-value for HWE exact test in control group >10−4, genotyping error rate per subject <5% for considering the SNPs or the subjects for analysis. Further, a SNP passing these criteria must have had MAF ≥1% and HWE P>10−5 in an individual center for that particular center to be included in the meta-analysis for that SNP. Cryptic related subjects or those for which reported sex did not match to that in DNA sample were also discarded.21 Then for each of the remaining SNPs, we obtained the estimates of βi and its SE, si, in center i (i=1,2,…,7) using logistic regression assuming a multiplicative genetic risk model. We also re-estimated si applying genomic control to correct the center-specific P-values for any residual confounding due to population substructure. Then we applied all four meta-analytic methods to the center-specific aggregate data in R (http://www.r-project.org/) and compared their power based on association P-values for the respective I2, heterogeneity P-value and τ estimated from the data. The significance level was adjusted for the multiple testing problem using the Bonferroni adjustment.

Results

Simulation results

Type I error

Type I error (ie, when both μ=0 and τ=0) rates for all four tests at α=0.05 are presented in Supplementary Table 1. More data on the error rates can be found in Supplementary Table 2 at μ=0 and τ=0. The RE global (new RE) method resulted in the smallest type I error rates maintaining nominal significance level in all simulation scenarios. Other methods slightly exceeded nominal level in few simulation scenarios. At the more stringent α=5.0 × 10−6, no methods produced any significant associations.

Statistical power

The statistical power of the four methods in different scenarios at α=0.05, 5.0 × 10−6, and 5.0 × 10−8 are presented in Supplementary Table 2 for both binary and quantitative traits. Some of the power comparisons are presented in Figures 1, 2, 3 and Supplementary Figures 1 to 8.

Figure 1
figure 1

Power of four meta-analytic methods at τ=0.2 and α=0.05. Simulation scenario: binary trait; equal Ni (N/k), equal MAF (0.20), equal case–control ratio (1:1). N, total sample size; k, number of cohorts; Ni, cohort size; τ, between-cohort SD; OR, odds ratio; MAF, minor allele frequency.

Figure 2
figure 2

Power comparison at different total sample size and number of cohorts, and minor allele frequencies at τ=0.2 and α=5 × 10−8. Simulation scenario: binary trait; average Ni=N/k (variable); MAF (variable), average case–control ratio=1:1 (variable). k, number of cohorts; N, total sample size; Ni, cohort size; τ, between-cohort SD; OR, odds ratio; MAF, minor allele frequency.

Figure 3
figure 3

Power of four meta-analytic methods under each underlying and assumed genetic model at τ=0.2 and α=0.05. Simulation scenario: Binary trait, N=6000, k=3, Ni=2000 (equal), MAF=0.20 (equal), case–control ratio=1:1 (equal). k, number of cohorts; N, total sample size; Ni, cohort size; τ, between-cohort SD; OR, odds ratio; MAF, minor allele frequency.

Power for a binary trait: At no heterogeneity (Ï„=0): both the traditional FE and RE methods had very similar power and higher than that of the global methods in all scenarios.

At low heterogeneity (τ=0.1): at the nominal significance level α=0.05 for a common variant (MAF≈0.20), the traditional FE method was the most powerful in almost all scenarios for detecting modest associations (OR≥1.20), followed by the traditional RE method. The new FE method performed as well or slightly better when there were fewer but larger cohorts in large studies (eg, when number of cohorts, k=3 for the total sample size, N=8000, or k=5 for N=10 000) especially for smaller overall associations (OR<1.20). But at more stringent significance levels, α=5.0 × 10−6 or smaller, there is no power advantage for global methods. For a less common variant (MAF≈0.05), the new methods did not perform better even when k=2 for N=10 000 at α=0.05.

At moderate heterogeneity (τ=0.2): for a common variant (MAF≈0.20) at α=0.05, the new FE method had the highest power when fewer but larger cohorts were included (k≤5) while the new RE method had the better power when many smaller cohorts (k≥7) were available for the small or modest available total sample size (N=2000∼4000) (Figure 1 and Supplementary Figure 7). New FE almost always had better power when the overall associations were very small (OR<1.20) (Figure 1 and Supplementary Figures 1 and 7). At α≤5.0 × 10−6, for the given sample size each of the methods had some gain in power for fewer cohorts with larger sizes for k≤7, but such advantage tended to diminish or even altered for N≥8000 for larger k (Figure 2 and Supplementary Figure 3). For N≤4000, the new RE method performed better or similar to traditional FE but better than new FE for k≥7, while new FE performed the best for larger cohort sizes (k<7). For N≥6000 the new methods generally performed better (where the new FE method had the highest power for k≤5 while the new RE had the highest power for larger k≥7) (Figure 2 and Supplementary Figures 5 and 7).

For a less common variant (MAF≈0.05): even at α=0.05, all methods had considerably low power, and the gain in power for the new methods were not as prominent as that observed for more common variants (Supplementary Figures 2 and 7). For example, the new RE method did not perform better than traditional FE method and the advantage of the new FE method was not clear either when N=4000 even when k≤5 (Supplementary Figure 2). For α=5.0 × 10−8, the power of all methods was very low for N≤6000. For N≤8000 with k≥7, traditional FE had the highest power where new RE performed better than new FE. New FE had similar or slightly better power than traditional FE for N≥8000 with k≤3 (Figure 2).

At substantial heterogeneity (τ=0.3): at α=0.05, the new FE generally outperformed all other methods for common or less common variant for any number of available cohorts (Supplementary Figures 2 and 3).

At α=5.0 × 10−8 for a common variant, the new FE method outperformed for any k for N≥4000 (Supplementary Figure 4), whereas new RE method performed similarly or better than new FE method for many cohorts of small sizes (N=2000 with k≥7) in which situation traditional FE performs even better. For a less common variant, power was too low for N≤4000 for all methods to make any meaningful comparison (Supplementary Figure 4). For N≥6000, new RE generally outperformed when k≥7 while new FE outperformed when k≤5.

Power for a quantitative trait: At no or low heterogeneity, traditional methods generally performed better than the new methods. But under higher heterogeneity, the new methods in general performed quite well for quantitative traits (Supplementary Figures 4, 5 and 6). For example, at τ=0.2, quantitative trait analysis was more powerful than binary trait analysis, whereas the new global methods had higher power even for a less common variant even at a more stringent significance level and had considerable power advantage over traditional methods for a common variant. Even at τ=0.1 and α=5.0 × 10−8 even for N=2000, the new FE method had similar or higher power than traditional methods when k=2 and and the new RE method outperformed when k=10 for a common variant. New global methods performed better even in presence of little heterogeneity for larger total sample sizes and almost always outperformed when the heterogeneity was substantial.

Similar comparative results were observed for the power of these tests irrespective of whether the minor allele frequencies, the proportions of cases, and cohort sizes were similar or varied across cohorts, and if an important independent prognostic variable was not adjusted for in the analysis. The traditional FE outperformed traditional RE in all conditions. In general, when heterogeneity increased, the power of traditional meta-analytic approaches generally decreased while that of the new global approaches increased at α=0.05 (Supplementary Figure 7). Interestingly, the power of even the traditional methods, and in particular the FE method increased as heterogeneity increased for α≤5.0 × 10−6 in situations where power is expected to be generally small (eg, when overall association was very small (OR≤1.20), or the total sample size was small (N≤4000 for common variant and N≤8000 for less common variant) (Supplementary Figure 7).

Impact of genetic model (mis)specification on power: At both α=0.05 and 5.0 × 10−6, the new methods had better power than the traditional methods in the presence of moderate or substantial heterogeneity no matter whether correct or wrong multiplicative or dominant risk model was assumed when the underlying model was one of them (Figure 3 and Supplementary Figure 8). However, when the underlying or assumed model was recessive, all of these tests had considerably low power and the power was zero or almost zero at more stringent significance level where new methods (particularly new RE method) had the least power. However, the sample size (N=6000, k=3) was not sufficient to make any meaningful comparison under recessive risk model since only about 4% had the risk genotype for MAF=0.20.

Application to real data

In the West Nile virus infection severity data set,21 13 371 SNPs were genotyped in 1346 participants recruited from seven study centers (cohorts) across Canada and the United States. There were 488 cases with neuroinvasive disease (meningitis, encephalitis, acute flacid paralysis) and 858 controls (infected but did not have severe complications). After applying QCs criteria and restricting analysis to White population, 9051 SNPs with 441 cases and 815 controls were left for analysis. Five cases were further discarded from one center as it had only cases with no controls for comparison. The Bonferroni-adjusted significance level was set to 5.52 × 10−6. However, it should be noted that this threshold is too conservative for association analysis as SNPs are not independent. There was no population substructure within each center except in center 2 for which genomic inflation factor, λ=1.057. Correction for population substructure did not significantly alter the meta-analysis results. About 3.8%, 8.6%, and 14.8% SNPs had τ>0.30, 0.20, and 0.10, respectively. The estimates of heterogeneity was larger than the sizes of respective overall associations (τ>log(OR)) for about 13.6% SNPs, which had some center-specific associations in reverse directions. About 8.2% SNPs had τ>log(OR) with very small average OR (1≤OR<1.10 or 1≤1/OR<1.10). About 3.4% and 8.2% SNPs had heterogeneity P-values <0.05 and <0.10, respectively, in Cochran’s Q-test. About 16% of the SNPs had modest (25%≤I2<50%) and 6% had substantial (I2≥50%) heterogeneity. Estimates of associations and their association P-values, and extent of heterogeneity for these tests are presented in Table 2 for three most significant SNPs as seen in the FE level test and another three of the most heterogeneous SNPs as suggested by Cochran Q-test for illustration purpose. In Supplementary Table 3, these three SNPs with the most heterogeneous effects were further explored within each cohort. In this analysis, none of the methods yielded any SNPs that remained significant at the Bonferroni-adjusted level. The traditional FE method produced the smallest P-values in the test of association for those SNPs having small heterogeneity (eg, rs2066789). For SNPs with large heterogeneity P-values, the new FE method produced the smallest association P-values, followed by the new RE method, both sets of which were much lower than those derived using the traditional FE and RE methods.

Table 2 Estimates of ORs and P-values for few SNPs from meta-analysis of West Nile virus data set

Discussion

The new RE global test produced the smallest type I error rates at nominal significance level. No method produced significant associations at a more stringent significant level, α=5.0 × 10−6. As expected, the traditional RE (level) method that is proposed for heterogeneous conditions performed the worst at high heterogeneity. This method assesses the average effect without utilizing the heterogeneity information and is overly conservative for genetic association studies where associations could be heterogeneous for genuine reasons.15, 16 New global methods work well for common variants, even if a wrong multiplicative or dominant genetic model is assumed when the underlying risk model was one of them, at high heterogeneity. At high heterogeneity, when fewer but larger individual cohort are available for the given small to modest total sample size (2000∼4000 subjects), the new FE method performs quite well, but it may fail if cohort sizes are very small.15 When there are many cohorts with smaller sizes, another global random method may be a more powerful choice than traditional methods. However, these global methods offer no clear advantage even as screening tools for less common genetic variants even at substantial heterogeneity in a small to modest sized study.

One concern is that these global methods and in particular the global FE method have a clear advantage in power over traditional methods mostly when overall associations are small (OR<1.20) but are highly heterogeneous across cohorts (ie, when τ≥μ=ln(OR)). Can we use these global methods for gene discovery in meta-analysis, where even the new RE method might achieve significance at genome-wide level with much higher power even when overall OR=1.0 or 1.05 at high heterogeneity? Although some degree of heterogeneity is expected because of genuine reasons, credibility of the association is questionable if very high heterogeneity is observed with such a small overall association. An observed association is unlikely to be robust, not even in a single cohort, if the associations of large magnitudes in individual cohorts are flip-flopped in opposite directions suggesting both protective and harmful effects of the same mutant gene in distinct populations, which would result in moderate or substantial heterogeneity. Such association could be more likely a spurious finding as a result of some undetected methodological error (eg, because of genotyping error) or chance variation.22, 23 In such studies, if the individual cohorts are well designed or large, real biases (eg, population stratification) are unlikely to alter the genuine association in the reverse direction with large magnitudes.11 If associations are highly heterogeneous across cohorts but most of the larger associations are in the same direction, the average association is also likely to be of larger magnitude and the traditional FE method can perform equally well for large effect sizes. Further, this method also has some increased power at more stringent significance levels to detect associations of small magnitudes as heterogeneity increases, although the gain in power is smaller compared with global methods. Thus, it may also better control false positives at extreme heterogeneity conditions caused by errors or chances rather than genuine factors.

Therefore, any perceived advantage of especially the new FE global method in high heterogeneity may not directly translate in to practice if the purpose is to achieve significance for discovery of genes with robust associations even in at least one cohort rather than just screening for the potential association in multi-cohort GWA studies or meta-analysis. For example, in our example data set, the SNPs explored in the Supplementary Table 3 had very high heterogeneity. They had similar MAFs with no genotyping errors or deviation from HWE across cohorts, whereas the study was conducted on subjects of genetically similar backgrounds and the same study protocol was followed across centers. Then what might have caused so much heterogeneity for these SNPs? Here, cohorts were defined based on geographic locations and might not be very distinct in terms of genetic and environmental factors. Furthermore, the total sample size and individual cohorts sizes were quite small and the case–control ratio was quite variable across centers (as the disease complications under study was quite uncommon in north America, it was a challenge to obtain a sufficient number of case and control subjects in each center during the time frame of the study). Therefore, the most plausible explanation for the substantial heterogeneity is the likely sampling variation. In practice, inclusion of such SNPs in analysis might just inflate the overall heterogeneity distributions and hence warrant tougher adjustment for population structure than is necessary for other SNPs that are more genuine candidates for analysis. However, we were too cautious to filter out such SNPs during the QC phase, because our purpose was to assess the utility of these newly proposed methods not only as tools to achieve significance at more stringent (adjusted or genome-wide) level to identify new genes associated with diseases, but also as screening tools to achieve significance in such level in the presence of heterogeneity so that they could be carried forward for further scrutiny. If there was a genuine small association of such a SNP in some of the cohorts because of, say, gene-local environment interaction while its association was reversed in some other cohorts by chance, then we would have missed an opportunity to test such a SNP had we filtered it out before analysis. In our example, for SNPs with very small overall associations with large heterogeneity, which could have been filtered out before analysis, the new FE method produced quite strong association P-values, whereas the new RE method suggested less impressive association and is less likely to lead to any unnecessary follow-up of such SNPs.

In recent years large-scale multi-cohort association studies have been carried out collaboratively for many complex diseases. For example, the INTERHEART Study24 assessed the associations of different genetic variants with myocardial infarction risk factors in over 8000 individuals from five ethnic populations. Many SNPs may be expected to display modestly or highly heterogeneous associations for myriad reasons in such studies in genetically and environmentally distinct cohorts. Substantial heterogeneity is likely for some variants even in genetically close populations. For example, in a meta-analysis of three GWA studies of type 2 diabetes in the northern European population,10 some SNP had an I2 as high as 77%. Multi-cohort GWA studies or meta-analyses are, in practice, likely to be much bigger in size, and include large individual cohorts than the data set we used. Hence, any strong association in a single cohort might justify a further exploration as genotyping errors or chance might not be the only explanations for such large association in a cohort. In such studies, these new approaches may be useful in screening genetic variants to assess association in the presence of high heterogeneity and prioritize them for further scrutiny. Simple exploration across cohorts can identify methodological and chance errors or biases causing heterogeneity; and if heterogeneity is still unexplained, pathway-based analysis could provide better insights about the role of genes and environments causing the heterogeneity.15 If this suggests the presence of some genuine associations in some cohorts, future replication or fine mapping studies in the cohort can resolve the issue. Then a genuine variant is more likely to be identified in the investigation process.

In considering the potential and pitfalls of these new global methods, there are some important questions that require further research and discussion: Are these new methods useful in practice in small to moderate sized genetic association studies to also validate or discover new genes in the presence of high heterogeneity? Do they work well in meta-analyses of independent research studies that might have employed different genotyping platforms or even recruited subjects with different ethnic backgrounds having different LD patterns, in which case an analyst might have to impute untagged markers? Also, for the genetic variants with heterogeneous associations across cohorts, the pathway-based three-point mixture model seems to be a promising tool to resolve the heterogeneity in specific cohorts.15 Although the method might not be feasible for meta-analysis of GWA studies, the prospect of the method could be further explored for certain sets of SNPs that are known to belong to biologically defined pathways.