Introduction

It is increasingly recognized that common diseases are not consequences of independent actions of the genes, but are caused by complex joint actions of multiple genetic and environmental risk factors. Gene–gene interactions play an essential role in the ignition and development of the diseases.1

Despite current enthusiasm for investigation of interactions between genes, the essential issue of how to define and detect gene–gene interaction remains unresolved.2 In the past, statistical and biological interactions are often defined separately. As Rothman et al3 pointed out, ‘The term statistical interaction is intended to denote the interdependence between the effects of two or more factors within the confines of a given model of risk’, and ‘Biological interaction may be defined as the interdependent operation of two or more causes to produce disease’. A core part of statistical interaction is to specify statistical models. Most popular models for statistical interactions between genes are additive models that defined the effect of gene interactions as a statistical deviance from the additive effects of single genes in the linear models (or logistic regression for qualitative traits) and were originally proposed by Fisher4, and further developed into their modern representations by Cockerham5 and Kempthorne.6 Statistical interaction models essentially treated the interaction effect as a residual term in genetic analysis, and hence are likely to limit the power to detect the interaction.

As an alternative to statistical interaction models, interactions between two loci (or genes) can be understood as irreducible dependencies between loci causing disease.7 The purpose of a new definition of interaction is to develop a mathematical representation of biological interaction, which is close to the true biological interaction. We use penetrance of the risks to measure the degrees of the risks in causing diseases. In a broad sense, the interaction corresponds to the situation in which the effect of one locus (gene) is affected by the presence or absence of the other.8, 9, 10 The presence of interaction between two loci implies that the two loci share something in common to cause diseases (or phenotype). The shared common features or information lead to the association of two loci in the disease population, that is, high dependency or correlation between two loci in the disease population. In the language of population genetics, the dependency between two loci corresponds to the linkage disequilibrium (LD) between two loci. In other words, although LD between two loci is not the interaction of the effects of those alleles on a disease, LD can be used to detect interaction. If we assume that the controls are sampled from a single isolated population, two unlinked loci are in linkage equilibrium in controls. However, the interaction between two loci will generate LD in disease population.11 Therefore, we can use the difference in LD between controls and cases to assess whether the interaction between two unlinked loci is present or not. If we assume that two loci are unlinked in the controls, in the presence of interaction, we observe LD between two loci in the cases. The level of LD due to interaction in the disease population depends on the magnitude of interaction between two loci. This motivated the development of statistics based on deviations from linkage equilibrium in the cases for detection of interaction between two loci.

Although LD-based statistics have demonstrated high power to detect interaction between two loci, in general, linkage phase information of marker loci for unrelated individuals is unknown; only genotype data are available. Experiments for generation of haplotype data are expensive and time consuming. Estimation of haplotypes based on genotype data inevitably incurs errors, which in turn will lead to increasing false interaction positive interaction in detection of interaction between two loci.12 The main purpose of this paper is to directly use unphased genotypes to develop statistics for the detection of interaction between two unlinked loci. Similar to the Hardy–Weinberg disequilibrium at marker loci, which can be used to develop an association test,13 the composite measure of LD14, 15, 16 that uses the genotype data to estimate the nonrandom association of alleles from different loci on the chromosomes, which are from the same parent (intragametic LD) and on the chromosomes, which are fromdifferent parents (intergametic LD), was used to design association tests allowing unknown linkage phase.17, 18, 19 We extend the composite measure of LD to test the interaction between two unlinked loci when only genotype data are available. To achieve this, we first develop a general theory to study intragametic and intergametic LD patterns under two-locus disease models. Then we develop a novel definition and measure of intragametic interaction, which is caused by two interacted alleles from unlinked loci on the same haplotype and intergametic interaction, which is caused by two interacted alleles from unlinked loci on different haplotypes. The pattern of intragametic and intergametic LD between two unlinked loci due to gene–gene interaction provides a foundation for developing statistics for the detection of interaction between two loci using genotype data. This motivates us to develop the composite LD-based statistics for testing interactions between two unlinked loci. To study the validity of the composite LD-based statistic for testing interaction, we estimate type 1 error rates of the test statistic using simulation. To evaluate the performance of the composite LD-based statistic for detection of interaction between two loci, we compare its power with logistic regression and apply it to two real examples.

Methods

Measure of interaction between two loci

Let D1 and d1 be the two alleles at the first disease locus with frequencies , respectively. Let D2 and d2 be the two alleles at the second disease locus with frequencies , respectively. Alleles D1 and d1 can be indexed by 1 and 2, respectively. At the first disease locus, let D1D1 be genotype 11, D1d1 be genotype 12 (or d1D1 be genotype 21) and d1d1 be genotype 22. Thus, the genotypes at the first disease locus can be indexed by ij. The genotypes at the second disease locus are similarly defined and can be indexed by kl. Two-locus genotypes are simply denoted by ijkl for individuals carrying the genotype ij at the first disease locus and kl at the second disease locus. Let fijkl be the penetrance of the individuals with genotype ijkl. Let P11, P12, P21, and P22 be the frequencies of haplotypes in the general population, respectively. Let P11A, P12A, P21A, and P22A be their corresponding haplotype frequencies in the disease population. Let P1/1, P1/2, P2/1 and P2/2 be the frequencies of respectively, where the slash denotes the two chromosomes in the individual, which are from different parents. Let P1/1A, P1/2A, P2/1A, and P2/2A be their corresponding frequencies of in the disease population. Let be the frequencies of the alleles D1, d1, D2, and d2 in the disease population, respectively.

In general, it is genotypes that have penetrances. For ease of discussion, we introduce a concept of haplotype penetrance. Consider a haplotype with two alleles at the different loci on the same chromosome. Then, the penetrance of haplotype is defined as

In other words, the penetrance of haplotype is defined as the probability that individual with the haplotype is affected. It is a weighted sum of the penetrances that contain haplotype . The penetrance h12, h21, and h22 is similarly defined.

The penetrance of two alleles at different loci on different chromosomes can be defined as

It is a weighted sum of genotypic penetrances. Similarly, we can define the penetrance h1/2, h2/1, and h2/2. If we assume the Hardy–Weinberg equilibrium and genotypic equilibrium in general population, then we have h11=h1/1, h12=h1/2, h21=h2/1, and h22=h2/2. Let be the measure of intragametic LD that measures the association of alleles from different loci on the same haplotype17 and be the measure of intergametic LD that measures the association of two alleles from different loci on different haplotypes17 in the general population. We can show that haplotype frequencies in disease population can be expressed as

and

where PA denotes disease prevalence.

Now we calculate the measures of intragametic and intergametic LD in disease population under a general two-locus disease model. The measures of intragametic and intergametic LD in disease population are denoted by respectively. We can show that they can be given by

and

where and . We define a measure of intragametic interaction that measures the interaction of two alleles from different loci on the same haplotype as and a measure of intergametic interaction that measures the interaction of two alleles from different alleles on the different haplotypes as . Then a measure of total interaction between two loci, which consists of intragametic and intergametic interaction is given by

Equation clearly shows that the interaction between two loci is defined by the penetrance of the two loci. Although the penetrance of the risks is not directly related to the biological process, it is related to the causes of the disease. Therefore, the above definition of interaction may have something to do with biological interaction. It follows from equations (3), (4) and (5) that the composite measure of LD, (Weir 1996) in disease population is given by

Absence of interaction between two loci is then defined as

equation (7) indicate that similar to linkage equilibrium where frequency of a haplotype is equal to the product of the frequencies of the component alleles of the haplotype, absence of interaction between two loci implies that the proportion of individuals carrying two alleles (either in the same chromosome or in the different chromosome) in the disease population is equal to the product of proportions of individuals carrying single allele in the disease population, if we assume that the disease is caused by only two investigated disease loci. In other words, the interaction between two disease susceptibility loci occurs when the contribution of one locus to the disease depends on another locus. In contrast to additive model for interaction, which was introduced by Fisher4, the interaction model defined by equations (5) and (7) are referred as to a multiplicative interaction model.

Indirect interaction between two unlinked marker loci

In the previous section, we studied interaction between two unlinked disease loci. Now we consider two marker loci, each of which is in LD with either of the two interacting loci. Assume marker M1 is in LD with disease locus D1 and marker M2 is in LD with disease locus D2. Furthermore, we assume that two disease loci D1 and D2 are unlinked. Let be the measures of intragametic and intergametic LD between two marker loci in the disease population, respectively. We denote the composite measure of LD between two marker loci by . Let δi be the LD measure between marker Mi and disease locus Di (i=1,2) in the general population. Then, we can show that (Appendix A)

It is clear that when the marker loci are the disease loci themselves, are reduced to . equation (8) can also be written in terms of the measure of interaction between two unlinked loci

Since , the absolute value of the LD measure between two unlinked marker loci in the disease population, for example, the composite measure of LD between two marker loci will be less than or equal to the absolute value of the composite measure of LD between two unlinked disease loci in the disease population.

Test statistic

In the previous section, we showed that under the multiplicative disease model, interaction between unlinked loci will create LD. Intuitively, we can test interaction by comparing the difference in the composite genotypic disequilibrium between two unlinked loci between cases and controls. Precisely, if we denote the estimators of the composite LD measures in cases and controls by , respectively, then the test statistic can be defined as

where

are similarly defined for controls, the formula for calculations of the composite measure of LD in cases and controls is given in Weir (1996),15 are defined as before, are their estimators, the quantities nA and nG denote the number of sampled individuals in cases and controls, respectively; the variance of the composite LD measure was the large-sample variance.15 Under the null hypothesis and assumption of the Hardy–Weinberg equilibrium, the variance of the composite measure of LD in cases and controls becomes . When sample size is large enough to ensure application of large sample theory, test statistic TI is asymptotically distributed as a central χ(1)2 distribution under the null hypothesis of no interaction (both intragametic and intergametic interactions) between two unlinked loci and assumption of the Hardy–Weinberg equilibrium.

In theory, we can use case only design to study interaction between two loci. However, in practice, background LD between two unlinked loci may exist in the population due to many unknown factors. Therefore, the test statistic based on case–control design is more robust than the statistic based on case only design.

Results

Type 1 error rates of test statistics

To examine the validity of the statistic for testing interaction, we performed a series of simulation studies. The computer program SNaP20 was used to generate two-locus genotype data of the sample individuals. A total of 20 000 individuals, who were equally divided into cases and controls were generated in the general population, assuming genotypic equilibrium (both intragametic and intergametic equilibria) between two loci. We randomly sampled 100–400 individuals from each of the cases and controls for the calculation of the type I error rates. A total of 10 000 simulations were repeated. Table 1 shows that the estimated type I error rates of the statistic TI for testing the interaction between two unlinked loci were not appreciably different from the nominal levels α=0.05, α=0.01 andα=0.001.

Table 1 Type 1 error rates of the test statistic TI to test interaction between two unlinked loci in a homogenous population

Power evaluation

To evaluate the performance of the composite LD-based statistic in testing gene–gene interaction, we compared the power of the statistic employing composite measure of the LD to that of the logistic model. We use the genotype coding scheme in QUANTO21 for power calculations. Specifically, we considered two types of genotype coding (genetic covariate variables). For a dominant model, homozygous wild type, heterozygous, and homozygous mutant genotypes were coded as 0, 1, and 1, respectively. For an additive model, they were coded as 0, 1, and 2, respectively. We considered two loci, denoted as G and H, respectively. We assume the following logistic model:

where is the baseline probability of disease in the population,

are the odds ratios for G when H=0, H when G=0 and interaction G × H, respectively.21 Power for both composite LD-based statistic and logistic regression22 was calculated by simulation. The computer program SNaP20 was used to generate 10 000 cases and 10 000 controls with unlinked two-locus genotype data. Two-locus interaction effect were simulated for two-locus dominant and additive models with penetrance functions as given in Gauderman (2002).21 Five hundred individuals were randomly sampled from each of the cases and controls. A total of 10 000 simulations were repeated. Figures 1a and 1b present the power comparisons between the logistic regression model and the composite LD-based statistic under the following two genetic interaction models: dominance × dominance and additive × additive. Figures 1a and 1b show that the power of both logistic regression and the composite LD-based statistic in detecting gene–gene interaction was an increasing monotonic function of the interaction odds ratio, a widely used measure in quantifying the strength of interaction between two loci. This implies that the proposed new interaction measure and test statistic are closely related to the traditional interaction measure. We can also see that the power of the composite LD-based statistic TI is higher than that of the logistic regression model.

Figure 1
figure 1

(a) Power of the test statistic TI and logistic regression analysis as a function of interaction odds ratio under a dominance × dominance model, assuming risk allele frequencies at both loci G and H are 0.2, number of individuals in both cases and controls are 500, population risk is 0.001, significance level is 0.001, and genetic odds ratios RG=1 and RH=1. (b) Power of the test statistic TI and logistic regression analysis as a function of interaction odds ratio under an additive × additive model, assuming risk allele frequencies at both loci G and H are 0.2, number of individuals in both cases and controls are 500, population risk is 0.001, significance level is 0.001, and genetic odds ratios RG=1 and RH=1.

Application to real data examples

To further evaluate its performance for detecting interaction between two unlinked loci, the proposed test statistic TI was applied to two real examples. The first example was a breast cancer case–control study. A total of 398 Caucasian breast cancer cases and 372 matched controls were sampled from the Ontario Familial Breast Cancer Registry (OFBCR).23 Nineteen SNPs from 18 key genes in DNA repair, cell cycle, carcinogen/estrogen metabolism, and immune system were typed. All SNPs were in Hardy–Weinberg equilibrium. Using multivariate logistic analysis under the codominant models, four pairs of genes: XPD and IL10, GSTP1 and COMT, COMT and CCND1, and BARD1 and XPD showed significant interactions.23 We used the statistic TI to test interactions between these four pairs of genes. The test results are summarized in Table 2, where the crude P-values were from the Table 4 in the paper by Onay et al (2006).23 The crude P-values were obtained from multivariate logistic regression analysis that includes all main effects and only the interaction of interest under the codominant models. As shown in Table 2, logistic regression analysis interactions between XPD-(Lys751Gln) and IL10-(G(-1082)A), BARD-(Pro24Ser) and XPD-(Lys751Gln), COMT-(Met108/158Val) and CCND1-(Pro24Pro) and GSTP1-(Ile241Val), and COMT-(Met108/158Val) were identified. But after the more conservative Bonferroni adjustment, none of these interactions were significant.23 Table 2 demonstrated that the P-values based on the test statistic TI were smaller than those based on the traditional logistic regression analysis for the XPD-(Lys751Gln) and IL10-(G(1082)A), but larger for the BARD1-(Pro24Ser)and XPD (Lys751Gln), and COMT-(Met108/158Val) and CCND1 (Pro241Pro).

Table 2 Comparison of P-values for testing gene-gene interactions (Example 1)

A popular point of view is that the statistics using haplotype data usually have smaller P-values than the statistics using genotype data. To examine this statement, the second example is coronary heart disease study in Shaghai, China in which 812 SNPs in 176 genes were typed for 1320 cases and 1129 controls. Atherosclerosis is the primary cause of coronary heart disease.24, 25 Although a majority of the results demonstrate that the P-values of intragametic LD-based statistic is smaller than that of composite LD-based statistic (data not shown), we can still find many pairs of SNPs for which the P-values of the composite LD-based statistic are smaller than that of the intragametic LD-based statistic and logistic regression. Here, we report the results of the detected interactions between 10 pairs of SNPs in Table 3. In Table 3, we can see that for all 10 pairs of SNPs, the P-values of the composite LD-based statistic are smaller than those of the intragametic LD-based statistic. This indirectly shows that there may exist intragametic and intergametic interactions, which generate intragametic and intergametic LD, respectively. The composite LD is the summation of intragametic and intergametic LD. When both intragametic and intergametic LD have the same sign, the absolute value of composite LD is larger than that of its component. Therefore, in this case the P-values of the composite LD-based statistic will be smaller than those of the intragametic LD-based statistic.

Table 3 P-values for testing interaction between unlinked loci in CAD study (example 2)

Discussion

For almost a century, interaction between loci is defined as a deviance from the summation of their genetic main effects of individual locus. As an alternative to additive model of interaction, we have shown that the interaction between loci can be interpreted as irreducible dependencies between them. In genetics, dependencies between loci can be understood as LD If two loci in the general population are in linkage equilibrium (or independent), their departure from equilibrium in the disease population is often attributed to the interaction between them. Therefore, the LD due to interaction between two loci can be used to measure the magnitude of interaction.

The most popular measure of LD is the intragametic LD measure that quantifies nonrandom association of two alleles from different loci on the same haplotype. The major limitation of using the intragametic LD measure to test for interaction is that in practice, haplotype data are often unavailable. Although a number of algorithms for estimation of haplotypes have been developed, the errors of haplotype estimation are inevitable. This will lead to inaccuracy in the detection of interactions between loci. To overcome this limitation, we proposed to use the composite measure of LD based on genotype data for detection of interactions between loci.

To gain a deep understanding of intragametic and intergametic interactions, we first developed the general theory to study composite LD patterns in the disease population under two-locus disease models. We introduced a new concept of intragametic and intergametic penetrance and developed a measure of interaction between two unlinked loci, including both intragametic and intergametic interactions. The theoretic analysis of the intragametic and intergametic LD motivated us to use a composite measure of LD for developing statistics to test interactions.

We examined the distribution of the composite LD-based statistic under the null hypothesis of no interaction and calculated type 1 error rates of the proposed statistic by simulation. Our results showed that type 1 error rates were close to nominal significance levels. The composite LD-based statistic has two remarkable features. First, the calculation of the composite LD-based statistic does not require linkage phase information. Therefore, the results of the composite LD-based statistic are more reliable than that of the intragametic LD-based statistic. Second, the power of the composite LD-based statistic may not always be less than that of the intragametic LD-based statistic. Although by simulation we showed that in general, the composite LD-based statistic under the dominant and additive two-locus disease models has higher power than the logistic regressions, the critical question is whether there are situations where the composite LD-based statistic has higher power than the traditional LD (intragametic LD)-based statistic. The preliminary results of real data analysis showed that in some cases, P-values of the composite LD-based statistic may be smaller than those of the intragametic LD-based statistic. equations (6) and (10) show that in theory, the composite LD-based statistic varies from half of the intragametic LD-based statistic to two times of the intragametic LD-based statistic depending on the ratio of the intergametic LD over the intergametic LD. Therefore, when the intergametic LD is comparable with the intragametic LD, the composite LD-based statistic may have higher power than the intragametic LD-based statistic.

Although the composite LD-based statistic has merit, it also has potential limitations. First, in addition to interaction, HWD may also increase the composite LD. The small P-values of the composite LD-based statistic may be caused by HWD, not by the interaction. Although this will not be a problem for association studies of two loci with the disease, but it will be the problem for gene–gene interaction analysis. Second, like other population-based methods, the population substructure may generate LD and hence create spurious interactions. Third, the presented methods in this report require that the two loci are unlinked.

In summary, our results suggest that the composite LD-based statistic is an alternative to the traditional logistic regression or the haplotype-based LD statistics.