Introduction

Understanding the genetic basis of human diseases is a major goal of the modern genetic research. For genetically simple diseases like those with a pattern of Mendelian inheritance, that is, high penetrance and early onset, linkage analysis is a simple approach to detect the cosegregation of a marker locus and the disease through a pedigree. Unfortunately, the monogenic diseases are only a small fraction of all the human diseases in the world today, where the most common diseases, that is, cardiovascular disease, cancer, and neuropsychiatric disorders, have a polygenic basis and show complex interactions with environmental factors.1,2 Other complications, such as unclear Mendelian inheritance, low penetrance, and late onset, restrict the capacity of the traditional linkage analysis to uncover the genetic basis of many human diseases. Thus, there is a need to develop novel analytic tools that can immerse traditional family-based genetic analysis of human disease into population-based human studies of complex disease.3,4

Owing to interindividual genetic variation, population-based studies are a promising approach to increase our understanding of complex genetic diseases. For example, genome-wide association studies using population samples, may be used to map loci affecting complex traits.5 However, false-positive associations may be obtained if the population under study is stratified. The transmission disequilibrium test (TDT) avoids the problem of ethnic from association by testing the difference between the frequency of marker alleles transmitted from heterozygous parents to the affected offspring and the frequency of marker alleles not transmitted.6 Although the original use of the TDT was to test for linkage in the presence of population association, it can be used to test any marker even if there is no prior evidence for association.7

However, the TDT is limited by the availability of DNA samples from parents of the affected individuals. This becomes a serious drawback particularly in epidemiological studies of late-onset diseases. A population-based approach does not need case-relative pairs, and careful matching for ethnic background may circumvent the confounding by ethnicity. But, because the description of genetic variation in human populations is an important prerequisite for the development of mapping strategies, an obvious question is whether an analogous test of the TDT can be carried out on population-based studies.

Mitchell8 developed a statistic (T), which measures disease–marker associations, and it can be estimated from case–control data. However, a test of significance for the T-statistic was not developed. Furthermore, this statistic can be estimated only in the unlikely situation of a population that has no serious deviations from Hardy–Weinberg proportions, a premise that cannot be achieved if the population is stratified. Thus, a more general method is needed to estimate the differential transmission of marker alleles to the affected offspring, based on population data. We developed a statistic (T-value) to estimate the proportion of transmission of a potential high-risk marker allele from heterozygous parents to the affected offspring. To estimate the T-value, it is not necessary to assume that Hardy–Weinberg equilibrium holds; in fact, it can be estimated in very diverse scenarios of population structure.

Methods

Let us suppose a genetic marker with two codominant alleles, A1 and A2. The TDT uses heterozygous parents for the genetic marker and it compares the frequency by which a ‘high-risk’ marker allele (A1) is transmitted to the affected offspring, to the frequency by which the alternate marker allele (A2) is transmitted to the same offspring.6 To carry out the TDT in a case–control study, we have to estimate the number of alleles, both A1 and A2, transmitted from heterozygous parents to their affected offspring.

The T-value can be written as:

where x1 and x2 are, respectively, the number of alleles A1 and A2 transmitted from heterozygous parents to the affected offspring. By definition, x1 and x2 can be calculated as

where xi,ij (i, j=1, 2) is the number of alleles Ai inherited from heterozygous parents to the affected offspring with genotype AiAj.

In population-based studies, we do not have information about the genotypes of the parents of the cases; therefore, we can assume that the different xi,ij values, as well as the T-value, are random variables. If we define ϕi,ij as the probability that one allele Ai taken from a patient with genotype AiAj had been inherited from a heterozygous parent, the probability distributions of the different xi,ij are

where P, H, and Q are the number of sampled patients with genotypes A1A1, A1A2, and A2A2, respectively.

The above probability distributions depend on the system of matings and the genetic structure of the sampled population. Although more complex situations can be envisioned, we will focus our analysis on four general models: (1) one population with random mating, (2) one population without random mating, (3) several populations with random mating within them, and (4) several populations without random mating within them.

One population with random mating (model T1)

The simplest situation occurs when the sampling is made on a single panmictic population; therefore, the ϕi,ij probabilities can be calculated using the Hardy–Weinberg genotype frequencies. Table 1 shows the frequencies of the different matings occurring in the population, the offspring proportions of each mating, as well as the number of A1 or A2 alleles inherited by each offspring from A1A2 heterozygous parents. According to Table 1, the different ϕi,ij probabilities are ϕ1,11(1)=ϕ1,12(1)=p2 and ϕ2,12(1)=ϕ2,22(1)=p1, where p1 and p2 are the frequencies of the A1 and A2 alleles in the sampled population, and the number 1 in parentheses refers to the first considered model. These results show that, within a random mating population, the probability of an A1 allele from a patient with genotype A1A1 or A1A2 being inherited from a heterozygous parent equals the frequency of the allele A2 in the general population. Likewise, the probability of an A2 allele from a patient with either genotype A2A2 or A1A2 being inherited from a heterozygous parent is equal to the frequency of the allele A1 in the general population.

Table 1 Frequency of the different matings occurring within a panmictic population (model T1) and within an inbred population (model T2), as well as the number of alleles A1 and A2 transmitted from heterozygous parents to the different kinds of offspringa

The expected value (μ) and variance (σ2) of the random variables x1 and x2 under this model can be calculated as

The probability distribution of T can be determined by the Monte–Carlo method, according to equation (1) and the different probability distributions of the xi,ij. An estimator of the proportion of A1 alleles transmitted from heterozygous parents to the affected offspring can be calculated by taking the expectation μT(1) of the distribution of T; therefore

The null hypothesis (μx1(1)μx2(1)=0 or μT(1) = ½) of nondifferential transmission of the allele A1 from heterozygous parents to the affected offspring can be tested by the statistic

that follows, approximately, a standard normal distribution if the number of sampled gene copies is large enough. A permutation test can be performed by simulations of the probability distribution of T according to equation (1).

One population without random mating (model T2)

The departure from the Hardy–Weinberg equilibrium causes a correlation between uniting gametes within the population, which can be measured by the inbreeding index F.9 Table 1 shows the mating frequencies with an inbreeding index F. According to the values in Table 1, the different ϕi,ij probabilities are

where the number 2 in parentheses refers to the second used model. It is noteworthy that the probabilities ϕ1,12(2) and ϕ2,12(2) are the same as those calculated under the one-population-random-mating model (MODEL T1).

Compared to model T1, the above equations show that a general effect of positive inbreeding is to decrease the probabilities of transmission from heterozygous parents to the homozygous offspring. The probabilities of transmission to the heterozygous offspring are not affected by inbreeding. In other words, because of the reduction in heterozygosity under positive inbreeding, the number of alleles transmitted from heterozygous parents will be reduced under model T2 compared to model T1.

The expectations and variances of the number of alleles A1 and A2 transmitted from heterozygous parents to the affected offspring can be estimated by equations (8), (9), (10) and (11). The testing of the null hypothesis of nondifferential transmission can be performed using the method in the one-population-random-mating model.

Several populations with random mating within them (model T3)

Let us suppose that the population under study is divided into k subpopulations of equal size and are in Hardy–Weinberg equilibrium. If the subpopulation source of each sampled individual is unknown, a correction that takes into account the allele frequency differences between the subpopulations must be performed and the ϕi,ij probabilities must be averaged across all the subpopulations. After averaging the ϕi,ij probabilities, we obtain

where p1 and p2 refer to the average gene frequencies of the alleles A1 and A2 in the total population, and the number 3 in parentheses refers to the third studied model. FST is the correlation between two gametes drawn at random from each subpopulation, and it measures the degree of genetic differentiation between subpopulations.9,10 FST is equal to σp2/p1p2, where σp2 is the variance of the allele frequencies across the subpopulations. γ1 and γ2 are values of skewness of the frequency distributions of the alleles A1 and A2 over all the subpopulations; for a diallelic locus γ1=−γ2.

The expectations and variances of the number of alleles A1 and A2 transmitted from heterozygous parents to the affected offspring can be estimated using the corresponding ϕi,ij(3) probabilities according to equations (8), (9), (10) and (11). The hypothesis testing of the nondifferential transmission can be done as explained in the last two models.

Several populations without random mating within them (model T4)

If the population under study is divided into k subpopulations of equal size and within them the Hardy–Weinberg equilibrium does not hold, the effects of population structure and inbreeding must be taken into account. The ϕi,ij probabilities within each subpopulation are given according to the one-population-nonrandom-mating model (T2); therefore, the corresponding values in the general population are the weighted averages of the ϕi,ij's across the subpopulations. Namely,

where FIT is the correlation between two uniting gametes relative to the total population and measures the reduction in heterozygosity of an individual relative to the total population. FIT takes into account both the effects of nonrandom mating within subpopulations (FIS) and the effects of population subdivision (FST), and they are related by the equation (1−FIT)=(1−FIS)(1−FST).10 It is noteworthy that the probabilities ϕ1,12(4) and ϕ2,12(4) are not affected by nonrandom mating within subpopulations, and they are the same as the corresponding probabilities on the several populations-random-mating model.

The expectations and variances of the number of alleles A1 and A2 transmitted from heterozygous parents to the affected offspring can be estimated using the corresponding ϕi,ij(4) probabilities according to equations (8), (9), (10) and (11). The hypothesis testing of the nondifferential transmission can be done as explained in the last three models.

Results

As previously described, there are several parameters that determine the value of the probabilities ϕi,ij. These include the frequency of the suspected allele in the general population (p1), the frequency of the same allele in the case sample (q1), and the different F values. To examine the behavior of the T-value, we calculated the probabilities ϕi,ij under three general conditions:

  1. 1

    The p1 and q1 frequencies were allowed to change, but the increased frequency of the allele A1 in cases relative to general population was fixed at 20%.

  2. 2

    The frequency of A1 in the general populations was fixed at 0.40, and the frequency of A1 in the cases was allowed to take various values.

  3. 3

    The p1 and q1 were, respectively, fixed to 0.40 and 0.50, but the vector (P, H, Q) was variable.

A drawback in using the different models in a case–control study is that the total departure from Hardy–Weinberg proportions in the general population can only be estimated using the control genotypes. In other words, only one estimate of the FIT value can be obtained, yet different FIS, FST pairs can result in the same FIT. Therefore, if we only know the total reduction of heterozygosity in the general population (FIT), we are not aware of the differential contributions of both the effects of nonrandom mating within subpopulations (FIS) and the effects of population subdivision (FST). A conceivable solution to this problem is to estimate the T-value under two different assumptions: one, that there is no population subdivision (FST=0) and that all the departure from Hardy–Weinberg proportions is due to the effects of nonrandom mating within the population (FIS=FIT>0); two, the general population is subdivided (FST=FIT>0), but there is random mating within each subpopulation (FIS=0).

Figure 1 shows the calculated T values by fixing the relative increase in the frequency of the allele A1 in cases relative to the general population at 20%. Since the models T2 and T3 are particular cases of the model T4, they are the upper and lower bounds of the estimated T-values according to the model T4. It is noteworthy that the effects of changes in the FIT are greater in model T3 than in model T2. In other words, model T2 is less affected by changes in the structure of the population. One reason for this result is that nonrandom mating within subpopulations does not affect the probabilities that one heterozygous case has received alleles from heterozygous parents (see equations (15) and (16)).

Figure 1
figure 1

Graphs show the effect of the total population inbreeding (FIT) on the T-value when the increase in the frequency of the high-risk allele in cases is fixed at 20% relative to controls, and the allele frequencies in cases and controls are allowed to change: (a) p1=0.10 and q1=0.12; (b) p1=0.20 and q1=0.24; (c) p1=0.30 and q1=0.36; (d) p1=0.40 and q1=0.48.

If we know the FIT value, but do not know the FIS and FST, the difference between the estimates of T according to models T2 and T3 can be understood as the degree of accuracy in our estimates. As the effect of nonrandom mating on the T-value is lesser than the effect of the subpopulation differentiation (Figure 1), our main concern is the latter factor. According to Figure 1, model T2 and model T3 produce similar estimates of the T-value when FIT values are less or equal to 0.25 and allele frequencies are at least 0.10. Slatkin11,12 has shown that the ratio FST/(1−FST) is roughly proportional to the divergence time between two populations, and it is equal to τ/2Ne, where τ is the divergence time in generations and Ne is the effective population size of each subpopulation, assuming equal sizes for both. For example, for an FST equal to 0.25 and an effective size equal to 500, the expected divergence time is 300 generations or 7500 years, assuming 25 years per generation. Although these estimates of divergence times must be taken with caution because of their unknown error, they do indicate that for small divergence times, the models T2, T3, and T4 produce very similar estimates of the T-value.

Another factor affecting the T-value is the degree of increase in the frequency of the high-risk allele in cases relative to the general population. To explore this effect, we fixed p1 at 0.40, while q1 was allowed to change from 0.50 to 0.80. According to Figure 2, a FIT0.30 combined with any allele frequency in cases produce very similar estimates of the T-value in the three models T2, T3, and T4. Although we chose a frequency of 0.40 for the high-risk allele in the general population, less frequent alleles gave the same results (data not shown).

Figure 2
figure 2

Graphs show the effect of the total population inbreeding (FIT) on the T-value when the frequency of the high-risk allele in controls is fixed at 0.40, and the allele frequencies in the cases are allowed to change: (a) q1=0.50; (b) q1=0.60; (c) q1=0.70; (d) q1=0.80.

If the allele frequencies are fixed both in the general population and in the cases, there is still another factor that can influence the estimates of the T-value; that is, the vector (P, H, Q) in cases, or the number of people with genotypes A1A1, A1A2, and A2A2, respectively. For example, a sample with P=250, H=500, and Q=250 has a different (P, H, Q) vector from a sample with P=300, H=400, and Q=300; yet both samples have the same allele frequencies. The rationale to analyze the (P, H, Q) vector is because if the population from where the cases are sampled is structured, we expect to find a similar degree of structure in the case sample. Figure 3 shows the effect of several combinations of P, H, and Q by fixing, respectively, the frequency of the high-risk allele to 0.40 and 0.50 in the general population and in cases. A striking result is that the three models T2, T3, and T4 converge to the same T-value when the degree of departure from Hardy–Weinberg proportions is the same in both cases and the general population. This result has a very important practical implication. For example, a highly recommended first analysis in a case–control study is to test if the Hardy–Weinberg equilibrium holds both in the case and control samples. As it was shown, any deviations from Hardy–Weinberg proportions affect the different probabilities of transmission from heterozygous parents, but, if the factors controlling the genotype frequencies have the same effect both in the general and the affected populations, the differences between the models T2, T3, and T4 disappear.

Figure 3
figure 3

Graphs show the effect of the total population inbreeding (FIT) on the T-value when the vector (P, H, Q) is variable in cases. The number of cases was fixed to 1000 is varuable with p1=0.40, q1=0.50, and different vectors. (a) Vector (250, 500, 250); (b) vector (300, 400, 300); (c) vector (350, 300, 350); (d) vector (400, 200, 400).

In most of the situations, the exact structure of the population under study is unknown and the FIT value is commonly the sole available estimate of population stratification. Since the true model may be unknown in the majority of the studies, we assessed type I error and power of the four different models under different population scenarios (Table 2). We assumed the same FIT for both general and affected populations; 100 000 simulations were used and a random sample of 100 cases was taken in each simulation. The frequency of the high-risk allele in the general population was fixed to 0.10. Higher allele frequencies did not affect appreciably type I error, but increased the power of the different models (data not shown). The critical point to reject the null hypothesis (Δ=0, see equation (13)) was chosen to attain a significance level of 0.05 (±1.96 standard deviation) for the true model under the different population scenarios. In the simplest situation, there is no population subdivision and there is random mating within a unique population (FIS=FST=FIT=0). The four models were equivalent in this first population setting. With nonrandom mating within a single population (FIS=FIT>0, FST=0), the true model T2 was equal to the model T4. In this case, model T1 showed a higher type I error (7%) and model T3 had a lower significance level (2−4%). The power of the four models was similar under this second population scenario and did not change greatly for FIT values greater than 0.10 (data not shown). When the population is subdivided, but there is random mating within each subpopulation (FST=FIT>0, FIS=0), the true model T3 was equal to the model T4. T1 and T2 models tended to give a higher false-positive rate. Although the power of the four models was similar, model T1 showed slightly superior power due to its high type I error (7−11%). Large differences were not observed for FIT values greater than 0.10 (data not shown). In the last situation, several subpopulations with nonrandom mating within them (FIS>0, FST>0, FIT>0); the T1, T2, and T3 models showed a higher false-positive rate compared to the true model T4. However, the difference between the model T3 and the true model T4 was at most 1.5% for FIT=0.10. Similar results were observed for greater values of population stratification (data not shown).

Table 2 Type I error and power of the Δ-statistic with a random sample of 100 cases under four different population scenarios, and a high-risk allele frequency of 0.10 in the general population

Conclusions

We presented a novel population-based method to estimate the differential transmission of marker alleles from heterozygous parents to affected offspring. This method expands the previous procedure proposed by Mitchell8 by providing an estimate of the differential transmission of alleles from heterozygous parent to their affected offspring and a statistical test that can be used under different population structures. Unlike Mitchell's approach, the current test does not assume Hardy–Weinberg proportions and can be adjusted to deal with deeper levels of population stratification. This method attains the same goals of the traditional family-based TDT, but it eliminates the need for recruitment of family members that are particularly difficult to obtain in late-onset diseases or large population-based epidemiological studies. Our analyses showed that model T3 was the most robust approach under several scenarios of population structure. Furthermore, for FIT values of 0.30 or less, all the models tested gave very similar estimates of the T-value. The usefulness of this procedure can be confirmed by comparing the results obtained from traditional family-based TDT with those from population-based TDT.

Population association studies between two loci that are linked are widely used to map loci affecting complex traits.13,14 One of the major advantages of the population-based studies over the family approaches is that they avoid recruitment of family members, which is particularly difficult when studying diseases with late onset. Case–control studies have often provided the first line of evidence that a putative disease susceptibility locus or a marker in linkage disequilibrium exists; for example, the observed association between the APOE genotype and Alzheimer's disease.15 However, the use of the case–control approach to uncover disease–marker associations has been disappointing. In general, initial reports of strong associations cannot be reproduced or are not supported by larger well-conducted studies.14,16 The results have been inconsistent perhaps due to modest gene effects per se, but more likely to problems with study design such as low statistical power or the lack of a comparable control population to determine the underlying gene frequencies. Another explanation that can limit the validity of the epidemiological case–control design relates to the potential for confounding that can result from population stratification or genetic admixture.17,18,19 For example, if the population under study is heterogeneous, if mating does not occur randomly, or if the cases and controls are not ethnically balanced, a coincidental allele frequency difference can emerge. Such an artifact is most likely to happen when the disease occurs more frequently in an unidentified subpopulation, which also differs, by chance, in the frequency of the tested allele.

To correct for population stratification, it is necessary to detect it and quantify it. Pritchard and Rosenberg20 have shown that approximately 15–20 unlinked microsatellite loci are needed to test for stratification, and hundreds of markers are required to identify subpopulations of recent divergence time.21,22 Here we proposed a simpler approach to measure population stratification. Although it is desirable to know the exact genetic structure of the population under study, a simple measure such as FIT of total departure of Hardy–Weinberg proportions is useful as a rough estimate of population stratification. Since our method, specifically model T3, is robust to major departures from Hardy–Weinberg proportions, either by nonrandom mating within subpopulations or population subdivision, the proposed test can be used in multiple epidemiological settings. By using a few unlinked loci, Wright's F-statistics9,10 provide a simple and reliable way to assess the genetic structure of the population under study. A limitation of the use of the F-statistics is that without a priori information about the population structure, it would be impossible to discriminate between the effects of the nonrandom mating within subpopulations and those due to the population subdivision. An approximation can be done by assuming that the total departure from Hardy–Weinberg proportions in the general population (FIT) is caused entirely by either nonrandom mating within subpopulations (FST=0, FIS=FIT>0) or population subdivision (FIS=0, FST=FIT>0). Based on the simulations presented in this study, both approximations can provide essentially the same estimates of the T-value when the values for FIT are as large as 0.30, but type I error of the different models may change depending on the exact population structure. Specifically, in the presence of nonrandom mating or population subdivision, model T1 overestimates the number of alleles transmitted from heterozygous parents, leading to a higher type I error. Model T3 showed the lowest false-positive rate (2.4–6.5%) with no significant lack of power compared to the true model under the different population scenarios. In other words, when the exact genetic structure of the population is unknown, the most conservative approach is to assume that total departure from Hardy–Weinberg proportions is only due to the effects of population subdivision. Also, due to theoretical reasons, we expect that, within subpopulations, most of the matings occur independently of the marker genotypes; therefore, population stratification will be the main factor affecting the FIT value. Although the degree of genetic differentiation between populations depends, in part, on the marker mutation rate, it has been shown that differences among major human groups constitute only 3–9% of the total genetic variation by using either high-mutation rate loci as microsatellites23 or low-mutation rate loci as allozyme genes.24 Therefore, it is very unlikely that FIT exceeds a value of 0.30 in a population-based approach with careful matching for ethnic background. Even though the proposed method can be used when knowledge about population subdivision is limited, this approach is flexible enough to incorporate demographic or historic data about the structure of the population when available. With this approach, it will be also possible to carry out an analysis of molecular variance (AMOVA)25 to quantify the different components of variance that contribute to the total inbreeding in the general population.

Although the proposed method uses population data, the logistics of this procedure diverge greatly from the traditional association studies. Traditional association studies are designed to test for differences in the gene frequencies between cases and controls. In contrast, the proposed method evaluates the differential transmission of marker alleles from heterozygous parents to the affected offspring. In fact, it is possible, depending on the model used, to have no differences in the allele marker frequencies between cases and controls, and obtain a T-value different from ½. It is necessary to emphasize that because the controls are used to estimate population parameters (frequency of the high-risk allele and FIT), a properly conducted study should require that controls represent the population that give rise to the cases. Thus, standard epidemiological methods such as matching by ethnic background must be used in the selection of the controls.

In summary, we presented a population-based TDT that attains the same goals of the traditional family-based TDT but without the recruitment of family members that are particularly difficult to obtain in late-onset diseases. By using simulations, model T3 proved to be the most robust approach under several scenarios of population structure. Although it can be useful to know the exact genetic structure of the sample population, different models gave very similar estimates of the T-value for FIT values of 0.30 or less. Further work is needed to compare the results obtained from traditional family-based TDT and population-based TDT.