Introduction

Linkage disequilibrium has been of great value in two areas of theoretical genetics studies: mapping quantitative trait loci (QTL) (Lander & Schork, 1994) and marker-assisted selection (MAS) (Lande & Thompson, 1990). In principle, the objectives of QTL location and MAS could be essentially achieved by detecting significant linkage disequilibrium between genetic loci affecting quantitative variation and polymorphic genetic marker loci as the first step.

Many researchers have focused on statistical methods for detecting the presence of or estimating the coefficient of linkage disequilibrium between two or more loci, at each of which there may be two or more alleles segregating. Hill (1974) developed likelihood-based procedures for estimating the coefficient of linkage disequilibrium between two loci in a finite random mating population. Brown (1975) established a theoretical framework for the sample sizes required to detect the disequilibrium by the use of data on gametic and zygotic frequencies. When there are multiple alleles segregating at the loci, a statistical procedure was suggested in Weir & Cockerham (1978) for calculating the power of testing the linkage disequilibrium. Furthermore, Weir (1979) presented a comprehensive discussion of the efficiency of using different sorts of data and statistical strategies from which linkage disequilibrium could be detected or estimated. These theories or techniques for inferences about linkage disequilibrium are, however, restricted to the circumstance where gametic or genotypic frequencies are observed directly.

The difficulties encountered in modelling linkage disequilibria involved with quantitative trait loci are mainly caused by the unavailability of genotypic data on the traits. Hill & Robertson (1966, 1968) demonstrated the predictability of the expected dynamics of linkage disequilibrium between a pair of linked QTLs in finite populations with or without selection. Since the abundance of genetic polymorphisms at the DNA molecular level was discovered in nearly all organisms, many statistical methods have been suggested for detecting linkage disequilibrium between a quantitative trait locus and a genetic marker locus segregating in populations with various structures (Soller & Genizi, 1978; Luo, 1993; Knott, 1994; Le Roy & Elsen, 1995). In these studies, linkage disequilibria between a QTL and a marker locus were assumed to be generated by hybridization between inbred lines or strains, i.e. the putative QTL was linked to the marker locus.

In addition to hybridization, linkage disequilibrium between a pair of loci can be produced by random drift, mutation, selection, merging of populations and by nonrandom mating, although the magnitude of the disequilibrium is maintained by the recombination frequency between the loci (e.g. Hartl & Clark, 1989). The aim of this paper is to develop two theoretical procedures with generality for detecting linkage disequilibrium between a QTL and a genetic marker locus and for formulating the statistical powers of the relevant statistical tests. Numerical analyses based upon intensive simulation studies are used to illustrate the validation of theoretical analyses and to confirm the accuracy of theoretical predictions.

Model

The method involves analysing a population of census size n. Two autosomal loci are assumed: one affects a quantitative trait (QTL) whereas the other is a codominant marker which has no direct effect on the trait. The two alleles are denoted by M and m at the marker locus and by A and a at the QTL. The phenotype of the trait (Z) is assumed to have an enviromental variance σ2e and to be normally distributed, although this is only necessary where stated. Three genotypes at the QTL, say AA, Aa and aa, are assumed to affect the quantitative trait by d, h and −d, respectively. The frequencies of M and A in the population are denoted by p and q, respectively. Genotypic value at the marker locus is denoted by T which is the number of alleles M.

The distribution of the QTL genotypes within each of three possible marker genotypes is illustrated in Table 1, where Q (or R) is the frequency of allele A at the QTL among chromosomes carrying M (or m), which is a function of allelic frequencies at the marker and QTL and the linkage disquilibrium between the two loci, say for example D. The relationships among Q, R and D can be derived by simple algebra: Q=q+D/p, R=q−D/(1−p) and D=p(1−p)(Q−R).

Table 1 Distribution of genotypes at the marker locus and QTL. p is the frequency of marker allele M, and Q (or R) represents the frequency of allele A at the QTL among chromosomes carrying the marker allele M (or m); d and h are additive and dominance effects at the QTL

It must be noticed that the theoretical model described in Table 1 implies random union of gametes with respect to the genotypes at both the marker locus and the QTL.

Theoretical analyses

The statistical model for the kth individual with the jth QTL genotype and the ith marker genotype in the population can be written as follows

where μ is the population mean, βi is the effect of marker genotype i, ωij is the effect of the jth QTL genotype within the ith marker genotype, and εijk is the residual effect whose distribution is normal with mean zero and variance σ2e. The residual variance accounts for variation of polygenes which are in linkage equilibrium with the marker and for the environmental variation. Under model (1), the population can be analysed by either of following procedures.

Analysis of variance

In model (1), the between- and within-marker genotype effects might be regarded as random effects (e.g. Jayakar, 1970; Hill, 1975), and then it can be worked out that the expected variance component between the marker genotypes is

and the expected within-marker genotype variance component is

It has been shown that expected mean squares in the analysis of variance model with random effects could be biased downwards because of a highly unbalanced hierarchical structure of the data (Soller & Genizi, 1978; Luo, 1993; Knott, 1994). An assumption of fixed QTL effect has already been widely made in these studies and in Knapp & Bridges (1990).

Under the fixed model of the QTL effect, it can be readily shown that the between-marker genotype effects are

The eqns (2) and (3) illustrate that under either the random or fixed model of the QTL effect, significant variation between the marker genotypes in the QTL effect is an indicator of the presence of linkage disequilibrium between the marker and QTL. The expected mean squares between and within marker genotypes under the fixed model are given in Appendix I.

Regression analysis

Lande & Thompson (1990) suggested the use of regression of phenotypic records of the quantitative trait on the number of alleles of marker loci as a marker score in a selection index of marker-assisted selection of a quantitative trait. In the present model, the regression coefficient is

It is clear that significance of the regression coefficient can be used to infer the presence of linkage disequilibrium. A statistical test of significance of the regression coefficient requires its variance. When the two variables (i.e. Z and T in the present context) involved in the regression analysis are normally distributed, the variance of the regression coefficient is simply calculated as

where r is the correlation coefficient between Z and T, σ2Z is the phenotypic variance of the trait and σ2T is the genetic variance at the marker locus. However, because the number of marker alleles (M) carried by each individual is a discrete random variable, and its median and arithmetic mean may not be consistent if the frequencies of the marker alleles are not equal, the normality approximation of its distribution may not be appropriate. In addition, the distribution of Z is not normal but a mixture of three normal subpopulations as described in model (1). A general formula for calculation of the sample variance of the regression coefficient can be derived following Kendall et al. (1983, p. 325):

where Cov(T, Z) and σ2T are the sample covariance between T and Z and the sample variance of T, respectively, and Cov[X, Y] and Var[X] represent operators of sample covariance and variance. Appropriate use of eqn (5.2) requires that the sample variance and covariance of Cov(T, Z) and σ2T are of order n−1, and this will be investigated in the following numerical analysis. Calculations of the variances and covariances involved in eqn (5.2) are demonstrated in Appendix II.

Prediction of power

In the analysis of variance, it has been shown that the linkage disequilibrium between the marker and QTL can be detected through testing the significance of the expected mean square between marker genotypes (EMSβ) against that within marker genotypes (EMSω). The power of the F statistical test can be predicted from the probability

where F v 1 , v 2 ( δ F ) represents a noncentral F-variable with degrees of freedom v1 and v2 and noncentral parameter δF, and Fα;v1,v2 stands for the upper α-point of a central F-variable with the same degrees of freedom. These distribution parameters can be determined following Johnson & Kotz (1970, p. 189ff.) as v1=2, v2=n−1 and

The power function (6.1) can be evaluated using the cumulative distribution of the noncentral F-distribution which is expressed in terms of an infinite series of multiples of incomplete beta functions as given in Johnson & Kotz (1970, p. 192).

When the linkage disequilibrium is detected by testing the significance of the regression coefficient given by eqn (4) the corresponding power can be predicted from the probability

where tv (δt) represents a random variable with noncentral Student's t-distribution of v degrees of freedom and noncentrality parameter δt, and tα/2;v is the upper α/2 point of a central t-variable with the same degrees of freedom. The value of v equals n−2 and the noncentral parameter is given by

(Johnson & Kotz, 1970, p. 201ff.). In the expression above, Γ(.) is a gamma function, b and σb are, respectively, the regression coefficient and its standard deviation, which could be estimated using either eqns (5.1) or (5.2). The influence of using these different variance predictors will be discussed in the following numerical studies. The power function (7.1) can be evaluated by calculating the cumulative distribution of the noncentral t-distribution in terms of confluent hypergeometric functions discussed in Amos (1964) or Owen (1968).

Numerical analyses

Simultion study

In order to confirm the previous theoretical predictions of statistical powers for detection of the linkage disequilibrium, populations were simulated for 12 different sets of parameters as summarized in Table 2. For each set of parameters, the joint genotypes at both the marker locus and the QTL for an individual were sampled from a multinomial distribution with the probability parameters as shown in Table 1 and the given sample size n. Once the marker–QTL joint genotype was determined, the phenotypic record for an individual was generated by its genotypic value of the QTL plus a random number sampled from a normal distribution of mean zero and variance σ2e.

Table 2 Parameters defining the 12 populations considered in numerical analyses, where n is the census population size, p and q are the frequencies of alleles M and A, D is the coefficient of linkage disequilibrium between the marker locus and QTL, h2 is the heritability of the QTL and φ is the dominance ratio at the QTL

The simulation program used in the present study can be easily run with different values of the allelic frequencies at both the marker locus and QTL, the additive effect and the dominance level at the QTL and the census population size. For simplicity, the QTL genotypic effects were expressed in terms of the QTL heritability (i.e. the proportion of genetic variance at the QTL to a given magnitude of phenotypic variance of the trait, which was assigned a constant value of 100).

Each parameter set was repeated 1000 times. The statistics involved in the power calculation were estimated as the mean of the repeated simulations, and the corresponding standard error of these means. Each set of the simulation data was used to perform analysis of variance and regression analysis. Calculating the frequency of the significant statistical tests of these two different analyses in the repeated simulation trials gives simulated observations of the power, as has been carried out in Carbonell et al. (1992).

Results

Tabulated in Table 3 are the average of the mean squares and their corresponding standard errors over 1000 replicates of simulations and the mean squares predicted from calculations based on the theoretical analyses developed in the present study. The theoretical predictions are in good agreement with the simulated observations, validating the theoretical model presented here. In Table 3, simulated observations of the powers of statistically testing for linkage disequilibrium between the marker and the QTL are also shown tegether with the theoretical predictions for all 12 populations. The theoretical calculations of the power provided adequate predictions to the corresponding simulated values.

Table 3 Numerical results of analysis of variance: expected mean squares between marker genotypes (EMSβ) and within marker genotype (EMSω) estimated from simulations, together with their corresponding standard errors, and predicted from theoretical calculation, as well as the observed powers and their corresponding theoretical predictions

Table 4 illustrates the estimates of sample variances of the regression coefficients calculated as the average of repeated simulations and by the use of the theoretical predictions. Among the three estimates of the variance observed from the simulation studies, ¯σ2b was the variance of the 1000 regression coefficients calculated from repeated simulations, whereas ^σ2b and σ2b were the averages of the observed variances of the regression coefficient, which were calculated by use of eqns (5.1) and (5.2), respectively. Theoretical predictions of these variances were derived in correspondingly similar ways. Numerical calculation indicates that the sample variance and covariance of the covariance between the phenotypic record (Z) and the number of the marker allele M (T) were in the range of 0.0021 to 0.0659, which were about of the order of n−1, for the circumstances considered here, suggesting the appropriateness of using the variance prediction based upon formula (5.2). It can be seen from Table 4 that theoretical prediction of the variance of the regression coefficient using either eqns (5.1) or (5.2) provides an adequate approximation for the simulated value in all 12 simulated populations. This demonstrates that possible violation of normality of these regression variables did not cause significant bias of the variance estimation and thus confirms the reliability of using eqn (5.1) as a simple predictor of the variance of the regression coefficient.

Table 4 Variances of the regression coefficient calculated: (i) from the sample variance of 1000 observed regression coefficients (¯σ2b); (ii) from the average of each simulated value, where ^σ2b and σ2b were derived using eqns (5.1) and (5.2) in the text, respectively, together with their corresponding standard errors. These variances were also predicted using the corresponding theoretical formula

The averages of the regression coefficients over the 1000 replicates of simulations and their corresponding standard errors are shown together with the theoretical predictions of these coefficients in Table 5. Comparisons of the coefficient estimates between the theoretical values and simulation averages show a good agreement. Theoretical calculation of the power provides an accurate prediction of the corresponding simulated values, and the theoretical power predictions using the different estimates of the regression coefficient variance based on eqns (5.1) and (5.2) yielded an almost identical value.

Table 5 Regression coefficients estimated from the average of repeated simulations, together with the standard errors, and predicted from theoretical calculations, as well as the observed statistical powers from simulations and those from theoretical prediction, where power1 was predicted using the sample variance eqn (5.1) and power2 was predicted using the sample variance eqn (5.2)

Comparison between the regression analysis and the analysis of variance shows that the regression test had consistently higher power than the F-statistical test in the analysis of variance.

Discussion

Statistical inference about linkage disequilibrium between polymorphic genetic marker loci and the loci controlling quantitative genetic variation is essential in the identification of genes affecting traits of great economic value in plant/animal breeding schemes or of disease-susceptibility in humans. It has been shown by Lande & Thompson (1990) and Gimelfarb & Lande (1994) that substantial linkage disequilibrium between marker loci and QTL is a prerequisite for marker-assisted selection (MAS) to achieve extra genetic progress. Moreover, the efficiency of MAS is highly dependent on correctly determining the markers which are incorporated in a MAS index. A false positive or negative inference about the disequilibrium, and in turn an erroneous use of the marker information, will result in reducing instead of improving the efficiency (Luo et al., 1997). It has also been hoped that linkage disequilibrium between a marker and a trait locus will lead to the identification of a disease gene in the vicinity of the marker (Weeks & Lathrop, 1995) even though very much care must be paid in interpreting the data of linkage disequilibrium as an alternative measure for obtaining a fine map for a disease predisposing gene in human populations (Hill & Weir, 1994). Moreover, it is widely agreed that the objectives of screening genes underlying human complex disease have been seriously limited by the difficulties involved in collecting large and informative pedigrees. However, data are more easily obtained from natural populations than from structured pedigrees in human or animal populations or segregating populations in plant or animal species. The present analysis provides a fast screening of individual markers which are significantly associated with the genetic variation for the purposes of using marker information either for improving selection efficiency of quantitative traits or for mapping genes underlying quantitative genetic variation. The linkage disequilibrium mapping of a gene is based on the study of association between the gene and the marker(s) whose map position is well known. This requires further knowledge about the magnitude of the disequilibrium and an appropriate parameterization of the decay of the disequilibrium in terms of the genetic distance between the target gene and the marker locus (Baret & Hill, 1997 for a comprehensive review).

A novel quantitative genetics model has been developed in the present paper to detect the presence of linkage disequilibrium between a marker locus and a locus contributing to quantitative genetic variation in natural populations. The model is appropriate for analysing linkage disequilibrium generated from all potential causes. Theoretical analyses demonstrated that this can be achieved by the methods based upon an appropriate statistical method of analysis of variance or analysis of regression. The powers of these statistical analyses were adequately predicted and the factors affecting the powers were investigated. The model differs from others in various respects: the two-loci models of Soller & Genizi (1978), Luo (1993) and Knott (1994) assumed the disequilibrium to be produced from crossing two lines in which the marker and trait loci were linked or completely linked. The MAS model proposed by Lande & Thompson (1990) suggested the use of the regression coefficient of the number of favourable marker alleles in MAS on the trait phenotype as a measure of the magnitude of the disequilibrium, but no attempt was made in their study to investigate the efficiency of the method. The model presented in this paper is appropriate for directly analysing the data of marker genotypes and phenotypic records of a quantitative trait without requiring knowledge of the haplotype frequencies at the two loci which was assumed to be available in Hill & Weir (1994) and in Terwilliger (1995). An important assumption made in the present analysis is random union of gametes with respect to the marker and QTL loci. Any violation of this assumption would result in a reduction of the test statistic and thus a lowering of the power of the disequilibrium test.

The present study has shown the following. (i) There is an important difference in power between the two approaches; the regression analysis is more powerful than the analysis of variance, particularly when the QTL has a low heritability. An examination of the calculation of the test statistics in the two approaches indicates that the test statistic in the analysis of variance essentially tests the significance of the correlation ratio of a continuous quantitative variate Z (i.e. the phenotypic records of the trait) on a discrete variate T (i.e. the number of the marker allele M), whereas the test statistic in the regression analysis virtually tests linearity of the regression of the variate Z on the variate T. It has been pointed out in Kendall & Stuart (1961, pp. 296–300) that the regression test will have higher power than the correlation ratio test (i.e. the test of the analysis of variance) if the alternative hypothesis is that the regression of Z on T is linear. Moreover, it can be seen, from comparing the powers of populations 2, 5 and 6 in Table 3 with those of the corresponding populations in Table 5, that the superiority of the regression analysis to the analysis of variance tends to become less important as the dominance ratio at the QTL increases from zero to one. This, however, is paralleled with the trend that the powers of both the approaches decrease to a very small value (<30 per cent) as the dominance ratio increases. The difference between the two statistical tests will become trivial when both of the tests have very low powers. (ii) Although the variables in the regression analysis do not strictly follow a normal distribution, the variance estimate of the regression coefficient predicted from using formula (5.1), which requires the variables to be normal, was not significantly different from that derived from using prediction eqn (5.2) without the need of invoking normality. (iii) Which factors affect the efficiency of the statistical tests of the linkage disequilibrium. Among the parameters considered, the amount of disequilibrium and size of the QTL are most important in determining the powers. A comparison of the powers among populations 2, 5 and 6 indicates that the power decreases with an increase in the dominance ratio at the QTL. The allelic frequencies at the marker and QTL display an important effect on the power in both models (the analysis of variance and the regression analysis). When the allelic frequencies at the two loci are low (e.g. population 7 in which p=q=0.3) or high (e.g. population 8 in which p=q=0.7), the power is increased compared to intermediate values (e.g. population 2 in which p=q=0.5). Moreover, comparison of the power for populations 9 and 10 shows that the frequencies p and q were interchangeable in determining the power. These agree with the fact that the two loci were symmetric in the theoretical model as described in Table 1. It was found in our previous study (Luo et al. 1997) that the allelic frequencies at the marker locus and QTL play an important role in determining the efficiency of MAS, but the effects of allelic frequencies at the two loci on MAS efficiency were not interchangeable. For a given amount of linkage disequilibrium between the two loci, an increase in the efficiency can be expected when both the frequencies p and q are low. Combining the findings of the present study with those of Luo et al., (1997) suggests that the allelic frequencies display a more important influence on the efficiency of MAS at the stage of selection than at the stage of screening the markers.

The present study has been focusing on modelling linkage disequilibrium between a single marker locus and a single QTL. This may seem distant from being completely realistic for polygenic inheritance of quantitative traits and for availability of marker linkage maps. In distinct comparison with the model present here, Lande & Thompson (1990) proposed a multiple regression approach in which an infinite loci model of quantitative genetic variation was assumed and the use of multiple markers was allowed. However, it has been shown in a simulation example given by Gimelfarb & Lande (1994) that either a false positive or a false negative inference about the linkage disequilibria between the marker loci and QTLs may be frequently made using multiple regression analysis because the marker-associated quantitative effects could be counterbalanced or could inflate each other among the linked marker loci. These problems, thus, leave the multiple regression model far from being conclusive for the theory of linkage disequilibria among marker loci and QTLs. A full understanding of the multidimensional marker-associated quantitative genetic effects requires a further reparameterization of the multiple regression coefficients in terms of genetic parameters such as the disequilibrium coefficients of different orders. The model studied here is increasingly likely to be a subunit of the sophisticated framework of multiple-loci disequilibria, and the study of the two-locus system in isolation is an important building block for an understanding of the system as a whole.