Introduction

Many characters of biological interest and economic importance vary in a dichotomous form, i.e. presence or absence, but are not inherited in a simple Mendelian fashion. These traits are called complex binary traits. Complex binary traits are presumably controlled by a number of genetic and environmental factors. Because of this, these traits belong to the category of quantitative traits (Falconer & Mackay, 1996).

Complex binary traits are usually analysed using a threshold model, in which it is assumed that the observed category is determined by the value of an underlying unobservable continuous variable (Harville & Mee, 1984; McCulloch, 1994). The underlying continuous variable, called the liability, can be considered as a regular quantitative trait which can be partitioned into genetic and environmental components. The binary phenotype and the continuous liability are linked through a fixed but unknown threshold (Wright, 1934). Existing quantitative genetics theory developed for continuous traits holds exactly for the continuous liability of binary traits. Methods of QTL mapping for binary traits have been well developed in line-crossing experiments (Hackett & Weller, 1995; Visscher et al., 1996; Xu & Atchley, 1996; Rebai, 1997; Xu et al., 1998) and four-way crosses (Rao & Xu, 1998). The methods are primarily conducted in a single cross or family. The statistical power of QTL mapping with a single family strongly depends on the two parents selected. If the two parents are fixed for the same allele at a putative QTL, the QTL is undetectable, no matter how many offspring are sampled from the family. But on the other hand, even if a QTL is segregating in the family and is detected, the estimated variance of the QTL can not be extrapolated beyond the particular family. To avoid a loss in statistical power as a result of homozygous parents being selected and to increase the statistical inference space of the estimated QTL parameters, one needs to combine data from multiple families. Methods to handle normally distributed data from multiple families in QTL mapping have already been developed. Typically, these methods include maximum likelihood (Knott & Haley, 1992; Grignola et al.,1996a,b), simple linear regression (Knott et al., 1996) and weighted least squares (Xu, 1998a). However, such methods are lacking for QTL mapping of binary traits.

In this paper, we develop a fixed-model approach to mapping quantitative trait loci for complex binary traits from multiple families of outbred populations. This approach is based on the threshold model, and describes the liability by a single linear model with a heterogeneous residual variance. We treat the genotypic effects of the QTL for each parent as fixed effects. The Fisher-scoring algorithm is adopted here to estimate these genetic parameters. The method automatically generates an asymptotic variance–covariance matrix for the estimated QTL effects, which are eventually used for hypothesis tests and estimation of QTL variances. The method is tested via analyses of simulated data.

Statistical methods

Threshold model

Consider n independent full-sib families. Let yij (i=1, …, n; j=1, …, ni) represent an underlying continuous-response variable associated with the jth individual in the ith family. Denote the genotypes of a putative QTL by Q i1SQ i2S and Q i1dQ i2d for the two parents of the ith family. The four possible genotypes in the progeny are Q i1SQ i1d, Q i1SQ i2d, Q i2SQ i1d and Q i2SQ i2d. We denote the values of the four genotypes by Gi11, Gi12, Gi21 and Gi22, respectively. The underlying variable yij can be treated as a usual quantitative character, which is described by the following linear model:

where β is a vector of unknown parameters (including the overall mean, common environmental effects shared by family members, polygenic effects and so on), which relates yij via a known incidence vector xij, ɛij is the residual error distributed as N(0, σɛ2), and zij=(zij1 zij2 zij3 zij4)T are indicators of the four possible genotypes. The variables zijk(k=1, …, 4) are defined as follows:

In quantitative genetic analysis of complex binary traits, yij itself is not observable. It has been postulated that yij controls the binary expression of the trait through a threshold model (Wright, 1934). The relationship between the underlying variable yij and the binary response sij is assumed to be:

for some threshold value t. The threshold model is overparameterized so that some constraints must be superimposed. As usual, we set σ2ɛ=1 and t = 0 (Harville & Mee, 1984; Sorensen et al., 1995).

As a regular quantitative trait, the genotypic values of the liability can be partitioned into additive and dominance effects, i.e.

where αi1s and αi2s are the effects of the two alleles in the sire, αi1d and αi2d are the effects of the two alleles in the dam, and δikl is the dominance deviation. Unfortunately, αiks, αild and δikl are not estimable; some constraints are required. If all the allelic and dominance effects are appropriately scaled (standardized), the following restrictions can be applied

Under these constraints, there are only three independent estimable effects, which are αi1s, αi1d and δi11. Denote αis=2αi1sid=2αi1dii11, Gi=(Gi11 Gi12 Gi21 Gi22)T and i=(αis αid δi)T, then Gi=i, where

Hence, model (1) can be re-expressed as:

where wijT=zijTH.

If the QTL is not at a marker locus, its genotype is unobservable so that zij are missing. However, the distribution of zij can be inferred from the genotypes of linked markers. Define the probability of zijk=1 conditional on marker information by

These conditional probabilities can be calculated using a multipoint method. Details of the general multipoint method can be found in Kruglyak & Lander (1995) and Rao & Xu (1998).

Given these conditional probabilities, the expectation and covariance matrices of zij are

and

Therefore, the conditional expectation and variance of yij can be derived as follows:

and

It should be noted that the conditional distribution of yij is a mixture of four normal distributions. Nevertheless, when the QTL effects are small relative to σɛ and the marker information content is high, the conditional distribution of yij can be close to a normal distribution. Therefore, model (5) can be approximated by the following heterogeneous residual variance model

where eij N(0, Vij).

Estimating genetic parameters

Under model (6), i.e. yij|(IM, β, γi) N(μij, Vij), the probability of sij is

which leads to Pr(sij=0)=1 − Pr(sij=1)=1 − Pij, where Φ(·) is the standardized normal distribution function.

In the fixed model, conditional on the genetic effects, {sij} (i=1, …, n; j=1, …, ni) are mutually independent. Therefore, we have the following log-likelihood function

The maximum likelihood estimates of θ = (βTγ1T…γnT)T can be solved using any convenient algorithm. We found that the Fisher-scoring algorithm is easy to derive and also extremely fast; the algorithm is described as follows:

where k denotes an iteration index,

is the score function, and

is the Fisher information matrix. The components of the score vector and the Fisher information matrix are given in the Appendix.

Each update of the Fisher-scoring algorithm involves inverting F(θ), which can be time consuming for a large number of families or a high dimension of β. These can be avoided by taking advantage of the partitioned structure of F(θ). Since the lower-right part of F(θ) is block diagonal, the Fisher-scoring algorithm can be simplified (see Appendix). After obtaining the maximum likelihood estimates of the parameters β, αisid and δi, the inverse of F(θ^) has to be calculated. However, the inverse of F(θ) also has a simple form (see Appendix).

Given α^is,α^id and δ^i, one can estimate the cross-family variances. If parents of the sampled families are randomly sampled from the base population, the cross- family variances provide approximate estimates of the QTL segregation variances in the base population. The additive and dominance variances in the base population are σ2α = ¼ [Var(αis) + Var(αid))] and σ2δ = Var(δi), respectively.

The additive variance σ2α and dominance variance σ2δ can be estimated by:

and

If Var(α^is), Var(α^id) and Var(δ^i) are obtained exactly, formulas (9) and (10) provide unbiased estimates of σ2α and σδ2, respectively. However, Var(α^is), Var(α^id) and Var(δ^i) can only be estimated approximately by the inverse of the Fisher information matrix (see the next section), and thus the estimates are only asymptotically unbiased.

Hypothesis testing

A useful property of the Fisher-scoring algorithm is that the variance–covariance matrix of θ^ = (β^T γ^1Tγ^nT)T can be approximated by the inverse of the Fisher information matrix, i.e. Var(θ^) ≈ F(θ^)−1. Because the resulting estimates θ^ = (βâT γ^1Tγ^nT)T are maximum likelihood estimates, they follow a multivariate normal distribution if the family sizes ni are large enough, i.e.

As a consequence, the following test statistic, w, will follow an approximate chi-squared distribution with m degrees of freedom under the null hypothesis that Cθ=0:

where m is the rank of matrix C.

The exact form of matrix C determines the type of hypothesis test. To test the overall null hypothesis H0isidi=0 i, then C=(0 I3n). Theoretically, there are many other hypotheses to test. In this study, we carry out only the overall test that no QTL at the locus of interest is segregating.

Simulation studies

Design of simulations

Properties of the proposed method were investigated numerically via Monte Carlo simulations. The following properties were examined: the bias, the standard error of the parameter estimates, and the statistical power of QTL detection. We considered the following factors on the performance of the mapping procedure: (i) the variance explained by the QTL; (ii) the sampling strategy (number of families vs. family size); and (iii) the trait incidence (proportion of affected individuals). We simulated one chromosome of length 100 cM with 11 codominant markers evenly spaced along the chromosome. A single QTL was simulated at 25 cM. The simulation was repeated 100 times for each situation. The total number of individuals was set at 750 in all simulations. The standard error of the parameter estimates was calculated from the standard deviation of the estimates among 100 replicates. The statistical power was determined by counting the number of runs (over the 100 replicates) that have test statistics greater than an empirical critical value. The empirical critical value under each condition is obtained by choosing the 95th and 99th percentile of the highest test statistic over 1000 additional runs under the null model (no QTL segregating).

Five equally frequent alleles were simulated for each marker locus. This setting allows each parent to have a 20% chance of being homozygous at each marker locus. In all situations, the residual error was assumed to be normally distributed, with a variance set at σ2ɛ=1.0. The broad-sense heritability of QTL, h2q(=σ2g/ (σ2g2ɛ)= (σ2α2δ)/(σ2α2δ2ɛ)), was examined at four levels: 0.4, 0.3, 0.2 and 0.1. Only a mixed mode of QTL inheritance was considered, i.e. σ2α2δ. Therefore, h2q=0.1 corresponds to σ2α2δ= 0.06, h2q=0.2 corresponds to σ2α2δ= 0.125, h2q=0.3 corresponds to σ2α= σ2δ=0.215, and h2q=0.4 corresponds to σ2α2δ=0.33. The sampling strategy was simulated at three levels: 5 × 150, 10 × 75 and 15 × 50 (number of families × family size). The trait incidence was set at 50% and 20%.

QTL allelic effects were considered to be normally distributed with preassigned additive and dominance variances. Each parent of a family was made of two alleles: the first allele was assigned a value sampled from a standard normal distribution, and the second assigned the negative value of the first allele. The dominance effect was the interaction effect between any two sampled alleles, which was assigned a value sampled from a N(0,1) distribution, independent of the allelic effects. When offspring were generated, their genetic values at the QTL were re-scaled so that they had the appropriate assigned variances. The liability of each offspring was the sum of its genetic value, the overall mean μ, and a residual error sampled from the N(0,1) distribution. The observable binary phenotype was set to 1 if the corresponding liability exceeded 0, and 0 otherwise. The overall mean of the liability, contained in β, determines the proportion of the trait incidence. What we did was to select the appropriate mean so that a preassigned level of incidence was obtained.

Results of simulation

The empirical critical values at Type I error rates of 0.05 and 0.01 for situations with disease incidences of 50% and 20% are given in Table 1. The trait incidence has a small effect on the empirical critical values. The empirical critical values are slightly higher than the critical values of the χ2 distribution with corresponding degrees of freedom (3 × n). As the number of families increases, the empirical critical value increases dramatically. This is expected because increasing the number of families increases the number of parameters tested.

Table 1 Empirical critical values for the significance test at α = 0.05 and α = 0.01, where α is the type I error rate

The estimates of QTL parameters (QTL position, additive and dominance variances, and heritability) and the empirical power are summarized in Tables 2, 3 and 4. The proposed method successfully locates the QTL position and estimates the additive and dominance variance components as well as the heritability with negligible biases. However, the QTL heritability, the sampling strategy and the trait incidence have strong impacts on the performance of the mapping procedure.

Table 2 Estimates of QTL parameters and empirical power (α = 0.05, 0.01) under different levels of heritability of the QTL and different sampling strategies when the trait incidence is 50%. Standard errors of the estimates, given in parentheses, are calculated by the standard deviations among 100 replicated simulations. σ2α, σ2δ, σ2g and h2q are additive, dominance, genetic variances and heritability of the QTL, respectively
Table 3 Estimates of QTL parameters and empirical power (α = 0.05, 0.01) under different levels of heritability of the QTL and different sampling strategies when the trait incidence is 20%. Standard errors of the estimates, given in parentheses, are calculated by the standard deviations among 100 replicated simulations. σ2α, σ2δ, σ2g and h2q are additive, dominance, genetic variances and heritability of the QTL, respectively
Table 4 Estimates of QTL parameters and empirical power (α = 0.05, 0.01) under different sampling strategies when the QTL heritability h2q is 0.4 and the trait incidence is 50%. Standard errors of the estimates, given in parentheses, are calculated by the standard deviations among 100 replicated simulations. Parametric values not listed in the table are: QTL position (cMA) = 25 cM, additive variance σ2α = 0.33, dominance variance σ2δ = 0.33 and genetic variance σ2g = 0.66

The performance of the proposed method is strongly affected by the trait incidence. In all cases, the estimates of the QTL parameters are more accurate and the statistical power is higher with the trait incidence of 50% than those with the trait incidence of 20% (see Tables 2 and 3. Under the threshold model, some information will be lost because of the translation from the underlying liability into the observed binary phenotype. The trait incidence determines the amount of lost information. The closer the trait incidence is to 50%, the less information is lost. This explains why the proposed method performs better in terms of the accuracy of parametric estimation and statistical power when the trait incidence is 50%.

The proportion of the phenotypic variance explained by the QTL, i.e. the QTL heritability h2q, has an effect on the accuracy of the estimated parameters and the statistical power (see Tables 2, 3 and 4. As expected, a lower QTL heritability (h2q=0.1) can increase the standard deviation of the estimated QTL position, and decrease the statistical power. Higher heritability levels tend to be associated with a slightly larger standard deviation in the estimated variances and heritabilities, because of the scaling effect, i.e. the standard deviation is correlated to the mean. In the case of high QTL heritability (h2q=0.4), the genetic variances and heritability are underestimated, although the estimated QTL position is accurate and the statistical power is high (Table 4). This is expected because the mixture of normal distributions can not be approximated by a single normal distribution when the QTL heritability is high.

The sampling strategy also has an impact on the estimation of the QTL position and the statistical power (see Tables 2, 3 and 4. When the number of families increases, the standard deviation of the estimated QTL position increases. In the case of low QTL heritability (h2q=0.1), the statistical power decreases as the number of families increases. When the power is already very high (e.g. h2q>0.1), the effect of the sampling strategy is expected to be negligible. However, the sampling strategy has a small effect on the precision of the QTL variances and estimated heritabilities. Overall, QTL mapping performs well with a few large families.

Discussion

We have developed a general framework of QTL mapping for complex binary traits by combining data from multiple families. Instead of analysing each family separately, this method carries out a joint statistical inference for multiple families. The link among families is reflected by the common fixed effect vector β (see models 5 and 6), which includes the overall mean μ and some common genetic and nongenetic factors, e.g. common environmental effects shared by these families and maternal effects. Because of these common effects, the estimates and tests of genetic parameters in different families are correlated. Therefore, a joint test for multiple families is more powerful than a test considering each family separately (Rebai & Goffinet, 1993). In addition, there are other reasons to justify the use of the proposed consensus method. Like most human diseases, complex binary traits in animal and plant populations undoubtedly have a complex genetic basis. Some QTLs controlling complex binary traits may be homozygous in any single individual and different individuals may be heterozygous for different QTLs. Ideally, several families should be selected and analysed jointly. As a result of using multiple families, the method has a wider statistical inference space than using a single family. Theoretically, the variance attributable to the QTL is better estimated with a large number of families. However, the number of parameters dramatically increases as the number of families sampled increases, which undoubtedly reduces the statistical efficiency of the proposed method. Therefore, with a fixed number of individuals, there is an optimal allocation between the number of families and the number of individuals per family where QTL mapping reaches its maximum power and minimum estimation error. Limited investigations have shown that a mating with several parents (5–10) should give a good sample of variance and allow the detection of QTL with reasonable power (Muranty, 1996; Xu, 1998a).

A typical problem in QTL mapping comes from missing QTL genotypes. In QTL mapping using outbred line crosses, when the putative QTL is not at a marker, the liability is actually a mixture of four normal distributions. Theoretically, the optimal treatment of the unobservable genotype is the mixture model maximum likelihood method, which uses all information contained in the data (Lander & Botstein, 1989; Zeng, 1994). The heterogeneous residual variance model proposed here uses a single distribution to approximate the four distributions, assuming that the residual is normally distributed. This approximation is feasible only when the QTL effects are small relative to the residual variance. Some comparisons between the heterogeneous residual variance model and the mixture model have been made in QTL mapping, showing that the two methods are virtually identical for normally distributed traits, even when the QTL effects are large (Xu, 1998b). Our simulations show that the proposed method performs well when the QTL heritability is not overly high. In the analysis of real data, the effect of any individual QTL being tested is usually small for most polygenic traits, which makes the proposed method valid for most situations. There are some advantages of the heterogeneous residual variance model over the mixture model. First, a simple Fisher-scoring algorithm is available for the heterogeneous residual variance model, which provides, as a by-product, an estimate of the variance–covariance matrix of the estimated parameters. Therefore, it is straightforward to conduct hypothesis tests. The Fisher-scoring method is difficult to derive for the mixture model. As a result, computing the variance–covariance matrix of the estimated parameters is difficult with the mixture model. Secondly, the heterogeneous residual variance model implemented via the Fisher-scoring algorithm is fast, which allows a multiple sampling technique, e.g. the permutation test, to be used more conveniently.

The results presented in this study are based on known linkage phases. Therefore, implementation of this method requires the knowledge of marker linkage phases in the parents. There are several ways to deduce the linkage phase in outbred pedigrees (e.g. Maliepaard et al., 1997).

The proposed method is a fixed-model approach because the genetic effects are treated as fixed. As observed in the simulation studies, the method is efficient when there is a small number of families with large family sizes. However, as the number of families increases, the substantial number of parameters to be estimated often generates some statistical problems, in particular when the family sizes are small. The random model approach, on the other hand, estimates only a few parameters, because only the variances are estimated and tested. With the random model approach, statistical analysis can be carried out even when the family sizes are small. The random model approach, though, is as-of-yet undeveloped for complex binary traits, and as such deserves further investigation.