Abstract
Complex binary traits have a dichotomous phenotypic expression but do not show a simple Mendelian segregation ratio. These traits are considered to be jointly controlled by the actions of several genes and a random environmental effect. The binary phenotype and the underlying factor are assumed to be linked through a threshold model. The underlying factor, referred to as the liability, is treated as a regular but unobservable quantitative character. Mapping quantitative trait loci (QTL) can be performed directly on the liability. Methods of QTL mapping for the liability of a complex binary trait have been well developed in line-crossing experiments. However, such a method is not available in outbred populations which usually consist of many independent pedigrees (families). In this study, we develop a method to analyse jointly multiple families of an outbred population. The method is developed based on a fixed-model approach, i.e. the QTL effects, rather than the variance, are estimated and tested. After the test, the estimated effects are then converted into a single estimate of the QTL variance by taking into consideration errors in the estimated effects. The QTL effects and variance–covariance matrix of the estimates are obtained by a fast Fisher-scoring method. Monte Carlo simulations show that the method is not only powerful but also generates very accurate estimates of QTL variances.
Similar content being viewed by others
Introduction
Many characters of biological interest and economic importance vary in a dichotomous form, i.e. presence or absence, but are not inherited in a simple Mendelian fashion. These traits are called complex binary traits. Complex binary traits are presumably controlled by a number of genetic and environmental factors. Because of this, these traits belong to the category of quantitative traits (Falconer & Mackay, 1996).
Complex binary traits are usually analysed using a threshold model, in which it is assumed that the observed category is determined by the value of an underlying unobservable continuous variable (Harville & Mee, 1984; McCulloch, 1994). The underlying continuous variable, called the liability, can be considered as a regular quantitative trait which can be partitioned into genetic and environmental components. The binary phenotype and the continuous liability are linked through a fixed but unknown threshold (Wright, 1934). Existing quantitative genetics theory developed for continuous traits holds exactly for the continuous liability of binary traits. Methods of QTL mapping for binary traits have been well developed in line-crossing experiments (Hackett & Weller, 1995; Visscher et al., 1996; Xu & Atchley, 1996; Rebai, 1997; Xu et al., 1998) and four-way crosses (Rao & Xu, 1998). The methods are primarily conducted in a single cross or family. The statistical power of QTL mapping with a single family strongly depends on the two parents selected. If the two parents are fixed for the same allele at a putative QTL, the QTL is undetectable, no matter how many offspring are sampled from the family. But on the other hand, even if a QTL is segregating in the family and is detected, the estimated variance of the QTL can not be extrapolated beyond the particular family. To avoid a loss in statistical power as a result of homozygous parents being selected and to increase the statistical inference space of the estimated QTL parameters, one needs to combine data from multiple families. Methods to handle normally distributed data from multiple families in QTL mapping have already been developed. Typically, these methods include maximum likelihood (Knott & Haley, 1992; Grignola et al.,1996a,b), simple linear regression (Knott et al., 1996) and weighted least squares (Xu, 1998a). However, such methods are lacking for QTL mapping of binary traits.
In this paper, we develop a fixed-model approach to mapping quantitative trait loci for complex binary traits from multiple families of outbred populations. This approach is based on the threshold model, and describes the liability by a single linear model with a heterogeneous residual variance. We treat the genotypic effects of the QTL for each parent as fixed effects. The Fisher-scoring algorithm is adopted here to estimate these genetic parameters. The method automatically generates an asymptotic variance–covariance matrix for the estimated QTL effects, which are eventually used for hypothesis tests and estimation of QTL variances. The method is tested via analyses of simulated data.
Statistical methods
Threshold model
Consider n independent full-sib families. Let yij (i=1, …, n; j=1, …, ni) represent an underlying continuous-response variable associated with the jth individual in the ith family. Denote the genotypes of a putative QTL by Q i1SQ i2S and Q i1dQ i2d for the two parents of the ith family. The four possible genotypes in the progeny are Q i1SQ i1d, Q i1SQ i2d, Q i2SQ i1d and Q i2SQ i2d. We denote the values of the four genotypes by Gi11, Gi12, Gi21 and Gi22, respectively. The underlying variable yij can be treated as a usual quantitative character, which is described by the following linear model:
where β is a vector of unknown parameters (including the overall mean, common environmental effects shared by family members, polygenic effects and so on), which relates yij via a known incidence vector xij, ɛij is the residual error distributed as N(0, σɛ2), and zij=(zij1 zij2 zij3 zij4)T are indicators of the four possible genotypes. The variables zijk(k=1, …, 4) are defined as follows:
In quantitative genetic analysis of complex binary traits, yij itself is not observable. It has been postulated that yij controls the binary expression of the trait through a threshold model (Wright, 1934). The relationship between the underlying variable yij and the binary response sij is assumed to be:
for some threshold value t. The threshold model is overparameterized so that some constraints must be superimposed. As usual, we set σ2ɛ=1 and t = 0 (Harville & Mee, 1984; Sorensen et al., 1995).
As a regular quantitative trait, the genotypic values of the liability can be partitioned into additive and dominance effects, i.e.
where αi1s and αi2s are the effects of the two alleles in the sire, αi1d and αi2d are the effects of the two alleles in the dam, and δikl is the dominance deviation. Unfortunately, αiks, αild and δikl are not estimable; some constraints are required. If all the allelic and dominance effects are appropriately scaled (standardized), the following restrictions can be applied
Under these constraints, there are only three independent estimable effects, which are αi1s, αi1d and δi11. Denote αis=2αi1s,αid=2αi1d,δi=δi11, Gi=(Gi11 Gi12 Gi21 Gi22)T and i=(αis αid δi)T, then Gi=Hγi, where
Hence, model (1) can be re-expressed as:
where wijT=zijTH.
If the QTL is not at a marker locus, its genotype is unobservable so that zij are missing. However, the distribution of zij can be inferred from the genotypes of linked markers. Define the probability of zijk=1 conditional on marker information by
These conditional probabilities can be calculated using a multipoint method. Details of the general multipoint method can be found in Kruglyak & Lander (1995) and Rao & Xu (1998).
Given these conditional probabilities, the expectation and covariance matrices of zij are
and
Therefore, the conditional expectation and variance of yij can be derived as follows:
and
It should be noted that the conditional distribution of yij is a mixture of four normal distributions. Nevertheless, when the QTL effects are small relative to σɛ and the marker information content is high, the conditional distribution of yij can be close to a normal distribution. Therefore, model (5) can be approximated by the following heterogeneous residual variance model
where eij ∼ N(0, Vij).
Estimating genetic parameters
Under model (6), i.e. yij|(IM, β, γi) ∼ N(μij, Vij), the probability of sij is
which leads to Pr(sij=0)=1 − Pr(sij=1)=1 − Pij, where Φ(·) is the standardized normal distribution function.
In the fixed model, conditional on the genetic effects, {sij} (i=1, …, n; j=1, …, ni) are mutually independent. Therefore, we have the following log-likelihood function
The maximum likelihood estimates of θ = (βTγ1T…γnT)T can be solved using any convenient algorithm. We found that the Fisher-scoring algorithm is easy to derive and also extremely fast; the algorithm is described as follows:
where k denotes an iteration index,
is the score function, and
is the Fisher information matrix. The components of the score vector and the Fisher information matrix are given in the Appendix.
Each update of the Fisher-scoring algorithm involves inverting F(θ), which can be time consuming for a large number of families or a high dimension of β. These can be avoided by taking advantage of the partitioned structure of F(θ). Since the lower-right part of F(θ) is block diagonal, the Fisher-scoring algorithm can be simplified (see Appendix). After obtaining the maximum likelihood estimates of the parameters β, αis,αid and δi, the inverse of F(θ^) has to be calculated. However, the inverse of F(θ) also has a simple form (see Appendix).
Given α^is,α^id and δ^i, one can estimate the cross-family variances. If parents of the sampled families are randomly sampled from the base population, the cross- family variances provide approximate estimates of the QTL segregation variances in the base population. The additive and dominance variances in the base population are σ2α = ¼ [Var(αis) + Var(αid))] and σ2δ = Var(δi), respectively.
The additive variance σ2α and dominance variance σ2δ can be estimated by:
and
If Var(α^is), Var(α^id) and Var(δ^i) are obtained exactly, formulas (9) and (10) provide unbiased estimates of σ2α and σδ2, respectively. However, Var(α^is), Var(α^id) and Var(δ^i) can only be estimated approximately by the inverse of the Fisher information matrix (see the next section), and thus the estimates are only asymptotically unbiased.
Hypothesis testing
A useful property of the Fisher-scoring algorithm is that the variance–covariance matrix of θ^ = (β^T γ^1T…γ^nT)T can be approximated by the inverse of the Fisher information matrix, i.e. Var(θ^) ≈ F(θ^)−1. Because the resulting estimates θ^ = (βâT γ^1T…γ^nT)T are maximum likelihood estimates, they follow a multivariate normal distribution if the family sizes ni are large enough, i.e.
As a consequence, the following test statistic, w, will follow an approximate chi-squared distribution with m degrees of freedom under the null hypothesis that Cθ=0:
where m is the rank of matrix C.
The exact form of matrix C determines the type of hypothesis test. To test the overall null hypothesis H0:αis=αid=δi=0 ∀i, then C=(0 I3n). Theoretically, there are many other hypotheses to test. In this study, we carry out only the overall test that no QTL at the locus of interest is segregating.
Simulation studies
Design of simulations
Properties of the proposed method were investigated numerically via Monte Carlo simulations. The following properties were examined: the bias, the standard error of the parameter estimates, and the statistical power of QTL detection. We considered the following factors on the performance of the mapping procedure: (i) the variance explained by the QTL; (ii) the sampling strategy (number of families vs. family size); and (iii) the trait incidence (proportion of affected individuals). We simulated one chromosome of length 100 cM with 11 codominant markers evenly spaced along the chromosome. A single QTL was simulated at 25 cM. The simulation was repeated 100 times for each situation. The total number of individuals was set at 750 in all simulations. The standard error of the parameter estimates was calculated from the standard deviation of the estimates among 100 replicates. The statistical power was determined by counting the number of runs (over the 100 replicates) that have test statistics greater than an empirical critical value. The empirical critical value under each condition is obtained by choosing the 95th and 99th percentile of the highest test statistic over 1000 additional runs under the null model (no QTL segregating).
Five equally frequent alleles were simulated for each marker locus. This setting allows each parent to have a 20% chance of being homozygous at each marker locus. In all situations, the residual error was assumed to be normally distributed, with a variance set at σ2ɛ=1.0. The broad-sense heritability of QTL, h2q(=σ2g/ (σ2g+σ2ɛ)= (σ2α+σ2δ)/(σ2α+σ2δ+σ2ɛ)), was examined at four levels: 0.4, 0.3, 0.2 and 0.1. Only a mixed mode of QTL inheritance was considered, i.e. σ2α=σ2δ. Therefore, h2q=0.1 corresponds to σ2α=σ2δ= 0.06, h2q=0.2 corresponds to σ2α=σ2δ= 0.125, h2q=0.3 corresponds to σ2α= σ2δ=0.215, and h2q=0.4 corresponds to σ2α=σ2δ=0.33. The sampling strategy was simulated at three levels: 5 × 150, 10 × 75 and 15 × 50 (number of families × family size). The trait incidence was set at 50% and 20%.
QTL allelic effects were considered to be normally distributed with preassigned additive and dominance variances. Each parent of a family was made of two alleles: the first allele was assigned a value sampled from a standard normal distribution, and the second assigned the negative value of the first allele. The dominance effect was the interaction effect between any two sampled alleles, which was assigned a value sampled from a N(0,1) distribution, independent of the allelic effects. When offspring were generated, their genetic values at the QTL were re-scaled so that they had the appropriate assigned variances. The liability of each offspring was the sum of its genetic value, the overall mean μ, and a residual error sampled from the N(0,1) distribution. The observable binary phenotype was set to 1 if the corresponding liability exceeded 0, and 0 otherwise. The overall mean of the liability, contained in β, determines the proportion of the trait incidence. What we did was to select the appropriate mean so that a preassigned level of incidence was obtained.
Results of simulation
The empirical critical values at Type I error rates of 0.05 and 0.01 for situations with disease incidences of 50% and 20% are given in Table 1. The trait incidence has a small effect on the empirical critical values. The empirical critical values are slightly higher than the critical values of the χ2 distribution with corresponding degrees of freedom (3 × n). As the number of families increases, the empirical critical value increases dramatically. This is expected because increasing the number of families increases the number of parameters tested.
The estimates of QTL parameters (QTL position, additive and dominance variances, and heritability) and the empirical power are summarized in Tables 2, 3 and 4. The proposed method successfully locates the QTL position and estimates the additive and dominance variance components as well as the heritability with negligible biases. However, the QTL heritability, the sampling strategy and the trait incidence have strong impacts on the performance of the mapping procedure.
The performance of the proposed method is strongly affected by the trait incidence. In all cases, the estimates of the QTL parameters are more accurate and the statistical power is higher with the trait incidence of 50% than those with the trait incidence of 20% (see Tables 2 and 3. Under the threshold model, some information will be lost because of the translation from the underlying liability into the observed binary phenotype. The trait incidence determines the amount of lost information. The closer the trait incidence is to 50%, the less information is lost. This explains why the proposed method performs better in terms of the accuracy of parametric estimation and statistical power when the trait incidence is 50%.
The proportion of the phenotypic variance explained by the QTL, i.e. the QTL heritability h2q, has an effect on the accuracy of the estimated parameters and the statistical power (see Tables 2, 3 and 4. As expected, a lower QTL heritability (h2q=0.1) can increase the standard deviation of the estimated QTL position, and decrease the statistical power. Higher heritability levels tend to be associated with a slightly larger standard deviation in the estimated variances and heritabilities, because of the scaling effect, i.e. the standard deviation is correlated to the mean. In the case of high QTL heritability (h2q=0.4), the genetic variances and heritability are underestimated, although the estimated QTL position is accurate and the statistical power is high (Table 4). This is expected because the mixture of normal distributions can not be approximated by a single normal distribution when the QTL heritability is high.
The sampling strategy also has an impact on the estimation of the QTL position and the statistical power (see Tables 2, 3 and 4. When the number of families increases, the standard deviation of the estimated QTL position increases. In the case of low QTL heritability (h2q=0.1), the statistical power decreases as the number of families increases. When the power is already very high (e.g. h2q>0.1), the effect of the sampling strategy is expected to be negligible. However, the sampling strategy has a small effect on the precision of the QTL variances and estimated heritabilities. Overall, QTL mapping performs well with a few large families.
Discussion
We have developed a general framework of QTL mapping for complex binary traits by combining data from multiple families. Instead of analysing each family separately, this method carries out a joint statistical inference for multiple families. The link among families is reflected by the common fixed effect vector β (see models 5 and 6), which includes the overall mean μ and some common genetic and nongenetic factors, e.g. common environmental effects shared by these families and maternal effects. Because of these common effects, the estimates and tests of genetic parameters in different families are correlated. Therefore, a joint test for multiple families is more powerful than a test considering each family separately (Rebai & Goffinet, 1993). In addition, there are other reasons to justify the use of the proposed consensus method. Like most human diseases, complex binary traits in animal and plant populations undoubtedly have a complex genetic basis. Some QTLs controlling complex binary traits may be homozygous in any single individual and different individuals may be heterozygous for different QTLs. Ideally, several families should be selected and analysed jointly. As a result of using multiple families, the method has a wider statistical inference space than using a single family. Theoretically, the variance attributable to the QTL is better estimated with a large number of families. However, the number of parameters dramatically increases as the number of families sampled increases, which undoubtedly reduces the statistical efficiency of the proposed method. Therefore, with a fixed number of individuals, there is an optimal allocation between the number of families and the number of individuals per family where QTL mapping reaches its maximum power and minimum estimation error. Limited investigations have shown that a mating with several parents (5–10) should give a good sample of variance and allow the detection of QTL with reasonable power (Muranty, 1996; Xu, 1998a).
A typical problem in QTL mapping comes from missing QTL genotypes. In QTL mapping using outbred line crosses, when the putative QTL is not at a marker, the liability is actually a mixture of four normal distributions. Theoretically, the optimal treatment of the unobservable genotype is the mixture model maximum likelihood method, which uses all information contained in the data (Lander & Botstein, 1989; Zeng, 1994). The heterogeneous residual variance model proposed here uses a single distribution to approximate the four distributions, assuming that the residual is normally distributed. This approximation is feasible only when the QTL effects are small relative to the residual variance. Some comparisons between the heterogeneous residual variance model and the mixture model have been made in QTL mapping, showing that the two methods are virtually identical for normally distributed traits, even when the QTL effects are large (Xu, 1998b). Our simulations show that the proposed method performs well when the QTL heritability is not overly high. In the analysis of real data, the effect of any individual QTL being tested is usually small for most polygenic traits, which makes the proposed method valid for most situations. There are some advantages of the heterogeneous residual variance model over the mixture model. First, a simple Fisher-scoring algorithm is available for the heterogeneous residual variance model, which provides, as a by-product, an estimate of the variance–covariance matrix of the estimated parameters. Therefore, it is straightforward to conduct hypothesis tests. The Fisher-scoring method is difficult to derive for the mixture model. As a result, computing the variance–covariance matrix of the estimated parameters is difficult with the mixture model. Secondly, the heterogeneous residual variance model implemented via the Fisher-scoring algorithm is fast, which allows a multiple sampling technique, e.g. the permutation test, to be used more conveniently.
The results presented in this study are based on known linkage phases. Therefore, implementation of this method requires the knowledge of marker linkage phases in the parents. There are several ways to deduce the linkage phase in outbred pedigrees (e.g. Maliepaard et al., 1997).
The proposed method is a fixed-model approach because the genetic effects are treated as fixed. As observed in the simulation studies, the method is efficient when there is a small number of families with large family sizes. However, as the number of families increases, the substantial number of parameters to be estimated often generates some statistical problems, in particular when the family sizes are small. The random model approach, on the other hand, estimates only a few parameters, because only the variances are estimated and tested. With the random model approach, statistical analysis can be carried out even when the family sizes are small. The random model approach, though, is as-of-yet undeveloped for complex binary traits, and as such deserves further investigation.
References
Fahrmeir, L. and Tutz, G. (1994). Multivariate Statistical Modeling Based on Generalized Linear Models. Springer-Verlag, New York.
Falconer, D. S. and Mackay, T. F. C. (1996). Introduction to Quantitative Genetics, 4th edn. Longman, London.
Grignola, F. E., Hoeschele, I. and Tier, B. (1996a). Mapping quantitative trait loci via residual maximum likelihood: I. Methodology. Génét Sél Évol, 28: 479–490.
Grignola, F. E., Hoeschele, I. and Tier, B. (1996b). Mapping quantitative trait loci via residual maximum likelihood: II. A simulation study. Génét Sél Évol, 28: 491–504.
Hackett, C. A. and Weller, J. I. (1995). Genetic mapping of quantitative trait loci for traits with ordinal distributions. Biometrics, 51: 1252–1263.
Harville, D. A. and Mee, R. W. (1984). A mixed-model procedure for analyzing ordered categorical data. Biometrics, 40: 393–408.
Knott, S. A. and Haley, C. S. (1992). Maximum likelihood mapping of quantitative trait loci using full-sib families. Genetics, 132: 1211–1222.
Knott, S. A., Elsen, J. M. and Haley, C. S. (1996). Methods for multiple-marker mapping of quantitative trait loci in half-sib populations. Theor Appl Genet, 93: 71–80.
Kruglyak, E. S. and Lander, E. S. (1995). Complete multipoint sib-pair analysis of qualitative and quantitative traits. Am J Hum Genet, 57: 439–454.
Lander, E. S. and Botstein, D. (1989). Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics, 121: 185–199.
Maliepaard, C., Jansen, J. and van Ooijen, J. W. (1997). Linkage analysis in a full-sib family of an outbreeding plant species: overview and consequence for applications. Genet Res, 70: 237–250.
McCulloch, C. E. (1994). Maximum likelihood variance components estimation for binary data. J Am Stat Ass, 89: 330–335.
Muranty, H. (1996). Power of tests for quantitative trait loci detection using full-sib families in different schemes. Heredity, 76: 156–165.
Rao, S. and Xu, S. (1998). Mapping quantitative trait loci for ordered categorical traits in four-way crosses. Heredity, 81: 214–224.
Rebai, A. (1997). Comparison of methods for regression interval mapping in QTL analysis with non-normal traits. Genet Res, 69: 69–74.
Rebai, A. and Goffinet, B. (1993). Power of tests for QTL detection using replicated progenies derived from a diallel cross. Theor Appl Genet, 86: 1014–1022.
Sorensen, D. A., Anderson, D., Gianola, D. and Korsgaard, I. (1995). Bayesian inference in threshold models using Gibbs sampling. Génét Sél Évol, 27: 229–249.
Visscher, P. M., Haley, C. S. and Knott, S. A. (1996). Mapping QTLs for binary traits in backcross and F2 populations. Genet Res, 68: 55–63.
Wright, S. (1934). An analysis of variability in number of digits in an inbred strain of guinea pigs. Genetics, 19: 506–536.
Xu, S. (1998a). Mapping quantitative trait loci using multiple families of line crosses. Genetics, 148: 517–524.
Xu, S. (1998b). Iteratively reweighted least squares mapping of quantitative trait loci. Behav Genet, 28: 341–355.
Xu, S. and Atchley, W. R. (1996). Mapping quantitative trait loci for complex binary diseases using line crosses. Genetics, 143: 1417–1424.
Xu, S., Yonash, N., Vallejo, R. L. and Cheng, H. H. (1998). Mapping quantitative trait loci for binary traits using a heterogeneous residual variance model: an application to Marek's disease susceptibility in chickens. Genetica, 104: 171–178.
Zeng, Z. B. (1994). Precision mapping of quantitative trait loci. Genetics, 136: 1457–1468.
Acknowledgements
The authors thank Drs D. Gessler and C. Xie for their helpful comments on the manuscript. This research was supported by the National Institutes of Health Grant GM55321-01 and the USDA National Research Initiative Competitive Grants Program 97-35205-5075.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
The components of the score vector and the Fisher information matrix
The components of the score vector and the Fisher information matrix are given by
and
where ϕ(·) is the probability density of the standardized normal distribution.
A simple algorithm for estimating parameters
Denote
and
The Fisher information matrix is partitioned into
Since the lower-right part of F(θ) is block diagonal, algorithm (8) can be re-expressed more simply as
where Δβ(k) = (k+1) − (k) and Δγi(k) = γi(k+1)−i(k).
After some transformations, the following algorithm is obtained, where each iteration step implies working off the data twice to obtain first the corrections (Fahrmeir & Tutz, 1994):
and then
A simple method for calculating F(θ)−1
Since the lower-right part of F(θ) is block diagonal, F(θ)−1 is obtained using standard formulae for inverting partitioned matrices (Fahrmeir & Tutz, 1994). The result is summarised as follows:
where
and
Rights and permissions
About this article
Cite this article
Yi, N., Xu, S. Mapping quantitative trait loci for complex binary traits in outbred populations. Heredity 82, 668–676 (1999). https://doi.org/10.1046/j.1365-2540.1999.00529.x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1046/j.1365-2540.1999.00529.x
Keywords
This article is cited by
-
Quantitative trait loci controlling leaf appearance and curd initiation of cauliflower in relation to temperature
Theoretical and Applied Genetics (2016)
-
A model for quantitative trait loci mapping, linkage phase, and segregation pattern estimation for a full-sib progeny
Tree Genetics & Genomes (2014)
-
Generalized linear model for interval mapping of quantitative trait loci
Theoretical and Applied Genetics (2010)
-
Study on mapping Quantitative Trait Loci for animal complex binary traits using Bayesian-Markov chain Monte Carlo approach
Science in China Series C: Life Sciences (2006)