Introduction

The mixture model maximum likelihood (ML) method for quantitative trait loci (QTL) mapping (Lander and Botstein, 1989) is the most efficient method for interval mapping (IM) (Kao, 2000). Least square (Haley and Knott, 1992) and weighted least square (LS; Xu, 1998a, 1998b) methods are approximations of the ML method but with improved computational speed. Recently, Feenstra et al. (2006) developed an improved weighted regression method by extending the simple regression method of Haley and Knott (1992) and the weighted regression method of Xu (1998a, 1998b). The authors made a simple assumption that conditional on marker information, the phenotypic value of individual j is normally distributed with mean μj and variance σj2. Feenstra et al. (2006) adopted an estimating equation (EE) algorithm to solve for the parameters (regression coefficients and residual error variance).

The exact ML method (Lander and Botstein, 1989) takes into consideration the mixture distribution of the phenotypic value conditional on flanking marker information. The mixture distribution occurs because the genotype of a QTL is unknown. The basic assumption is that the residual error has a known distribution, that is, normal. The simple regression method of Haley and Knott (1992) ignores the uncertainty of the QTL genotype and assumes that for all , where n is the sample size. No other assumption is required. The iteratively reweighted LS (IRLS) method of Xu (1998a and 1998b) takes into account the uncertainty of QTL genotype so that varies across j=1,, n, but ignores the mixture distribution. In addition, when maximizing the objective function, the method of IRLS treats σj2 as a constant, although σj2 is a function of μj, and thus as a function of QTL effects. The parameter values involved in σj2 are replaced by values in the previous iteration. The EE method of Feenstra et al. (2006) takes into account the fact that σj2 is a function of μj and maximizes the objective function with respect to parameters occurring in every place of the objective function. Feenstra et al. (2006) compared all the methods and showed that ML>EE>IRLS>LS, that is, ML is the most efficient method and LS is the least efficient one.

Feenstra et al. (2006) used the EE algorithm to solve for the parameters, but provided no explicit iterative equation. We found that an explicit expression of the iteration exists by using a Fisher scoring algorithm. The iteration equation appears to be simple and thus easy to program. In addition, the method automatically provides a variance–covariance matrix for the estimated QTL effects. This covariance matrix is required to construct a W-test statistic. This test statistic may replace the likelihood-ratio test statistic with a computational advantage over the latter—in that only a single likelihood function (under the full model) is needed.

Methods

Model

The model follows Xu (1998b) and Feenstra et al. (2006), which is

where yj is the phenotypic value of individual j (j=1,, n), Uj is the expectation of variables indicating the QTL genotype given marker information, β is a vector of QTL effects (including the population mean) and ej is the residual error. The residual error has mean zero and variance

where is the conditional variance–covariance matrix of the QTL indicator variables given marker information (it is not a summation sign). Note that if the QTL genotype is observed (i.e., the QTL overlaps with a fully informative marker), would be zero and thus the residual error variance would be identical to the environmental error variance σ2. Definitions and formulations of Uj and are given by Xu (1998b) and also described later in this study for an F2 population. The parameter vector is and the data include yj, Uj and . Let

and

The likelihood function is based on the assumption of independent for j=1,, n. Therefore, the logarithm of the likelihood function is

where

Fisher scoring algorithm

The partial derivatives of the likelihood for individual j with respect to the parameters are

where . Further manipulation on Equation (7) leads to

Let be the score vector and be the Hessian matrix. The information matrix is . Using the following identity (Wedderburn, 1974)

we obtained

The score vector is

The derivations of equations (10) and (11) are presented in Appendix A for advanced readers. The information matrix and the score vector are required to perform the following Fisher scoring algorithm for the ML estimation

where is the parameter value in the t-th iteration. Starting with an initial value , we iterate equation (12) until a certain criterion of convergence is reached, i.e., . The ML solution is for t satisfying the convergence criterion. The variance–covariance matrix of the estimated parameters is given by

W-test statistic

When the variance–covariance matrix of the estimated parameters is known, the likelihood ratio tests are not needed. Instead, a W-test statistic can be used (Wald, 1941). We now use an F2 mating design as an example to show how to construct the W-test statistic. Let

be indicator variables for the QTL genotype, the linear model of the phenotypic value of individual j is

where β0=μ is the population mean (intercept), β1=a is the additive effect and β2=d is the dominance effect and is the environmental error. Using marker information, we calculate

where . The , matrix is

where

and

The probabilities of QTL genotypes conditional on marker information are and , respectively, for the three genotypes. They are calculated based on the multipoint method of Jiang and Zeng (1997).

The parameter vector is . Therefore, var (θ) is a 4 × 4 matrix. Let and

be a subset of matrix . The W-test statistic for hypothesis is

Under H0, this test statistic follows approximately a χ2 distribution with two degrees of freedom. Therefore, the W-test statistic is comparable to the likelihood ratio test statistic. The W-test statistic has a simple relationship with the F-test statistic, that is, W=2F, where the F-test statistic follows an F distribution with a numerator degrees of freedom 2 and a denominator degrees of freedom n−3 under H0.

Information matrix of the EM algorithm

The mixture model-based ML method implemented through the expectation maximization (EM) algorithm (Lander and Botstein, 1989) does not have a simple method for calculating the variance–covariance matrix of the estimated parameters. Kao and Zeng's (1997) method for calculating the variance–covariance matrix is quite complicated. If the QTL position is fixed at a particular genome location, that is, the QTL position is not a parameter, their formulas may be simplified. The simplified version of the variance–covariance matrix is relatively easy to program. To compare the Fisher scoring algorithm with the mixture model-based ML method for the variance–covariance matrix of the estimated parameters, we introduced the variance–covariance matrix under the EM algorithm here. The EM algorithm is derived based on the following full data QTL model (assuming that QTL genotypes are observable)

The score function when Xj is fully observed is

The Hessian matrix (second partial derivatives) is

The Louis' (1982) information matrix is

The expectations are taken with respect to the missing value X using the posterior probabilities of QTL genotypes. Calculation of the first term is straightforward, but the second term is hard to compute. The method of Kao and Zeng (1997) for calculating is complicated. Luo et al. (2003) used Monte Carlo simulation to approximate , but the method is computationally demanding. We realized that

because when θ = θ̂, where θ̂ is the maximum likelihood estimate (MLE) of θ. The variance–covariance matrix of the scores is

which is not difficult to calculate. Note that the expectations and the variance–covariance matrices are calculated with respect to the missing value X using the posterior probabilities of QTL genotypes (conditional on both marker and phenotype information and the estimated parameter value). Therefore, the information matrix for the EM estimated QTL parameters is

because as shown in equation 24. The estimated variance–covariance matrix for θ̂ is . This variance–covariance matrix will be compared to that obtained from the weighted LS method.

Results

Monte carlo simulation

The purpose of the simulation study is to verify the Fisher scoring algorithm. The final result of the Fisher scoring algorithm is identical to the result of EE (Feenstra et al., 2006), because both algorithms maximize the same likelihood function. Extensive simulation studies for the improved weighted regression method have been performed by Feenstra et al. (2006). Therefore, we only evaluated the Fisher scoring algorithm under one situation. We placed a QTL in the middle of a 10 cM marker interval. The two markers were fully informative. The simulated parameter values were

The additive and dominance effects of the QTL explained 7.3% and 1.43% of the phenotypic variance, respectively. Overall, the QTL contributed 8.73% of the phenotypic variance. The sample size of the simulated F2 population was n=300. The simulation experiment was replicated 200 times. The QTL position was fixed at the true location (5 cM away from either flanking marker) and only θ was estimated.

For comparison purpose, we also analyzed the data using the simple regression or the LS method of Haley and Knott (1992), the IRLS method of Xu (1998a, 1998b) and the mixture model ML method of Lander and Botstein (1989). The Fisher scoring method developed in this study is denoted by FISHER. The EE algorithm is simply a different algorithm from the Fisher scoring algorithm for the same problem. Both EE and FISHER maximize the same likelihood function, and thus both generate the same ML estimates of the parameters. For the three methods with iterations, the same initial values of parameters were used, which are

The average estimates of the parameters and their standard deviations obtained from the 200 replicated simulations are listed in Table 1. The estimated parameters for all the four methods are very close to the true parameters. This verified all methods, including the Fisher scoring method developed here. The new method took about five iterations to converge, whereas ML and IRLS took about seven and four iterations, respectively, to converge to the same criterion . The computing time for the Fisher scoring method took about 0.1 s per replication, which is approximately the same as the time for the IRLS method. Both FISHER and IRLS are faster than ML.

Table 1 Mean estimates of parameters and the standard deviations (in parentheses) of the estimated parameters obtained from 200 replicated simulations

Special algorithm is required to obtain the variance–covariance matrix of the estimated QTL parameters for the ML method (Kao and Zeng, 1997; Luo et al., 2003). The Fisher scoring method, however, provides such a covariance matrix as a by-product of the iteration process. To evaluate the accuracy of the estimated variance–covariance matrix, we compared the ‘predicted’ covariance matrix with the ‘realized’ covariance matrix. The predicted covariance matrix was obtained as follows. For the kth replicated simulation we calculated using equation (13) and then took the average as the ‘predicted’ covariance matrix. The realized covariance matrix was calculated using the following approach. Let be the estimated parameters from the kth replicated simulation . The ‘realized’ covariance matrix was defined as

where . The predicted and realized covariance matrices obtained from 200 replicated simulations are given in Table 2. Clearly, the predicted covariance matrix is very close to the realized covariance matrix.

Table 2 Comparison of the predicted with the realized variance–covariance matrices of the estimated parameters for the Fisher scoring algorithm

Numerical evaluation

A dataset from an F2 mouse population consisting of 110 individuals was used as an example for demonstration. The data were published by Lan et al. (2006) and are freely available from the internet (see Lan et al., 2006 for the website address). A preliminary analysis showed that there was a QTL for the trait of 10th week body weight on the second chromosome between markers D2Mit194 and D2Mit263 (result not given). The putative position of the QTL is at 95.7 cM, whereas the two flanking markers are located at 85.4 and 98.7 cM, respectively. QTL parameters were estimated assuming that the position of the QTL is known (fixed at 95.7 cM).

The iteration process of the Fisher scoring algorithm is given in Table 3. It took seven iterations to converge when was used as the initial values of the parameters. If the LS estimates of QTL parameters were used as the initial values, only five iterations were required to converge to the same criterion, (data not shown).

Table 3 Iteration process of the Fisher scoring algorithm of QTL mapping for the mouse data

The data were also analyzed using the other three methods (LS, IRLS and ML) and the results are given in Table 4. The estimated parameters from LS and IRLS are more alike to each other than to the parameters estimated from ML and FISHER, which are almost identical to each other. The W-test statistics for all the four methods (LS, IRLS, ML and FISHER) were compared with the likelihood-ratio test statistics and they are indeed similar to each others (see last two rows of Table 4).

Table 4 Estimated QTL parameters and their standard errors (in parentheses) for the mouse data from four different methods

The variance–covariance matrices for the estimated parameters for the four methods are given in Table 5. The covariance matrices for the four methods are very similar. To further validate the accuracy of the covariance matrix of the Fisher scoring method, we performed a bootstrap analysis (Efron, 1979) with 1000 replicated samples to draw an empirical covariance matrix for the estimated parameters. The bootstrapped covariance matrix for the FISHER method is also presented in Table 5. We can see that the bootstrapped covariance matrix is similar to the predicted one from the Fisher scoring method, although some relatively large deviations have occurred for some covariance elements.

Table 5 Variance–covariance matrix of the estimated QTL parameters for the mouse data

Interval mapping and composite interval mapping (CIM)

This section presents the results of IM and CIM (Zeng 1994) for the same mouse data (Lan et al., 2006). The mouse genome has 19 chromosomes (excluding the sex chromosome). The data investigated contain 110 F2 mice and 193 markers covering about 1800 cM of the entire genome. The trait of interest is still the 10th week body weight. The entire genome was scanned with 1 cM increment using the IM approach (under the single QTL model). The log of odds (LOD) score profiles (converted from the likelihood ratio test statistic profiles) for all the four methods (LS, IRLS, ML and FISHER) are given in Figure 1. The four methods generate almost identical result (the profiles overlap). Permutation tests showed that the critical value for the LOD score was 2 .985, close to 3 (data not shown). Therefore, we used LOD 3 as the approximate critical value for controlling the genome wise Type I error of 0.05 for declaration of statistical significance. Three QTL passed the LOD 3 criterion for all the four methods. The estimated positions and effects for the three QTL are given in Table 6. The three QTL detected explain 10–20% of the phenotypic variance each. Again, the four methods give almost identical result for the estimated QTL parameters.

Figure 1
figure 1

Log of odds (LOD) test statistic profiles for the four methods (LS, IRLS, ML and FISHER) using the interval mapping (IM) approach. The 19 chromosomes are merged into a single genome. The dash–dot horizontal line along the top of the graph represents the threshold value of LOD 3. The main and minor tick marks on the horizontal axis separate the chromosomes and indicate the position of markers, respectively. IRLS, iteratively reweighted LS; LS, least square; ML, maximum likelihood.

Table 6 Estimated QTL parameters for the mice data from the interval mapping analysis

We have identified three QTL using the IM approach. Since IM uses a single QTL model, the estimated QTL parameters are subject to bias (Zeng, 1994). We now adopt the CIM approach to handle multiple QTL. We scanned the entire genome again but used the markers nearby the three identified QTL (in the IM) as cofactors to control the background noise. The LOD score profiles are presented in Figure 2. Using the same LOD 3 as the approximate critical value, we detected two QTL, one remains in chromosome 2 and the other one occurs in chromosome 13. Again, the four methods are almost indistinguishable. The estimated QTL parameters are given in Table 7. The two detected QTL individually explain a less proportion of the phenotypic variance compared with the results of the IM, which is expected for the CIM approach.

Figure 2
figure 2

Log of odds (LOD) test statistic profiles for the four methods (LS, IRLS, ML and FISHER) using the composite interval mapping (CIM) approach. The 19 chromosomes are merged into a single genome. The dash–dot horizontal line along the top of the graph represents the threshold value of LOD 3. The main and minor tick marks on the horizontal axis separate the chromosomes and indicate the position of markers, respectively. IRLS, iteratively reweighted LS; LS, least square; ML, maximum likelihood.

Table 7 Estimated QTL parameters for the mice data from the composite interval mapping (CIM) analysis

Figure 3 represents the LOD score profiles for the FISHER method developed in this study under theIM and the CIM frameworks. Again, the IM is a single QTL model, whereas the CIM is a multiple QTL model. The two different approaches do show significant difference. The CIM is an improved approach and is highly recommended over the IM approach.

Figure 3
figure 3

Log of odds (LOD) test statistic profiles for the Fisher scoring method (FISHER) under the interval mapping (IM) and composite interval mapping (CIM) frameworks. The 19 chromosomes are merged into a single genome. The dash–dot horizontal line along the top of the graph represents the threshold value of LOD 3. The main and minor tick marks on the horizontal axis separate the chromosomes and indicate the position of markers, respectively.

Discussion

The Fisher scoring algorithm developed in this study and the EE algorithm developed by Feenstra et al. (2006) produce identical result, because both maximize the same likelihood function. Therefore, they are two different algorithms for the same method, called the improved weighted LS method. Properties of the improved weighted LS method have been investigated thoroughly by Feenstra et al. (2006). Therefore, we only provided results of simulations and real data analysis for a simple situation: the QTL position is known. The purpose of this study is to demonstrate that a Fisher scoring algorithm can be used to estimate QTL parameters. Given that FISHER is identical to EE, why do we bother to develop such an algorithm? Several reasons may justify the new algorithm. One is that Fisher scoring is an important algorithm in genetics and this study gives another example of its application to genetics. Another reason is that the EE algorithm introduced by Feenstra et al. (2006) has no explicit expression of the iteration process, whereas the Fisher scoring algorithm does. More importantly, the EE was performed in two steps within an iteration, the β-step and the σ2 -step. As a result, variance–covariance matrix of the estimated parameters is not available for the EE algorithm, at least it is not a by-product of the iteration process, but the Fisher scoring algorithm provides such a matrix as a by-product of the iteration process. We were informed that Feenstra B (personal communication) did derive the information matrix for the parameters under the EE algorithm in his thesis, but the result was not published.

An algorithm of maximization, like the Fisher scoring algorithm, that provides an easy way for calculating the variance–covariance matrix of the estimated parameters may be more preferable than the EM algorithm for the mixture model ML method, because a W-test statistic can be used for significance test of QTL. The W-test statistic is similar to the likelihood-ratio test, but it is computationally less intensive than the latter. For the Fisher scoring algorithm, three different test statistics (additive, dominance and both) can be generated with only one objective function (the likelihood function for the full model) for maximization, whereas the EM implemented ML method requires maximization of several likelihood functions (under the full model and various reduced models).

The Fisher scoring implemented weighted LS method has some similarity to the quasilikelihood method (Wedderburn, 1974). Both require only the first (mean) and second (variance) moments of an observed data point to be known functions of parameters and no other assumptions are needed. The Fisher scoring method, however, cannot be replaced by the quasilikelihood method, because the latter requires the variance to be a known function of the mean for each observed data point.

Although the EM implemented ML is not the focus of this study, the simple method for calculating the variance of the score functions, , is new and cost efficient relatively to the method of Kao and Zeng (1997) and the Monte Carlo method of Luo et al. (2003).

Any quantitative traits may be controlled by more than one QTL. The IM approach uses a model that contains only a single QTL. Therefore, the model will never be the correct one if multiple QTL exist. However, people using the IM approach can still detect multiple QTL, simply by evaluating the number of peaks in the test statistic profile that pass the critical value of the test statistic. The estimated QTL effects will be biased if multiple linked QTL exist. There are two ways to handle multiple QTL. One is the multiple IM developed by Kao et al (1999), where a multiple QTL model is fit and the number of QTL included is determined by a model selection algorithm, for example, stepwise regression. The other method is called the CIM where markers with significant effects are selected and then fit into the model to control the background noise (Zeng, 1994). In this study, we modified the Fisher scoring algorithm to fit the CIM model. We first used the IM approach to identify markers. We found three separate peaks in chromosome 2 of the mouse genome. We then selected three markers that are close to the three peaks as cofactors and fit the composite mapping model. Once we fit the CIM model, the three QTL in chromosome 2 became one and an additional QTL in chromosome 13 was detected. Therefore, the CIM does show different result from the IM. We did not fit a true multiple QTL mapping model, because the focus of this study is to develop the Fisher scoring method rather than evaluating the multiple QTL mapping model. The CIM does serve as a simple method to handle multiple QTL. The actual multiple QTL model under the Fisher scoring algorithm is not hard to fit, but requires substantial computational time. Let q be number of QTL fit to the model, the multiple QTL model is

where Ujk is the conditional expectation of the QTL genotype indicator variable for locus k and βk is a vector of QTL effects for the kth locus. The residual error has mean zero and variance

where is the conditional variance–covariance matrix of the kth QTL genotype indicator variable. The score vector and the information matrix are equivalent to those obtained for the single QTL model except that the dimensionalities of the vector and matrix are increased.

Epistatic effects may also contribute to the genetic variance of a quantitative trait. However, they are not as important as the main QTL effects for many quantitative traits of agronomy crops (Xu and Jia, 2007). Extension to models with epistatic effects is tedious and will digress from the main focus of this study. Therefore, we only introduce the main effect model under the Fisher scoring algorithm. Feenstra et al (2006) investigated the properties of the EE algorithm and found out that the EE algorithm can be substantially more efficient that the LS and weighted LS methods when epistatic effects are fit to the model. Since the Fisher scoring algorithm is identical to the EE algorithm in terms of parameter estimation and significance test, we expect that the Fisher scoring algorithm will also show the same advantage as the EE algorithm in handling epistatic effects.

Finally, we developed a new SAS procedure called PROC QTL, which has an option to conduct QTL mapping using the Fisher scoring method. The SAS code is given in Appendix B.