Introduction

Lander & Botstein (1989) presented an exact maximum likelihood method (ML) for mapping quantitative trait loci (QTL) in line crossing experiments. When the putative position is off the markers, the QTL genotype is actually not observed, so the model involves missing data. Solutions of the exact maximum likelihood method involving missing data are usually obtained using the (Expectation–Maximization) EM algorithm (Dempster et al., 1977), which requires many cycles of iterations. Haley & Knott (1992) discovered that the ML can be well approximated by the simple regression method (REG). The authors conducted extensive computer simulations, showing no detectable difference between ML and REG in the range of parameters considered in the simulation experiment. A similar argument is also found in Martinez & Curnow (1992). As a consequence, the simple regression method has become widely accepted, especially in European countries, because of its simplicity and convenience of use relative to the ML.

Given that two methods are available for QTL mapping, which method should be chosen for real data analyses? For analysis of a single data set, it does not matter which one is used because the two methods will generate almost identical results. Some researchers may want to avoid the word ‘approximation’ and choose ML, and others may prefer simplicity and thus choose REG. Xu (1995) recently found that the residual variance estimated by the REG method contains part of the QTL variance caused by the uncertainty of QTL genotype. This observation may alert users of the REG that the explanation of the residual variance should be treated with caution. However, the REG method is computationally so superior to the ML that it may become the choice for multiple data analyses, such as the permutation tests (Churchill & Doerge, 1994) and the bootstrap construction of confidence intervals (Visscher et al., 1996). These nonparametric methods involve thousands of analyses of the (resampled) same data set and could be prohibitive for ML if the data set and genome size are large.

The purposes of this paper are: (i) to investigate further the difference between the REG and ML methods via simulation studies in situations with high heritabilities and dominant and/or missing markers; and (ii) to improve the existing regression method so that the pure environmental variance can be separated from the residual variance, yet the property of high computing speed is retained.

Statistical methods

Linear model

Let yj be the phenotypic value of an F2 individual that can be described by the following linear model:

where xj is a known vector, β is a vector of unknown fixed effects, α and δ are, respectively, the average effect of allelic substitution and the dominance effect of a putative QTL, and εj is the residual error with N(0,σ2ε). Note that for a single-QTL model the residual error is purely caused by uncontrollable environmental noise. The independent variables, zj and wj, are defined as:

and

where Q1Q1, Q2Q2 and Q1Q2 are, respectively, the genotypes of the two parental lines and the F1 hybrid. Because the genotype of a QTL is not observable if the QTL is not at a marker, zj and wj are usually missing. However, the conditional distribution of z and w can be inferred from the genotypes of linked markers. Let p(kl)j be the conditional probability that the individual is of genotype QkQl, given marker information. Given the conditional probabilities, yj is considered to be sampled from a mixture of three distributions with means of μ11, μ12 and μ22 and a common variance σ ε 2 , where:

Statistical tests and parameter estimation are conducted through one of the three methods described below.

Maximum likelihood method (ML)

The likelihood function is:

where φkl(yj) is the normal probability density for those individuals with genotype QkQl. It is well known that the maximum likelihood solution for the unknown parameters, θ=[β α δ σ2ε]T, can be solved via the EM algorithm (Dempster et al., 1977). To test the hypothesis that no QTLs are segregating, i.e. H0: α=δ=0, the following likelihood ratio test statistic is applied:

where θ0 is different from θ by introducing two constraints, α=0 and δ=0.

Simple regression method (REG)

The regression method of QTL mapping developed by Haley & Knott (1992) and Martinez & Curnow (1992) is an approximation of the ML method. These authors approximate the mixture of three distributions by a single distribution so that the ML solution can be obtained by a simple regression approach. The approximate single model is:

where IM denotes marker information and:

and:

Note that the residual ej is different from that given earlier. This single model has a mean of:

and a variance of:

The unknown parameters are solved using the ordinary least squares method (Haley & Knott, 1992). Under the assumption that yj is normal, the least squares solutions are identical to the maximum likelihood estimators if the likelihood function is defined by:

Two assumptions of the ML are violated by the regression analysis. One is the normal distribution of yj and the other is the homogeneous residual variance. Violation of the normal distribution is not a problem with the regression method because estimation of the parameter does not depend on a normal distribution. Although the hypothesis test depends on the normal assumption, the t- or F- tests are usually very robust. Heterogeneous residual variance may cause a slight problem in the regression analysis (Xu, 1995), but is not likely to change the results qualitatively relative to the true ML analysis (Haley & Knott, 1992). The difference between the true ML and the regression method comes from the difference in the estimation of the residual variance. The regression method generally provides a residual variance estimation that contains part of the QTL variance not explained because of the uncertainty of QTL genotype (Xu, 1995). The F- value can be used as the test statistic for the simple regression method. However, to compare this method with the ML, the test statistic, originally used by Haley & Knott (1992), is adopted here:

where RSSfull is the residual sum of squares of the full model and RSSreduced is that of the reduced model. This test statistic can be compared with that given in eqn (3) because they are very similar under the null hypothesis (see Table 5).

Table 5 Empirical critical values of the test statistic for testing the presence of a QTL on a chromosome of length 100 cM

Iteratively reweighted least squares method (IRWLS)

To retain the advantages of both the regression method and the ML method, a weighted regression method is investigated here. The mixture model is still approximated by a single model (eqn 4), but the residual variance is further partitioned into several components:

where Var(zj|IM)α2 is part of the QTL variance not explained because of the uncertainty of zj, Var(wj|IM)δ2 is part of the QTL variance not explained because of the uncertainty of wj, and 2Cov(zj wj|IM)αδ is because of the uncertainty of both zj and wj. All three additional components in the residual will vanish if the genotype of the QTL is actually observed, i.e. Var(zj|IM)=Var(wj|IM)= Cov(zj wj|IM)=0. These additional components are computed as follows:

and:

Let y be an n×1 vector of the data. The model can be expressed in matrix notation as:

where Z is an n×1 vector with the jth element equal to E(zj|IM), W is an n×1 vector with the jth element equal to E(wj|IM), and e is an n×1 vector of residuals. The expectation and variance matrix of the model are:

and:

where R is a diagonal matrix with the jjth element equal to:

and:

The likelihood function is:

The ML solution can be solved via a weighted least squares approach which is described below.

Given an initial guess of the values of λα, λδ and λαδ, matrix R is treated as known. Under the pretence of known R, the solution of θ can be easily obtained via the weighted regression analysis:

and:

Because R depends on unknown parameters, it must be updated by the estimates of α, δ and σ2ε, and the estimation is then repeated until convergence. This algorithm is extremely fast — only two to three cycles of iteration are required, in contrast to 80–100 iterations in the EM algorithm at the same accuracy. The likelihood ratio test statistic, Λ, is applied to the weighted regression analysis.

Dominant and missing markers

The missing marker problem can be solved easily. A missing marker should be skipped over and the nearest nonmissing markers are picked up. Dominant markers provide partial information which is extracted by using a hidden Markov model algorithm. Details of the hidden Markov model are found in Lander & Green (1987) and Kruglyak et al. (1995).

Simulation studies

Eleven equally spaced markers were simulated on a single chromosome segment of length 100 cM. A single QTL was located at position 25 cM. The population size (number of F2 individuals) was set at 300. Under the null model, the QTL was assigned a value of zero for both the additive and dominance effects. Simulations were repeated 1000 times and the 95 and 99 percentiles of the test statistics were chosen as the empirical critical values for power calculation. Under the alternative model, a nonzero additive effect was simulated while the dominance effect was still set to zero. Simulations were repeated 100 times. Empirical power was calculated by counting the number of runs in which test statistics were greater than the empirical critical values. In all simulations, the variance of the environmental effect was set at σ2ε=1.0.

Each data set was analysed using the three methods: the exact maximum likelihood method (ML), the simple linear regression analysis (REG) and the iteratively reweighted least squares method (IRWLS). Powers and estimation errors of the three methods were compared, based on averages of 100 runs.

Factors considered include the size of the QTL effect, measured by the average effect of gene substitution (α), and the amount of marker information. The average effect of gene substitution was examined at three levels: α=0.324 leading to h2=0.05; α=0.820 resulting in h2=0.25 and α=1.155 corresponding to h2=0.40. The amount of marker information was investigated in four situations: (i) all markers codominant and no missing markers, the highest level of marker information content; (ii) 50 per cent loci in the F1 parent randomly set to dominant and no missing markers in the offspring; (iii) 50 per cent loci in the F2 offspring randomly set to missing values; and (iv) 50 per cent loci in the parent dominant and 50 per cent loci in the offspring missing, the lowest level of marker information content.

Average values of the estimated parameters and their standard deviations calculated based on 100 replicated simulations are listed in Tables 1, Table 2, Table 3 and Table 4. The three methods show virtually no difference with regard to parametric estimation of the additive effect (α), dominance effect (δ) and the location of the QTL (cMA), which is consistent with Haley & Knott (1992) for the comparison of ML and REG. Another observation is that when both marker information content and the heritability are low, estimation of the QTL position tends to be biased towards the centre of the chromosome for all three methods. This bias occurs because, with smaller QTL effects and less marker information, some of the QTL peaks found may represent, not the simulated QTL but, a Type I error. The position of these Type I errors tends to be randomly distributed along the linkage group; thus the mean position of Type I errors is at the centre of the chromosome and their joint effect, along with some real QTL, is to move the estimated position over all simulated replicates towards the centre of the chromosome. The last, and important, observation is that the simulations verify the theoretical prediction that the simple regression provides a confounded estimation of the true residual variance and part of the QTL variance. The level of confounding increases as the marker information content decreases (from Table 1 to Table 4). The confounding, however, no longer exists in the IRWLS method (see the comparison with ML).

Table 1 Comparison of three methods of QTL mapping via Monte Carlo simulations. All markers are codominant and there are no missing values. Parametric values not listed in the table are: QTL position (cMA)=25 cM, δ=0 and σ2ε=1.0. Results are averages of 100 replicated simulations with the standard deviations over the replicates given in parentheses
Table 2 Comparison of three methods of QTL mapping via Monte Carlo simulations. There are 50 per cent dominant markers with no missing values. Parametric values not listed in the table are: QTL position (cMA)=25 cM, δ=0 and σ2ε=1.0. Results are averages of 100 replicated simulations with the standard deviations over the replicates given in parentheses
Table 3 Comparison of three methods of QTL mapping via Monte Carlo simulations. All markers are codominant and there are, on average, 50 per cent missing markers. Parametric values not listed in the table are: QTL position (cMA)=25 cM, δ=0 and σ2ε=1.0. Results are averages of 100 replicated simulations with the standard deviations over the replicates given in parentheses
Table 4 Comparison of three methods of QTL mapping via Monte Carlo simulations. There are, on average, 50 per cent dominant and 50 per cent missing markers. Parametric values not listed in the table are: QTL position (cMA)=25 cM, δ=0, and σ2ε=1.0. Results are averages of 100 replicated simulations with the standard deviations over the replicates given in parentheses

The empirical critical values based on 1000 repeated simulations are given in Table 5, showing very little difference between the three methods. These critical values, however, are different across different levels of marker information contents. The highest critical values occur when all markers are codominant and there is no missing marker. These empirical critical values are then used to compute the empirical statistical powers for the three methods (see Table 6). Again, the three methods have virtually identical statistical powers.

Table 6 Empirical powers of three methods for QTL detection under various situations. α is the Type I error rate

To view the details of the comparison of the three methods, the likelihood ratio test statistics of the three methods are plotted against the chromosome position. Figure 1 shows the likelihood ratio profiles (average of 100 runs) at three levels of heritability in the situation where 50 per cent of the marker loci in the offspring are missing. The IRWLS method is nearly indistinguishable from the ML method, and both methods have higher testing signals than the REG method. Figure 1a, b and c also shows that the difference between ML (IRWLS) and REG increases as the heritability increases. When the heritability is fixed at 0.25, the likelihood ratio profiles (average of 100 runs) of the three methods are compared at each of the four levels of marker information content. Again, IRWLS and ML are virtually identical but both are different from that of the simple regression method. When all markers are codominant and there is no missing marker, the test statistics of the three methods are identical at marker loci but different off the markers. The ML(IRWLS) curves shows significant discontinuity at marker loci (Fig. 2a). When 50 per cent of the marker loci are dominant and there is no missing marker, the discontinuity of the ML (IRWLS) still exists but becomes less obvious (Fig. 2b). The test statistics of the ML (IRWLS) at the marker loci are now different from those of the REG. As the marker information content decreases, the discontinuity of ML (IRWLS) disappears (Fig. 2c, d).

Fig. 1
figure 1

Comparison of the likelihood ratio profiles (test statistics) of three methods, maximum likelihood (ML), simple regression (REG) and iteratively reweighted least squares (IRWLS). Eleven codominant markers (with a 50 per cent chance of missing) are equally spaced along a chromosome of 100 cM. A single QTL resides at position 25 cM. (a) Variation explained by the QTL is 0.05; (b) variation explained by the QTL is 0.25; (c) variation explained by the QTL is 0.40.

Fig. 2
figure 2

Comparison of the likelihood ratio profiles (test statistics) of three methods, maximum likelihood (ML), simple regression (REG) and iteratively reweighted least squares (IRWLS). Eleven markers are equally spaced on a chromosome of 100 cM. A single QTL explaining 25 per cent of the phenotypic variation resides at position 25 cM. (a) All markers codominant and no missing markers; (b) 50 per cent of the markers dominant and no missing markers; (c) 50 per cent of the markers missing; (d) 50 per cent dominant and 50 per cent missing markers.

In conclusion, ML and IRWLS show no difference but both differ from REG. However, the difference is only detectable at the micro level. The advantage of ML and IRWLS over the REG is that they provide a true estimate of σ2ε. The ML, however, is many times slower than REG because many cycles of iterations (≈80) are required for the EM algorithm to converge. In contrast, the IRWLS algorithm only requires two to three cycles of iterations to converge, about two or three times slower than the REG but 30–40 times faster than the ML. Of course, the comparisons in computing speed are based on the algorithms adopted here in this particular research. If other algorithms had been used, such as the Newton–Raphson iteration for the ML and the regression on marker-type algorithm (Whittaker et al., 1996) for the REG, the comparisons would produce quantitatively different results, but the conclusion is not anticipated to change qualitatively.

Discussion

In an earlier paper (Xu, 1995) it was pointed out that estimation of the residual variance with the simple regression method is confounded by part of the QTL variance. A simple way was also provided to separate the confounding variances in a backcross design. However, simply correcting the estimated residual variance does not necessarily correct the difference in test statistic between the REG and ML. The improved regression method (IRWLS) corrects both deficiencies yet retains the simplicity and rapidity of the regression method. With the current improvement, the regression method can now be safely applied to all data analyses without any concerns.

The (revised) regression method is particularly useful for permutation tests (Churchill & Doerge, 1994) and construction of confidence intervals by bootstrapping (Visscher et al., 1996) because thousands of analyses of resampled data sets are required. In addition to its simplicity and speed allowing resampling and permutation, the regression method has another major strength that makes it very valuable for use on real data: it can be used to fit relatively complex models and thus include multiple or interacting QTL effects. The weighted regression method retains this strength of regression. If the distribution of residual error is known, the ML is optimal. In some situations, the distribution is unknown and normality is only an approximation, so the ML is also an approximate method. In contrast, REG and IRWLS are independent of the distribution of the residual error. Combined with the permutation test, the regression methods are actually nonparametric methods which may be applied to a wider range of data.

The significant discontinuity of the likelihood ratio profiles at fully informative markers is a drawback of the ML and IRWLS compared with the REG. The peaks within marker intervals have a clear pattern, that is they all face in the direction where the true QTL resides. The strong discontinuity is analogous with linkage analysis (of markers), where the likelihood ratio of zero recombination can show very strong discontinuity (to minus infinity) at a marker once one or more recombination events have been observed, because the probability that the two markers are fully linked is zero. The difference between quantitative change and qualitative change can also explain the discontinuity. When the putative QTL position is off the markers, all three genotypes of the QTL are possible so that the population actually has a mixture of three distributions, no matter how likely a particular genotype is (e.g. 0.999). When the putative position moves to a marker locus, the genotype is actually observed so that the population has a single distribution. The observed genotype has occurred with probability 1.0. The difference between 0.999 and 1.0 is a qualitative change, whereas the change from 0.998 to 0.999 is a quantitative change. The ML and IRWLS methods are extremely sensitive to the qualitative change, whereas the REG method does not distinguish between the two types of change.

It should be noted that the test statistic for the weighted least squares method (IRWLS) cannot be chosen as the reduction of the weighted residual sum of squares. This is in contrast to the simple regression method, where the QTL location is chosen at the position with the minimum residual sum of squares. The residual sum of squares for the IRWLS method is:

which can be made as small as possible by increasing the values of the diagonal elements of R. The diagonal elements, however, are proportional to the uncertainty of the genotype of a putative position, i.e. the variance of the independent variables, zj and wj, as seen in eqn (8). The uncertainty, nonetheless, takes its maximum value at a position with minimum information content, in the middle of an interval. Therefore, the estimated QTL position will be biased towards the centre of an interval if RSS is used as the test statistic. Therefore, the likelihood ratio has been chosen as the test statistic in this paper. However, other test statistics might be more appropriate, and this deserves further investigation.