Further investigation on the regression method of mapping quantitative trait loci

XU, Shizhong

doi:10.1046/j.1365-2540.1998.00307.x

Download PDF

Original Article
Published: 01 March 1998

Further investigation on the regression method of mapping quantitative trait loci

Shizhong XU¹

Heredity volume 80, pages 364–373 (1998)Cite this article

612 Accesses
48 Citations
Metrics details

Abstract

The simple regression method of mapping quantitative trait loci (QTL) is further investigated in comparison with the mixture model maximum likelihood method under high heritabilities, dominant and missing markers. No significant difference between the two methods is detected in terms of errors of parameter estimation and statistical powers, with the exception that the estimation of residual variance provided by the regression method is confounded with part of the QTL variance. The test statistic profiles show some difference between the two methods, but the difference is only detectable at the micro level. An alternative method, referred to as iteratively reweighted least squares, is proposed, which can correct the deficiency of parameter confounding in the regression method yet retains the properties of simplicity and rapidity of the ordinary regression method. Like the existing regression method, the weighted least squares method can be useful in QTL mapping in conjunction with the permutation tests and construction of confidence intervals by bootstrapping.

From Mendel to quantitative genetics in the genome era: the scientific legacy of W. G. Hill

Article 11 July 2022

The flashfm approach for fine-mapping multiple quantitative traits

Article Open access 22 October 2021

Evaluating and improving heritability models using summary statistics

Article 23 March 2020

Introduction

Lander & Botstein (1989) presented an exact maximum likelihood method (ML) for mapping quantitative trait loci (QTL) in line crossing experiments. When the putative position is off the markers, the QTL genotype is actually not observed, so the model involves missing data. Solutions of the exact maximum likelihood method involving missing data are usually obtained using the (Expectation–Maximization) EM algorithm (Dempster et al., 1977), which requires many cycles of iterations. Haley & Knott (1992) discovered that the ML can be well approximated by the simple regression method (REG). The authors conducted extensive computer simulations, showing no detectable difference between ML and REG in the range of parameters considered in the simulation experiment. A similar argument is also found in Martinez & Curnow (1992). As a consequence, the simple regression method has become widely accepted, especially in European countries, because of its simplicity and convenience of use relative to the ML.

Given that two methods are available for QTL mapping, which method should be chosen for real data analyses? For analysis of a single data set, it does not matter which one is used because the two methods will generate almost identical results. Some researchers may want to avoid the word ‘approximation’ and choose ML, and others may prefer simplicity and thus choose REG. Xu (1995) recently found that the residual variance estimated by the REG method contains part of the QTL variance caused by the uncertainty of QTL genotype. This observation may alert users of the REG that the explanation of the residual variance should be treated with caution. However, the REG method is computationally so superior to the ML that it may become the choice for multiple data analyses, such as the permutation tests (Churchill & Doerge, 1994) and the bootstrap construction of confidence intervals (Visscher et al., 1996). These nonparametric methods involve thousands of analyses of the (resampled) same data set and could be prohibitive for ML if the data set and genome size are large.

The purposes of this paper are: (i) to investigate further the difference between the REG and ML methods via simulation studies in situations with high heritabilities and dominant and/or missing markers; and (ii) to improve the existing regression method so that the pure environmental variance can be separated from the residual variance, yet the property of high computing speed is retained.

Statistical methods

Linear model

Let y_j be the phenotypic value of an F₂ individual that can be described by the following linear model:

where x_j is a known vector, β is a vector of unknown fixed effects, α and δ are, respectively, the average effect of allelic substitution and the dominance effect of a putative QTL, and ε_j is the residual error with N(0,σ²_ε). Note that for a single-QTL model the residual error is purely caused by uncontrollable environmental noise. The independent variables, z_j and w_j, are defined as:

and

where Q₁Q₁, Q₂Q₂ and Q₁Q₂ are, respectively, the genotypes of the two parental lines and the F₁ hybrid. Because the genotype of a QTL is not observable if the QTL is not at a marker, z_j and w_j are usually missing. However, the conditional distribution of z and w can be inferred from the genotypes of linked markers. Let p_(kl)j be the conditional probability that the individual is of genotype Q_kQ_l, given marker information. Given the conditional probabilities, y_j is considered to be sampled from a mixture of three distributions with means of μ₁₁, μ₁₂ and μ₂₂ and a common variance $σ_{ε}^{2}$ , where:

Statistical tests and parameter estimation are conducted through one of the three methods described below.

Maximum likelihood method (ML)

The likelihood function is:

where φ_kl(y_j) is the normal probability density for those individuals with genotype Q_kQ_l. It is well known that the maximum likelihood solution for the unknown parameters, θ=[β α δ σ²_ε]^T, can be solved via the EM algorithm (Dempster et al., 1977). To test the hypothesis that no QTLs are segregating, i.e. H₀: α=δ=0, the following likelihood ratio test statistic is applied:

where θ₀ is different from θ by introducing two constraints, α=0 and δ=0.

Simple regression method (REG)

The regression method of QTL mapping developed by Haley & Knott (1992) and Martinez & Curnow (1992) is an approximation of the ML method. These authors approximate the mixture of three distributions by a single distribution so that the ML solution can be obtained by a simple regression approach. The approximate single model is:

where I_M denotes marker information and:

and:

Note that the residual e_j is different from that given earlier. This single model has a mean of:

and a variance of:

The unknown parameters are solved using the ordinary least squares method (Haley & Knott, 1992). Under the assumption that y_j is normal, the least squares solutions are identical to the maximum likelihood estimators if the likelihood function is defined by:

Two assumptions of the ML are violated by the regression analysis. One is the normal distribution of y_j and the other is the homogeneous residual variance. Violation of the normal distribution is not a problem with the regression method because estimation of the parameter does not depend on a normal distribution. Although the hypothesis test depends on the normal assumption, the t- or F- tests are usually very robust. Heterogeneous residual variance may cause a slight problem in the regression analysis (Xu, 1995), but is not likely to change the results qualitatively relative to the true ML analysis (Haley & Knott, 1992). The difference between the true ML and the regression method comes from the difference in the estimation of the residual variance. The regression method generally provides a residual variance estimation that contains part of the QTL variance not explained because of the uncertainty of QTL genotype (Xu, 1995). The F- value can be used as the test statistic for the simple regression method. However, to compare this method with the ML, the test statistic, originally used by Haley & Knott (1992), is adopted here:

where RSS_full is the residual sum of squares of the full model and RSS_reduced is that of the reduced model. This test statistic can be compared with that given in eqn (3) because they are very similar under the null hypothesis (see Table 5).

Table 5 Empirical critical values of the test statistic for testing the presence of a QTL on a chromosome of length 100 cM

Full size table

Iteratively reweighted least squares method (IRWLS)

To retain the advantages of both the regression method and the ML method, a weighted regression method is investigated here. The mixture model is still approximated by a single model (eqn 4), but the residual variance is further partitioned into several components:

where Var(z_j|I_M)α² is part of the QTL variance not explained because of the uncertainty of z_j, Var(w_j|I_M)δ² is part of the QTL variance not explained because of the uncertainty of w_j, and 2Cov(z_j w_j|I_M)αδ is because of the uncertainty of both z_j and w_j. All three additional components in the residual will vanish if the genotype of the QTL is actually observed, i.e. Var(z_j|I_M)=Var(w_j|I_M)= Cov(z_j w_j|I_M)=0. These additional components are computed as follows:

and:

Let y be an n×1 vector of the data. The model can be expressed in matrix notation as:

where Z is an n×1 vector with the jth element equal to E(z_j|I_M), W is an n×1 vector with the jth element equal to E(w_j|I_M), and e is an n×1 vector of residuals. The expectation and variance matrix of the model are:

and:

where R is a diagonal matrix with the jjth element equal to:

and:

The likelihood function is:

The ML solution can be solved via a weighted least squares approach which is described below.

Given an initial guess of the values of λ_α, λ_δ and λ_αδ, matrix R is treated as known. Under the pretence of known R, the solution of θ can be easily obtained via the weighted regression analysis:

and:

Because R depends on unknown parameters, it must be updated by the estimates of α, δ and σ²_ε, and the estimation is then repeated until convergence. This algorithm is extremely fast — only two to three cycles of iteration are required, in contrast to 80–100 iterations in the EM algorithm at the same accuracy. The likelihood ratio test statistic, Λ, is applied to the weighted regression analysis.

Dominant and missing markers

The missing marker problem can be solved easily. A missing marker should be skipped over and the nearest nonmissing markers are picked up. Dominant markers provide partial information which is extracted by using a hidden Markov model algorithm. Details of the hidden Markov model are found in Lander & Green (1987) and Kruglyak et al. (1995).

Simulation studies

Eleven equally spaced markers were simulated on a single chromosome segment of length 100 cM. A single QTL was located at position 25 cM. The population size (number of F₂ individuals) was set at 300. Under the null model, the QTL was assigned a value of zero for both the additive and dominance effects. Simulations were repeated 1000 times and the 95 and 99 percentiles of the test statistics were chosen as the empirical critical values for power calculation. Under the alternative model, a nonzero additive effect was simulated while the dominance effect was still set to zero. Simulations were repeated 100 times. Empirical power was calculated by counting the number of runs in which test statistics were greater than the empirical critical values. In all simulations, the variance of the environmental effect was set at σ²_ε=1.0.

Each data set was analysed using the three methods: the exact maximum likelihood method (ML), the simple linear regression analysis (REG) and the iteratively reweighted least squares method (IRWLS). Powers and estimation errors of the three methods were compared, based on averages of 100 runs.

Factors considered include the size of the QTL effect, measured by the average effect of gene substitution (α), and the amount of marker information. The average effect of gene substitution was examined at three levels: α=0.324 leading to h²=0.05; α=0.820 resulting in h²=0.25 and α=1.155 corresponding to h²=0.40. The amount of marker information was investigated in four situations: (i) all markers codominant and no missing markers, the highest level of marker information content; (ii) 50 per cent loci in the F₁ parent randomly set to dominant and no missing markers in the offspring; (iii) 50 per cent loci in the F₂ offspring randomly set to missing values; and (iv) 50 per cent loci in the parent dominant and 50 per cent loci in the offspring missing, the lowest level of marker information content.

Average values of the estimated parameters and their standard deviations calculated based on 100 replicated simulations are listed in Tables 1, Table 2, Table 3 and Table 4. The three methods show virtually no difference with regard to parametric estimation of the additive effect (α), dominance effect (δ) and the location of the QTL (cM_A), which is consistent with Haley & Knott (1992) for the comparison of ML and REG. Another observation is that when both marker information content and the heritability are low, estimation of the QTL position tends to be biased towards the centre of the chromosome for all three methods. This bias occurs because, with smaller QTL effects and less marker information, some of the QTL peaks found may represent, not the simulated QTL but, a Type I error. The position of these Type I errors tends to be randomly distributed along the linkage group; thus the mean position of Type I errors is at the centre of the chromosome and their joint effect, along with some real QTL, is to move the estimated position over all simulated replicates towards the centre of the chromosome. The last, and important, observation is that the simulations verify the theoretical prediction that the simple regression provides a confounded estimation of the true residual variance and part of the QTL variance. The level of confounding increases as the marker information content decreases (from Table 1 to Table 4). The confounding, however, no longer exists in the IRWLS method (see the comparison with ML).

Table 1 Comparison of three methods of QTL mapping via Monte Carlo simulations. All markers are codominant and there are no missing values. Parametric values not listed in the table are: QTL position (cM_A)=25 cM, δ=0 and σ²_ε=1.0. Results are averages of 100 replicated simulations with the standard deviations over the replicates given in parentheses

Full size table

Table 2 Comparison of three methods of QTL mapping via Monte Carlo simulations. There are 50 per cent dominant markers with no missing values. Parametric values not listed in the table are: QTL position (cM_A)=25 cM, δ=0 and σ²_ε=1.0. Results are averages of 100 replicated simulations with the standard deviations over the replicates given in parentheses

Full size table

Table 3 Comparison of three methods of QTL mapping via Monte Carlo simulations. All markers are codominant and there are, on average, 50 per cent missing markers. Parametric values not listed in the table are: QTL position (cM_A)=25 cM, δ=0 and σ²_ε=1.0. Results are averages of 100 replicated simulations with the standard deviations over the replicates given in parentheses

Full size table

Table 4 Comparison of three methods of QTL mapping via Monte Carlo simulations. There are, on average, 50 per cent dominant and 50 per cent missing markers. Parametric values not listed in the table are: QTL position (cM_A)=25 cM, δ=0, and σ²_ε=1.0. Results are averages of 100 replicated simulations with the standard deviations over the replicates given in parentheses

Full size table

The empirical critical values based on 1000 repeated simulations are given in Table 5, showing very little difference between the three methods. These critical values, however, are different across different levels of marker information contents. The highest critical values occur when all markers are codominant and there is no missing marker. These empirical critical values are then used to compute the empirical statistical powers for the three methods (see Table 6). Again, the three methods have virtually identical statistical powers.

Table 6 Empirical powers of three methods for QTL detection under various situations. α is the Type I error rate

Full size table

To view the details of the comparison of the three methods, the likelihood ratio test statistics of the three methods are plotted against the chromosome position. Figure 1 shows the likelihood ratio profiles (average of 100 runs) at three levels of heritability in the situation where 50 per cent of the marker loci in the offspring are missing. The IRWLS method is nearly indistinguishable from the ML method, and both methods have higher testing signals than the REG method. Figure 1a, b and c also shows that the difference between ML (IRWLS) and REG increases as the heritability increases. When the heritability is fixed at 0.25, the likelihood ratio profiles (average of 100 runs) of the three methods are compared at each of the four levels of marker information content. Again, IRWLS and ML are virtually identical but both are different from that of the simple regression method. When all markers are codominant and there is no missing marker, the test statistics of the three methods are identical at marker loci but different off the markers. The ML(IRWLS) curves shows significant discontinuity at marker loci (Fig. 2a). When 50 per cent of the marker loci are dominant and there is no missing marker, the discontinuity of the ML (IRWLS) still exists but becomes less obvious (Fig. 2b). The test statistics of the ML (IRWLS) at the marker loci are now different from those of the REG. As the marker information content decreases, the discontinuity of ML (IRWLS) disappears (Fig. 2c, d).

In conclusion, ML and IRWLS show no difference but both differ from REG. However, the difference is only detectable at the micro level. The advantage of ML and IRWLS over the REG is that they provide a true estimate of σ²_ε. The ML, however, is many times slower than REG because many cycles of iterations (≈80) are required for the EM algorithm to converge. In contrast, the IRWLS algorithm only requires two to three cycles of iterations to converge, about two or three times slower than the REG but 30–40 times faster than the ML. Of course, the comparisons in computing speed are based on the algorithms adopted here in this particular research. If other algorithms had been used, such as the Newton–Raphson iteration for the ML and the regression on marker-type algorithm (Whittaker et al., 1996) for the REG, the comparisons would produce quantitatively different results, but the conclusion is not anticipated to change qualitatively.

Discussion

In an earlier paper (Xu, 1995) it was pointed out that estimation of the residual variance with the simple regression method is confounded by part of the QTL variance. A simple way was also provided to separate the confounding variances in a backcross design. However, simply correcting the estimated residual variance does not necessarily correct the difference in test statistic between the REG and ML. The improved regression method (IRWLS) corrects both deficiencies yet retains the simplicity and rapidity of the regression method. With the current improvement, the regression method can now be safely applied to all data analyses without any concerns.

The (revised) regression method is particularly useful for permutation tests (Churchill & Doerge, 1994) and construction of confidence intervals by bootstrapping (Visscher et al., 1996) because thousands of analyses of resampled data sets are required. In addition to its simplicity and speed allowing resampling and permutation, the regression method has another major strength that makes it very valuable for use on real data: it can be used to fit relatively complex models and thus include multiple or interacting QTL effects. The weighted regression method retains this strength of regression. If the distribution of residual error is known, the ML is optimal. In some situations, the distribution is unknown and normality is only an approximation, so the ML is also an approximate method. In contrast, REG and IRWLS are independent of the distribution of the residual error. Combined with the permutation test, the regression methods are actually nonparametric methods which may be applied to a wider range of data.

The significant discontinuity of the likelihood ratio profiles at fully informative markers is a drawback of the ML and IRWLS compared with the REG. The peaks within marker intervals have a clear pattern, that is they all face in the direction where the true QTL resides. The strong discontinuity is analogous with linkage analysis (of markers), where the likelihood ratio of zero recombination can show very strong discontinuity (to minus infinity) at a marker once one or more recombination events have been observed, because the probability that the two markers are fully linked is zero. The difference between quantitative change and qualitative change can also explain the discontinuity. When the putative QTL position is off the markers, all three genotypes of the QTL are possible so that the population actually has a mixture of three distributions, no matter how likely a particular genotype is (e.g. 0.999). When the putative position moves to a marker locus, the genotype is actually observed so that the population has a single distribution. The observed genotype has occurred with probability 1.0. The difference between 0.999 and 1.0 is a qualitative change, whereas the change from 0.998 to 0.999 is a quantitative change. The ML and IRWLS methods are extremely sensitive to the qualitative change, whereas the REG method does not distinguish between the two types of change.

It should be noted that the test statistic for the weighted least squares method (IRWLS) cannot be chosen as the reduction of the weighted residual sum of squares. This is in contrast to the simple regression method, where the QTL location is chosen at the position with the minimum residual sum of squares. The residual sum of squares for the IRWLS method is:

which can be made as small as possible by increasing the values of the diagonal elements of R. The diagonal elements, however, are proportional to the uncertainty of the genotype of a putative position, i.e. the variance of the independent variables, z_j and w_j, as seen in eqn (8). The uncertainty, nonetheless, takes its maximum value at a position with minimum information content, in the middle of an interval. Therefore, the estimated QTL position will be biased towards the centre of an interval if RSS is used as the test statistic. Therefore, the likelihood ratio has been chosen as the test statistic in this paper. However, other test statistics might be more appropriate, and this deserves further investigation.

References

Churchill, G. A. and Doerge, R. W. (1994). Empirical threshold values for quantitative trait mapping. Genetics, 138: 963–971.
CAS Google Scholar
Dempster, A. P., Laird, N. M. and Ribin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J R Statist Soc B, 39: 1–38.
Google Scholar
Haley, C. S. and Knott, S. A. (1992). A simple regression method for mapping quantitative trait loci in line crosses using flanking markers. Heredity, 69: 315–324.
Article CAS Google Scholar
Kruglyak, L., Daly, M. J. and Lander, E. S. (1995). Rapid multipoint linkage analysis of recessive traits in nuclear families, including homozygosity mapping. Am J Hum Genet, 56: 519–527.
CAS Google Scholar
Lander, E. S. and Botstein, D. (1989). Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics, 121: 185–199.
CAS Google Scholar
Lander, E. S. and Green, P. (1987). Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci USA, 84: 2363–2367.
Article CAS Google Scholar
Martinez, O. and Curnow, R. N. (1992). Estimating the locations and the sizes of the effects of quantitative trait loci using flanking markers. Theor Appl Genet, 85: 480–488.
Article CAS Google Scholar
Visscher, P. M., Thompson, R. and Haley, C. S. (1996). Confidence intervals in QTL mapping by bootstrapping. Genetics, 143: 1013–1020.
CAS Google Scholar
Whittaker, J. C., Thompson, R. and Visscher, P. M. (1996). On the mapping of QTL by regression of phenotype on marker type. Heredity, 77: 23–32.
Article Google Scholar
Xu, S. (1995). A comment on the simple regression method for interval mapping. Genetics, 141: 1657–1659.
CAS Google Scholar

Download references

Acknowledgements

The author thanks two anonymous reviewers for their helpful comments and criticisms on an earlier version of the manuscript. The reviewers provided elegant explanations for the bias in QTL position estimation and the discontinuity of the test statistic profiles in the ML, which have been incorporated in the revision. This research was supported by the National Institutes of Health Grant GM55321–01 and the USDA National Research Initiative Competitive Grants Program 95–37205–2313.

Author information

Authors and Affiliations

Department of Botany and Plant Sciences, University of California, Riverside, 92521, CA, USA
Shizhong XU

Authors

Shizhong XU
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shizhong XU.

Rights and permissions

Reprints and permissions

About this article

Cite this article

XU, S. Further investigation on the regression method of mapping quantitative trait loci. Heredity 80, 364–373 (1998). https://doi.org/10.1046/j.1365-2540.1998.00307.x

Download citation

Received: 11 April 1997
Published: 01 March 1998
Issue Date: 01 March 1998
DOI: https://doi.org/10.1046/j.1365-2540.1998.00307.x

Keywords

This article is cited by

Quantitative trait loci influencing forking defects in an outbred pedigree of loblolly pine
- Jin S. Xiong
- Steven E. McKeand
- Ross W. Whetten
BMC Genetics (2016)

Further investigation on the regression method of mapping quantitative trait loci

Abstract

Similar content being viewed by others

From Mendel to quantitative genetics in the genome era: the scientific legacy of W. G. Hill

The flashfm approach for fine-mapping multiple quantitative traits

Evaluating and improving heritability models using summary statistics

Introduction