Introduction

Novel genes in crop cultivars or animal inbred strains are a critical resource for plant and animal improvement. Therefore it is relevant to ask whether it is also possible to map a quantitative trait locus (QTL) within the crop cultivars or animal inbred strains. Grupe et al. (2001) proposed the use of the mean phenotypic values of inbred strains to map the likely genomic locations of QTL ‘in silico,’ which theoretically would represent a major advance. Chesler et al. (2001) and Darvasi (2001) have questioned the validity of this so-called in silico mapping (ISM) on the basis that it is associated with a relatively high false-positive rate (FPR). Recently, Peltz's group further extended the computational method for mapping phenotypic traits that vary among inbred strains onto haplotypic blocks, known as the haplotype-based ISM (HISM) approach (Liao et al., 2004). This method predicted the genetic basis for strain-specific differences in several biologically important traits of mice (Liao et al., 2004; Guo et al., 2006, 2007; Liang et al., 2006a, 2006b), but it only performs a single-locus analysis. Nevertheless, the advantages of detecting QTL from inbred strains without the need for conventional QTL analysis are strong enough to justify seeking improvements to the ISM methodology.

The ISM and its extension assume the presence of a single QTL per linkage group. This assumption is problematic (Kao et al., 1999; Zhang, 2006), primarily because only the effects of the putative QTL at any particular marker position can be included in the model and all other QTL effects have to be ignored. As a result, estimates of the effects and the positions of QTL will be biased whenever there is more than one QTL present on a given linkage group. On the other hand, when several markers are in strong linkage disequilibrium (LD), the method can fail to reject any of these markers because of strong linkage. This inevitably leads to a relatively high FPR. Several approaches have been proposed to ameliorate this situation. The early approaches to this problem applied composite interval mapping on a large sample size (Jansen, 1993; Zeng, 1993). More recently, multi-QTL mapping has been developed (Kao et al., 1999; Xu, 2003, 2007; Zhang and Xu, 2005; Zhang, 2006; Xu and Jia, 2007). However, all these approaches are focused on segregating populations from controlled crosses rather than from the germplasm resource of inbred lines. Therefore, a current priority is to incorporate the notion of multi-QTL mapping into the extended ISM, which can be used in a natural population.

Another major concern is whether the choice of absolute trait differences as the response variable is optimal. In Haseman–Elston (H–E) regression (Haseman and Elston, 1972), the earliest suggestion was to apply the squared trait difference as the phenotypic difference. As pointed out by Wright (1997), this approach discards some useful information, and some benefit has been seen in using the trait values of both members of a sib pair. In effect, the squared difference and the trait sum together contain exactly the same information as the original two trait values. Drigalenko (1998) developed the idea further and suggested using the trait product as the phenotypic difference. The result has been further confirmed in multi-QTL H–E regression (Zhang et al., 2008), and these methods were exploited as the basis for the extended ISM described in this article.

In this paper, we show that the ISM can be extended to multi-loci ISM (MLISM) using all markers of a given genome. In the current version of MLISM, the parameters were estimated by the penalized maximum likelihood (PML) method; several response variables for phenotypic difference were compared with each other in order to optimize the procedure. The effect of LD markers around the identified QTL was also investigated.

The new method proposed in this paper was tested by simulation. The purposes of the simulation were threefold: (1) to select the best response variable, (2) to identify whether the MLISM FPR was substantially decreased and (3) to understand the effect of LD markers around detected QTL.

Materials and methods

Simulation study design

We conducted seven simulation experiments in this paper. In the first, the simulated pedigree was the maize pedigree described by Zhang et al. (2008) (Figure 1). The number of inbred lines within the maize pedigree was 404(n). Of these, n0(=103) were base (founder) lines, which were in linkage equilibrium so that the genotypes for markers and QTL with four alleles could be simulated. Non-founders (n1=301) were bred by repeated self-pollination of a hybrid between two inbred lines. Thus, each non-founder line represents a recombinant inbred line with respect to a pair of known parents. The genotype of all non-founders could be generated from the genotypes of their corresponding parents, analogous to simulating the genotypes of recombinant inbred lines from their two parents. All of the non-founder lines can be used to detect QTL. Sixty-one equally spaced markers were simulated on a single-chromosome segment 600 cM long. A single QTL was located at position 200 cM and overlapped with markers. The environmental variance was calculated as

where σg2 is the genetic variance and h2 is heritability. Allelic effects were calculated by relating the genetic variance of the QTL to both the allelic frequencies and the allelic number. The phenotypic value of each line was the sum of corresponding QTL genotypic values and the residual error, with an assumed N(0,σe2) distribution. Each simulation run consisted of 200 replicates. The other simulation experiments were carried out similarly. All simulated parameters are given in Table 1.

Figure 1
figure 1

The familial relationships between the 404 inbred maize lines used in the all simulation experiments except for the last one.

Table 1 Simulated parameters in all the simulation experiments

Genetic model

Let the kth inbred-line pair among all non-founders have trait phenotypic values (zk,1, zk,2), and the average value of zk,1 and zk,2 over all pairs be . The absolute trait difference is ykA=zk,1zk,2, and the identity-by-state (IBS) sharing at a marker locus for the two lines is x. On the basis of the ISM of Grupe et al. (2001), the regression of yA on x is described by:

where b0 is the regression intercept and b is the regression slope (Grupe et al., 2001). Provided that each marker locus on the entire genome can be linked to putative QTL, the model (1) can be extended to the MLISM:

where bi is the regression coefficient related to the ith putative QTL; p is the number of all markers on the entire genome; xki is the IBS of the kth inbred-line pair at the ith marker locus; and ek is the residual error with an assumed N(0,σ2) distribution. The squared trait difference is given by ykD=(zk,1zk,2)2, the trait product by ykP=(zk,1)(zk,2), and the trait sum by ykS=[(zk,1)+(zk,2)]2. The regressions of ykD, ykP and ykS on the IBS can be established similarly. Response variables are denoted by y.

Parameter estimation

There are several methods to estimate the parameters in model (2), that is, the PML (Zhang and Xu, 2005) and Bayesian LASSO (Park and Casella, 2008; Yi and Xu, 2008). If residual variance is heterozygous, the method proposed by Yi and Banerjee (2009) is available as well. We here adopt the PML method. Briefly, in the PML method, the penalized likelihood function is the product of a likelihood function L(θY, M) and penalty function P(θ, ξ). The likelihood function is calculated by

where θ=(b0, b1, …, bp, c, σ2), m=n1(n1−1)/2, , Y=(y1, y2, …, ym)T, M represents marker information, and ϕ(y;α,σ2) is a normal density function with mean α and variance σ2. The penalty function is:

where ξ=(μ1, …, μp, σ12, …, σp2) is the vector of hyperparameters, and η>0 is the prior sample size for assessing μi. Note that p(σi2)1 for the response variable ykP, and p(σi2)inv-χ2(ν,si2), with si2=0 (i=1,2,…, p), for the other response variables (Zhang et al., 2008). The penalized likelihood function is

Thus, the PML estimates for both model parameters and hyperparameters are

The iterative steps for parameter estimation are identical to those given by Zhang and Xu (2005) and He and Zhang (2008). The convergence criterion was Σθi(t+1)θi(t)<10−6. In equation (10), the value of ν depends on the response variable when si2=0 (i=1,…, p). From a wide range of values, we have empirically determined that n should be set to 6 for ykA and 7 for ykD (data not shown).

Likelihood ratio test

As stated by Zhang and Xu (2005) and Zhang et al. (2008), it is now possible to test the null hypothesis H0:bi=0 that there is no QTL linked to the ith marker locus by using the LR test statistic:

where θ′={b0, b(1), …, b(q), σi2} with b(k)>10−6, k=1, 2, …, q, θi={b0, b(1), …, b(i−1), b(i+1), …, b(q), σi2} is the vector of parameters that excludes b(i), σi2 is the residual variance of the reduced model under H0, and L(θ) is the log-likelihood function. For simplicity, the usual QTL significance criterion (LODi3) was applied. For each QTL simulated, the samples for which LOD or −lgP(H0) exceeded the threshold of 3.0 were counted. The ratio of the number of actual QTL to the total number of replicates (200) represents the empirical power of the method. The FPR was given by the ratio of the number of false-positive effects to the total number of zero effects. Note that linked false positives were only counted once.

Monte Carlo simulation studies

The choice of the best response variable for phenotypic difference

To demonstrate the first objective, four response variables, including the absolute (squared) trait difference ykA(ykD) and the trait product (sum) ykP(ykS), were compared in the first and second simulation experiments with the maize pedigree in Figure 1. In the first simulation experiment, the number of non-founders in the maize pedigree was 301. Sixty-one equally spaced markers, each with four alleles, were simulated on a single-chromosome segment of length 600 cM, and a single QTL with 0.20 heritability and four alleles was located at 200 cM. Each of the 200 simulated data sets was analyzed four times by MLISM with each response variable, in turn (Table 2). This analysis showed that the minimum FPR is achieved using ykP, and the QTL detection power and the standard deviation for the estimates of QTL position were highest and lowest, respectively, using ykP. In the second simulation experiment, three QTL, each with two alleles, were simulated with heritabilities of 0.05, 0.10 and 0.15 and located at marker positions 100, 300 and 500 cM, respectively. The number of alleles for markers was set at 2, and other parameters were the same as those in the first simulation experiment. The results were consistent with the above trend (Table 3). This implies that the trait product is the optimal response variable in the MLISM. It should be noted that the estimate of σ2 for the MLISM is different from the simulated value. This is because that the MLISM uses pairwise distances analysis rather than origin response variable analysis. In addition, under the situation of low heritability, the power for QTL detection is less for the HISM than for the MLISM. In the following simulation experiments, only outcomes using ykP are reported.

Table 2 Comparisons of in silico mapping (ISM), haplotype-based ISM (HISM) and variance component analysis (VCA) with multiple loci ISM (MLISM) under a single-quantitative trait locus (QTL) with a 0.20 heritability with a maize pedigree of 301 inbred lines (200 replicates)
Table 3 Comparisons of in silico mapping (ISM) and haplotype-based ISM (HISM) with multiple loci ISM (MLISM) under multiple quantitative trait locus (QTL) simulated with a maize pedigree of 301 inbred lines (200 replicates)

The evaluation of FPR for multiple loci ISM

To demonstrate whether the FPR for the MLISM was significantly reduced, the FPR for five simulation experiments was calculated. Each data set was analyzed by the ISM, the HISM and the MLISM, for a total of three analyses.

In the third simulation experiment, we pruned the maize pedigrees to have the correct number of non-founders of 100, 200 and 300, respectively. Eleven equally spaced markers, each with three alleles, were placed on each of two 100 cM chromosome segments; two QTL, each with three alleles, were simulated with heritabilities of 0.10 and 0.20 and located at marker positions 20 and 80 cM, respectively. The results are shown in Figure 2a (Supplementary Table S1). As expected, the FPRs in the MLISM were substantially decreased under various sample sizes as compared with those in the ISM. The power of QTL detection is less for the HISM than for the MLISM with small sample sizes. In addition, the FPR in the MLISM increased as sample size increased because the QTL detection power increases as sample size increases.

Figure 2
figure 2

Box plot of false-positive rate for three methods, including in silico mapping (ISM), haplotype-based ISM (HISM) and multiple loci ISM (MLISM). Results labeled ‘A’ under the x axis are from the entire genome and those labeled ‘N’ from the chromosomes without quantitative trait locus. (a) The number of non-founders (sample size); (b) the number of alleles for markers and QTL; (c) 1:2:3 for the three alleles, denoted by ‘S’; and 1:1:1 for the three alleles, denoted by ‘U’; (d) a mimic genetic linkage map; (e) a high-density genetic linkage map.

The fourth simulation experiment was designed to investigate the effects of the number of alleles (both marker and QTL) on the FPR of the new method, so the number of alleles was set at 2, 3 and 4. The number of non-founders in the maize pedigree was 301, and other parameters were the same as those in the third simulation experiment. The results are given in Figure 2b (Supplementary Table S2). Compared with the FPR obtained by using the ISM, the FPR from the MLISM was substantially decreased with various conditions of allele numbers. The QTL detection power decreased as the number of alleles increased. In simulations with a small amount of alleles, the power for QTL detection is slightly less for the HISM than for the MLISM.

In the fifth simulation experiment, the effect of the allelic frequency on the FPR of the new method was assessed by letting the frequency ratio of the three alleles (for the two simulated QTL and all markers) be set as either 1:1:1 (uniform distribution) or 1:2:3 (skewed distribution). Other parameters were the same as those in the fourth simulation experiment. The results are given in Figure 2c (Supplementary Table S3). The trend for the FPR obtained by both distributions was the same. In addition, the skewed distribution decreased the statistical power.

The sixth simulation experiment needed to mimic the actual data, which did not have equally spaced markers. The corrected genetic linkage map of Zhu et al. (2007) was adopted, and this genome consisted of 12 chromosomes. We genotyped 103 founders and 301 non-founders from the maize pedigree for 167 markers, each with two alleles. These markers covered 2121.15 cM of the genome with an average marker interval of 13.68 cM. Six QTL with heritabilities of 0.05, 0.05, 0.025, 0.15, 0.075 and 0.05 and locations at marker positions 0.00, 60.32, 123.52, 942.78, 1672.05 and 1723.00 cM, respectively, were simulated. The results are shown in Figure 2d (Supplementary Table S4) and the FPR for the MLISM was significantly reduced as well.

Another simulation was performed to evaluate the effect of a large genome with a high marker density on the FPR of the new method. The simulated genome consisted of 10 chromosomes, each with 101 evenly spaced markers at a 2 cM per marker interval. For 1010 markers (two alleles) along the genome, we genotyped 10 founders and 100 non-founders from a simulated pedigree. Ten QTL with heritabilities of 0.025, 0.05, 0.025, 0.05, 0.15, 0.075, 0.025, 0.10, 0.05 and 0.05 were placed at marker positions 220, 270, 320, 370, 650, 1050, 1150, 1450, 1850 and 1950 cM, respectively. The simulation was repeated 100 times. The results are shown in Figure 2e (Supplementary Table S5) and the same trend in Figure 2d was observed.

The relationship between QTL detection and LD

To demonstrate the last objective, the last simulation experiment was used to investigate the relationship between QTL detection and LD. When one QTL at a marker position was identified, markers could be observed around the identified QTL. Of these markers, we needed to analyze not only whether there was LD between the identified QTL and adjacent markers, but also whether there were false QTL residing on these LD markers. Thus, the proportion of the number of false QTL to the number of LD markers could be calculated. The results are listed in Table 4. The results showed that the false QTL proportion for the new method was generally lower than those for the ISM and the HISM. This implies that the new method has almost controlled for the effect of strong LD markers mentioned above.

Table 4 The relationship between quantitative trait locus (QTL) detection and linkage disequilibrium (LD) for in silic o mapping (ISM), haplotype-based ISM (HISM) and multiple loci ISM (MLISM) approaches with a simulated pedigree of 100 non-founders and a high-density genetic linkage map (100 replicates)

Discussion

We used a maize pedigree population of inbred lines of known ancestry as an example to demonstrate the new method of MLISM. This method can be directly applied to inbred strains of mice. When haplotype data are available, the new method can further be extended to haplotype-based MLISM (Liao et al., 2004; Guo et al., 2006, 2007; Liang et al., 2006a, 2006b). When single-nucleotide polymorphism molecular data are available, the new method can be used to improve the precision of QTL mapping and to select the candidate genes. When microarray gene expression marker data are available, it will be possible to elucidate the functional relationships among genes. This study indicates that the MLISM method, with the trait product as the response variable, represents an improvement on the existing methods for QTL mapping in a population of inbred lines.

This method was validated using simulated data. When applied to real data, most of the favorable properties will remain; however, some minor modifications are required. First, population structure generates spurious genotype/phenotype associations (Pritchard and Rosenberg, 1999; Yu et al., 2006). As reported by Yu et al. (2006), false positives can be reduced by incorporating a Q matrix {p(k,1)h} obtained from the STRUCTURE software into model (2). The model appears to be

where H is the number of sub-populations; p(k,1)h and p(k,2)h are the posterior probabilities of zk,1 and zk,2, respectively, conditional on the hth sub-population; and sh is the regression coefficient related to the hth sub-population. In the simulation studies, therefore, we have focused on evaluating the performance of the proposed method rather than addressing the effect of population structure on the new method. Second, the polygenic effect is generally included in the mixed genetic models, such as Yu et al. (2006) and Zhang et al. (2005). However, there are no polygenic effects in the MLISM. The reason for this is that all QTL across the genome have been included in the present genetic model. For ISM with unrelated individuals, one would expect that a method with all markers fitted in the model would be a better choice. Thus, there would be no need to use the mixed model with a combined polygene control, which requires testing single loci one-by-one. Third, in reality, markers may not be evenly distributed along the genome. Although the MLISM does not depend on the uniformity of marker distribution, tightly linked markers may cause poor estimates of the marker effects because of a high degree of multicollinearity. Therefore, it is recommended to use only one marker from a cluster of markers.

The proposed method differs from the ISM and the HISM in several ways. First, it extends the analysis from a single QTL to a multi-QTL model, which significantly reduces the FPR. Second, the response variable used in ISM is replaced by the trait product. Third, because the parameters are estimated by PML rather than by least squares method or analysis of variance, the FPR is substantially decreased because PML associates extremely small estimates for loci that are only loosely linked to a QTL. In particular, the PML approach is able to estimate the parameters for an oversaturated genetic model (Zhang and Xu, 2005; He and Zhang, 2008). It should also be noted that this property of the PML method results in a lower power of the new method compared with the ISM, but this shortcoming can be overcome by increasing the sample size. In addition, when several markers are in strong LD, the LR test in the single-QTL analysis could fail to reject any of these markers because of strong linkage. Thus, the relationship between QTL detection and LD was analyzed in the last simulation experiment. Results in Table 4 showed that the ratio of the number of false QTL to the number of the LD markers for the new method is less than those for the ISM and the HISM. This indicates that the new method almost controls for the effect of strong LD, which is a significant advance. In the cases of small sample sizes, low QTL heritability, and a small number of alleles, the power for QTL detection is less for the HISM than for the MLISM. Therefore, the new method is better than the ISM and its extension.

The new method differs from many QTL mapping methods proposed to date. The MLISM is designed for whole genome data sets collected from unrelated or very distantly related inbred lines, whereas other mapping methods are designed mostly for segregating populations from controlled crosses. This new method differs from genome-wide association mapping as well. The MLISM uses pairwise distances analysis, whereas genome-wide mapping directly fits the value of the trait with a linear model where the independent variables are simply the marker genotypes. Results from the Monte Carlo simulation showed that the former technique is better than the latter under the situations of both multiple alleles and small number of non-founders (data not shown), because the sample size is much more for the former than for the latter. On the other hand, the large sample size can result in slightly higher FPR, although their FPRs have been controlled under a low level.

Although the MLISM is similar to multi-QTL Haseman–Elston regression (Zhang et al., 2008), it differs in several respects. The first difference concerns the IBS and the identity-by-descent in the multi-loci genetic model. Second, if the pedigree is not complete, that is, it is missing or mistaken, then the MLISM is a better choice than multi-QTL Haseman–Elston regression. Finally, the power of the detection of QTL is relatively higher for the MLISM than for multi-QTL Haseman–Elston regression (Zhang et al., 2008).

In coarse QTL mapping, mapping precision is relatively low because of a combination of too few generations, limited marker density and too few recombinants. The availability of plentiful single-nucleotide polymorphism markers or haplotypes across a population of inbred lines can address this problem. Single-nucleotide polymorphism marker density can, in principle, be increased to as high a level as necessary. In addition, there are a number of advantages to using inbred lines instead of backcross and F2 populations for QTL mapping. These include a greater allele number and a broader reference population, a much higher mapping resolution and a lesser investment in time (Buckler and Thornsberry, 2002; Flint-Garcia et al., 2003; Zhang et al., 2005; Yu and Buckler, 2006; Iwata et al., 2007). In particular, where single-nucleotide polymorphism genotypes are available, the fine mapping of QTL in a population of inbred lines becomes possible without the earlier need to generate mapping populations.