Introduction

Mapping genes or genomic regions responsible for the variation in quantitative traits (QTLs) is now a feasible task for most economically important species because of the large number of DNA polymorphisms available scattered along the genome. The development of statistical methods to detect QTLs in experimental crosses and in outbred populations has run parallel to advances in molecular methodologies. Thus, a number of approaches like maximum likelihood (ML, Lander & Botstein, 1989), regression (Haley & Knott, 1992; Knott et al., 1996), the method of moments (Darvasi & Weller, 1992; Luo & Woolliams, 1993) or, more recently, nonparametric methods (Kruglyak & Lander, 1995; Coppieters et al., 1998) have been applied to estimate QTL effect and position. Estimation procedures have been usually compared by means of simulation where the trait was distributed as assumed under the analysis, typically a mixture of normal distributions with equal variances within QTL genotypes. ML and regression have been shown to perform similarly under a variety of situations (Haley & Knott, 1992; Knott et al., 1996), whereas the method of moments was more unreliable than ML in the heteroscedastic model, i.e. when the variances for each QTL genotype differ (Luo & Woolliams, 1993). Most of these methods have been evaluated solely in terms of bias and accuracy, whereas robustness has received little attention. In particular, the effect of extreme phenotypic observations, i.e. outliers, on QTL effect estimates has not been studied in detail, to our knowledge. The scarcity of studies on robust QTL estimation occurs despite the ample statistical theory and methods available (e.g. Staudte & Sheather, 1990). Nonetheless, Jansen & Stam (1994) presented a strategy to detect outliers within an ML estimation framework. It should be stressed that outliers may be caused by relatively common phenomena in animal or plant management like preferential treatment of a subgroup of individuals, or a disease causing a less than average performance. This issue is particularly relevant because if a large-effect gene is segregating the trait will not be normally distributed. Thus, the researcher may be misled whenever the data show departures from normality that may be caused simply by outliers. This is especially the case with segregation analysis.

In this work we explore the performance of the ‘Minimum Distance’ (MD) estimation method in the context of QTL effect estimation. It should be remarked that other robust approaches like robust regression (e.g. Haley & Knott, 1992) or robust maximum likelihood (e.g. MacLachlan & Basford, 1987) are available, but the Minimum Distance approach has been shown to be very efficient when there is data contamination because of, for instance, outliers and is especially powerful when applied to mixtures (Parr & Schucany, 1988; Cao et al., 1995). It was first proposed by Wolfowitz (1957) although its use at that time was limited by computational constraints. Nowadays it has become a more popular tool in the context of robust statistics theory and practice. We also present a generalization of the MD methodology to deal with missing data that occur with, for example, selective genotyping. In this latter instance regression provides biased estimates even without outliers.

Materials and methods

Theory

A minimum distance estimator of a parameter θ is a value that minimizes δ [F(.), F(.|θ)], where F(.|θ) is the distribution of interest, F(.) is an empirical distribution function obtained from the data, and δ [ ] is a distance measure (Titterington et al., 1985). The MD method thus comprises a variety of procedures, depending on the actual distance used. We used the Cramer–von Mises distance, as it has been shown to perform well in a variety of settings (Woodward et al., 1984; García-Dorado, 1997; García-Dorado & Marin, 1998):

where N is the number of observations, and F(yi|θ) is the value of the distribution function for the ith observation when the observations are ranked in ascending order. Now consider two inbred lines fixed for alternative allele markers (MM vs. mm) and QTLs (QQ vs. qq). In a backcross, assume that the trait of interest follows a normal distribution N(μ1, σ1) in homozygous individuals for the QTL (QQ), and that the distribution is N(μ2, σ2) in the heterozygous (Qq) individuals. The distribution function of the trait in individuals classified according to a linked marker is, assuming that haplotype QM is the nonrecombinant,

and

where r is the recombination fraction between the marker and the QTL, and Φ[ ] is the standard normal distribution function. In order to apply eqn (1), individuals are classified and ranked within marker genotype, and the distance minimized is the sum of the distances within a marker class. If the QTL effect is estimated using flanking markers, then an expression similar to eqn (2a,b) is derived, taking into account that there are four marker classes in a backcross.

A common strategy for limiting molecular work is selective genotyping (Lander & Botstein, 1989), which consists of genotyping only the extreme individuals. The effect of a small number of aberrant data may be magnified with this strategy, and its consequences have not been explored. Although the ML estimation theory is well developed with missing data, i.e. the EM algorithm, the MD method has not been applied. Here we develop a computer intensive strategy, the Monte Carlo MD (MC-MD), which is similar to the MC-EM algorithm proposed by Wei & Tanner (1990). It consists of a double iteration loop. The MC-MD steps are as follows.

1 Initialize 1θ^ = {μ10, μ20, σ10, σ20}.

2 Do j=1, J

2.1 For each untyped individual with record yi, compute the probability of having QTL genotype G=QQ or Qq:

2.2 Draw a random marker genotype given P(G=QQ|y, θ^), P(G=Qq|y, θ^) and distances between marker and QTL if an untyped individual.

2.3 Rank individuals within the current marker class and estimate θj using a regular MD algorithm.

3 Update θ^=∑j=1Jθj/J.

4 Repeat from 2 until the distribution of θ^ stabilizes.

J is the number of missing marker genotypes per individual drawn that are used to compute θ^, and U[ ] is the standard normal density function. As in other Monte Carlo methods, θ^ does not provide a point estimate but rather samples are obtained from the distribution of θ^, taking into account uncertainty in missing genotypes, once convergence has been attained. For a general study on convergence properties of Monte Carlo methods, the reader is referred to Tanner (1993). A preliminary study here showed that convergence was reached in a few iterations because observations were independently distributed.

Computer simulation

The MD and ML methods were compared by means of simulation. Backcross populations of size N=200 and 500 individuals were simulated. The mean and standard deviation of the first genotype were set to μ1=σ1=1. Four cases were considered: (i) μ2=1, σ2=1; (ii) μ2=1, σ2=1.25; (iii) μ2=1.252, σ2=1; and (iv) μ2=1.282, σ2=1.25. The values for μ2 in cases (iii) and (iv) were chosen so that the difference between QTL means was 1.25 phenotypic standard deviations. Two markers 25 cM apart were simulated, and the QTL was simulated at position 15 cM within the marker bracket. Five outliers were simulated for a population size of 200, and five or 25 outliers with N=500. An outlier was generated by taking a random number from a distribution N(2μ, 2 σ) instead of the ‘correct’ N(μ, σ). As argued in the introduction, this way of simulating outliers mimics directional bias that may be caused by disease or preferential treatment. The likelihood (or the distance) were maximized (or minimized) using the E04JAF subroutine of NAG software (Numerical Algorithm Group, 1995) with full genotyping. This subroutine uses a quasi-Newton algorithm that allows constraints to be fixed on the variables, i.e. σi > 0. The EM algorithm as described in Lander & Botstein (1989) and Luo & Kearsey (1992) was implemented for ML and MC-MD as above with selective genotyping. The proportion genotyped in this case was 40%, the extreme 20% in each tail.

Two hundred and 100 replicates were run for each case with full and selective genotyping, respectively. Bias and empirical standard deviation (SD) of the parameter estimates and power were calculated. Significance thresholds were obtained by permutation (Churchill & Doerge, 1994), 1000 permutations of the data being obtained for each replicate. Power was calculated as the number of replicates where the statistics exceeded the significance value (P=0.05). The tests were whether μ1=μ2, and whether σ1=σ2.

Results and discussion

In order to compare both methods in the most favourable setting for ML, simulations were run without outliers (Table 1). ML and MD methods provided unbiased estimates for both location and scale parameters. Standard deviations of MD and ML estimates for means were very similar, resulting in equal power of both methods to detect differences between QTL means. Standard deviations of ML estimates of scale parameters were slightly smaller than with MD and, consequently, power was slightly larger with ML than MD methodology in heteroscedastic models. In summary, when the ‘true’ model equals the model used in the analysis, ML procedure shows, overall, equal or better properties than its MD counterpart although the difference is not large.

Table 1 Mean estimates of location (μ) and scale (σ) parameters with Minimum Distance (MD) and Maximum Likelihood (ML) methodologies. Values in parentheses are the empirical standard deviations over 200 replicates. The power in detecting differences between QTL means (∏μ) and standard deviations (∏σ) is also shown. No outliers, μ1 = σ1 = 1; (a) population size 200; (b) population size 500

Nonetheless, if the data simulated do not correspond exactly to the model assumed in the analysis, i.e. in all real data analyses, then ML may not be the best choice. Tables 2 and Table 3 show the results for ‘contaminated’ populations. The percentage of outliers was either 1% or 5% (N=500) and 2.5% (N=200). Contamination affected estimation by causing a bias and by increasing the empirical standard deviation of the estimates, thus augmenting the estimation error. However, the extent of these phenomena differed between methods. MD methodology was much more robust than its ML counterpart. Even with 1% of outliers and N=500, ML estimates were quite sensitive. For instance, for μ2=1 and σ2=1.25, the empirical SD of σ^2 was almost doubled with ML compared to the case without outliers, whereas the SD remained constant with MD methods. Increasing the number of outliers resulted in a larger SD of ML, but only marginally in MD estimation. Similarly, the SD of estimates increased more rapidly for ML than MD methods as the percentage of outliers increased. Outliers affected both location and scale parameter estimates. However, scale parameter estimates were more affected than location estimates by the presence of outliers, especially with ML. Bias was about 50 and 100% larger for scale than location estimates.

Table 2 Mean estimates of location (μ) and scale (σ) parameters with Minimum Distance (MD) and Maximum Like-lihood (ML) methodologies. Values in parentheses are the empirical standard deviations over 200 replicates. The power in detecting differences between QTL means (∏μ) and standard deviations (∏σ) is also shown. Population size 200, 5 outliers, μ1 = σ1 = 1
Table 3 Mean estimates of location (μ) and scale (σ) parameters with Minimum Distance (MD) and Maximum Likelihood (ML) methodologies. Values in parentheses are the empirical standard deviations over 200 replicates. The power in detecting differences between QTL means (∏μ) and standard deviations (∏σ) is also shown. Population size 500, μ1 = σ1 = 1; (a) 5 outliers; (b) 25 outliers

Contamination affects power negatively by increasing the SD of estimates, but bias tends to augment spuriously the differences between genotypes, which favours power, especially in ML estimation. All in all, power was little affected by outliers with respect to location parameters with both ML and MD methods. In terms of comparing standard deviations, power decreased markedly in contaminated populations with ML but only moderately with MD estimation. Power depended basically on population size, and it was barely affected by increasing the number of outliers.

Setting significance thresholds is not a straightforward issue in genome searches. Non-normal, or unknown, distributions make this issue even more complex. For instance, it is not clear how to set significance thresholds by analytical or simulation methods if the nature and the percentage of outliers is not known, as is the case in analysing real data. We have used data permutation because of its flexibility (Churchill & Doerge, 1994). Given the limited number of replicates run, power when μ1=μ2, or σ1=σ2 was very close to the a priori set significance level (5%), showing the adequacy of the permutation strategy. It can be seen that permutation behaved equally well whether a heteroscedastic model was simulated or not, and irrespective of the presence of outliers.

A further interesting aspect is convergence of maximization algorithms in heteroscedastic models. Luo & Woolliams (1993) reported that, under the heteroscedastic model, the log-likelihood might be unbounded and thus the ML estimates may not exist. We have not, apparently, encountered this problem: ML provided ‘reasonable’ estimates irrespective of whether σ1 was equal or not to σ2 whenever no outliers were simulated. Different algorithms, e.g. simplex or quasi-Newton, provided identical results.

Performance of EM-ML and MC-MD was compared with selective genotyping in the large population (N=500). The 40% extreme individuals were genotyped and zero or 25 outliers were considered. After some exploratory analysis the total number of iterations was 30, and the number of times that missing marker genotypes were simulated per iteration in MC-MD (J) was also set to 30. The parameters reported are the mean of the last 10 iterations (Table 4). In general, it was found that the algorithm was rather stable, and quite independent of J. Selective genotyping did provide unbiased estimates of location parameters with only a marginal increase in error estimation compared to full genotyping (compare with Table 2) in noncontaminated populations. Scale parameter estimates were, interestingly, slightly biased downwards in the heteroscedastic situations. Here differences in the SD of estimates were larger with MD than with ML, suggesting again that ML behaves better than MD when the model of analysis corresponds to that of simulation. And again the conclusion is reversed when there is data contamination.

Table 4 Mean estimates of location (μ) and scale (σ) parameters with Minimum Distance (MD) and Maximum Like-lihood (ML) methodologies. Values in parentheses are the empirical standard deviations over 100 replicates. Population size 500, μ1 = σ1 = 1; (a) no outliers; (b) 25 outliers. Only the extreme 40% distribution is genotyped

Selective genotyping led to biased estimates in a contaminated population (Table 4). It is quite clear that the stochastic MD method proposed is much more robust than ML using a standard EM algorithm. Even when μ1=μ2 and σ1=σ2, ML produced a larger bias than MD for both μ and σ estimates, and the bias increased if μ1μ2 and σ1σ2. Empirical SDs with ML were twice those of MD for scale parameters, and were about 20% larger for location parameter estimates. Overall, selective genotyping caused almost no increase in SDs of the estimates for MD, but it did have a more noticeable effect on ML estimates. Power with selective genotyping could not be calculated using permutation because of the prohibitive amount of CPU required in MC-MD although it can be conjectured that MD and ML patterns should be similar to that with full genotyping (Table 3).

In conclusion, the MC-MD method proposed alleviates to a large extent the bias caused by contamination in selectively genotyped populations. Higher computing costs of MC-MD than EM-ML are fully justified in this instance. However, these are only preliminary results on MC-MD and further studies are needed in order to evaluate its convergence and statistical properties in a more general framework. These results confirm the expectation that selective genotyping may be a risk-prone strategy with outliers, because almost certainly these will be included in the genotyped pool and its weight in the resulting estimates will be larger than if the whole population is genotyped. A further disadvantage of selective genotyping is an increased error in determining the QTL position (Pérez-Enciso, 1998).

General discussion and conclusion

In this work we have focused on phenotypic outliers that can be caused by extreme environmental factors like disease, and we have not studied the effect of incorrect genotyping or wrong pedigree information. If marker information is wrong but compatible with parent genotypes, a bias will occur, and the QTL effect will be underestimated. The effect of wrong marker information on QTL estimation is limited, however, and will cause a bias smaller than 2% in backcrosses unless the percentage of errors is large, e.g. greater than 10% (Pérez-Enciso, 1998).

The MD-estimate is the value of the parameter that makes the model closest to the sampling information, which seems a very reasonable strategy when the model assumed in the analysis does not represent the ‘true’ model, and it provides an intuitively appealing interpretation of the MD-estimates. Some minimum distance estimation methods have especially good properties in mixture distribution problems (Titterington et al., 1985). An additional advantage of MD methodology is its robustness. The literature shows that MD is normally more robust than ML when the real distribution does not pertain to the assumed parametric distribution, or when the actual distribution is ‘contaminated’ as, for example, when there exist outliers (Woodward et al., 1984; Parr & Schucany, 1988). This is because MD methods do not give so much weight to extreme data as ML does. García-Dorado (1997) illustrates how the MD method can lead to more sensible estimates of mutation effects than ML. In this work, we have shown that this methodology has clear advantages in some instances that may be encountered in QTL analysis.

Nonetheless, location and scale parameter estimates are not equally sensitive to contamination. Dispersion parameter estimates are much more sensitive to outliers than location parameters, where MD methods show comparatively a more robust behaviour. This is because scale parameters depend to a larger extent on the square differences, which are magnified by outliers. As a result, contamination may result in erroneously concluding that variances are heterogeneous if, for some reason, the proportion of outliers differs between QTL genotypes.

Choice of the distance in MD methodology is somewhat arbitrary, and it may severely affect the estimates. Nonetheless, some distances have a more clear interpretation. For instance, the Kullback–Leibler distance (Kullback & Leibler, 1951) is equivalent to the ML criterion. All distance measures tend to produce asymptotically normally distributed estimators. The distance measure has a most critical impact in small samples. Here we used the Cramer–von Mises distance as it is one of the most widely used (Parr & Schucany, 1988; Garcia-Dorado, 1997). We also tried other measures, like the Kolmogorov–Smirnov measure, but Cramer–von Mises gave identical or better results.

Typically, MD consists of comparing distribution functions, but other alternative MD methods based on density rather than distribution functions have been developed by Cao et al. (1995). In this strategy, the distance between a density function and a nonparametric density estimator is minimized. This strategy consumes more CPU time than standard approaches. It requires specifying an appropriate smoothing and evaluating the chosen kernel estimator. According to Cao et al. (1995), MD density-based methods are specially suited for testing whether the assumed density belongs to a given parametric family. In the context of QTL studies, this may be relevant to detecting how many QTL genotypes are segregating in a given population, or to detecting departures from normality within QTL genotypes. This issue merits further attention. The possible presence of influential points but which are not outliers is a further aspect of robustness which has not been dealt with here. Jansen & Stam (1994) considered the changes in the weighted sum of squared residuals as a means to check for the presence of outliers. A comparison of changes in parameter estimates vs. changes in the sum of squared residuals obtained when a given observation is deleted may provide a means of identifying influential observations that are not outliers.

Minimum distance methods do not come without disadvantages. It is not clear how to take into account a correlated structure in the data (e.g. genetic relationships, common environment), because MD strategies are based upon the assumption of independence and identical distributions for each observation. We have also found that, in some instances, MD statistics may be unstable along a genome search when changing the marker interval. This problem can be alleviated by using alternatives to the genome scan. We (M. Pérez-Enciso & L. Varona, unpubl. obs.) have studied a strategy where the whole genome is partitioned into segments and the effect of each segment is analysed simultaneously by using a multiple regression/ANOVA approach. Among the issues that should be explored in more detail are the behaviour of the MD approach with more than one QTL and with non-normal distributions, as well as the properties of the MC-MD approach.