Introduction

There has been considerable debate about whether recombination occurs in mitochondrial DNA. Recombination has never been directly observed in human mtDNA but paternal inheritance has been recently observed (Schwartz and Vissing, 2002). Two lines of indirect evidence suggest that recombination might have occurred. The first line of evidence comes from the excess of homoplasies that are observed in phylogenetic trees, that is, identical mutation events occurring independently in different parts of a phylogeny (Eyre-Walker et al, 1999). However, this excess of homoplasies could also be due to hypervariable sites and it thus remains unclear whether recombination or heterogeneity in the mutation rate is involved (McVean, 2001; Wiuf, 2001). The second line of evidence for recombination comes from the observation of a negative relationship between linkage disequilibrium (LD) and distance in some human mtDNA data sets (Awadalla et al, 1999). However, the analysis of other data sets has not corroborated this observation (Ingman et al, 2000; Jorde and Bamshad, 2000; Elson et al, 2001; Herrnstadt et al, 2002a), and if the relationship is observed, it is only observed when LD is measured with r2 as opposed to D′ (Jorde and Bamshad, 2000). This has led to an intense discussion about whether recombination occurs in human mitochondria (Jorde and Bamshad, 2000; Kivisild and Villems, 2000; Kumar et al, 2000; Parsons and Irwin, 2000; McVean, 2001; Wiuf, 2001; Innan and Nordborg, 2002). McVean (2001) suggested a method to get consistent result between the two statistics. This is to perform the analysis only on pairs of sites that are informative about recombination. However, the choice of the informative sites requires a prior knowledge of the recombination rate in the data set analysed, which makes the interpretation of the outcome of this test difficult. Further work on methods of detecting recombination from polymorphism data showed that the power for detecting recombination, that is, the probability of detecting recombination when there is recombination, by estimating the number of homoplasies (Posada and Crandall, 2001) or the relationship between LD and distance (Meunier and Eyre-Walker, 2001; Wiuf, 2001) is very low for small rates of recombination. However, simulation (Posada and Crandall, 2001) and experimental data analysis (Posada, 2002) showed that the most powerful methods to detect recombination from polymorphism data are the homoplasy test (Maynard-Smith and Smith, 1998), Geneconv (Sawyer, 1999) and Maximum Chi-Square (Maynard-Smith, 1992). Unfortunately, the power of the LD-distance test was not assessed along with these three methods.

We decided to extensively reanalyse the evidence of recombination in human mtDNA because different groups have used different data sets and different methods to analyse their data set. For example, Awadalla et al (1999) restricted their analysis to synonymous variants segregating at greater than 10%, Herrnstadt et al (2002a, 2002b) to variants segregating at greater than 5% whereas Ingman et al (2000) included all polymorphisms. Awadalla et al (1999) reasoned that restricting the analysis to polymorphisms segregating at higher frequency would increase the probability of detecting recombination by focusing the analysis on older mutations. However, this has never been tested, and theoretical work actually suggests that the ability to detect recombination is independent of the frequency of the alleles included in the analysis, at least in populations that are stationary in size (Meunier and Eyre-Walker, unpublished results). Furthermore, several large data sets of complete human mtDNA sequences have recently been published (Ingman et al, 2000; Finnila et al, 2001; Herrnstadt et al, 2002a, 2002b), but these have not been analysed in depth. In this paper we analyse the relationship between LD and distance, using both r2 and D′ to measure LD, excluding mutations segregating below several different thresholds. We also run the homoplasy test (Maynard-Smith and Smith, 1998), Geneconv (Sawyer, 1999) and Maximum Chi-Square (Maynard-Smith, 1992).

Methods

Data

We extracted the protein coding sequences from four recently published data sets of complete mtDNA sequences (45 sequences from Awadalla et al, 1999; 53 sequences from Ingman et al, 2000; 192 sequences from Finnila et al, 2001 and 560 sequences from Herrnstadt et al, 2002a). Elson et al (2001) recently analysed a data set of 64 European sequences and two Africans; this data set was not publicly available for analysis. The sequences compiled by Awadalla et al and those published by Ingman et al and Herrnstadt et al are globally distributed, whereas the sequences published by Finnila et al are from Finnish individuals. Since there are many polymorphisms in the Herrnstadt data we split the data set into three population groups (African 56 sequences, Asian 69 sequences and European 435 sequences), which we analysed separately. From the protein coding sequence we extracted biallelic synonymous polymorphisms segregating in codons which did not contain any nonsynonymous polymorphisms. We discarded nonsynonymous segregating sites because selection on these sites may introduce noise into the relationship between LD and distance. If there are epistatic interactions between nonsynonymous mutations, then two mutations may be favoured together, increase substantially in frequency and generate LD. McVean (2001) also suggested that adaptive substitution could lead to a correlation between LD and distance. The details of the data are given in Table 1. The data are available upon request.

Table 1 Nucleotide polymorphism data set used in the analysis

We also compiled 10 data sets of human mtDNA restriction fragment length polymorphism (RFLP) (Table 2) from various geographic regions. It is not generally possible to determine whether an RFLP is due to a synonymous and nonsynonymous mutation without additional sequencing—this information was only provided for data set 3. To analyse the relationship between LD and distance, we removed the polymorphic sites contained in the D-loop (500 bp of each site of the replication origin) as a disproportionate number of the polymorphisms are located in the Dloop (Ingman et al, 2000); furthermore, the rate of mutation in the D-loop is higher than at synonymous sites (comparison of the proportion of segregating sites in quartet and in the D-loop by a Fisher exact test in the Finnila et al, data set, P<0.001) and this alone may generate trends in LD with distance (Awadalla et al, 1999; Innan and Nordborg, 2002).

Table 2 RFLP data sets used in the analysis

Relationship between LD and distance

We estimated LD using two measures, D′ and r2 for all pairs of polymorphic sites. For one pair of biallelic loci A1A2 and B1B2: LD, D and D′ and r2 are defined as follows (Lewontin, 1964; Hill and Robertson, 1968):

where fA1, fA2, fB1 and fB2 are the frequencies of the A1, A2, B1 and B2 alleles.

We calculated the correlation coefficient between LD and the distance between polymorphic sites excluding sites at which the rare allele was segregating below a certain cutoff frequency: no cutoff, no singletons, 0.05, 0.10 and 0.20. Previous studies used Pearson's correlation coefficient to assess the significance of the relationship between LD and distance, we also computed Spearman's rank correlation coefficient because the relationship between LD and distance is not necessarily linear (Wiuf, 2001). We then assessed the significance of the relationship by a Mantel test (Sokal and Rohlf, 1995; Awadalla et al, 1999). The Mantel test was performed by maintaining the data matrix and randomly permuting the position of sites; for each randomly permuted data set we calculated the correlation between LD and distance to obtain the null distribution of correlation coefficients.

Homoplasy test (Maynard-Smith and Smith, 1998)

We used MEGA (Kumar et al, 2001) to estimate the number of homoplasies from the most parsimonious tree for the six nucleotide polymorphism data sets using only the synonymous informative variants. We then estimated the number of expected homoplasies under clonality and tested for an excess of homoplasies using the homoplasy test described by Maynard-Smith and Smith (1998). This test implies the calculation of the effective number of sites. The number of effective sites equals the number of sites (nA, nC, nT, nG) by their probability of mutations (pA, pC, pT, pG). There are 3411 synonymous sites in human mitochondria, 16% are T, 44% C, 36% A and 5% G. Assuming that the rate of transversion is negligible, and that base composition is at equilibrium, the effective number of sites equals 2172 in human mitochondria (4*3411*[0.6*0.26*0.74+0.4*0.125*0.875]) (Maynard-Smith and Smith, 1998).

Geneconv (Sawyer, 1999)

The program GENECONV 1.81. (http://www.math.wustl.edu/~sawyer/geneconv/index.html) was employed. The global permutation P-values are based on BLAST like global scores (10 000 replicates).

Maximum Chi-Square (Maynard-Smith, 1992)

The computer program MaxChi2 was kindly provided by David Posada, implementing a modification of the Maximum Chi-Square method suggested by Wiuf et al (2001). The statistic employed was the Maximum Chi-Square in the original alignment. For each pair of sequences, this statistic was calculated on a sliding window that moved one nucleotide at a time and included only variable sites. The width of the window was arbitrarily set to the total number of variable sites divided by 1.5 (following Posada and Crandall, 2001). The significance of the putative recombination events identified by Maximum Chi-Square was assessed by randomly permuting the positions of sites 1000 times. We used Bonferroni correction to correct for multiple tests.

Results

The relationship between LD and distance

We have investigated the relationship between LD and distance for 16 human mtDNA data sets, six of these are data sets of complete mtDNA sequences while the other 10 are RFLP data sets. We have performed 20 analyses on each nucleotide data set and 16 analyses on each RFLP data set—we have investigated the correlation between LD, as measured by either r2 or D′, and the distance between sites using both Pearson's and Spearman's correlation coefficients, including all sites, and excluding sites at which the rare allele is a singleton, or below a frequency of 5, 10 (RFLP data set) and 20% (nucleotide data set).

The results from the analysis of the complete mtDNA data sets are given in Table 3. We restricted our analysis to synonymous polymorphisms to minimise the effects of LD generated by selection. For the data set of Awadalla et al (1999), the correlation between r2 and distance is almost always negative and it is significant for polymorphisms segregating at greater than 20%. In contrast, the correlation between D′ and distance is almost always small and nonsignificant, except for polymorphisms segregating at >5% when Spearman's correlation coefficient is used, when it is significantly positive. In the data set of Ingman et al (2000) there is an overall tendency towards negative correlations but none of the correlations are significant. In the data set of Finnila et al (2001), r2 is generally negatively correlated to distance and often significantly so; however, the D′ is significantly positively correlated to distance for polymorphisms segregating at >20%. In the three data sets from Herrnstadt et al (2002a), the correlations are generally very small, with r2 being almost always negatively correlated with distance; D′ is generally negatively correlated for the Africans and Europeans, but positively correlated for the Asians. Few of the correlations are significant.

Table 3 Relationships between LD and distance for pairs of synonymous sites for different cutoff frequencies

The 10 RFLP data sets show no consistent trend between LD and distance (Table 4). Data set 5 shows no trend, data sets 2 and 4 show a positive significant correlation between r2 and distance. Data sets 3, 6, 7a and 8 largely show negative correlations with several of the correlations being significant or highly significant. In data set 9 the correlation between r2 and distance is negative for all frequency classes, whereas it is positive for D′. Surprisingly, there is a highly significant negative correlation between r2 and distance in data set 1, but a highly significant positive correlation between D′ and distance, when all mutations are included in the analysis, or only singletons are excluded.

Table 4 Relationships between R2 and D′ and distance between polymorphic sites (PS) for the 10 data sets (Table 1)

The results of the analyses are summarised in Table 5. Overall, there is an excess of negative correlations for r2 (51 out of 60) on the synonymous polymorphism where all significant results are for a negative relationship. For the RFLP polymorphism analysed, the proportion of negative correlation between r2 and distance is less (55 out of 80 test performed) and the significant relationship are positive or negative. The correlation between D′ and distance is as likely to be negative as positive, and significant correlations are found equally in both directions both for nucleotide and RFLP polymorphism. It is important to appreciate that none of the data sets or analyses are independent so it is difficult to assess the overall significance of these findings: none of the correlations are significant if we correct for multiple comparisons.

Table 5 Comparison of the synonymous SNP and RFLP analyses

The correlation between LD and distance is very similar for both Pearson's and Spearman's correlation coefficient, although the correlations using the latter are usually a little larger, and slightly more significant.

Homoplasy test

Recombination is not only expected to lead to a decline in LD with distance but also to generate homoplasies in phylogenetic trees; that is, instances in which the same mutation, or the reverse mutation, occurs at the same site in different parts of the tree. Homoplasies can be generated by both recombination and multiple mutations. Maynard-Smith and Smith (1998) have devised a method to predict the number of homoplasies due to multiple mutations when there is no recombination, taking into account codon usage bias. They have also devised a test of whether the observed number is greater than the number expected under clonality. We have applied their methods to synonymous variants segregating in each of the mtDNA sequence data sets; we restrict our analysis to DNA sequence data sets since it is only possible to predict the number of homoplasies for synonymous polymorphisms. The results are presented in Table 6. In each case the observed number of homoplasies is higher than the number expected under clonality, and with the exception of the data set of Finnila et al, the difference is significant.

Table 6 Homoplasy test for the synonymous nucleotide polymorphism data sets

Geneconv and Maximum Chi-Square

No recombination was detected by Geneconv with any of the six sequence data sets. No recombination event was detected by Maximum Chi-square after correction for multiple tests.

Discussion

We have analysed 16 human mtDNA data sets for indirect evidence of recombination using four approaches; we have investigated the correlation between pairwise LD and the distance between sites, the level of homoplasy in phylogenetic trees, and the clustering of substitutions by Geneconv and Maximum Chi-Square. Overall, there is a tendency towards a negative correlation between LD, when it is measured by r2, and distance. Of the 140 analyses we have performed, 104 showed a negative correlation. Furthermore, the significant correlations were all negative for the nucleotide polymorphism. Unfortunately, because many of the tests are nonindependent, it is not possible to assess the overall significance of these results. In contrast, approximately half the analyses showed a positive correlation between LD, when it was measured using D′, and distance, and the significant correlations were both positive and negative. The analysis of simulated data sets under simple evolutionary models (constant population size, no selection) suggests that there should not be any conflict between the two statistics (Meunier and Eyre-Walker, 2001). Interestingly, McVean et al (2002) showed that the r2 statitistics is more powerful than the D′ statitistics under the finite sites model. As the finite site model describes well, the high rate of polymorphism on synonymous sites in human mtDNA, the r2 statistics should be favoured.

In contrast to the rather confused picture offered by the analysis of LD, there is a clear excess of homoplasy in five out of six data sets for which this analysis was performed. This excess of homoplasy could be due to recombination or it could be due to hypervariable sites. Stoneking (2000) has recently demonstrated that hypervariable sites exist in the mtDNA control region (see also additional analysis by Eyre-Walker and Awadalla, 2001). However, the control region has unique mutational dynamics, including a four-fold higher mutation rate than on synonymous positions so it is not clear whether hypervariable sites exist in the coding region of the mtDNA. Eyre-Walker et al (1999) performed a number of analyses that were aimed at testing whether hypervariable sites exist in the coding region; they found no evidence, but none of their analyses were particularly powerful.

The outcome of the five recombination detection methods used is summarised in Table 7. If we exclude the homoplasy test because it generates a high rate of false positives if there is mutation rate heterogeneity (Posada and Crandall, 2001), no complete sequence data set shows evidence for the recombination from two tests. Empirical data analysis suggests that one should not rely on a single test to infer the presence of recombination (Posada, 2002). Therefore, there is a lack of evidence for recombination in human mtDNA from our analysis, although two out of six sequence data sets show evidence for recombination in one method (excluding the Homoplasy test). This indicates either no recombination or very rare recombination events. As paternal inheritance of mtDNA in humans has recently been observed (Schwartz and Vissing, 2002) recombination in human mtDNA is a real possibility. However, our study shows that it is hardly detectable from sequence data with available recombination detection methods.

Table 7 Evidence for recombination from the synonymous polymorphism data (no frequency cutoff) from the five recombination detection methods used