Introduction

The central role of mitochondria in metabolism, allied to the continued rise in incidence of age-related metabolic and degenerative disease has led to common variants in mitochondrial DNA (mtDNA) becoming increasingly popular as targets in association studies.1 Whole mitochondria association studies are attractive because of the low number of SNPs needed to be genotyped to get full coverage of all variants with >1% minor allele frequency in the mitochondrial genome.

Mitochondrial DNA has become an important resource for studying the history of populations. The lack of recombination in the mitochondria means that the mitochondrial genome is passed down from mother to offspring completely unchanged except for occasional spontaneous mutations. When mutations occur, they are in complete linkage with all other variants in the mtDNA. Over time, sets of linked variants called haplogroups form. Each haplogroup is related to all others through a mutational tree, which can be traced back to a common ancestor. By studying the patterns of haplogroup frequencies in different populations we can get an insight into the history of modern populations.

A number of diseases in which mitochondrial dysfunction has been demonstrated have differing prevalences depending upon the population examined. Genetic drift has led to differences in mtDNA variation between populations, whereas the contribution of selection to this variation is still under dispute.2, 3, 4 For this reason, it is important to control for population substructure when performing mitochondrial association studies to avoid false-positive associations.

It has been known for some time that mitochondrial DNA shows less geographical structuring in Europe than classical markers and the Y chromosome.5 Furthermore, variation between European populations is smaller compared with populations on other continents after controlling for geographic distance.6 However, recent studies have shown that when large enough sample sizes are used, there is evidence for a distinct East–West gradient and a much less pronounced North–South gradient.5, 7 In addition, mitochondrial population substructure within European populations has also been demonstrated.8, 9

A majority of the Australian population are descendents of immigrants from the British Isles, however there has been much migration from other areas of Europe and more recently from Asia. Given this population history, the Australian population may show evidence of population stratification. This would be manifest through there being differences in haplogroup frequencies between people of different geographical ancestry.

Haplogroup-based association tests assume that an associated variant is more likely to occur on a particular background or that the combination of variants that define the haplogroup may contribute to the phenotype. If however, the causative variant occurs on a number of different backgrounds, the haplogroup approach may give false-negative results,8 and an SNP-based approach would have more power to detect true-positive associations, although they too may be confounded by population substructure.

Testing for differences in haplogroups and SNP frequencies between ancestral groups can inform about population structure, but there may also be population substructure within the ancestral groups. In a substructured population, we would expect that mates will be more likely to come from the same subpopulation, and therefore will be more similar to one another than random pairs from the population. Hatt and Parsons (1965)10 have previously shown evidence for non-random mating among individuals of British Isles ancestry in Australia. Mates may also be more similar owing to assortative mating, which will also lead to population substructure.

Controlling for population structure is also important in genome-wide association studies (GWAS), and two major methods have been developed that use genotyping data to control structure in such analyses: ‘genomic control’11 and ‘structured association’.12 Population structure becomes more of a confounding factor as sample size increases, and even small levels of stratification can have a dramatic effect in the kind of large-scale studies that are needed to detect the variants of modest effect that have so far been found in GWAS.13, 14 Analysis of mitochondrial DNA variation can contribute to the detection of population stratification in GWAS cohorts.

In this study, we characterise the extent of maternal population structure in an Australian population sample as a precursor to carry out mitochondrial association studies, and show that common mitochondrial variants can be used to detect population structure in genome-wide association studies.

Materials and methods

Sample

In all 3839 individuals from 1037 Australian families were genotyped for a total of 69 mitochondrial SNPs. Families consisted of adolescent twins, their siblings and parents who had been recruited as part of two overlapping studies into the genetic and environmental influences on melanoma risk factors15 and cognition.16 Twins between the ages of 12 and 16 and their families were recruited by contacting principals of schools in the Brisbane area and through media appeals. The sample has been shown to be representative of the general Queensland population with respect to mole counts16 and cognition.17 Further information on the sample including recruitment strategy can be found in Zhu et al (2007)18 and Duffy et al (2007).19 The data set contained 343 pairs of monozygotic (genetically identical) twins, and 594 sets of dizygotic (non-identical) twins. Data for both parents were available in 622 of the families. There were 34 families for whom only the father's genotype was available, and 228 families in which only the mother's genotype was available. The remaining 153 families had no parental data available. There were a total of 345 non-twin siblings included.

Genotyping

Saxena et al (2006)20 identified all coding region mtSNPs found at a frequency of >1% in European populations and found that all of these variants could be tagged at an r2>0.8 by 64 tagging SNPs. We genotyped 61 of the 64 tag SNPs with an additional nine SNPs, which tag variation at r2>0.8, that otherwise require the use of multi-SNP haplotypes using a method described in a previous study.21 The last SNP is a common D-loop variant. Two SNPs were found to be monomorphic in our sample, leaving a total of 69 SNPs. SNPs were typed using iPLEXâ„¢ Gold chemistry and analyzed using a Sequenom MassARRAY Compact Mass Spectrometer (Sequenom Inc., San Diego, CA, USA). A two-step 200 short cycles programme was used for the iPLEX Gold reaction. The products were spotted on a SpectroChip (Sequenom Inc., San Diego, CA, USA), and data were processed and analysed by MassARRAY TYPER 3.4 software (Sequenom Inc., San Diego, CA, USA).

Ancestry

Parents of the twins were asked to list the country of birth and ancestry of their parents and grandparents. In cases where the specific country of birth or ancestry was not known, they were asked to specify a region, for example, The British Isles. For our study, an individual's ancestry was taken as the named ancestry of the maternal great-grandmother, or in cases where this information was not available the maternal grandmother. Assignment to ancestral groups was done according to the criteria listed in Table 1.

Table 1 Ancestral groups in the sample countries, which are included in those groups, and the number of individuals assigned to each ancestral group

Haplogroup assignment

Owing to familial relationships in the sample, there were 1693 independent mitochondrial haplotypes. Haplogroups were assigned according to a previously published method.17 To ensure greater accuracy in the assignment of haplogroups, haplotypes that contained missing data were discarded for the haplogroup analyses. A total of 1314 haplotypes had no missing data. Thirteen haplotypes could not be unambiguously assigned to a haplogroup, a majority of which were haplotypes of individuals with non-European ancestry (76.9%). The frequency of the haplogroups was calculated for each ancestry group (Table 2).

Table 2 Distribution (%) of inferred haplogroups by ancestry groups

Between groups structure test

As each individual has been typed for a panel of mitochondrial tagging SNPs designed to tag all common European variation,20 we would expect to see significant differences in mitochondrial DNA variation between European and non-European groups. However, there may also be differences in haplogroup and SNP frequencies between groups from different regions of Europe. For the most part, we would expect that the SNPs that are found to have differential frequencies between groups will be correlated with one another as they are found in the same haplotype. However, the fact that there is sufficiently low linkage disequilibrium between many of these SNPs, all of which needed to be typed to tag most of the variation mitochondrial DNA as shown in McRae et al,17 suggests that an SNP by SNP analysis may give a finer picture of stratification in the sample.

Significant differences in haplogroup frequencies between groups were tested for using a contingency table test for count data in the statistical package R.22 As the contingency table being analysed contained cells with small values, P-values were estimated by Monte Carlo simulation using 100 000 replicates.

Significant differences in individual SNP frequencies between ancestral groups were also tested for using a contingency table test, with a simulated P-value. Although the haplogroup-based test excluded individuals with missing data at any SNP, this test did not.

Assortative mating analyses

Each set of parents was tested to see if there is an association between the haplogroups of mates, that is, a pair of individuals who have produced offsprings in the sample. In families where only the father's genotype was available, the mother's genotype was assumed to be identical to the children's. A 10 × 10 contingency table of haplogroups was constructed (Table 4) and the association was tested for using a contingency table test with a simulated P-value. In a substructured population in which there is assortative mating, we might expect that individuals tend to mate with partners who have the same haplogroup as themselves, or that mates will tend to have a particular combination of haplogroups.

Similarity of mates was further tested by comparing all sets of parents at each of the 69 loci. There were 656 sets of parents in the sample. Three tests were performed to test for the genotypic similarity of mates. Firstly, a similarity score was calculated by comparing the number of sites at which the parents had the same allele and dividing by the total number of sites compared. If one of the parents had a missing genotype for an SNP, no comparison was made for that SNP and so the number of SNPs compared decreased by one.

where S is the summary statistic, n is the total number of non-missing SNPs compared between the pair and δi equals one if the pair are concordant or δi equals 0 if the pair are discordant for the ith SNP.

An overall mean score was calculated from the similarity scores of each pair of parents. A total of 656 pairs of individuals were then drawn randomly from the population of all parents and a similarity score calculated for each. This was done 1000 times and each time an overall mean was calculated. The mean of the parents was then compared with the distribution of means for the randomly drawn pairs to calculate the significance.

The above test fails to take into account the sampling variance of each allele that is the probability of a pair sharing a given allele is dependent upon the allele frequency. To take into account the sampling variance and the differing allele frequencies, the test was performed again, but each time the parents share the same allele, δi is multiplied by the inverse of the variance, where H=2p (1−p), the probability of the pair being discordant for an allele. This is then divided by the inverse sum of the variance of all SNPs compared for each pair.

where T is the weighted sum statistic, n is the total number of non-missing SNPs compared between the pair, and δi equals one if the pair are concordant or δi equals 0 if the pair are discordant for the ith SNP.

A third test was carried out which takes into account the significance of sharing a rare variant. Each SNP is weighted by frequency by multiplying δi by where p is the allele frequency of the shared allele.

The analyses were then repeated using only pairs drawn from the British Isles. There were 380 such pairs. There were insufficient number of pairs from other ancestral groups to perform the analyses.

Results

Table 1 shows the distribution of self-reported maternal ancestry in our sample. Given the history of immigration to Australia, it is not unexpected to find that a vast majority of the sample have ancestry from the British Isles.

Figure 1 shows the distribution of minor allele frequencies in the sample. SNPs were selected for genotyping if they have minor allele frequencies >1% frequency in Europeans. There are 16 SNPs with a minor allele frequency of 5% or more.

Figure 1
figure 1

The distribution of minor allele frequencies of mtSNPs in the sample.

Table 2 shows the distribution of inferred haplogroups by ancestry in our sample, and these agree with those previously reported in the Australian population.23 Haplogroup H, the most common haplogroup in Europe, is also the most common haplogroup in the Australian population, with a frequency of 42.0%. The next three most common haplogroups J, T and U, together comprise a further 32.4% of the population.

As would be expected, a significant difference in the distribution of haplogroup frequencies between individuals of European and non-European ancestry was found (P=0.00001). When the four European groups are compared, there is evidence for a significant difference in haplogroup frequencies between them (P=0.00012). Most of this difference appears to be caused by differences between Western and Eastern populations. The UK and Ireland, Northern European and Southern European populations are all significantly different from the Eastern European group (P=0.002, 0.0016, 0.001, respectively). This observation is in agreement with a previous study which showed that the first principal component of variation in haplogroup frequencies in Europe is along an east–west axis.5 Even with the small sample size of Eastern Europeans in our sample, the lower frequency of haplogroup H in that region can be detected. There is also evidence for a difference in frequencies between Northern and Southern populations, although these differences only approach significance (P=0.06). There is no evidence for differences between the British Isles and Northern European groups (P=0.48). It is clear that there is significant evidence for population substructure among European ancestral groups in Australia.

Our SNP panel is composed of SNPs that tag all coding region variants found within Europe at a frequency >1%, so we would expect to see significant differences in the allele frequencies between the European and Asian groups. Using Fisher's exact test, we find that several SNPs (Table 3) are significant even after correction for multiple testing (5% significance with 69 tests=0.0007). Bonferroni correction is an appropriate strategy here because of the low LD between markers.

Table 3 mtSNPs that have significant frequency differences between European and non-European groups and that are significantly different between European groups

Several of these SNPs define major European haplogroups. mt4216 tags haplogroups J and T, whereas mt7028 defines haplogroup H, the most common European haplogroup. A number of the other SNPs tag super haplogroup M, which is found primarily in Asia. Some of these SNPs have low frequencies in the European population (marked in the haplogroups column in Table 3) but are much more variant in the South Sea Island and Aboriginal group.

The same analysis was performed on only the European ancestral populations. When all four populations are included, five SNPs are found to be significant (Table 3).

None of these SNPs are significant after Bonferroni correction. However, mt10398, mt14798, and mt12372 approach significance. The three SNPs that approach significance are strong predictors of haplogroups in our sample. The mt10398A allele is found at a much lower frequency in the Eastern European group relative to the other European populations. It is a predictor of haplogroup T, which is found to be enriched in Eastern Europeans in our sample. The mt10398G allele is found in haplogroups U and I, both of which are not found in the Eastern European sample. The mt12372A allele is found only in haplogroups U and K, both of which are almost absent in Southern and Eastern Europeans. The mt14798C allele is found in low frequency in individuals of Southern or Eastern European ancestry. This allele is found in individuals with haplogroups J and K, both of which are at much lower frequency in Eastern and Southern Europe. It is clear that the SNP by SNP analysis complements the haplogroup analysis. The SNPs that have significant differences in frequency between groups are also those that contribute to the significant differences in haplogroup frequencies.

Assortative mating analyses

The mean similarity score for all pairs is 0.9105 (SD=0.046, SE of mean=0.0018). The mean score for the 1000 means of randomly drawn pairs is 0.9092 (SE=0.002). Comparing the mean of the true mates to the empirical distribution of random pair means gives a P-value of 0.22, providing no evidence of increased allele sharing among mates. This analysis included all pairs of mates regardless of their population of ancestry. We would assume that in a population in which there is stratification that mates will more likely be from the same subpopulation. As such, there will be a higher probability that a randomly drawn pair will be from different populations and hence, will share less alleles in common. For this reason, we might expect that mates will have more similar genotypes than a set of randomly drawn pairs. However, this is not what is found.

When the sampling variance is taken into account, the mean similarity score for mates is 0.9741 (SD=0.018, SE of mean=0.0007), whereas the mean score of the replicates for random pairs is 0.9729 (SE=0.001). The empirical P-value is 0.11. For the third test that takes allele frequencies into account, the mean similarity score for mates was 0.9459 (SD=0.07, SE=0.0027) and the mean similarity score of random pairs is 0.9435 (SD=0.002), and the P-value is 0.07.

We then restricted the analysis to only mates and randomly drawn pairs from the British Isles. As there was no evidence of greater similarity among mates when all populations were included, we expect to find no evidence when we restrict it to one ancestral population. This is because randomly drawn pairs are more likely to be similar as they come from the same population. The mean similarity score for mates was 0.9093, and the mean of the 1000 samples of randomly drawn pairs is 0.9092 (SD=0.002). The empirical P-value was 0.48. There is no evidence for increased allele sharing among mates of British Isles ancestry using this test. There is also no evidence for increased sharing when the tests take sampling variance (P=0.52) and allele frequency into account (P=0.17).

Mates were also compared to see if they tend to share the same haplogroup or a pair of haplogroups more often than would be expected by chance. A 10 × 10 table of male and female haplogroups was constructed (Table 4). The association between haplogroups was tested for using Fisher's exact test.24 No significant association was found (P=0.784).

Table 4 Assigned haplogroups of all pairs of parents in the sample

It is clear that there is little evidence to suggest that, in our sample, individuals tend to mate with individuals whose mitochondrial genotype is more similar to their own.

Discussion

The epidemic of metabolic disease in Western countries and increasing emphasis on aging research has lead to increasing focus on mitochondrial variants as potential modifiers of disease risk. The data gleaned from mitochondrial association studies will also be very useful for population genetics analysis. As a precursor to carrying out association studies, we have investigated whether any population substructure can be detected in the maternal line of the Australian population.

Our study is limited by the small sample size of some of the ancestral groups, but we believe our sample to be representative of the ancestral make-up of the Australian population. Although most of the population trace their ancestors back to the British Isles, a much smaller number are of native Australian ancestry, and there are large numbers of smaller groups whose ancestors came to Australia from various regions of the world. Even with small sample sizes, significant differences can be found in the distribution of mitochondrial variants in different populations. In populations of European ancestry, these differences follow the same trend as seen previously in Europe.

These analyses demonstrate that population substructure may be a confounding factor in mitochondrial association studies in the Australian population. It would be advisable to get an estimate of population structure in the sample and to adjust the test accordingly. In this study, we have only analysed the frequencies of the major European haplogroups, many of which can be subdivided into subhaplogroups. Analysis of these subhaplogroups may lead to the detection of a finer scale substructuring in the population.

There are clear implications for mitochondrial association studies in the Australian population, but there are also implications for mitochondrial association studies in any population that has had a significant immigration or is admixed. Although the non-recombining nature of mitochondria means that differences between populations are more pronounced and easily detected when looking at mitochondrial variants, there may also be significant implications for nuclear association studies. An equivalent study using autosomal SNPs is much more difficult; however, as it requires accurate ancestry information for a larger number of individuals in each participant's family, and only leads to an estimate of what proportion of the genome is admixed, in cases where an individual's parents or grandparents are of differing ancestry. For association studies in the Australian population, it may be necessary to subdivide people of European ancestry when it comes to matching cases and controls.

In addition to the stratification analysis, the correlation between mtSNPs and nuclear SNPs (nSNPs) was examined (results not shown). Although they have differing modes of inheritance, it is likely that there are significant interactions between variants in nuclear and mitochondrial DNA. Mitochondrial DNA codes for 13 proteins that are subunits of four of the five large complexes, are essential for oxidative phosphorylation. As such, proteins that are encoded in the mitochondria do not act independently and efficient oxidative phosphorylation requires interaction with nuclear encoded proteins.25 These interactions have led to adaptive coevolution between the two genomes.26 This coevolution may have led to certain combinations of nuclearDNA (nDNA) and mtDNA variants being highly correlated in the population. Even in the absence of selective pressures, we would expect that mitochondrial variants that define region-specific haplogroups would be correlated with ancestry informative nSNPs. Therefore, the correlation between mitochondrial and nuclear variants may have implications for the design of association studies, and can give an insight into the relationship between nuclear and mitochondrial variants.

A recent large-scale case–control study in the United Kingdom14 showed that it is possible to separate individuals from the North and South of the country when a large enough panel of SNPs is used. In addition, non-random mating was previously demonstrated among individuals of British Isles ancestry in Australia. As the data set does not contain information on which region of each country individuals can trace their ancestry to, it is difficult for us to test within the population substructure using mitochondrial data. We have performed a crude test for population structure by testing whether mates tend to be more similar to one another at the genetic level than random pairs drawn from the population. In a structured population, we expect that individuals from the same subpopulation will be more similar to one another at the genetic level. Although when examining mitochondria, the genotype of the father has no bearing on that of his children, his sons would still be more likely to mate with someone from the same population, and will therefore have a similar mitochondrial genotype to himself. We find no evidence for increased sharing among mates, even after correction for allele frequency. The only ancestral group for which there was a large enough sample size to perform this test was the British Isles group. It is possible that other groups will show evidence of increased sharing among mates if a large enough sample size is examined. Despite being unable to detect any within-group stratification in our sample, it may still be a confounding factor in mitochondrial association studies. The knowledge gleaned from these analyses will greatly aid the interpretation of association studies that will be carried out on the same sample.