Main

Human mitochondria contain a tiny, high-copy-number circular genome (mitochondrial DNA (mtDNA)). Sequencing of the human mtDNA in 1981 (ref. 1) revealed that it encodes 13 core protein components of the oxidative phosphorylation system, as well as 2 ribosomal RNAs and 22 transfer RNAs required for their expression. Tissues can contain tens to thousands of copies of mtDNA per cell, depending on cell type. Variants in mtDNA can be maternally transmitted or arise somatically, and, when they co-exist with wild-type molecules, lead to a state called heteroplasmy. Notably, more than 99% of the mitochondrial proteome, including all proteins required for mtDNA maintenance, replication and transcription, is encoded by the nuclear DNA (nucDNA) and imported4 into the organelle.

Defects in mtDNA are associated with a spectrum of human diseases. Since the first identification of pathogenic mtDNA mutations5,6, scores of maternally inherited syndromes have been reported7. Mendelian forms of mitochondrial disease producing mtDNA deletion or depletion were later identified and mapped to nuclear genes involved in mtDNA replication, maintenance and nucleotide balance8,9,10. More subtle declines in mtDNA copy number (mtCN) and an accumulation of somatic mtDNA mutations have both long been associated with ageing and age-associated disease11,12. Mutations in mtDNA accumulate in many cancers and in a small subset are even considered to be ‘drivers’ of tumorigenesis13.

The dynamics of heteroplasmy are complex and presumed to be shaped by mutation, drift and selection. The mtDNA mutation rate has been reported as 10–100× higher than for the nucDNA14, with the non-coding region (NCR) containing three hypervariable regions thought to be mutational hotspots15. The high copy number, elevated substitution rate and lack of recombination have made mtDNA NCR variants a valuable genetic tool in forensics and anthropology, even leading to the African mitochondrial ‘Eve’ hypothesis16,17. Heteroplasmy can vary across siblings, attributed to germline bottleneck effects, and between cell types and tissues, thought to be due to random segregation and selection18,19. Detailed mechanisms underlying heteroplasmy dynamics in humans remain obscure, although mouse studies20 predict a role for nuclear genetics.

Here, we characterize the spectrum of mtCN and heteroplasmy across approximately 300,000 individuals spanning 6 ancestry groups in the UK Biobank (UKB) and AllofUs (AoU). We find that blood mtCN declines with age, is influenced by blood cell composition and is under the control of numerous nuclear genetic loci. We then turn to mtDNA variation, finding that about 1 in 192 individuals carries 1 of 10 well-known pathogenic mtDNA variants. We characterize the landscape of mtDNA variation across this population and find that nearly every human harbours heteroplasmic mtDNA variants. Whereas heteroplasmic mtDNA single nucleotide variants (SNVs) tend to be somatic in origin and to accumulate with age, we find that heteroplasmic indels tend to be quantitatively maternally inherited, with their relative levels influenced by nuclear genetic variation. These loci provide insights into the mechanisms by which the mitochondrial and nuclear genomes genetically interact to maintain mtDNA homeostasis.

Calling mtCN and variants

We developed mtSwirl, a scalable pipeline for calling mtDNA variants and copy number from whole-genome sequencing (WGS) data (Methods and Supplementary Note 1). We extended a pipeline used to analyse mtDNA variation in gnomAD21, now constructing self-reference sequences for each sample using homoplasmic and homozygous calls on the mtDNA and reference nucDNA regions of mtDNA origin (NUMTs; Extended Data Fig. 1a). mtSwirl shows improved mtDNA coverage, particularly among African haplogroups (Extended Data Fig. 1b–e), and reduced variant calls at very low heteroplasmy (Extended Data Fig. 1f), indicating reduced ancestry- and NUMT-specific mis-mapping. We observe high concordance of heteroplasmy estimates with the previous method used in gnomAD (R2 = 0.996 for heteroplasmy > 0.05), with homoplasmies showing allele fractions now closer to 1, suggesting reduced NUMT artefact21 (Extended Data Fig. 1g). We used mtSwirl to quantify mtDNA traits across 274,832 individuals of diverse ancestry across UKB and AoU (Extended Data Fig. 2 and Supplementary Table 1), generating more than 7,800,000 mtDNA variant calls across all samples.

Determinants of variation in human mtCN

We began by identifying covariates of blood mtCN (mtCNraw) in UKB, observing a strong influence of blood cell composition (R2 ≈ 23%; Fig. 1a) as previously reported22,23 (Extended Data Fig. 3c). We identified several more unexpected covariates including time of day, month of year and fasting duration (R2 ≈ 2.5%; Fig. 1a and Extended Data Fig. 3e–j). Following adjustment for all identified covariates (Methods and Supplementary Notes 2 and 3), we found that covariate-adjusted mtCN (which we term mtCNadj) was unimodal in UKB across 178,134 subjects with an average of 61.66 copies per diploid nuclear genome (Extended Data Fig. 3d). We observed a linear decline in mtCNadj with age (Fig. 1c) of approximately 2% per decade among both males and females.

Fig. 1: Genetic and phenotypic determinants of mtCN in UKB.
figure 1

a, Variance explained in mtCN by correction models. b,c, Mean mtCNraw (b) and mtCNadj (c) as a function of age and genetic sex. For b and c, mtCN is copies per diploid nuclear genome, error bars are mean ± 1 s.e.m., and total n = 178,129 and 164,798, respectively. d, GWAS Manhattan plot from UKB cross-ancestry meta-analysis. Labelled genes were obtained using fine-mapping, rare variant evidence or nearest gene. Red genes encode mitochondrial proteins or are implicated in mtDNA disease; *gene at GWS for the Cauchy P value from RVAS; CS variants proximal to the gene with PIP > 0.1; CS with PIP > 0.9; ‘c’, coding variant in the CS; underline, eQTL colocalization PIP > 0.1. Asterisks above peaks on chr 19 and 21 correspond to GP6 and RUNX1, respectively. e, Variants in the 95% CS with PIP > 0.1 causing a protein-altering change. f, Standardized odds ratios for log mtCNraw, log mtCNadj and major blood composition phenotypes in predicting risk of selected common diseases in UKB. Inset numbers are two-sided raw P values with Bonferroni P value cut-off = 0.0025; error bars are 95% confidence intervals (95% CIs) around odds ratios (ORs); sample sizes are in Supplementary Table 8. HTN, hypertension; MI, myocardial infarction; T2D, type 2 diabetes. g,h, Correlations between effect sizes for lead SNPs detected at GWS for neutrophil count between neutrophil count and mtCNraw (P = 4.4 × 10−73) (g) and mtCNadj (P = 0.511) (h). Error bars represent 1 s.e., dotted line is weighted least squares regression line, inset corresponds to regression P value. AF, allele frequency.

We next assessed the degree to which variation in mtCNadj is under nuclear genetic control. Our genome-wide association study (GWAS) identified 92 linkage disequilibrium (LD)-independent nucDNA association signals across 46 loci (Fig. 1d) after cross-ancestry meta-analysis, with an estimated SNP-heritability of approximately 4% (Methods). By contrast, mtDNA haplogroup explained less than 0.5% of the variance in mtCNadj, with only a few associations of small magnitude observed (Extended Data Fig. 4a,b). Thirty-three nuclear loci showed variants with a posterior inclusion probability (PIP) of 0.1 or greater after fine-mapping (Methods); 11 of these had protein-altering variants in the 95% credible set (CS) at PIP > 0.1 (Fig. 1e) and 7 showed expression quantitative trait locus (eQTL) colocalization with the assigned gene at PIP > 0.1, including TFAM, MFN2, NDUFV3 and RRM1. Eight loci contained genes implicated in disorders of mtDNA maintenance, six of which harboured variants with PIP > 0.1. Prioritized genes (Methods) encoded proteins that participate in the mtDNA nucleoid and replisome (TFAM, POLG2, TWINKLE, TOP1MT, LONP1), nucleotide metabolism (RRM1, RRM2B, DGUOK, AK3, SLC25A5) and mitochondrial fusion (MFN1, MFN2). The PNP–APEX1 locus was notable as these adjacent genes encode proteins in nucleotide metabolism and mtDNA repair, neither of which has been implicated in mtCN control. Fine mapping implicated both genes, even identifying a missense variant in APEX1 at PIP > 0.9 (Extended Data Fig. 5a). Several more loci included mitochondrial proteins with no previous links to mtDNA (SLC25A10, MCAT, NDUFV3). Telomerase (TERT) is in the vicinity of one locus; however, fine mapping did not provide further evidence for its causality (Supplementary Table 3).

We also performed a gene-based rare variant association study (RVAS) for mtCNadj in UKB (Methods and Supplementary Table 7). In several instances we find convergence with our GWAS, including associations with ultra-rare (minor allele frequency (MAF) < 0.0001) missense or loss of function (LoF) variation in TWNK and TFAM (Extended Data Fig. 5c). RVAS provided clarity to other GWAS loci with uncertain gene assignments (for example, highlighting TOP3A in a locus containing several genes; Fig. 1d) and identified several associations with genes not identified by GWAS. For instance, we found associations with the burden of rare protein-altering variation in genes previously linked to Mendelian mtDNA deletion or depletion disease (OMA1, SAMHD1), as well as associations with genes unlinked to mitochondria (for example, MILR1) (Extended Data Fig. 5d).

We next tested mtCNadj for heritability enrichment in genes associated with organelles or organs using stratified LD-score regression24,25,26 (S-LDSC; Methods). The most significant organelle enrichment was seen for the mitochondrion (Extended Data Fig. 4c). Across organs, skeletal muscle and whole blood were top scoring (Extended Data Fig. 4d). Whole blood enrichment is expected given the sampling site, but skeletal muscle enrichment was unexpected and may be due to shared patterns of gene expression between blood and muscle, or could indicate non-cell autonomous control of blood mtCN.

Blood composition influences bulk mtCN

Although many previous studies have reported associations between low blood mtCN and common diseases27,28,29,30, we could not replicate these results using mtCNadj in UKB for type 2 diabetes, myocardial infarction, stroke, hypertension or dementia (Fig. 1f). When we repeated this analysis using mtCNraw, that is, without adjusting for blood composition, we could recover these earlier associations (Fig. 1f). We extended these analyses to 24 more common diseases, finding that, in total, 20 showed significantly increased risk with reduced mtCNraw; after correction for blood cell composition, the inverse correlations disappeared for all traits except osteoarthritis (Extended Data Fig. 3k). Associations with four cardiovascular disease traits even changed direction with mtCNadj, now showing a positive correlation with increased risk. In all five cases, Mendelian randomization did not support a causal role for mtCNraw or mtCNadj after correcting for multiple tests (Extended Data Fig. 6). Even the oft-reported elevated mtCN in females31 appears to be largely driven by blood composition (Fig. 1b,c). Our GWAS analyses also underscore the confounding effects of blood composition in previous work. Using mtCNadj, we could replicate (at P < 5 × 10−5) 70 of the 96 previously reported mtCN GWAS loci32, with 37 at genome-wide significance (GWS) (Methods). Using mtCNraw, we could recover 12 more loci from this previous study at GWS including loci containing HBS1L-MYOB, C2, HLA, GSDMC and CD226, which are linked to blood cell types and inflammation (Extended Data Fig. 4f). By contrast, associations near TFAM, a well-known mtCN-controlling gene33, encouragingly strengthen by about 40 orders of magnitude following blood composition adjustment.

It has long been known that inflammation is associated with cardiometabolic disease34; indeed, elevations in inflammatory blood cell indices predict elevated risk for 26 of 29 tested diseases in UKB (Fig. 1f and Extended Data Fig. 3l). Bidirectional Mendelian randomization showed that effect size loci at GWS for neutrophil count were strongly positively correlated with corresponding mtCNraw effect sizes (Fig. 1g), whereas the converse did not convincingly hold (Extended Data Fig. 4g), suggesting that changes in blood cell composition cause mtCNraw changes rather than the reverse. Importantly, neutrophil count effect sizes did not predict corresponding mtCNadj effect sizes (Fig. 1h and Extended Data Fig. 4h).

The most parsimonious explanation for our observations is that previously reported associations between low blood mtCN and elevated common disease risk are, in many cases, secondary to blood composition changes. For the few associations that survive blood composition corrections (Extended Data Fig. 3k), other mechanisms may be involved. Indeed, Mendelian randomization suggests reverse causation or shows high heterogeneity for these traits, arguing against simple forward causal relationships in these instances (Extended Data Fig. 6).

Nuclear control over mtDNA 7S coverage

We next aimed to use variation in sequencing coverage across the 16,569 bases of the mtDNA to dissect specific molecular mechanisms of mtDNA replication. We observe a coverage dip by over 50% in the major NCR of the mtDNA (Fig. 2a), which contains the light strand promoter (LSP), three conserved sequence blocks (CSBs), the heavy strand origin of replication (OH) and the D-loop, which contains a stable third strand of DNA (7S DNA) (Extended Data Fig. 7). It is believed that mtDNA replication requires an ‘RNA primer’ which forms from the termination of LSP-initiated transcription at CSBII (red dashed arrow, Fig. 2a inset). Primed mtDNA synthesis begins at CSBII, with the nascent DNA between CSBII and OH forming a transient ‘DNA flap’ (black dashed arrow, Fig. 2a inset). Further replication can either continue to full-length or be terminated prematurely to produce the persistent 7S DNA (black solid arrow, Fig, 2a inset; see also ref. 35). In theory, we expect the highest local WGS coverage in the persistently triple-stranded 7S DNA, lower coverage in the transiently triple-stranded DNA flap region and lowest coverage in the RNA primer region. This is what we observe (Fig. 2a).

Fig. 2: Nuclear genetic control of relative mtDNA coverage in the NCR.
figure 2

a, Mean UKB mtDNA per-base coverage. Dropdown highlights coverage depression in the mtDNA NCR. Arrows refer to mtDNA replication products: red dashed arrow, RNA primer; black dashed arrow, transient DNA ‘flap’; black solid arrow, replicated mtDNA. Grey ribbon is ±1 s.d. CSB, conserved sequence box. b, Two-dimensional (2D) histogram showing mtDNA coverage in the DNA flap region versus RNA primer region. Red line is linear fit, from which the residual is used as a ‘coverage discrepancy’. The distribution of these residuals is shown in the lower panel. c, GWAS Manhattan plot of the discrepancy of mtDNA coverage in the DNA flap region versus RNA primer region (see b). d, 2D histogram showing mtDNA coverage in the DNA flap region versus 7S DNA region. As in b, red line is linear fit, and the residual is shown as a density in the lower panel. e, GWAS Manhattan plot of the discrepancy of mtDNA coverage in the DNA flap region versus 7S DNA region (see d). Red genes are mitochondria-relevant; *gene with Cauchy P value at GWS from RVAS; CS variants proximal to the gene with PIP > 0.1; proximal CS variants with PIP > 0.9; ‘c’, missense variant identified in the CS; underline, eQTL colocalization with PIP > 0.1. f, Structure of MGME1 (5ZYV from RSCB under CC0 license; https://doi.org/10.2210/pdb5zyv/pdb) with bound single-stranded DNA in dark blue, the 310 helix in pink and the T265 alpha carbon as a red sphere. Inset shows the hydrogen bond between T265 and I262.

We hypothesized that genetic variation in nuclear-encoded mtDNA replication machinery might influence the tendency of replication intermediates in the NCR to persist. To attempt to quantify these intermediates, we computed the discordance in coverage between these three regions across individuals in UKB (that is, residuals; Fig. 2b,d and Methods). Upon performing GWAS and cross-ancestry meta-analysis for these traits, we find that nuclear genetic variants near MGME1 associate with the degree of coverage discordance between the RNA primer and the DNA flap (Fig. 2c), whereas variants near TFAM, POLG, MCAT and MGME1 associate with the discordance between 7S DNA and the DNA flap (Fig. 2e). All four genes encode mitochondrial-localized proteins, and MGME1 and POLG work in concert to resolve flap intermediates (that is, the DNA flap) through exonuclease activity during mtDNA replication36. Missense variants in POLG, MGME1 and MCAT all show PIP > 0.1 after fine-mapping, and the highest PIP variant at the MGME1 locus causes p.Thr265Ile, which is in the MGME1 exonuclease domain (Fig. 2f). We also identify a variant causing p.Ala303Gly in MCAT, which has no previous connection to mtDNA maintenance and encodes a component of mitochondrial type II fatty acid synthase. RVAS identified further associations between the levels of missense or LoF variation in novel genes and the 7S DNA and DNA flap coverage discordance, including OMA1 (Supplementary Table 7).

Phenotypes caused by pathogenic mtDNA mutations

We next considered mtDNA sequence variation in UKB (Methods), with an initial focus on well-established, disease-associated mtDNA variants. We began by assessing the carrier rates for ten common pathogenic mtDNA variants associated with maternally inherited diseases, including Leber’s hereditary optic neuropathy; mitochondrial encephalomyopathy, lactic acidosis and stroke-like episodes (MELAS); and aminoglycoside-induced ototoxicity (Fig. 3). We find that approximately 1 in 192 individuals in UKB carries at least one of the ten pathogenic mtDNA variants, in agreement with a previous estimate of 1 in 200 (ref. 37).

Fig. 3: Carrier frequencies and intermediate phenotypes for pathogenic mtDNA mutations assessed in UKB.
figure 3

Carrier frequencies for ten pathogenic mutations in UKB, with heteroplasmy distributions plotted as jittered points and annotations corresponding to canonically associated disease(s). Panels show mean triglyceride levels, haemoglobin A1c, auditory threshold (by means of speech-recognition threshold test) and visual impairment (logMAR, by means of vision test) among mtDNA variant carriers. Point size corresponds to number of carriers with available phenotype data (n); only points with more than 10 measurements are shown. Vertical lines represent trait means among individuals not carrying any of the ten variants. Error bars, ±1 s.e.m. AIOT, aminoglycoside-induced ototoxicity; LHON, Leber’s hereditary optic neuropathy; MERRF, myoclonic epilepsy with ragged red fibres; LS, Leigh syndrome; NARP, neuropathy, ataxia, retinitis pigmentosa; FDR, false discovery rate.

An open question is whether individuals carrying rare pathogenic mtDNA variants in the population exhibit intermediate disease phenotypes. We can now address this thanks to the rich phenotyping in UKB. We tested four phenotypes traditionally associated with these mtDNA variants: haemoglobin A1c (chrM:3243:A,G), triglyceride levels (chrM:3243:A,G), hearing impairment (chrM:1555:A,G, chrM:3243:A,G, chrM:7445:A,G) and visual impairment (chrM:3460:G,A, chrM:11778:G,A, chrM:14484:T,C, chrM:14459:G,A). Individuals carrying the chrM:3243:A,G variant show elevated haemoglobin A1c, elevated triglycerides, and hearing and vision impairment (Fig. 3 and Methods) relative to individuals carrying none of these ten mtDNA variants. Owing to their low frequency of detection in the UKB sample, we do not have the statistical power to exclude the presence of more subtle intermediate phenotypes among the other tested variants.

mtDNA variation across 253,583 people

Next, we more broadly examined the entire spectrum of homoplasmic and heteroplasmic mtDNA variation. Our analysis across UKB and AoU yields the largest database of mtDNA SNVs and indels to date to our knowledge (Fig. 4a). Consistent with earlier gnomAD analyses21, we find that the number of homoplasmies per individual is closely related to haplogroup, with haplogroup H (closest to GRCh38 reference) showing the fewest and haplogroup L0 showing the most (Extended Data Fig. 8a). Aggregate heteroplasmy distributions were highly consistent between UKB and AoU (Extended Data Fig. 8d), and most individuals carried 0–1 heteroplasmic SNVs and 0–2 heteroplasmic indels (Extended Data Fig. 8e). The hypervariable regions of the mtDNA, found in the NCR, contain an elevated heteroplasmic SNV rate and most heteroplasmic indel variants (Fig. 4a). Heteroplasmic indels primarily arise near poly-C stretches (for example, chrM:302, chrM:567, chrM:955, chrM:16182) in the non-protein-coding mtDNA, whereas coding mtDNA shows a low indel rate despite the presence of many poly-C tracts (Fig. 4a), consistent with negative selection. We tested the most common heteroplasmies in UKB for association with risk of 29 common diseases (Methods) and found no evidence of association, although sample sizes were limited (Extended Data Fig. 8g).

Fig. 4: Pervasive nuclear genetic control over common mtDNA heteroplasmies.
figure 4

a, Quality control (QC)-passing mtDNA heteroplasmies in UKB and AoU. From the inside: mtDNA positions of poly-C tracts; genomic annotations (orange, HVR; yellow, rRNA genes; blue, tRNA genes; purple, coding genes); heteroplasmic SNV counts (red); heteroplasmic indel counts (black). The teal arc region is the focus of Fig. 5. Line in outermost track, 100 indels. b, Mean heteroplasmy count per individual across age groups in AoU. Error bars are 1 s.e.m.; total n = 95,328. c, Heteroplasmy transmission in mother versus offspring (left), father versus offspring (middle) and sibling versus sibling (right) for UKB heteroplasmic variants. d, Heteroplasmy transmission in 1000G cell lines in mother versus offspring (left) and father versus offspring (right) pairs. e, Selected heteroplasmy distributions among carriers. For panels ad, red, SNV; black, indels. f, GWAS lead SNPs from common heteroplasmies with any signals at GWS. Point size corresponds to lead SNP two-sided P value; dark points are at GWS. Vertical lines, SNPs identified for multiple mtDNA variants or near genes of interest. Green, genes also nominated for mtCN; *has Cauchy P value at GWS from RVAS; CS variants with PIP > 0.1; CS variants with PIP > 0.9; ‘c’, coding variant in CS; underline, eQTL colocalization with PIP > 0.1. g, Role of genes identified by heteroplasmy GWAS in mtDNA dynamics. h, chrM:16183:AC,A heteroplasmy versus DGUOK lead SNP genotype. i, Structure of DGUOK (2OCP from RSCB under CC0 license; https://doi.org/10.2210/pdb2ocp/pdb) with Q170 in red, nearby residues participating in hydrogen bonds or stacking interaction in pink, and dATP as black sticks. j, chrM:16183:A,AC heteroplasmy versus POLG2 lead SNP genotype. k, Structure of polymerase gamma (4ZTU from RSCB under CC0 license; https://doi.org/10.2210/pdb4ZTU/pdb) with POLG in light blue and POLG2 subunits in green/yellow. Bound DNA is in dark blue; POLG2 residue G416 is shown as red spheres. In panels h and j, red lines, median.

Heteroplasmy transmission and age accrual

We next investigated the patterns of transmission and age-dependence for mtDNA heteroplasmies. For analysis of age, we focused on AoU given the broader age range of participants (20–90 versus 40–70 for UKB). Although heteroplasmic SNVs tend to accumulate with age (particularly after age 70), this was not the case for indel heteroplasmies (Fig. 4b). Using siblings and parent–offspring pairs in UKB (Methods), we found that nearly all heteroplasmic indels were quantitatively maternally transmitted and shared between siblings, whereas most heteroplasmic SNVs were not (Fig. 4c). We also analysed WGS from 602 trios from the 1000 Genomes Project (1000G), finding a similar pattern (Fig. 4d). Unlike UKB blood samples, 1000G samples underwent Epstein-Barr virus transformation to create cell lines before WGS38,39, implying that the maintenance of these heteroplasmic indels is robust and can be quantitatively maintained through both maternal transmission and cell culture, albeit with some added variance (Fig. 4d). The robust maternal transmission and stability across age leads us to conclude that most indel heteroplasmies are inherited as mixtures; by contrast, for heteroplasmic SNVs, the typical lack of transmission and accumulation with age strongly suggest that they typically arise by means of somatic mutagenesis. In contrast to earlier reports40, we find no evidence of paternal transmission (Fig. 4c,d). Over 80% of heteroplasmic SNVs were transitions, which showed a sharp increase in frequency in older age, consistent with the somatic mtDNA mutational spectrum seen in ageing brains41. Curiously, we observed a decline in heteroplasmic transversions in older individuals (Extended Data Fig. 8f).

Nuclear GWASs for mtDNA heteroplasmy

We then sought to determine the extent to which mtDNA heteroplasmy is influenced by nuclear genetic loci. To our knowledge, nuclear loci influencing individual mtDNA heteroplasmies have never been identified in humans. Given that most common heteroplasmies showed maternal transmission (Extended Data Fig. 9), we restricted to individuals carrying each heteroplasmy and performed GWAS with the heteroplasmy level as a quantitative trait (Fig. 4e and Extended Data Fig. 8h).

We identified 42 LD-independent associations across 39 heteroplasmies after cross-ancestry meta-analysis of UKB GWASs (Supplementary Note 7). Our results revealed a shared nuclear genetic architecture for heteroplasmies across mtDNA sites, with 9 of 20 unique nuclear loci associated with more than one heteroplasmic variant (Fig. 4f and Extended Data Fig. 10a). Cross-mtDNA heterogeneity was also observed: chrM:302:A,AC and chrM:302:A,ACC appeared most associated with loci near SSBP1, TFAM, LONP1 and MCAT, whereas the other heteroplasmies were most strongly associated with loci containing DGUOK, PNP and POLG2. Although many genes implicated in heteroplasmy control were also identified in our mtCN GWAS, others were not (for example, TEFM, MTPAP, SSBP1, ABHD10; Fig. 4f). Many associated loci were near genes with established roles in mtDNA replication and maintenance (Fig. 4g), with missense variants identified in the 95% CS in DGUOK, LONP1, POLRMT, MGME1 and POLG2, and eQTL colocalization PIP > 0.1 seen for POLRMT, POLG2 and TFAM. Of the novel hits, we highlight a locus containing C7orf73 (Fig. 4f and Extended Data Fig. 10f), which encodes a protein recently linked to complex IV (ref. 42), suggesting a moonlighting role for this short protein in mtDNA maintenance.

Zooming in, we see relatively large effect sizes from PIP > 0.9 variants in or near genes related to nucleotide metabolism (PNP, DGUOK) and DNA replication (POLG2). The probable causal variant for PNP (PIP 1, Extended Data Fig. 10g) is intronic and colocalizes with a strong negative cross-tissue eQTL43 (multi-tissue P ≈ 0; colocalization PIP 1; Extended Data Fig. 10h,i). PNP is not yet linked to mtDNA disease but performs an analogous reaction to TYMP (an mtDNA disease gene) on purines. The probable causal variant for DGUOK (PIP 0.99, Fig. 4h) results in a p.Gln170Arg missense change in the kinase domain, potentially affecting the tertiary structure of the protein as this glutamine side chain participates in a number of hydrogen bonds and stacking interactions (Fig. 4i). The putative causal variant for POLG2 (PIP 1, Fig. 4j) results in p.Gly416Ala in a predicted anticodon binding domain. This amino acid is highly conserved (Extended Data Fig. 10j) and the mutation affects a loop near the POLG2 homodimer surface (Fig. 4k). These examples highlight protein-altering variants that appear to substantially affect the levels of specific heteroplasmic mtDNA variants.

To test whether heteroplasmy-associated nuclear loci act through mtDNA mutagenesis, we repeated our GWAS, re-coding heteroplasmy traits as ‘case/control’, in which for each mtDNA variant, cases showed detectable heteroplasmy and controls did not. We observed little signal (Extended Data Fig. 10b), arguing against a mutagenic origin influenced by nucDNA variation and supporting the notion that maternal transmission determines the presence of each tested heteroplasmy, whereas nuclear variation can influence the subsequent relative heteroplasmic fraction.

We took several steps to validate our genetic findings. We performed a replication analysis in AoU across 96,698 diverse individuals and observed high concordance between cross-ancestry meta-analysis effect sizes in UKB and AoU (R2 = 0.79; Extended Data Fig. 10c and Supplementary Note 4) with limited attenuation (as expected with winner’s curse44). We investigated potential technical and biological confounders, observing little correlation between these variables and heteroplasmies (Extended Data Fig. 11a–e and Supplementary Note 2). We explicitly tested the robustness of our results to the contaminating effects of NUMTs (Supplementary Notes 5 and 6), finding that GWAS effect sizes were not sensitive to mtDNA coverage as would be expected for NUMT-derived signals (Extended Data Fig. 11j–m). We found strong correlations between UKB meta-analysis effect sizes and those from individual ancestry groups in AoU despite small n (R2 = 0.49–0.78; Extended Data Fig. 10d), reducing the likelihood of confounding by recent polymorphic NUMTs. We tested all GWAS hits for LD R2 > 0.1 with variants within 20 kilobase (kb) windows of 4,736 reference and polymorphic NUMTs, finding only 1 potentially concerning locus—among the UKB EUR (European) group, the SSBP1 locus had LD R2 ≈ 1 with variants in a reference NUMT. Importantly, this locus remained significant for chrM:302:A,AC among the AFR (African) group in AoU despite AFR showing much lower LD with NUMT variants (Extended Data Fig. 10k). Further, the levels of ultra-rare missense/LoF variation in SSBP1 were significantly associated with chrM:302:A,AC heteroplasmy (Fig. 5i and Supplementary Table 7).

Fig. 5: Length heteroplasmies at chrM:302 are inherited maternally as mixtures, co-exist in single cells and are under the influence of variation in the nuclear genome.
figure 5

a, Scheme of chrM:302 region with associated G-quadruplex and length heteroplasmy (GmAGn) nomenclature. b, Sibling–sibling transmission of chrM:302 length heteroplasmies. ce, chrM:302 length heteroplasmy composition across UKB (c), within select UKB mtDNA haplogroups (d) and across 171 single cells in whole blood (e). For ce, each vertical bar corresponds to a single individual (c,d) or cell (e). For be, colours correspond to the legend next to panel d. f, Mean mtCNadj as a function of major chrM:302 allele (red line) and TFAM allele (black dot). Error bars, mean ± 1 s.e.m.; mtCN, copies per diploid nuclear genome; total n = 121,816. g, Case-only mtDNA heteroplasmy GWAS Manhattan plot for chrM:302:A,AC. Red genes are mitochondria-related; *gene with RVAS Cauchy P value at GWS; CS variants proximal to the gene with PIP > 0.1; ‘c’, missense variant identified in the CS; underline, eQTL colocalization with PIP > 0.1. h, chrM:302 heteroplasmy as a function of highest PIP SNP genotype in SSBP1 locus. Red line, median. i, Quantile–quantile plots of gene-based SKAT-O P values from RVAS for chrM:302:A,AC. Colours represent max MAF of included variants, black line is null expectation, error band is 95% CI under the null. Ref, reference.

CSBII variation across people and cells

The ‘length heteroplasmy’ at chrM:302, located in the CSBII region of the mtDNA NCR (Fig. 5a), is the most common heteroplasmic site we observed and occurs within a regulatory motif for mtDNA replication2. Although the reference genome corresponds to GmAG7 (nomenclature indicates the length of the poly-G stretch on the GRCh38 opposite strand, Fig. 5a), we frequently observe individuals harbouring GmAG8 (chrM:302:A,AC), GmAG9 (chrM:302:A,ACC) and GmAG10 (chrM:302:A,ACCC). The fractions of mtDNA carrying these variants are quantitatively shared between siblings (Fig. 5b), indicating maternal transmission of mixtures of multiple mtDNA haplotypes at position 302.

Most of the 156,885 individuals assessed in UKB harbour a mixture of these length heteroplasmies (Fig. 5c), with individuals from different haplogroups showing different distributions (Fig. 5d). The observed quantitative maternal transmission of heteroplasmy implies that mtDNA mixtures exist in individual cells, and we indeed find mtDNA mixtures at chrM:302 in 171 single cells from one individual (Fig. 5e) by re-analysing previously reported single-cell data (Methods).

We find multiple lines of evidence linking mtDNA replication and length variation at chrM:302. Longer alleles at this site are associated with declining mtCNadj with an effect size comparable to the TFAM locus (Fig. 5f, PIP ≈ 1). Nuclear genetic analyses for chrM:302:A,AC, the most common length heteroplasmy, nominated several genes relevant for mtDNA replication and nucleotide balance (for example, SSBP1, identified by GWAS and corroborated by ultra-rare RVAS; Fig. 5g,i), including several genes not identified in GWASs for other heteroplasmic sites (CDA, MTPAP, TFAM, TEFM, LONP1, MCAT; Figs. 4f and 5g). mtCN and chrM:302:A,AC heteroplasmy even show colocalization at the two most significant mtCN loci: 10:60145079:A,G (a TFAM 5′ untranslated region (UTR) variant) and 19:5711930:C,T (a LONP1 missense variant); both show a PIP ≈ 1 for mtCN and have PIP > 0.3 for chrM:302:A,AC. It is notable that previous studies have suggested that the chrM:302 site serves as a ‘rheostat’ for mtDNA replication versus transcription, which are functionally linked in mitochondria3,45. The G-quadruplex at CSBII (Fig. 5a) is a tertiary RNA/DNA hybrid structure that promotes DNA replication by impairing RNA polymerase progression, promoting the formation of interrupted RNA fragments subsequently used for primed replication2,46. Prior in vitro studies have suggested that CSBII G-quadruplex strength is a function of chrM:302 allele, altering the degree to which RNA transcription switches to DNA synthesis45 (Fig. 5a). We now report that nuclear variants in genes related to the mtDNA replisome can favour one length heteroplasmy over another—for example, variants near SSBP1 favour chrM:302:A,ACC (Fig. 5h). Taken together, our results propose that nuclear genetic variation can influence the replication efficiency of mtDNA molecules based on chrM:302 allele.

Discussion

Given that all protein machinery for mtDNA replication and maintenance is nucDNA-encoded, it is plausible that commonly occurring nuclear variants can influence mtDNA heteroplasmy, although this has never been demonstrated in humans. Here, by leveraging WGS across two large biobanks, we report pervasive nuclear genetic control of mtDNA abundance and heteroplasmy variation in humans. Many of these nuclear quantitative trait loci (QTLs) correspond to machinery responsible for mtDNA maintenance, which may influence heteroplasmy by directly acting on mtDNA and altering the relative replication efficiency of mtDNA molecules based on mtDNA sequence, whereas several others correspond to genes never before linked to mtDNA biology. High statistical resolution allows us to gain detailed molecular insights into the mechanisms underlying an entire battery of mito-nuclear interactions, with implications for basic physiology, human disease and evolution.

Our ability to dissect the genetic architecture of mtCN and heteroplasmy was possible both because of the statistical power afforded by the scale of large biobanks and because of careful attention given to technical and biological confounders. We analysed mtDNA sequences across 274,832 individuals of diverse ancestries from two biobanks. We were particularly attentive to contamination by mtDNA pseudogenes in the nuclear genome (NUMTs, Supplementary Notes 5 and 6). We explicitly tested many potential confounders of mtDNA traits, finding that correction of mtCN for blood cell composition had a profound effect on the observed association landscape. Many previously reported associations between blood mtCN and cardiometabolic traits27,28 disappear or reverse direction after adjustment for blood cell composition (Fig. 1f). Our corrections reduce and even eliminate certain recently reported GWAS hits32 near genes suspiciously related to blood cell composition and inflammation (for example, HLA, HBS1L). Our data suggest that, in many cases, an inflammatory state in cardiometabolic disease influences blood cell composition, driving the previously observed decline in mtCN.

The resulting GWASs of mtCNadj and mtDNA heteroplasmies provide molecular insights into mtDNA maintenance. The nuclear loci we identify, including those with fine-mapped missense variation (for example, MGME1, POLG, POLG2, DGUOK, LONP1), are enriched for roles in the mtDNA nucleoid, mtDNA replication and nucleotide balance. We show how population-level genetic analysis can produce detailed, mechanistic insights into mtDNA replication: GWAS of the relative mtDNA coverage in the 7S DNA ‘flap’ region highlights missense variants in both MGME1 and POLG, whose products have exonuclease activity that can resolve this replication ‘flap’ intermediate. We speculate that the putatively causal variant in MGME1, p.Thr265Ile, may act by directly affecting DNA binding by disrupting a hydrogen bond within a helix-forming part of the DNA binding pocket of the MGME1 exonuclease domain (Fig. 2f). We observe notable differences in the genetic architecture of mtCNadj versus heteroplasmy: although TFAM, LONP1, DGUOK and PNP are associated with both traits, the former two (encoding components of the mtDNA nucleoid) were the most significant associations for mtCNadj, whereas the latter two (involved in nucleotide balance) were among the strongest associations across many heteroplasmies. QTLs corresponding to TWNK were identified only for mtCNadj, whereas associations near SSBP1, TEFM and POLRMT were specific to heteroplasmy, suggesting that genetic variation in different mtDNA replication genes can have effects specific to mtCN or heteroplasmy. We spotlight many loci with no previous links to mtDNA biology, such as C7orf73, MCAT, ABHD10, NDUFV3, CDA and ADA, implying new roles for their protein products. Future studies are required to evaluate the specific impacts of the candidate causal variants on the function of proteins involved in mtDNA replication and maintenance.

Our study has implications for rare mitochondrial diseases. First, our GWAS nominates candidate genes for unsolved mitochondrial disease. PNP is an excellent example: it has not previously been linked to mtDNA disease; however, we now show that it is associated with mtCNadj and the levels of 13 length heteroplasmic variants at 3 mtDNA sites. It participates in purine catabolism, and defects in analogous steps in pyrimidine catabolism are linked to mtDNA deletion/depletion syndromes. Second, we confirm an earlier estimate that about 1 in 200 individuals carries a known pathogenic mtDNA variant37, but now also report intermediate phenotypes in such individuals—for example, the MELAS A3243G variant is associated with an increased risk for diabetes. Interestingly, the heteroplasmy distribution observed for the MELAS variant appears to be left-shifted, potentially suggesting negative selection as previously observed18. Third, because the number of wild-type mtDNA molecules is key for healthy physiology, it is tempting to speculate that individuals with a higher mtCN polygenic score may be more resilient to pathogenic, heteroplasmic mtDNA mutations, helping to explain some of the striking phenotypic variability observed between family members that carry the same maternally transmitted pathogenic mtDNA mutations47. Larger, rare disease-focused studies will be required to determine the extent to which the nuclear variants we have identified can modify the penetrance of mtDNA mutations.

A striking finding from our work is that nearly every human harbours heteroplasmic mtDNA variants obeying two key principles: (1) heteroplasmic SNVs are typically somatic and accrue with age sharply after age 70, whereas (2) heteroplasmic indels are found in more than 60% of individuals, do not accrue with age and are usually inherited as mixtures in the same maternal lineage. The accrual of point mutations with age has been reported11; however, to our knowledge the stability of indels with age has not previously been appreciated. Consistent with earlier work15, heteroplasmic SNVs tend to occur more in the mtDNA hypervariable regions, but we find that most heteroplasmies detected here are actually inherited indels. Most heteroplasmic indels appear to occur next to poly-C stretches in the non-protein-coding mtDNA; heteroplasmic indel rates are orders of magnitude lower next to poly-C stretches in coding regions, suggesting negative selection in these regions. Strikingly, for any given common indel, we find that maternal heteroplasmy levels quantitatively predict offspring heteroplasmy levels, suggesting neutral transmission. We show that these heteroplasmy levels are also under nuclear genetic control, with associated loci enriched for genes involved in mtDNA biology and nucleotide balance. These loci are similar across heteroplasmies at multiple mtDNA sites, suggesting a shared genetic architecture.

In theory, the nuclear QTLs we identify for mtDNA length heteroplasmies could operate by one of two mechanisms: (1) the associated nuclear variants are ‘mutagenic’ and impair mtDNA copying fidelity resulting in somatic indels due to slippage in poly-C tracts48, or (2) these nuclear variants confer a replicative advantage to maternally inherited mtDNA molecules carrying certain length variants. Our data favour the latter. Case/control GWAS showed very little signal compared with case-only analysis; in concert with the observed maternal transmission this strongly suggests that the identified nuclear QTLs modify existing indel heteroplasmy levels rather than acting through mutagenesis, potentially by altering the replicative efficiency of the mtDNA molecules carrying different alleles.

Our work provides insight into mechanisms by which the nuclear genotype may be able to confer a replicative advantage to specific mtDNA variants. This is perhaps best illustrated by length heteroplasmy at chrM:302. This heteroplasmy occurs within the G-quadruplex in CSBII in the mtDNA NCR, which may induce switching from transcription to replication by blocking transcription progression. Previous in vitro studies have shown that the chrM:302 length polymorphism affects the strength of this G-quadruplex, hence modifying the transcription/replication switch3,45. We find that mixtures of mtDNA with different chrM:302 length variants are found in over half of the population and are maternally inherited. Once inherited, we show that chrM:302 alleles influence mtDNA abundance (acting in cis), and we find that the resulting heteroplasmy levels are influenced (in-trans) by nuclear QTLs (for example, SSBP1, POLG2, TEFM) whose protein products are thought to directly operate this regulatory switch45. In sum, our results indicate that the associated nuclear variants alter chrM:302 heteroplasmy by influencing factors that interact with the chrM:302 G-quadruplex, thus privileging the replication of mtDNA templates carrying a particular chrM:302 genotype. Recent experiments in embryonic stem cells led to speculation that CSBII length variants may contribute to mtDNA reversion after mitochondrial replacement therapy49 owing to replicative advantage of carryover mtDNA from the intending mother. Our results may provide mechanistic insight into nuclear genetic control of this reversion.

An open question is why mtDNA heteroplasmy is so common in humans, and whether a selective advantage preserves this variation and the observed mito-nuclear interactions. In the current paper, we have shown that quantitative mtDNA traits in the population can be under both cis-acting control (through mtDNA variation) and trans-acting control (through nucDNA variation), and it is possible that these effects balance each other to maintain stable heteroplasmy across generations. As the mtDNA has high mutation rates with little or no recombination, it is prone to the accumulation of disabling mutations that could lead to its ‘meltdown’ through Mueller’s ratchet50. However, mtDNA mutation followed by heteroplasmy is a requisite step in evolutionary adaptation. Nuclear QTLs for mtDNA heteroplasmy may represent mechanisms by which a reservoir of such variation can be tolerated and harnessed over evolutionary time-scales.

Methods

Overview of mtSwirl

Here we develop mtSwirl, a scalable pipeline for mtCN and variant calling which makes calls relative to an internally generated per-sample consensus sequence before mapping all calls back to GRCh38. In addition to GRCh38 reference files and WGS data, the mtSwirl pipeline takes as input nuclear genome reference intervals that represent regions with high homology to the mtDNA (reference NUMTs). We constructed a set of 385 putative NUMTs by using a BLAST-based inventory of reference NUMTs published previously51, extending the boundaries of each interval by 500 bases, and merging any overlapping intervals. Initial variant calls in the mtDNA and reference NUMT regions are made from mapped WGS data using Mutect2 and HaplotypeCaller, respectively (using GATK v.4.2.6.0), and haplogroup inference is performed using Haplogrep52. Consensus sequences are subsequently constructed using homoplasmies (mtDNA) and homozygous alternative (nucDNA) calls. Reads are realigned to the new consensus sequence and variants are called on the mtDNA using Mutect2. To avoid the artificial coverage depression at the ends of the mtDNA reference genome, we call variants in the control region after alignment to a shifted mtDNA molecule. All variant calls and per-base coverage estimates are then returned to GRCh38 coordinates and output from the pipeline. See Supplementary Note 1 for more details. We release two versions of our pipeline on GitHub (https://github.com/rahulg603/mtSwirl): mtSwirlSingle, a single-sample pipeline intended for use with Cromwell and on platforms with high worker limits such as Terra and the AoU Workbench, and mtSwirlMulti, a multi-sample version that processes multiple samples serially per machine, intended for use on platforms with a smaller parallel worker limit such as the UKB Research Analysis Platform.

Cohorts

UKB

The UKB is a large prospective cohort study of approximately 500,000 individuals in the UK53, about 200,000 of whom had WGS performed at the time of this study. Samples were selected for the first round of WGS using a pseudorandom approach to ensure that included samples were representative of the full cohort. Sequencing data were generated using DNA extracted from buffy coat obtained from participants; more details have been reported previously54. All UKB data were accessed under application 31063 and mtDNA variant calling was performed on the UKB Research Analysis Platform.

AoU

AoU is a large longitudinal cohort study based in the USA, with a central goal of enroling a diverse cohort of participants providing electronic health record data over time, specimens for genetic analysis, survey responses and standardized biometric measurements55. At the time of this study, 98,590 individuals had completed WGS on samples obtained from whole blood. DNA extraction was completed at the Mayo Clinic, and sequencing was performed at three sequencing centres (Baylor College of Medicine, Broad Institute and University of Washington) using harmonized protocols. Post-sequencing variant and sample QC was performed by the AoU Data and Research Center (DRC). All mtDNA analyses were performed using the AoU Researcher Workbench in the Controlled Tier v6 workspace: ‘Genetic determinants of mitochondrial DNA phenotypes’, using data from the Q2 2022 release. See https://support.researchallofus.org/hc/en-us/article_attachments/7237425684244/All_Of_Us_Q2_2022_Release_Genomic_Quality_Report.pdf for more details on genomics QC and preprocessing.

gnomAD v.3.1 subset

gnomAD v.3.1 is a database aggregating WGS data from 76,156 samples from several experiments and projects around the world, as part of which an mtDNA variant call-set was recently produced21. Samples were sourced from several study designs including case–control studies for common diseases, population-based cohorts and observational studies. Individuals with inborn severe paediatric disease were excluded. Most data were sourced from sequencing performed on either blood samples extracted using study-specific methodologies or cell lines21. We made use of a subset of the gnomAD v.3.1 samples to prototype our pipeline (mtSwirl) and compare its performance with previous mtCN and variant calls (‘Vanilla’). We excluded samples with very high mtCN as done previously21, as these are most likely to be cell line samples rather than whole blood samples; we used a more stringent threshold of 350 as we wanted to maximally enrich for whole blood samples for this analysis. We also removed samples with mtCN < 50 due to elevated NUMT contamination in these samples21 (Extended Data Fig. 8c). We selected approximately 6,300 samples from gnomAD v.3.1 to maximize inclusion of diverse haplogroups including those underrepresented in UKB (Extended Data Fig. 2a). We specifically supplemented samples belonging to the L haplogroups and enforced a cap on the number of samples assigned to either NFE (Non-Finnish European) or FIN (Finnish). For other larger haplogroups we performed random subsampling proportional to the original composition of the gnomAD dataset to achieve our final sample size. All analyses were performed using Terra (https://app.terra.bio/), and all analyses were performed using the mtSwirl pipeline deployed using Cromwell in Terra.

1000G

The expanded 1000G cohort is a foundational collection of 3,202 diverse samples from 26 populations with recently completed high-coverage WGS and 602 trios38,39. Unlike the other cohorts, for which sequencing was performed directly on whole blood or whole blood subfractions, sequencing for 1000G was performed on lymphoblastoid cell cultures which were established from peripheral blood mononuclear cells at the Coriell Cell Repositories39. The expanded 1000G cohort, which includes the full set of unrelated samples from 1000G phase 3 as well as additional samples to complete 602 trios, was recently sequenced with more details elsewhere38. All data were accessed through the ‘1000G-high-coverage-2019’ workspace in Terra, and all analyses were performed using mtSwirl deployed using Cromwell in Terra.

Computing mean nucDNA coverage in UKB

As mean nucDNA coverage was not available for UKB, we used samtools v.1.9 idxstats56, samtools flagstat and GATK v.4.2.6.0 CollectQualityYieldMetrics as part of the mtSwirlMulti pipeline to efficiently and economically estimate mean coverage on the nucDNA. Idxstats-based counts of total mapped reads were computed over autosomes with the subsequent formula applied to get average nucDNA coverage after removing contributions from duplicate reads:

$$\begin{array}{l}{\rm{Mean}}\,{\rm{coverage}}\,=\\ \frac{({\rm{total}}\,{\rm{mapped}}\,{\rm{reads}}-{\rm{singletons}}-{\rm{reads}}\,{\rm{with}}\,{\rm{discordant}}\,{\rm{mate}}-{\rm{duplicates}})\times {\rm{read}}\,{\rm{length}}}{{\rm{genome}}\,{\rm{length}}}\end{array}$$

Computing mtCN

Across all cohorts we used the following formula to compute mtCN:

$$2\times \frac{{\rm{m}}{\rm{e}}{\rm{a}}{\rm{n}}\,{\rm{o}}{\rm{r}}\,{\rm{m}}{\rm{e}}{\rm{d}}{\rm{i}}{\rm{a}}{\rm{n}}\,{\rm{m}}{\rm{t}}{\rm{D}}{\rm{N}}{\rm{A}}\,{\rm{c}}{\rm{o}}{\rm{v}}{\rm{e}}{\rm{r}}{\rm{a}}{\rm{g}}{\rm{e}}}{{\rm{m}}{\rm{e}}{\rm{a}}{\rm{n}}\,{\rm{n}}{\rm{u}}{\rm{c}}{\rm{D}}{\rm{N}}{\rm{A}}\,{\rm{c}}{\rm{o}}{\rm{v}}{\rm{e}}{\rm{r}}{\rm{a}}{\rm{g}}{\rm{e}}}$$

We defaulted to use of mean mtDNA coverage for the main mtCN-related analyses.

Post-calling mtDNA phenotype QC

To integrate our variant calls and perform sample and variant QC, we extended a previously developed pipeline21. Single-sample variant call format files (VCFs) emitted from mtSwirl were merged into a single Hail MatrixTable (v.0.2.98 (ref. 57)) upon which all downstream steps were conducted.

For sample QC, any samples showing homoplasmic variant overlap (Supplementary Note 1) were removed. We observed a significant elevation in heteroplasmic SNV calls among samples with mtCN below 50, with a stabilization of heteroplasmic calls above 50 mtDNA copies per cell (Extended Data Fig. 8c), highly suggestive of elevated NUMT contamination in the low copy number samples. Thus, to avoid contamination of our results, all samples with mtCN < 50 were removed. Finally, all samples with evidence of contamination more than 2% were removed, as estimated by (1) mtDNA contamination using Haplocheck 0124 (ref. 58) in mtSwirl, (2) nucDNA contamination or (3) the presence of multiple haplogroup-defining variants at abnormally low allele fraction. Given the small count of samples processed in 2006 and abnormally elevated mtCN estimates in these samples (Extended Data Fig. 3e), we excluded these samples from all UKB analyses.

For variant QC, (1) variants with a very low heteroplasmy (less than 0.01) were called as reference with a heteroplasmy of 0, (2) variants with heteroplasmy below 0.05 were flagged and removed as these are at high risk of being enriched for NUMT-derived signals and (3) all variant calls flagged by Mutect2 were removed (Supplementary Note 1). For all sites, a minimum coverage threshold of 100 was used to distinguish between homoplasmic reference calls and sites without variant calls due to low variant calling confidence as done previously21. mtDNA variants were annotated using the Variant Effect Predictor v.101 (ref. 59) and dbSNP v.151 (ref. 60). Variants with at least 0.1% of samples passing filters and showing a heteroplasmy between 0 and 0.5 were annotated as ‘common low heteroplasmy’. Variant calls failing QC were coded with a missing heteroplasmy.

For mtCN, we removed the samples identified during variant call-set sample QC as showing signs of contamination or abnormal overlapping homoplasmy calls, or which were processed in 2006. Because we expect mtDNA-wide coverage measures, such as mtCN, to be robust to NUMTs, we do not enforce hard cut-offs on mtCN measurements.

Construction of mtDNA heteroplasmy phenotypes

We defined our set of common heteroplasmies in UKB as ‘common low heteroplasmy’ variants (Methods) which are present as heteroplasmies in at least 500 individuals, resulting in 39 variants. We produced two main sets of phenotypes: (1) a ‘case-only’ dataset consisting of heteroplasmy values for these variants in which any individuals without the variant detected were coded as missing and (2) a ‘case–control’ dataset in which cases consisted of those with any detectable heteroplasmy and controls consisted of those with the variant not detected. In both phenotype schemes, samples identified as homoplasmic for each variant were always coded as missing. For the case–control dataset, only samples that could be accurately inferred as a reference for each variant were labelled as controls—specifically, the sample was coded as missing for a variant if it had a coverage less than 100 at the site or showed the variant call as QC-fail (Methods).

For sensitivity analyses, we produced several further case-only heteroplasmy datasets: (1) in which any variant calls supported by an alternative allele depth (AD alt) of less than the mean nucDNA coverage of the sample were made missing; (2) in which heteroplasmy estimates were corrected for the depth of mtDNA coverage at the variant site after re-alignment; and (3) in which length heteroplasmy estimates at chrM:302 were corrected for median coverage at CSBII. All corrections were performed by obtaining residuals from the linear regression of the heteroplasmy onto the covariate for each variant across all samples before genetic analysis.

mtDNA phenotype covariate adjustment approach

We investigated time of day of blood draw, fasting time, assessment date and assessment centre as technical covariates for mtDNA traits. As draw time and assessment date are continuous, we used natural splines in the correction model to flexibly model nonlinear relationships between these covariates and the mtDNA phenotype. For assessment date, we used knots placed roughly seasonally to model seasonal variation in mtDNA phenotypes—these corresponded to 3 month increments starting on 1 July 2007 and ending on 1 July 2010. For draw time, we used a natural spline basis with 5 degrees of freedom. Assessment month and assessment centre were modelled as indicator variables. Fasting times were provided in increments of 1 h and thus were modelled as indicator variables; fasting times of more than 18 h were labelled as 18 and fasting times of 0 were labelled as 1. All terms were included in a joint model for correction.

We also investigated the relationship between mtDNA phenotypes and blood cell type percentages and mean blood cell volumes. We selected all non-redundant traits available: white blood cell leucocyte count, haematocrit percentage, platelet crit, monocyte percentage, neutrophil percentage, eosinophil percentage, basophil percentage, reticulocyte percentage, high light scatter reticulocyte percentage, immature reticulocyte fraction, mean corpuscular volume, mean reticulocyte volume, mean sphered cell volume, mean platelet thrombocyte volume. We did not include nucleated red blood cell percentage as only approximately 1% of the entire UKB cohort has non-zero values for this measure, and we excluded lymphocyte percentage given collinearity with neutrophil percentage (r = 0.92) and the sum-to-1 property of the white blood cell differential measurements. To avoid excess leverage from outlying blood cell measurements, we removed any blood measurements with a Z-score > 4. All terms were included in a joint model for correction.

For both the technical covariate and blood cell type models, F-test P values were obtained for each of the 40 mtDNA phenotypes (39 case-only heteroplasmies and mtCN). For any phenotypes that showed F-test P < 0.05/40 (Bonferroni corrected), we produced corrected versions of the phenotype by obtaining the residuals from the regression of the mtDNA phenotype onto covariates of interest before genetic analysis. For mtCN, adjustments were performed with log(mtCN) as the response variable. For heteroplasmy estimates, adjustments were performed with case-only heteroplasmies as the response variable. The specific corrections implemented were (where ‘ns’ refers to the natural spline function):

$$\begin{array}{l}\log ({\rm{m}}{\rm{t}}{\rm{C}}{\rm{N}})\sim \,{\rm{n}}{\rm{s}}({\rm{b}}{\rm{l}}{\rm{o}}{\rm{o}}{\rm{d}}\,{\rm{d}}{\rm{r}}{\rm{a}}{\rm{w}}\,{\rm{t}}{\rm{i}}{\rm{m}}{\rm{e}},5)+{\rm{a}}{\rm{s}}{\rm{s}}{\rm{e}}{\rm{s}}{\rm{s}}{\rm{m}}{\rm{e}}{\rm{n}}{\rm{t}}\,{\rm{c}}{\rm{e}}{\rm{n}}{\rm{t}}{\rm{r}}{\rm{e}}\\ \,\,\,+\,{\rm{f}}{\rm{a}}{\rm{s}}{\rm{t}}{\rm{i}}{\rm{n}}{\rm{g}}\,{\rm{t}}{\rm{i}}{\rm{m}}{\rm{e}}+{\rm{n}}{\rm{s}}({\rm{a}}{\rm{s}}{\rm{s}}{\rm{e}}{\rm{s}}{\rm{s}}{\rm{m}}{\rm{e}}{\rm{n}}{\rm{t}}\,{\rm{d}}{\rm{a}}{\rm{t}}{\rm{e}},\,{\rm{S}}{\rm{E}}{\rm{A}}{\rm{S}}{\rm{O}}{\rm{N}}{\rm{A}}{\rm{L}}\,{\rm{K}}{\rm{N}}{\rm{O}}{\rm{T}}{\rm{S}})\\ \,\,\,+\,{\rm{m}}{\rm{o}}{\rm{n}}{\rm{t}}{\rm{h}}\,{\rm{o}}{\rm{f}}\,{\rm{a}}{\rm{s}}{\rm{s}}{\rm{e}}{\rm{s}}{\rm{s}}{\rm{m}}{\rm{e}}{\rm{n}}{\rm{t}}+{\rm{b}}{\rm{l}}{\rm{o}}{\rm{o}}{\rm{d}}\,{\rm{c}}{\rm{e}}{\rm{l}}{\rm{l}}\,{\rm{v}}{\rm{a}}{\rm{r}}{\rm{i}}{\rm{a}}{\rm{b}}{\rm{l}}{\rm{e}}{\rm{s}}\end{array}$$

As sensitivity analyses for case-only heteroplasmy phenotypes, residuals from the following models were produced:

$$\begin{array}{l}\text{chrM:567:A,ACCCCCC}\sim \,{\rm{n}}{\rm{s}}({\rm{b}}{\rm{l}}{\rm{o}}{\rm{o}}{\rm{d}}\,{\rm{d}}{\rm{r}}{\rm{a}}{\rm{w}}\,{\rm{t}}{\rm{i}}{\rm{m}}{\rm{e}},5)+{\rm{a}}{\rm{s}}{\rm{s}}{\rm{e}}{\rm{s}}{\rm{s}}{\rm{m}}{\rm{e}}{\rm{n}}{\rm{t}}\,{\rm{c}}{\rm{e}}{\rm{n}}{\rm{t}}{\rm{r}}{\rm{e}}\\ \,\,+\,{\rm{f}}{\rm{a}}{\rm{s}}{\rm{t}}{\rm{i}}{\rm{n}}{\rm{g}}\,{\rm{t}}{\rm{i}}{\rm{m}}{\rm{e}}+{\rm{n}}{\rm{s}}({\rm{a}}{\rm{s}}{\rm{s}}{\rm{e}}{\rm{s}}{\rm{s}}{\rm{m}}{\rm{e}}{\rm{n}}{\rm{t}}\,{\rm{d}}{\rm{a}}{\rm{t}}{\rm{e}},\,{\rm{S}}{\rm{E}}{\rm{A}}{\rm{S}}{\rm{O}}{\rm{N}}{\rm{A}}{\rm{L}}\,{\rm{K}}{\rm{N}}{\rm{O}}{\rm{T}}{\rm{S}})\\ \,\,+\,{\rm{m}}{\rm{o}}{\rm{n}}{\rm{t}}{\rm{h}}\,{\rm{o}}{\rm{f}}\,{\rm{a}}{\rm{s}}{\rm{s}}{\rm{e}}{\rm{s}}{\rm{s}}{\rm{m}}{\rm{e}}{\rm{n}}{\rm{t}}\end{array}$$
$$(\text{chrM:16093:T,C}\,;\,\text{chrM:16182:A,ACC}\,;\,\text{chrM:16183:A,AC})\,\sim \,{\rm{b}}{\rm{l}}{\rm{o}}{\rm{o}}{\rm{d}}\,{\rm{c}}{\rm{e}}{\rm{l}}{\rm{l}}\,{\rm{v}}{\rm{a}}{\rm{r}}{\rm{i}}{\rm{a}}{\rm{b}}{\rm{l}}{\rm{e}}{\rm{s}}$$

For each response variable, residuals were generated using \({\rm{residuals}}\,({\rm{lm}}({\rm{model}}))\) as implemented in R v.4.2.1. In all visualizations of covariate-adjusted variables (for example, mtCNadj), we rescaled the residualized variable by adding the pre-adjustment mean. In the case of mtCNadj, we rescaled and exponentiated the residualized variable to return adjusted values back to an absolute scale. See Supplementary Notes 2 and 3 for more details.

mtDNA principal component analysis and predictive power for mtDNA haplogroups

To construct a high-quality variant genotype matrix for principal component analysis (PCA), we obtained the set of homoplasmic variants (heteroplasmy ≥ 0.95) passing QC identified at a MAF ≥ 0.001 in UKB. Any samples with a QC-pass homoplasmy detected were coded as 1 for each respective variant; all others were coded as 0. This binary genotype matrix was subsequently filtered to the set of unrelated samples upon which we computed the first 50 principal components after centring and scaling using the efficient truncated singular value decomposition algorithm implemented in the irlba v.2.3.5.1 package in R. Related samples were projected onto these principal components (PCs) to produce a set of mtDNA PC coordinates for each sample. The set of related samples were defined previously in the Pan UK Biobank (Pan UKBB) project61. In brief, PC-relate was used as implemented in Hail within each assigned genetic ancestry group in UKB and the maximal set of unrelated samples were identified using the maximal independent set algorithm implemented in Hail.

To assess the goodness of fit of mtDNA PCs for the prediction of top-level mtDNA haplogroups, we fit a multinomial model with top-level haplogroup as the response variable and the first 30 mtDNA PCs as explanatory variables as implemented in the nnet v.7.3-17 package in R62. We included only samples belonging to haplogroups with at least 30 samples in UKB. For assessment of the predictive power of mtDNA PCs for ‘level 2’ haplogroups, we fit multinomial models using a similar approach for each top-level haplogroup, with ‘level 2’ haplogroups as the response variable. In all cases, a null model was fit in parallel with the same response variable with only an intercept term. We computed McFadden’s pseudo R2 for each model with the following formula:

$${\rm{Pseudo}}\,{R}^{2}=1-\frac{{\rm{log\; likelihood}}}{{\rm{null\; model\; log\; likelihood}}}$$

Correlations between mtCN, mtCNadj, blood cell composition, heteroplasmies and disease phenotypes

We obtained 29 common disease diagnoses from UKB from a previously curated set of phecodes and International Classification of Disease–10 (ICD10) codes corresponding to major common diseases61 along with demographic variables (age, sex) and blood cell composition phenotypes (Methods). We obtained mtCNraw, mtCNadj, common (N > 500) case-only heteroplasmies (Methods) and three major blood cell composition traits (platelet crit, monocyte count and neutrophil count), and performed Z-score transformation for each. To test for associations with disease phenotypes, we implemented a logistic regression model using the glm function in R, including age, sex, age2, age2 × sex, age × sex, top-level haplogroup and genetic ancestry group assignment as covariates:

$${\rm{D}}{\rm{i}}{\rm{s}}{\rm{e}}{\rm{a}}{\rm{s}}{\rm{e}}\,{\rm{p}}{\rm{h}}{\rm{e}}{\rm{n}}{\rm{o}}{\rm{t}}{\rm{y}}{\rm{p}}{\rm{e}}\,\approx \,{\rm{t}}{\rm{r}}{\rm{a}}{\rm{i}}{\rm{t}}+{\rm{a}}{\rm{g}}{\rm{e}}+{\rm{s}}{\rm{e}}{\rm{x}}+{{\rm{a}}{\rm{g}}{\rm{e}}}^{2}+{{\rm{a}}{\rm{g}}{\rm{e}}}^{2}\times {\rm{s}}{\rm{e}}{\rm{x}}+{\rm{a}}{\rm{g}}{\rm{e}}\times {\rm{s}}{\rm{e}}{\rm{x}}+{\rm{p}}{\rm{o}}{\rm{p}}{\rm{u}}{\rm{l}}{\rm{a}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}}+{\rm{t}}{\rm{o}}{\rm{p}}\,{\rm{l}}{\rm{e}}{\rm{v}}{\rm{e}}{\rm{l}}\,{\rm{h}}{\rm{a}}{\rm{p}}{\rm{l}}{\rm{o}}{\rm{g}}{\rm{r}}{\rm{o}}{\rm{u}}{\rm{p}}$$

We included haplogroups with at least 30 individuals represented in UKB. Haplogroup was included in the model only when the trait was mtDNA-derived (for example, it was not included for blood composition phenotypes). Odds ratios were obtained as \(\exp ({\beta }_{{\rm{trait}}})\), and the 95% CI was obtained as \(\exp ({\beta }_{{\rm{trait}}}\pm 1.96\times {\rm{s.}}{{\rm{e.}}}_{{\rm{trait}}})\).

Derivation of mtDNA coverage discrepancy phenotypes

We obtained mtDNA intervals corresponding to the 7S DNA, heavy strand origin, CSBII, CSBIII and the LSP45,63,64. We computed per-individual median mtDNA coverages in the regions corresponding to the first third of the 7S DNA (termed ‘7S DNA’), the region between CSBII and the heavy strand origin (‘7S DNA flap’), and the region between CSBIII and the LSP (‘7S RNA primer’). To generate coverage discrepancy phenotypes, we regressed DNA flap coverage onto either 7S DNA coverage or 7S RNA primer coverage. To avoid coverage discrepancies attributable to inherited mtDNA variation in the regions of interest, we included indicator variables for all top-level haplogroups with at least 30 samples as well as their interactions with 7S DNA or 7S RNA primer coverage. We also included terms corresponding to the same blood cell composition and technical variables used for adjustment of mtCN (Methods and Supplementary Note 2) to reduce the degree of variation attributable to these factors. The residuals from the following model were used as the coverage discrepancy phenotype for GWAS:

$$\begin{array}{l}7{\rm{S}}\,{\rm{D}}{\rm{N}}{\rm{A}}\,{\rm{f}}{\rm{l}}{\rm{a}}{\rm{p}}\,{\rm{c}}{\rm{o}}{\rm{v}}{\rm{e}}{\rm{r}}{\rm{a}}{\rm{g}}{\rm{e}}\,\approx \,(7{\rm{S}}\,{\rm{R}}{\rm{N}}{\rm{A}}\,{\rm{p}}{\rm{r}}{\rm{i}}{\rm{m}}{\rm{e}}{\rm{r}}\,{\rm{o}}{\rm{r}}\,7{\rm{S}}\,{\rm{D}}{\rm{N}}{\rm{A}}\,{\rm{c}}{\rm{o}}{\rm{v}}{\rm{e}}{\rm{r}}{\rm{a}}{\rm{g}}{\rm{e}})\\ \,+\,{\rm{h}}{\rm{a}}{\rm{p}}{\rm{l}}{\rm{o}}{\rm{g}}{\rm{r}}{\rm{o}}{\rm{u}}{\rm{p}}+\,(7{\rm{S}}\,{\rm{R}}{\rm{N}}{\rm{A}}\,{\rm{p}}{\rm{r}}{\rm{i}}{\rm{m}}{\rm{e}}{\rm{r}}\,{\rm{o}}{\rm{r}}\,7{\rm{S}}\,{\rm{D}}{\rm{N}}{\rm{A}}\,{\rm{c}}{\rm{o}}{\rm{v}}{\rm{e}}{\rm{r}}{\rm{a}}{\rm{g}}{\rm{e}})\\ \,\times \,{\rm{h}}{\rm{a}}{\rm{p}}{\rm{l}}{\rm{o}}{\rm{g}}{\rm{r}}{\rm{o}}{\rm{u}}{\rm{p}}+{\rm{b}}{\rm{l}}{\rm{o}}{\rm{o}}{\rm{d}}\,{\rm{c}}{\rm{e}}{\rm{l}}{\rm{l}}\,{\rm{c}}{\rm{o}}{\rm{m}}{\rm{p}}{\rm{o}}{\rm{s}}{\rm{i}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}}+{\rm{t}}{\rm{e}}{\rm{c}}{\rm{h}}{\rm{n}}{\rm{i}}{\rm{c}}{\rm{a}}{\rm{l}}\,{\rm{v}}{\rm{a}}{\rm{r}}{\rm{i}}{\rm{a}}{\rm{b}}{\rm{l}}{\rm{e}}{\rm{s}}\end{array}$$

Relatedness analyses in UKB

Relatedness was computed and sibling–sibling and parent–offspring pairs were inferred as previously described in UKB65. For the assessment of transmission of all QC-pass mtDNA variants, we restricted to only variants found in five or more samples.

Determination of chrM:302 length heteroplasmy composition

To construct length heteroplasmy compositional profiles, we obtained all pre- and post-QC variant calls made at position chrM:302. We generated a ‘QC-fail’ heteroplasmy estimate at position 302 for each individual by summing pre-QC heteroplasmies that failed post-calling QC; all other alleles included in the composition passed QC (Methods). We defined a ‘reference’ call at chrM:302 for each sample as \(1-{\rm{s}}{\rm{u}}{\rm{m}}(\text{heteroplasmy of any allele at chrM:302})\), in which the sum included all QC-pass alleles as well as the ‘QC-fail’ estimate. All samples without variant calls at chrM:302 were assigned a reference fraction of 1, and samples with a depth of less than 100 at chrM:302 (after local re-alignment during variant calling) were excluded. For each sample, we combined all heteroplasmies from calls other than reference, chrM:302:A,AC, chrM:302:A,ACC and chrM:302:A,ACCC into an ‘Other’ category. The ‘QC-fail’ fraction was included in the ‘Other’ category. Any calls with a missing value for a chrM:302 allele (that is, because the allele was removed due to filtering) were imputed as a heteroplasmy of 0 for the purposes of visualizations and analyses. As a final step, any calls with a heteroplasmy fraction less than 0.05 were labelled ‘Other’ as we use this heteroplasmy cut-off throughout our study to avoid contamination from potential NUMT-derived artifact.

Associations between pathogenic variant carrier status and continuous phenotypes in UKB

We obtained continuous phenotypes available in UKB corresponding to classic symptoms of MELAS—diabetes-like symptoms (elevated triglycerides (ID 30870), elevated haemoglobin A1c (ID 30750)) and hearing impairment (by means of the speech-reception threshold assessment (IDs 20019 and 20021))—as well as the results from the visual acuity test for analysis of known pathogenic variants for Leber’s hereditary optic neuropathy (logMAR from visual acuity test (IDs 5201 and 5208)). All obtained phenotypes were filtered to samples with available mtDNA variant calls and corrections were applied for age, sex, age2, age2 × sex, age × sex and genetic ancestry group assignment by obtaining residuals from the following linear regression model using \({\rm{residuals}}\,({\rm{lm}}({\rm{model}}))\) in R:

$${\rm{M}}{\rm{e}}{\rm{a}}{\rm{s}}{\rm{u}}{\rm{r}}{\rm{e}}{\rm{m}}{\rm{e}}{\rm{n}}{\rm{t}}\,\approx \,{\rm{a}}{\rm{g}}{\rm{e}}+{\rm{s}}{\rm{e}}{\rm{x}}+{{\rm{a}}{\rm{g}}{\rm{e}}}^{2}+{{\rm{a}}{\rm{g}}{\rm{e}}}^{2}\times {\rm{s}}{\rm{e}}{\rm{x}}+{\rm{a}}{\rm{g}}{\rm{e}}\times {\rm{s}}{\rm{e}}{\rm{x}}+{\rm{p}}{\rm{o}}{\rm{p}}{\rm{u}}{\rm{l}}{\rm{a}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}}$$

As blood biomarkers tend to have log-normal distributions, corrections were applied after log transformation of haemoglobin A1c and triglyceride levels. Post-adjustment, all measurements were returned to their original scale by adding the pre-adjustment dataset-wide means for each measurement modality. Final estimates for the speech-recognition threshold and vision logMAR were generated by averaging measurements for the left and right ear and eye, respectively.

Carriers of known pathogenic mtDNA variants were defined as individuals carrying the variant post-QC at any fraction. We defined a set of controls as individuals with none of the ten known pathogenic mtDNA variants tested. Only samples that could be accurately inferred as reference for all ten variants were labelled as controls—the sample was excluded if, for any of the ten variants, it had a coverage of below 100 at the site or showed a QC-fail variant call (Methods).

Comparisons between residual phenotype values among variant carriers versus global controls were performed only for variant–phenotype pairs with more than ten defined phenotype values among variant carriers. P values were obtained by performing a two-sample t-test between phenotype values among variant carriers and the set of global controls, and Q values were obtained by applying the Benjamini–Hochberg procedure.

Creation of mutational spectrum categories

Heteroplasmic SNV mutation types in AoU were constructed using the set of QC-pass heteroplasmic SNVs. For each SNV type, the set of individuals without any heteroplasmic variants was identified as those with no QC-pass variant call of that type; these individuals were included as zeros in estimates of the mean SNV count of each type.

chrM:302 length heteroplasmy inference in single cells

Single-cell mitochondrial single-cell assay for transposase-accessible chromatin with sequencing (mtscATAC-seq) data18 were obtained and analysed with Massachusetts General Hospital Institutional Review Board (IRB) approval under protocol no. 2016P001517. We used the BedTools66 intersect tool (v.2.29.2) to identify read alignments completely spanning the chrM:300–318 locus in the mtscATAC-seq data. We then iterated over these reads and classified their chrM:302 length variant by extracting the poly-C/G tracts using a regular expression, ‘AA(CCC+[CT]CC+)GC’, anchored on the two constant base pairs on either side of the variant region to detect the canonical variant structure of two poly-C/G tracts with or without a single intervening A/T. Alleles in matching reads were classified based on the length of their poly-C/G tracts, whereas alleles in the reads that did not match the regular expression were classified as missing. Next, we filtered out any reads with cell barcodes that were not in the published list of cell calls, and further restricted our analysis to only the cells with at least 20 reads at the chrM:300–318 locus. For each of these high-coverage cells, we calculated the fraction of reads showing each of the top three most common length variants (G6AG8, G6AG9 and G6AG10) and aggregated any other detected alleles into the remainder (Other) for display as a stacked bar plot. We also estimated bulk heteroplasmy by summing the allele counts from the high-coverage cells and re-calculating the fractions for the top three length variants, again with all other alleles being aggregated into the remainder ‘Other’ category.

UKB GWAS approach

All GWASs were performed in UKB using approaches as performed in the Pan UKBB initiative61. In brief, ancestry assignment was performed by first projecting UKB samples into genotype PC-space constructed from reference samples from 1000G phase 3 and the Human Genome Diversity Project (HGDP), and subsequently using a random forest classifier to assign continental labels trained on the 1000G + HGDP reference data. In each ancestry group, PCA was performed among unrelated samples with related samples projected onto this PC-space. Further sample QC was performed as described as part of the Pan UKBB initiative61, including removal of ancestry outliers using a centroid-based metric, and filtering of individuals with high genotype missingness, sex discordance and sex chromosome aneuploidies. Variant QC was also performed on UKB-provided imputed v3 variants (GRCh37) as part of the Pan UKBB initiative61, including only those with INFO scores greater than 0.8 on autosomes and the X-chromosome. Association tests were performed only on variants with a minor allele count (MAC) > 20. We have constructed and released a mapping from our QC-pass UKB GRCh37 variants to GRCh38 coordinates, built using the bcftools +liftover tool (https://github.com/freeseek/score) with default parameters.

For GWAS, SAIGE v.1.1.5 (ref. 67) was used to perform association tests for each assigned ancestry group using the first ten per-population PCs, age, age × sex, age2 and age2 × sex as covariates (referred to as ‘baseline’). Ancestry groups were included only if at least 50 individuals had the phenotype defined. The use of the SAIGE GRM-based approach allowed for the inclusion of related samples in the GWAS, and we enabled leave-one-chromosome-out fitting in all steps. For all continuous phenotype GWASs (case-only mtDNA heteroplasmy traits and mtCN), phenotypes were inverse rank normalized before genetic analysis.

For all main mtDNA heteroplasmy analyses, top-level mtDNA haplogroup was included as an extra set of covariates in the GWAS model as a set of 24 indicator variables with haplogroup A as reference. Any samples belonging to top-level haplogroups with fewer than 30 samples represented were excluded. The same GWAS model was used for sensitivity analysis of case-only heteroplasmies after removing calls with AD alt less than mean nucDNA coverage, after correction for local variant coverage, after correction for CSBII coverage, and after correction for technical or blood trait covariates (Methods). For the main mtCN analyses, we used only the baseline covariates to perform genetic associations with mtCNraw and mtCNadj.

We performed two extra sensitivity analyses for case-only heteroplasmy GWASs: (1) inclusion of 30 mtDNA PCs as covariates in the GWAS model instead of top-level haplogroup for seven variants which showed relatively high heterogeneity across level two haplogroups, and (2) inclusion of mtCNadj as a covariate in the GWAS model for all case-only heteroplasmies in addition to top-level haplogroup. We also tested the effects of including top-level haplogroup indicator variables as extra covariates in GWASs for mtCNraw and mtCNadj.

AoU GWAS approach

We performed a GWAS in AoU as a replication for our main case-only heteroplasmy analyses in UKB. Ancestry inference was performed upstream by the AoU DRC. In brief, AoU samples were projected into the PCA space of genotypes from chromosomes 20 and 21 from HGDP and 1000G, and a random forest classifier trained to identify ancestry labels in 1000G + HGDP was used to assign continental ancestry labels to AoU samples.

We performed sample and variant QC after WGS variant calls (GRCh38) were imported into Hail. Multi-allelic sites were split and sites with very low precomputed allele frequency were removed (MAF > 0.0001 retained). For sample QC, samples flagged by the DRC as population outliers for several metrics or identified as related by the DRC were excluded. For variant QC, we removed any variants filtered by the DRC, which occurred in brief because of no high-quality genotypes for the variant (defined as GQ ≥ 20, DP ≥ 10, AB ≥ 0.2 for heterozygotes), excess heterozygotes or a low-quality score for the variant. We further removed any variants not in Hardy–Weinberg equilibrium (one-sided P ≤ 1 × 10−10) and variants with a call rate ≤ 0.95. Finally, we removed any variants with MAC < 20 in each assigned ancestry group.

We next extracted covariates relevant for our GWAS model. We used an SQL query to obtain date of birth in the controlled data repository and used the provided QC flat files to obtain sex assigned at birth. As date of sample collection was not provided, approximate age was constructed for all analyses by subtracting the year of birth from the year 2021. To address residual stratification in assigned ancestry groups, we produced PCs in each ancestry group using a very similar approach as used in UKB (Methods) as we found that the provided PCs did not appropriately handle stratification among positive control phenotypes such as height, blood glucose, diastolic blood pressure and systolic blood pressure (Supplementary Note 4). We included 20 recomputed PCs, in addition to approximate age, age2, age × sex and age2 × sex as covariates in the final GWAS model. We did not perform genetic association analysis for the MID (Middle Eastern) group as fewer than 400 samples with available WGS data were assigned MID.

We used Hail with the \({\rm{hl.}}{\rm{linear\_regression\_rows}}()\) method to perform GWAS after all QC. As described in the Methods, we performed genetic analysis for all QC-pass case-only mtDNA heteroplasmies with homoplasmic calls set to missing. As this analysis is intended for replication, we included any mtDNA variants found in 300 or more samples across any ancestry group, resulting in 41 variants for genetic analysis. Of these, 36 were also analysed in UKB; 3 UKB variants were not sufficiently common in AoU for genetic analysis. As in UKB, for the analysis of case-only mtDNA heteroplasmies, top-level mtDNA haplogroup was included as covariates in the GWAS model as a set of 27 indicator variables in addition to age, sex and PC covariates. Samples belonging to top-level haplogroups with fewer than 30 samples in AoU were excluded. All case-only mtDNA heteroplasmy phenotypes were inverse rank normalized before analysis.

See the AoU genotype quality report for more information on upstream genotype data and sample QC, ancestry inference and relatedness inference (https://support.researchallofus.org/hc/en-us/article_attachments/7237425684244/All_Of_Us_Q2_2022_Release_Genomic_Quality_Report.pdf).

UKB rare variant analysis approach

Gene-based and single-variant testing of rare variants was performed using SAIGE-GENE+ (ref. 68) as implemented in SAIGE v.1.1.5. Given the analysis of low-frequency variants and the small sizes of the other populations, we focused on the EUR (European) genetic ancestry group for this analysis. Covariates and phenotypes were identical to those used for the common variant GWASs in all cases (Methods). Genetic data were obtained from the UKB OQFE 450k exomes release. We enabled leave-one-chromosome-out fitting in all steps, with default parameters used for estimation of categorical variance ratios. SKAT-O69 was used for set-based testing, with burden and SKAT70 P values reported for each test. Gene- and variant-consequence annotations were used as constructed elsewhere68. For each gene, synonymous, missense, LoF, missense + LoF and synonymous + missense + pLoF variants with maximum MAF 1 × 10−4, 1 × 10−3 and 1 × 10−2 were included in combinatorial sets (12 variant sets per gene) with aggregate P values combined per gene using the Cauchy combination test71. Rare variant associations from first assessed using P values from the Cauchy test which combines information across all evaluated categories, with subsequent evaluation of associated variant groups (for example, missense versus synonymous, MAF cut-offs) performed only for results at GWS from the Cauchy test. Thus, for a given phenotype, we defined our GWS threshold based on the primary assessment of the singular Cauchy test (that is, \(\frac{0.05}{\approx \,18000\,\text{genes}}\)).

Heritability estimation and enrichment analyses for mtCN

S-LDSC25 was used for heritability estimation and enrichment analyses for mtCN in UKB as performed previously24. In brief, we analysed EUR summary statistics in UKB, restricting variants to those in HapMap3 (HM3). We estimated overall SNP-heritability, controlling for 97 annotations corresponding to coding regions, enhancer regions, MAF bins and others72 (referred to as baselineLD v.2.2). For enrichment analyses, we obtained gene-sets corresponding to (1) the top 10% of genes specifically expressed in major tissues from GTEx26 and (2) genes producing protein products that localize to each major organelle with high confidence using COMPARTMENTS73. Variants were mapped to each gene with a 100 kb symmetric window and LD scores for each gene-set annotation were computed using the 1000G EUR reference panel (https://alkesgroup.broadinstitute.org/LDSCORE/). Heritability enrichment for all gene-sets was tested using S-LDSC atop the baseline v.1.1 model, controlling for 53 annotations including coding regions and 5′ and 3′ UTRs25.

Cross-ancestry meta-analysis in UKB and AoU

We conducted a fixed-effect meta-analysis across ancestries in each cohort (UKB and AoU) based on inverse-variance weighted betas and standard errors74. For each ancestry, we excluded low-confidence variants defined as MAC ≤ 20 in either biobank. We computed effect size heterogeneity P values across ancestries using Cochran’s Q-test75. All computation was done using Hail v.0.2.

All visualizations of main GWASs (for example, mtCN, coverage discrepancy traits, heteroplasmy traits) are of cross-ancestry meta-analyses after restriction to the set of ‘high-quality’ variants as defined previously61.

Identification of LD-independent lead SNPs and locus definitions

Clumping was performed using Plink v.1.90 (ref. 76) in Hail Batch for GWAS results obtained in UKB after filtering to high-quality variants. We used significance thresholds of 1 for both the index and clumped SNPs, set the LD threshold for clumping at 0.1 and set the distance threshold at 500 kb. We used single-ancestry and multi-ancestry LD reference panels corresponding to the ancestry groups included in the final multi-ancestry meta-analyses for each mtDNA phenotype as well as for blood cell traits. Reference panels were constructed by randomly sampling 5,000 individuals from all samples in any given set of ancestry groups in the UKB. For single-ancestry LD panels corresponding to ancestry groups with fewer than 5,000 individuals assigned (EAS (East Asian) and MID), the full sample available for each ancestry group was used. More details on the LD reference panels can be found as part of the Pan UKBB project61. Clumping output files from Plink were converted to Hail Tables and then combined into MatrixTables using the multi-way-zip-join method as implemented in Hail.

We defined distinct loci conservatively by starting with LD-independent lead SNPs at GWS and merging any SNPs within 2 megabases (Mb) of one another.

Replication of previous mtCN GWAS with our study

We performed a comparison of significant loci identified in a previous GWAS of mtCN in UKB32 with our own by performing LD clumping on previously released summary statistics as described (Methods) using 1000G phase 3 EUR reference data for LD. We defined distinct loci as described (Methods), merging any SNPs within 2 Mb of one another, arriving at 96 loci previously identified. We defined a replicated locus with mtCNraw or mtCNadj as one in which our GWAS showed a signal at P < 5 × 10−5 or 5 × 10−8 within 2 Mb of the most significant variant identified in the previous study at each locus.

Bidirectional Mendelian randomization between UKB mtCN and selected traits

GWAS effect sizes and LD-independent loci from the UKB cross-ancestry meta-analysis for mtCNraw and mtCNadj were obtained. Summary statistics and LD-independent loci from GWAS among EUR for neutrophil count (ID 30140) and case/control disease traits that showed correlation with mtCNadj: osteoarthritis (categorical_20002_both_sexes_1465), angina (categorical_20002_both_sexes_1473), myocardial infarction (phecode_411.2_both_sexes), ischaemic heart disease (phecode_411_both_sexes) and high cholesterol (categorical_20002_both_sexes_1473), were obtained from the Pan UKBB project61. Loci for effect-size comparison were restricted to those passing variant QC as performed in UKB (Methods). For each mtCN phenotype, neutrophil count and disease trait, GWAS effect sizes were obtained for all variants at GWS in the mtCN GWAS, and, vice versa, mtCN, neutrophil count and disease trait GWAS effect sizes were obtained for all neutrophil count and disease trait variants at GWS. We assessed the relationship between pre- and post-adjustment mtCN GWAS effect sizes and neutrophil count/disease trait GWAS effect sizes using inverse-variance weighted linear regression using weights corresponding to \(\frac{1}{{\rm{s.}}{\rm{e.}}{({\rm{m}}{\rm{t}}{\rm{C}}{\rm{N}})}^{2}}\times \frac{1}{{\rm{s.}}{\rm{e.}}{(\text{trait of interest})}^{2}}\), in which effect size standard errors were obtained from the respective GWAS.

Fine-mapping in UKB

To identify putative causal variants in associated loci, we conducted statistical fine-mapping of mtDNA traits in UKB using cross-ancestry meta-analysis summary statistics. Although we previously showed that fine-mapping a meta-analysis is often miscalibrated due to heterogeneous characteristics of constituent cohorts (for example, genotyping or imputation)77, a within-cohort cross-ancestry meta-analysis such as the present study is a notable exception given no such heterogeneity systematically exists across ancestries.

We used FINEMAP-inf and SuSiE-inf, which model infinitesimal effects78, with cross-ancestry meta-analysis summary statistics (Methods) and a covariate-adjusted in-sample dosage LD matrix79. We defined fine-mapping regions based on a 3 Mb window around each lead variant and merged regions if they overlapped as described previously79. We excluded the major histocompatibility complex (MHC) region (chr 6: 25–36 Mb) from analysis due to extensive LD structure in the region. For each method, we allowed up to ten causal variants per region and derived PIPs of each variant using a uniform prior probability of causality. To achieve better calibration, we computed min(PIP) across the methods and derived up to 10 independent 95% CSs from SuSiE-inf as described elsewhere79. All reported PIPs are min(PIP) values between the two methods.

Enrichment of functional categories among fine-mapped variants

We computed functional enrichment of fine-mapped variants across the mtDNA traits in UKB. We first annotated each variant with seven functional categories (pLoF, missense, synonymous, 5′ UTR, 3′ UTR, promoter, cis-regulatory element (CRE) and non-genic) as described previously79. We then estimated functional enrichment for each category as a relative risk (that is, a ratio of proportion of variants) between being in an annotation and fine-mapped (PIP ≤ 0.01 or PIP > 0.1). That is, a relative risk = (proportion of variants with PIP > 0.1 that are in the annotation)/(proportion of variants with PIP ≤ 0.01 that are in the annotation). The 95% CIs were calculated using bootstrapping with 5,000 replicates. We note that, to increase statistical power, we combined pLoF/missense and 5′/3′ UTR into single categories, respectively, and used a more lenient threshold (PIP > 0.1 versus >0.9) compared with our previous analysis79.

Gene- and variant-prioritization

To nominate genes using GWAS results for each phenotype, we used the following approach to balance clarity with confidence in the gene assignment.

  1. 1.

    If the locus had a CS, for each CS:

    1. a.

      Filter to variants in the CS and retain variants from the CS that are either minimal PIP or coding, or have PIP > 0.7.

    2. b.

      If the variant has PIP > 0.9 and is a coding variant for a gene, assign that gene to the CS.

    3. c.

      Otherwise, assign genes within 3 kb of the variant or, if no genes are within 3 kb, assign the nearest gene to the CS.

  2. 2.

    If the locus had multiple CSs and at least one had a variant with PIP > 0.1, we retained assignments corresponding only to variants with PIP > 0.1.

  3. 3.

    If the locus did not have a CS, we assigned the gene with a boundary nearest to the most significant variant in the locus.

  4. 4.

    We also used RVAS to nominate additional, or support existing, gene assignments for all GWAS loci containing genes with SKAT-O Cauchy RVAS P values at GWS for the same phenotype.

If a variant is inside a gene body (but is non-coding), we considered that gene to be nearest. For case-only heteroplasmy GWASs, when the same locus was significant across multiple heteroplasmy phenotypes, we performed manual integration to arrive at a set of genes supported by the most compelling genetic evidence across variants for each locus. The SSBP1 locus was particularly complex, so we assigned SSBP1 (which harbours the max PIP variant) and provided visualization of the full locus (Extended Data Fig. 10k). We did not use fine-mapping evidence from variants with PIP > 0.1 that are not assigned to a CS. All assignments were manually reviewed. In all GWAS visualizations, we labelled the strength of evidence supporting the gene assignment (for example, if supported by moderate- or high-PIP fine-mapped variants, coding variants, RVAS gene-based test association).

Colocalization with eQTLs

We conducted colocalization of fine-mapped variants of mtDNA phenotypes and cis-eQTL associations from GTEx v.8 (ref. 43) and eQTL catalogue release 4 (ref. 80) as described previously79. Briefly, we retrieved fine-mapping results of cis-eQTL associations that were fine-mapped using SuSiE81 with covariate-adjusted in-sample dosage LD-matrices79. We then computed a PIP of colocalization for a variant as a product of PIP for GWAS and for cis-eQTL (CLPP = PIPGWAS × PIPcis-eQTL)82. When displaying colocalization across heteroplasmy traits, we indicate colocalization if we see a colocalization PIP > 0.1 for the assigned gene and any variant in the CS for any tissue and for any heteroplasmy trait.

Replication of UKB heteroplasmy results in AoU

To perform replication analysis in AoU, we used LD-independent lead SNPs from all case-only heteroplasmy GWASs originally performed in UKB (Methods). We filtered association statistics from AoU (Methods) to these lead variants and compared effect sizes when the variants were identified in AoU with MAC > 20. We used Deming regression implemented in the deming v.1.4 package in R to assess the relationship between effect sizes for these lead SNPs in cross-ancestry meta-analyses in the two biobanks while accounting for standard errors in both83,84. We coded alleles such that effect sizes were always positive in UKB.

Assessment of LD with known polymorphic and reference NUMTs

We collated an extensive database of polymorphic and reference NUMT intervals using BLAST, known reference NUMTs51,85 and published polymorphic NUMTs inferred using mate-pair mapping discordance86,87. To search for regions of homology to the mtDNA within the reference nucDNA, we used BLASTn 2.13.0 with the GRCh37 reference genome with a word size of 11, an expected threshold of 0.05, short queries enabled and default values for the other parameters. In total, we obtained 4,736 overlapping reference and polymorphic NUMT intervals. We constructed a 20 kb window around each nucDNA NUMT region (10 kb up, 10 kb down) and then conservatively tested for LD R2 > 0.1 between any SNP in the window and each lead variant at GWS for our UKB case-only heteroplasmy GWAS using in-sample genome-wide EUR LD-matrices generated in UKB61. All LD values used to examine individual loci in either biobank were computed in-sample—for example, in AoU we computed LD using the post-QC genotype MatrixTable (Methods) used for GWAS with the Hail function \({\rm{hl.}}{\rm{ld\_matrix}}()\).

Multiple alignment of POLG2 protein sequence

POLG2 homologues were detected using the best bidirectional BlastP hit (expected < 1 × 10−3) from humans and were aligned using MUSCLE88.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.