Introduction

Responses to exercise training depends on the type of exercise stimulus, and varies considerably between individuals1,2,3. This variability is tissue-specific, and may be explained by a combination of genetic variants, epigenetic signatures, other molecular and lifestyle factors4,5. Mitochondria are the key mediators of intracellular energy and are involved in many essential cell metabolism and homeostasis processes6 with exercise training improving mitochondrial function and content6,7,8,9.

The mitochondrial genome encodes 37 genes that are highly conserved but differ slightly amongst different regional isolates (haplogroups)10. Mitochondrial haplogroups and Single Nucleotide Polymorphisms (SNPs), in conjunction with SNPs in mitochondrial-related genes (nuclear encoded mitochondrial proteins: NEMPs) have previously been associated with athletic performance in highly trained populations and response to exercise training in the general population11. While these studies have advanced our understanding, they have primarily utilised targeted genotyping technology such as candidate gene approaches, or Sanger sequencing to investigate specific mitochondrial coding regions and NEMPs, such as NRF2 and PGC1α12,13,14,15. Many of these studies also lacked robust technical measures on aerobic fitness measures9. As such, many of the identified variants have not been replicated, and exercise-related genetic variants remain unknown16. To date, studies assessing mitochondrial DNA (mtDNA) variants and NEMPs pertaining to exercise training have focused on protein-coding variants, with no studies looking at the more subtle effects of synonymous and non-coding changes11,17,18,19,20. Further, these studies have often based haplogroup analyses on sequencing or genotyping of the mitochondrial hypervariable region(s) (~ 500–1,000 bp), with no consideration for the remaining mitochondrial genome (~ 15,000 bp) and the specific haplogroup of exercise participants. For instance, 3`UTR (untranslated regions) variants that do not directly affect protein function may however affect translation, mRNA shuttling to specific organelles, or epigenetic modification such as microRNA silencing21. Intronic variants may also lead to splice site changes directly contributing altered protein structure and function22. As Next Generation Sequencing has become more widely available and affordable, sequencing of the whole mitochondrial genome (16,569 bp) is now feasible to uncover genetic variants associated with physical fitness phenotypes. When used in combination with SNP genotyping arrays, it is possible to examine, not only the 37 mitochondrially-encoded genes, but variants within all nuclear NEMP genes simultaneously.

Genetics may influence exercise response in conjunction with environmental factors such as diet, repeated exercise bouts, and age. Whilst these are modifiable, it is difficult to gauge the contribution of these factors to exercise response within short term exercise studies. Further, the additive effects of genetic variants to exercise response are not well understood as only a few genetic variants have been consistently replicated in the field.

Therefore, the aim of the present study was to examine the association between genetic variants (i.e. mitochondrial variants and NEMPs), and aerobic fitness measures in the well-characterised Gene SMART cohort. We hypothesise that by utilising whole-mitochondrial sequencing, we will uncover novel genetic variants associated with exercise responses.

Results

Exercise responses and principal component analysis (PCA)

Participant characteristics and response to exercise for all phenotypes are detailed in Table 1. P-values shown for delta variables are respective of one tail of a paired samples t-test.

Table 1 Participant characteristics before and after four weeks of high-intensity interval training in the Gene SMART study.

Weeks of HIIT elicited small yet significant improvements in Wpeak, LT, VO2max, and TT (PP: 18.96 ± 16.49 Watts, P = 2.28e−13; LT: 15.69 ± 16.24 Watts, P = 7.47e−11; VO2max: 1.12 ± 3.84 mL/min·kg, P = 0.012; TT: − 101.86 ± 144.64 s, P = 2.81e−6).

From our mtDNA sequencing, we obtained an average depth coverage of 615X over the mitochondrial genome. Following sample annotation with Mitomaster, we found that there were 60 distinct haplogroups within the Gene SMART completed cohort of 62 participants. As such, there were no statistically significant associations between the mitochondrial haplogroups with exercise response traits. A summary table of the mitochondrial haplogroups found within the Gene SMART participants is shown in Table 2. The confidence scores (0–1) represent the number of mtDNA variants found in each participant that belong to their respective haplogroup.

Table 2 Summary of mitochondrial Haplogroups within the Gene SMART study.

Following PCA on the response traits, we found that the first 4 principal components (PC1: 35.49%, PC2: 28.46%, PC3: 16.51%, PC4: 12.74%) cumulatively explained 92.3% of the total variance between individuals; therefore we included only these first 4 PCs in subsequent analyses.

Association between genetic variants (mitochondrial encoded and nuclear encoded) and exercise phenotypes

Following quality control, 170 mitochondrial and 4,124 NEMP genetic variants were included in association testing. A cumulative total of 4,325 NEMP variants and 28 mitochondrial variants passed the nominal threshold of significance (Punadjusted < 0.05) for all tests. A solar plot showing the clustering of mitochondrial genomic variants for each trait is shown in Fig. 1 23. The exonic variants passing the nominal threshold from the mitochondrial association results are summarised in Table 3.

Figure 1
figure 1

Solar plot showing significant hits from mitochondrial association testing. Each dot represents a detected variant. The inner ring of the plot represents the mitochondrial genome and is coloured based on genomic region as summarised in the plot legend. The X-axis represents the mitochondrial base pair location. The Y-axis represents the significance level [− log10 (P-value)] in the Gene SMART population over multiple traits. The significance threshold was set at P < 0.05 and is represented by the circular blue line. The concentric white rings surrounding the genome represent the P-value thresholds − log10 (0.01) and − log10 (0.001) respectively.

Table 3 Exonic mitochondrial SNPs associated with phenotypic traits and PCs.

28 variants passed the nominal significance threshold of Punadjusted < 0.05 in various delta traits and principal components. Of these, 8 were located within the hypervariable control region and therefore discounted from further analyses. A further 2 genetic variants were located within a mitochondrial rRNA gene, 1 within the tRNALeu gene, 1 within the mitochondrially encoded ATP synthase membrane subunit 6 (ATP6) gene, 2 within the mitochondrially encoded NADH: ubiquinone oxidoreductase core subunit 4 (ND4), 2 in mitochondrially encoded NADH: ubiquinone oxidoreductase core subunit 5 (ND5) and 1 in mitochondrially encoded cytochrome B (CYB). None of the mitochondrial genomic variants were associated with composite response traits or individual response traits at FDR < 0.05. A manhattan plot of the NEMP variants is shown in Fig. 2. A summary of the association statistics for the variants passing a nominal threshold of Punadjusted < 1e−4 in both the NEMP associations are shown in Table 4.

Table 4 Summary statistics for exonic variants in the nuclear encoded, mitochondria-related genes.

A full list of variants reaching the nominal P-value threshold (< 0.05) may be found in (Supplementary Table SI). 6 SNPs in 5 distinct genes were associated with ΔTT and 2 SNPs in 2 distinct genes were associated with PC2. The most significant variant was rs2041840 associated with PC2 and located within NDUFAF7; we found that the rs12712528 variant also within NDUFAF7 had a moderate correlation with rs2041840 (R2 = 0.5) Fig. 3e. This variant was also trending towards significance in the Δ-Weight and Δ-VO2max response phenotypes (Table 4). The T allele at rs2041840 was associated with a better response to exercise. The Locus Zoom plot (Fig. 3d) surrounding the MTG2 gene was also gene-rich with 11 proximal genes. The two associated variants (rs6062129 and rs6121949) were moderately correlated (R2 = 0.5), however there were no SNPs found within the proximal genes. The locus zoom plot for the variants found within the AFG3L2 gene (Fig. 3f) was proximal to 6 genes within 200 Kb. There was also a proximal SNP within the SLMO1 gene however this was not in linkage with the variants identified within the AFG3L2 gene.

Figure 2
figure 2

Manhattan plot for all hits from all response phenotypes, biochemical measures, and PCs in the linear dominant and recessive association models. Suggestive significance was set at − log10(Punadjusted = 0.0001, blue line). As all traits were included clusters of variants represent association across multiple traits rather than one significant locus commonly associated with GWAS.

Discussion

In this study, we utilised state-of-the-art mitochondrial sequencing, along with high-throughput targeted genotyping of mitochondrial-related variants encoded by the nucleus (NEMPs) to discover novel genetic variants associated with responses to exercise. A total of 28 mitochondrial and 4,325 nuclear encoded mitochondrial associated variants passed the nominal significance thresholds for the various candidate gene association tests. We did not detect mitochondrial variants associated with exercise response, but we uncovered eight NEMPs in seven distinct genes associated with exercise response. It should be noted that we have attempted to control for the contribution of environmental stimuli on each exercise phenotype within our study. For instance, the arguably largest contributors to exercise response (diet, age, repeated bouts) were carefully scrutinised in order to ascertain the genetic contribution to each phenotypic trait. Further, we have ascertained which genetic variants contribute to multiple phenotypic traits using composite traits built with PCA data reduction.

Novel exercise loci

The most significant variant was associated with the composite exercise response phenotype and located within an intron of NDUFAF7 (rs2041840). The T allele was associated with better exercise response as shown by the positive beta values. NDUFAF7 encodes an arginine methyltransferase that is essential for mitochondrial complex I assembly24. We have showed that this variant was in a gene rich region with 8 proximal genes (Fig. 3a), indicating possible effects for this variant in any of the proximal genes or indeed for genes that may be further away from the loci. Specifically, the interaction between this variant and the Glutaminyl-Peptide Cyclotransferase (QPCT) and Protein Kinase D3 (PRKD3) genes has been previously described in a recent GWAS study25. In a recent RNAseq profiling study of exercise training, it was demonstrated that the QPCT gene was upregulated following 12 weeks of training26. As such, we expected the variant seen within the NDUFAF7 gene to be associated with differing levels of the QPCT transcript following prolonged exercise training.

The two intronic variants within the MTG2 gene were found to be associated with the change in time trial measures and appeared to be moderately linked (Fig. 2b). The MTG2 gene resides in a gene rich locus with 11 proximal genes. When assessed for functionality within the UCSC genome browser, we noted that both the MTG2 variants were located in a regulatory element (GH20J062181) that interacts with the MTG2 transcription start site. Further, there was a large amount of layered H3K27 acetylation at the variant site, and the linked MTG2 promoter region. The MTG protein regulates the assembly and function of the mitochondrial ribosome. As such, dysregulation of the gene could result in the downregulation of mitochondrial translation, and therefore a lower response to exercise training. The variants also showed a 20% recombination rate with the 5` region of the TAF4 gene. The TAF4 protein forms part of the transcription factor II D (TFIID) complex and has a central role in mediating promoter responses to transcriptional activators and repressors. Dysregulation of this gene could introduce global translational repression and therefore lack of response to HIIT training. This is supported by the positive effect size for the C and G alleles of the MTG2 variants rs6062129 and rs6121949 respectively (β = 587.7 s).

An intronic variant within AFG3L2 was also shown to be associated with the composite exercise response phenotype (rs7231304), but this gene has not previously been associated with exercise response. However, mutations in AFG3L2 have been shown to cause spinocerebellar ataxia through the development of mitochondrial proteotoxicity27,28. As such, the intronic variation within this gene might inhibit exercise response through dysregulation of mitochondrial structure and function. Further, this variant is in a locus with 6 proximal genes (Fig. 3c), however no genes within this locus shared a recombination rate above 10%. There were two proximal SNPs with a moderate correlation to the SNP of interest also within the AFG3L2 genic region. When assessed for functionality through the UCSC genome browser, we noted that the SNP was within a DNAseI hypersensitivity region, and therefore may have effects on the mRNA half-life rather than protein functionality in response to training.

The T allele at the exonic rs7085433 variant in the TIMM23 gene was associated with the change in time trial phenotype (Δ-TT) causes a non-coding transcript of the TIMM23 gene. This gene is one of the targets of transcriptional activators NRF-1 and GA binding protein (GABP/NRF-2)29, in which we have previously shown genetic variants associated with athletic performance30,31. TIMM23 is one of the mitochondrial transmembrane subunits that form the mitochondrial protein import (TIM23) complex. Therefore, this subunit is essential for the transport of peptide containing proteins across the inner mitochondrial membrane. The non-coding transcript resulting from the variant would render the complex non-functional and as such impaired transport of biomolecules across the inner mitochondrial membrane may impair exercise potential. The effect size of this variant was very highly positive (β = 587.7 s) and therefore, this non-coding transcript may result in a slower time to complete the time trial.

The rs1063271 variant lies within the 3` Untranslated Region (UTR) of the SPTLC2 gene. UTR variants have been shown to influence transcript half-life; through the dysregulated binding of transcript shuttle proteins; or change the binding site of miRNAs resulting in epigenetic silencing of the gene32. As many current miRNA binding site analysis tools require targeted sequences, we examined the interaction between specific miRNAs previously found within exercise training with the 3`UTR of the SPTLC2 gene33. We noted that all of the miRNAs included in our STarMir curation were able to bind to the 3`UTR of the SPTLC2 gene in both seed and seedless sites. As such, it was not possible to indicate a specific miRNA mechanism within the context of this study although we note that this genetic variant in SPTLC2 should be computationally explored in future studies. The SPTLC2 protein is involved in the de novo biosynthesis of sphingolipids by forming a complex with its counterpart; SPTLC134. Overexpression of this protein has also been shown to cause elevated sphingolipid formation and therefore mitochondrial autophagy35. Much like the TIMM23 rs7085433 variant, the effect size for time to completion in Time Trial (β = 587.7 s) indicated that carriers of T allele/genotype have slower TT and therefore poorer response to exercise when compared to carriers of the C allele/genotype. We hypothesise that the T allele for this variant may induce a novel miRNA binding site in the transcript resulting in the silencing of the SPTLC2 gene.

The rs11061368 variant lies within an intronic region of the DIABLO gene. The protein encoded by this gene functions to induce apoptotic processes through the activation of caspases in the Cytochrome C/Apaf-1/caspase-9 pathway. When viewed in UCSC genome browser, it was evident that the SNP was not affiliated with any regulatory elements and therefore we were unable to determine the true functionality of this intronic variant. However, we postulate a molecular mechanism that should be explored in future exercise related studies. Although the associated variant does not show functionality within this gene, the dysregulation of the DIABLO gene could prevent adequate muscle remodelling resulting in the lack of response to training. Further, the variant also lies ~ 50 Kb away from the Interleukin 31 (IL31) gene, a pro-inflammatory cytokine associated with the activation of Signal Transducer and Activator of Transcription 3 (STAT3) pathways, which have already been extensively studied and implicated in exercise training responses.

The FAM185A gene has to date had limited previous research and as such we were unable to elucidate any specific molecular function within the context of exercise training. However, the gene has been previously associated with plexus-forming angiogenesis within the context of foetal lung tissue36. It is plausible that the gene is involved in angiogenic processes outside embryonic development. Further, the gene is proximal to 9 other genes within 200 Kb. There was no evidence of high recombination rates with any of the proximal genes, and there were no proximal SNPs correlated with the rs113400963 polymorphism. When we assessed the polymorphism using the UCSC Genome browser to determine functional consequence of this variant and found no link between the intronic variant with any regulatory or epigenetic regions.

Mitochondrial

None of the mitochondrial genetic variants identified in this study were associated with exercise response in the present study to a threshold of FDR < 0.05. Additionally, we lacked enough statistical power to associate mitochondrial haplogroup with exercise responses as the cohort was extremely heterogeneous. Although nominal significance was achieved, due to the hypervariable nature of the control region, we chose not to focus on the SNPs within this region.

The g.A8701G variant within the ATP6 gene causes a missense change within its respective protein (p.Thr59Ala) and has been well characterised in hypertensive cases37. This variant was nominally significant in both the Δ-LT phenotype and the PC3 composite trait within the cohort. As the Δ-LT trait provided a smaller contribution to PC3, the variant was assumed to be partially associated with a mixture of the Δ-TT and Δ-VO2max phenotypes. The effect size of this variant indicated a poor response to exercise training (β = − 26.19 LT).

Interestingly, all the variants associated with PC4 were related to the utilisation of the amino acid Leucine. Firstly, the g.A12308G variant within the mitochondrial coding region for the tRNA for Leucine. Whilst the effect of this variant was unclear, it appears to have influenced the composite phenotypes within PC4. Mutations within tRNA genes have previously been associated with reduction in organelle quantity and downregulation of protein synthesis38. Secondly, both synonymous variants in the ND4 (g.A11467G) and ND5 (g.G12372A) genes result in a codon that is used far less frequently (CUA[276] > CUG[42]) in mitochondrial translation processes39. As the biosynthesis of tRNAs is costly with respect to intracellular energy levels, it is possible that the combination of the dysregulation of the tRNAleu and the codon usage frequency change in two subunits of the mitochondrial membrane respiratory chain NADH dehydrogenase (complex I) may result in premature intracellular energy (ATP) deficiency and contribute to the poor response to exercise training associated with these traits. It should be noted that the stringent thresholds for association in the mitochondrial association tests could also have resulted in false negative results. Additionally, mitochondrial genetic variants rarely influence phenotypic traits in isolation.

We have identified novel nuclear-encoded, mitochondrial-related SNPs and loci associated with adaptations to High Intensity Interval Training. Additionally, we have postulated the mode of action for different molecular mechanisms that may be responsible for the variability in response to exercise intervention. It should be noted that performing mitochondrial sequencing on muscle tissue as opposed to blood may yield more informative results with heteroplasmic associations due to the high concentration of mitochondria. We note that while we have utilised comprehensive sequencing and high throughput arrays in combination with robust exercise phenotypes, the variants associated with responses in this study, need to be replicated in larger cohorts of both the general population and elite athletes. Further, the variants assessed within the current study were tag SNPs within the genotyping arrays and further information may be gained from the imputation of additional SNPs within the regions we have discovered. This could be achieved by leveraging on large multi-centre initiatives such as the Athlome consortium40. Additionally, functional genomic analyses are required to determine the effect of these variants on the molecular pathways commonly involved in exercise response. Such studies could include transcriptomics, epigenetics and functional cell work in a multi-omics approach.

Methods

Participants

At the time of analysis, 77 participants had taken part in the study, 62 of whom successfully completed 4 weeks of High-Intensity Interval Training (HIIT) intervention protocol in the Gene SMART (Skeletal Muscle Adaptive Response to Training) study41 at Victoria University, Australia. Ethical clearance for this study was provided by the Human Research Ethics Committee at Victoria University (Approval Number: HRE13-233), and the clearance was transferred to and also provided by the QUT Human Research Ethics Committee (Approval Number: 1600000342). All participants provided informed consent prior to the study and all methods were carried out in accordance with relevant guidelines and regulations. We analysed the 62 participants who did not drop out of the study and all had healthy BMI and were moderately trained with an age range of (31.33 ± 7.94 years).

The Gene SMART study design has been previously reported41. Briefly, participants were required to provide medical clearance to satisfy the inclusion criteria. Following familiarisation, baseline exercise performance was determined on a cycle ergometer during a 20 km time trial (TT), and two graded exercise tests (GXTs); these tests were administered a few days apart and no more than two weeks apart to limit temporal variability in performance.

Molecular methods

Genomic DNA was extracted for 77 participants regardless of completion status from 2.0 mL of whole blood using a QIAmp DNA blood midi kit (QIAGEN, Hilden, Germany). Briefly, the concentration and purity of genomic DNA (gDNA) from all samples was assessed via Nanodrop spectrophotometry and Qubit fluorometry. We used an in-house sequencing method recently developed by our group at the Genomics Research Centre, Queensland University of Technology, Australia to sequence the whole mitochondrial genome of each participant42. Illumina Infinium Microarray was used on HumanCoreExome-24v1.1 bead chip to genotype all samples for ~ 550,000 loci. For all samples, 1 µg total gDNA was sent to the Australian Translational Genomics Centre, Queensland University of Technology Australia, for SNP genotyping on the arrays.

Data filtering

A bioinformatics pipeline (SAMtools, BCFtools) was utilised to generate variant call files (VCF) for all samples as described previously42. VCF files were then aligned to the revised Cambridge Reference Sequence (rCRS) and all sequences were stringently left aligned back to this reference genome to account for the single end (SE) reads generated from Ion Torrent sequence information. FASTA files were generated for all samples and then merged VCF and FASTA files were produced for the entire data set. The merged FASTA files were annexed using MITOMASTER, a mitochondrial sequence database, to call haplogroups and obtain variant annotation information for all samples43,44. The merged VCF file was converted to PLINK (v1.90p) format using the function ‘-make-bed’ for further association analysis. A detailed description of the analysis pipeline may be found in markdown format in the GRC computational genetics GitHub account (https://github.com/GRC-CompGen/mitochondrial_seq_pipeline), including all necessary files (NEMP locations BED file) and scripts (mitochondrial Solarplot R script) to replicate our analyses within other data sets.

The ped file generated from Illumina GenomeStudio v2.0 software was converted into binary format. We did not impute any genotypes to prevent false positive associations and a larger multiple testing burden. There were 551,839 typed SNPs; subsequent SNP and individual filtering and trimming was based on (1) SNPs with > 20% missing data (239 removed), (2) individuals with > 20% missing data (0 removed), (3) minor allele frequency < 0.01 to remove rare variant associations (260,269 removed), (4) SNPs out of Hardy Weinberg equilibrium for quantitative traits (58 removed due to P < 1e−6)45. All samples passed kinship and heterozygosity thresholds after the filtering outlined above, leaving 62 samples and 291,273 SNPs to analyse. A BED file containing the genomic locations (GRCh37) of all known Nuclear Encoded Mitochondrial Protein genes (NEMPs) was obtained from the Broad Institute’s human MitoCarta2.0 website46,47,48,49. PLINK was used to extract the SNPs within the genomic locations from the Omni Express SNP chip data of the same participants. In total, 4,806 SNPs were within the NEMP genomic regions detailed by the Broad Institute MitoCarta2.0 bed file and considered to be mitochondrially related variants.

Exercise-response phenotypes

Participant stratification into high and low response groups lead to a loss of statistical power in association testing. As such, and to avoid classifying responders and non-responders via arbitrary thresholds, we chose to keep the phenotypes as continuous variables for association testing50.

To ascertain variants that were associated with exercise response for key physiological traits, we utilised the delta (Δ) change (Post phenotype–Pre phenotype) quantitative trait data for; peak power output (ΔWpeak in Watts); power at lactate threshold (ΔLT in Watts); peak oxygen uptake (ΔVO2peak in mL/min/kg body weight); and time to completion measurement for a 20 km time trial (ΔTT in seconds). As the quantitative traits were all continuous and to keep maximal statistical power, we did not use arbitrary response thresholds. With multiple, correlated response phenotypes, we conducted a Principal Component Analysis (PCA) of the response phenotypes using the R package FactoMineR51. PCA is a dimensionality reduction method that computes linear combinations of the multiple response phenotypes into principal components (PCs) so that the variance between individuals is maximised. Every individual is then represented by one value for each PC, considered a composite trait of the different response phenotypes. A more detailed description of PCA for composite trait association testing is shown in Supplementary Fig. 1.

Missing phenotypic values were excluded from the phenotype table prior to PCA to prevent skewing of data and to maintain appropriate PCs. Following the PCA, these variables were set as “missing” for the association analysis. We also tested the individual response phenotypes and compared the significance levels of variants between the composite traits with those within each PC. This resulted in 4 PCs that cumulatively explained > 90% of the variance between participants.

Statistical analysis

Analysis of the response traits was performed in SPSS using a paired samples t-test. SPSS was also used to test associations between mitochondrial haplogroups and exercise response with a Wald test. Analyses for the mitochondrial SNPs and NEMP SNPs were kept separate for analysis using different association models. We used PLINK V1.90p to perform quantitative linear association tests (95% CI) with both dominant and recessive models. An additive model was also attempted but yielded the same results as our dominant model. We adjusted all association results for age and effect sizes were determined using raw beta regression coefficient values (i.e. genotype X is associated with β [unit specific to trait of interest] changes in the phenotype). Variants that passed a nominal P-value threshold of P < 0.05 were considered for further analysis whereas variants that passed multiple testing adjustment using the Benjamini–Hochberg False Discovery Rate (FDR < 0.05) method were considered significant associations. We performed adjustment for multiple testing for each phenotype separately. Performance of an apriori power calculation for this study indicated that the linear modelling approach with additive genotypic effects for our sample size (n = 62) was sufficient for at least 80% power to detect SNP-based heritability of 13% or more at the relaxed alpha level of 0.05. SNP-based heritability estimates were approximated by genotype Vs outcome R2 values from linear regressions. We also note that the Gene SMART cohort is a tightly controlled study with rigorous physiological measures, all performed in duplicate, which by itself significantly increase the power of the detected apriori.

All variants from the association tests were plotted in R using the tidyverse, ggplot2, and qqman packages. The script for the mitochondrial solar plot depicted in Fig. 1 has been adapted from Stephen turner’s GitHub account23. Locus zoom plots were generated with the online locus zoom software using a compilation of the dominant and recessive nominal results from our association tests52. As the participants were all Caucasian, SNP linkage r2 values were calculated with the HapMap CEU database. As intronic SNPs may affect genes far away, we termed genes within 200 Kb of the SNP of interest as “proximal” regardless of NEMP status.

Figure 3
figure 3

Locus Zoom plots of significant intronic SNPs from the nuclear mitochondrial association testing. Each panel shows the locus surrounding (a) rs6969054 variant within the FAM185A gene, (b) rs11061368 variant within the DIABLO gene, (c) rs1063271 variant within the SPTLC2 gene, (d) rs6062129 variant within the MTG2 gene, (e) rs2041840 variant within the NDUFAF7 gene, and (f) the rs7231304 variant within the AFG3L2 gene. All panels show the variant of interest ± 200 Kb. Left y-axis shows − log10(P-value) of association results for all traits and right y-axis shows recombination rate across the locus in relation to the variant of interest. X-axis shows genomic position across the respective chromosomal regions. All variants were plotted according to the hg19/GRCh37 reference build and recombination rate was calculated from the default parameters within the locus zoom software.

We utilised the UCSC genome browser (hg19/GRCh37) to ascertain the genomic affect and therefore consequence of all statistically significant variants. The GeneHancer track was utilised to determine the regulatory element affect for each variant53. We chose to postulate molecular mechanisms for variants that were purely intronic and did not show affinity for epigenetic or transcription factor binding. However, these should be interpreted with caution and therefore we note that the variants in this category were found to be association based only and should be confirmed through replication analysis in larger exercise cohorts.