Introduction

Mitochondrial DNA (mtDNA) deletions are large structural variants in the mitochondrial genome where a large piece of DNA (often several kilobases or more) is missing1,2,3,4,5,6. These mtDNA molecules may or may not have the ability to replicate and can lead to metabolic impairment by disrupting regions that code for proteins, ribosomal and/or transfer RNAs essential for the oxidative phosphorylation (OXPHOS) pathway3,7,8,9. The identification of mtDNA deletion breakpoints and the relative quantification of deleted vs. wild-type mitochondrial genomes have traditionally relied on several approaches, including Southern Blot analysis, Sanger sequencing, and quantitative polymerase chain reaction (qPCR)10,11,12. Recently, however, several bioinformatics programs and pipelines have been described to identify mtDNA deletions with high accuracy using next-generation sequencing (NGS) data13,14,15,16,17. Our group developed the Splice-Break pipeline for this purpose, and we previously demonstrated its accuracy and applications to human disease and aging research when used on Illumina sequencing data derived from NGS libraries prepared using mitochondrial-enriched, long-range PCR products as the DNA input13. While PCR amplification of the mitochondrial genome in one or more large fragments is a common approach used by many mitochondrial research groups, it still represents a niche method, and such sequencing data is limited. With the more widespread availability of RNA-sequencing (RNA-Seq) data from a variety of tissues and phenotypes, however, it is of great interest to determine if these datasets can be utilized as a resource for mtDNA deletion investigations.

In this study, we demonstrate how Splice-Break213 may be utilized to evaluate mtDNA deletions from RNA-Seq data. We describe a robust evaluation of 30 common mtDNA deletions in 1570 human samples from 14 RNA-Seq studies18,19,20,21,22,23,24,25,26,27,28,29, including 1107 samples across 11 tissues from the Genotype-Tissue Expression (GTEx) Project29. Studies analyzed include both publicly available and newly presented datasets and cover a variety of methods including those that utilized bulk RNA-Seq (polyA/non-ribosomal depletion vs. ribosomal depletion), laser capture microdissection (LCM) RNA-Seq, spatial transcriptomics, and single-cell RNA sequencing (scRNA-Seq) library preparation methods. These 14 studies include 1570 samples that vary in age, diagnosis and tissue, including: (1) age ranges from prenatal to 96 years; (2) diagnoses of Parkinson’s Disease (PD)18,22,24, Alzheimer’s Disease (AD)19, schizophrenia (SCZ)20,21, major depressive disorder (MDD)21, bipolar disorder (BD)21, and healthy controls (CTRL); and (3) tissues of several brain regions18,19,20,21,24,25,26,28,29, skeletal muscle22,23,29, liver29, blood29, and the pancreas27.

Large mtDNA deletions are often associated with short repeat sequences proximal to the 5′ and 3′ breakpoints; these repeats lead to an incorrect joining of two distal regions of the mitochondrial genome, either through strand-displacement during mtDNA replication or homologous end-joining during DNA repair processes30,31,32,33. Our previous study evaluating mtDNA deletions from brain and blood samples found that all the “Top 30” most frequent mtDNA deletions (which were detected in 69% or more of all samples) were associated with (perfect/imperfect) repeat sequences of 8–22 base pairs in length. These 30 deletions were previously validated by Sanger sequencing and include 12 breakpoints that had been previously described in patients with mitochondrial deletion disorders13,34. Three such examples are the 6335–13999, 7816–14807, and 8471–13449 deletions, which represent 5′-3′ positions 6341–14005, 7814–14805, and 8482–13460, respectively, when adjusted to the first repeat base13,33,35,36,37,38,39. These deleted molecules have been associated with a number of rare diseases including (but not limited to) Kearns-Sayre Syndrome (KSS)36,37, chronic progressive external ophthalmoplegia (CPEO) with or without an associated POLG mutation37,40,41, and diffuse leukodystrophy39; moreover, they have been found to be enriched in metabolically-active, somatic tissues and have been investigated in pathogenic phenotypes associated with cellular aging (e.g., PD, AD)42,43,44.

MtDNA sequences present in RNA-Seq datasets are likely derived from real, transcribed RNA molecules, but may also be attributed to DNA contamination or “off-target” effects26,45,46. The amount of mitochondrial gene transcripts detected in RNA-Seq studies can also be highly variable due to overall RNA quality and library preparation procedures26,45,46,47,48,49,50,51. Due to these concerns, many RNA-Seq and ATAC-Seq bioinformatics pipelines remove mitochondrial reads prior to genome alignment and/or transcript quantification26,45,46,47,48,49,50,51. As such, our first objective was to determine which subset or species of mtDNA deletions demonstrated a significant and positive correlation between RNA-Seq and the aforementioned DNA sequencing processes that utilize long-range PCR amplification of the mitochondrial genome. We PCR-amplified mtDNA from the dorsolateral prefrontal cortex (DLPFC) and processed the NGS data through the traditional Splice-Break2 pipeline13, and compared the deletion read %’s to mRNA-Seq data that was previously published by Zeppillo et al. using the same tissue (GSE224683)20. Next, we evaluated 14 RNA-Seq studies18,19,20,21,22,23,24,25,26,27,28,29 to determine how methodological differences in library preparation and sequencing depth affected our ability to detect mitochondrial reads and common mtDNA deletions. Lastly, taking our library preparation observations into account, we evaluated 11 of the 14 RNA-Seq datasets derived from human brain, muscle tissue, liver and blood, and examined if these common deletions significantly correlated with age20,21,22,23,29, were enriched in disease phenotypes or brain regions18,19,20,21,22,24,25,29, or had differential levels in cortical grey matter layers (I-VI) compared to white matter26. Cortical layer evaluations were performed using two spatial transcriptomics datasets, including one dataset of middle temporal gyrus (MTG) that is newly presented here in this study26 (GSE226663).

Here, we describe the methodological considerations for applying the Splice-Break2 pipeline to RNA-Seq datasets, and present evidence that somatic aging effects, tissue specificity, and metabolic dysfunction can be studied with this approach. Given the wealth of both publicly available and restricted RNA-Seq datasets, this strategy may expedite investigations of common mtDNA deletions, especially for cases where the affected human tissue is not readily available for additional sequencing studies (e.g., rare diseases). Moreover, RNA-Seq studies are already designed to appreciate factors such as tissue and cellular specificity, environmental context, aging and drug exposure, whereas germline DNA studies (e.g., whole genome sequencing (WGS)) often use DNA derived from saliva or blood, which may not accurately recapitulate the somatic structural variation present in metabolically-active tissues or cells52,53. Taken together, we believe this study will open a new door to study common mtDNA deletions and their role in human health and disease.

Methods

Samples

We evaluated 1570 samples from 14 RNA-Seq datasets. Thirteen of these datasets are provided on the National Center for Biotechnology Information’s (NCBI) Gene Expression Omnibus (GEO) website or are otherwise publicly available- we refer to these datasets collectively as “GEO+”18,19,20,21,22,23,24,25,26,27,28. This consisted of 7 bulk sequencing with and without ribosomal depletion (GSE226663, GSE114517, GSE159699, GSE224683, GSE140089, GSE164471, and the Stanley Neuropathology Consortium dataset: http://sncid.stanleyresearch.org), 2 LCM RNA-Seq (GSE114918 and GSE166024), 2 spatial transcriptomics (GSE226663 and SpatialLIBD dataset available at Globus endpoint “jhpce#HumanPilot10x”), and 2 single-cell sequencing datasets (GSE81547 and GSE67835). The last dataset came from the Genotype-Tissue Expression Consortium (GTEx)29, where we analyzed bulk sequencing samples from 11 human tissues (dbGaP phs000424.vN.pN). All described studies obtained consent from living donors and/or next-of-kin for postmortem tissues, along with necessary approvals from Institutional Review Boards (IRBs) and/or ethics committees.

The 30 samples derived from mitochondrial-enriched, long-range PCR products as the DNA input are available through dbGaP (phs002395.v1.p1) as part of a larger investigation of blood and brain-derived mtDNA13,54. DNA sequencing was performed on mitochondrial enriched PCR-amplicons fragmented by sonication and prepared for NGS using a whole-genome library preparation kit13,55. Briefly, ~50 mg of frozen brain tissue was homogenized manually with a pestel; total DNA was extracted using the AllPrep DNA/RNA/Protein Kit (Qiagen, Valencia, CA, USA) and quantified using the Qubit Fluorometer and dsDNA BR Assay Kit (Invitrogen, Carlsbad, CA, USA) according to the manufacturer’s instructions. mtDNA was enriched for each sample using a long-range (LR) PCR. The forward primer used was: 5′-CCGCACAAGAGTGCTACTCTCCTC-3′; the reverse primer used was: 5′-GATATTGATTTCACGGAGGATGGTG-3′ (Integrated DNA Technologies, Coralville, IA, USA)13,55. Each sample was amplified in a 50 μl PCR reaction that contained the following: 50 ng of total DNA, 1 μl of each 10 uM primer, 8 μl of 2.5 mM dNTPs, 0.5 μl of LA Taq DNA Polymerase, Hot-Start Version (Takara Bio USA, Inc., Mountain View, CA, USA), and 5 μl 10x buffer. Thermocycler parameters were as follows: 94 °C for 1 min, followed by 30 cycles of denaturation at 98 °C for 10 s and annealing/extension at 68 °C for 15 min, with a final extension at 72 °C for 10 min. Reactions were then kept at 4 °C. Following PCR, 5 μl (10% of the total reaction volume) was loaded into a 1% agarose gel containing 10 mg/ml ethidium bromide and the gel was run at 100 V for approximately 2 h to confirm amplification of the full-length mitochondrial genome. PCR products were purified using Agencourt AMPure XP beads (Beckman Coulter, Indianapolis, IN, USA) and were quantified using the Qubit Fluorometer and dsDNA BR Assay Kit (Invitrogen, Carlsbad, CA, USA). DNA shearing was performed in S220 Covaris microTUBEs with the following settings: Duty Factor = 5%, Peak Incident Power = 175 W, Cycles per Burst = 200, Time = 35. 200 ng of the sheared LR mitochondrial PCR product was used for library preparation using the TruSeq Nano DNA HT Library Preparation Kit (Illumina, San Diego, CA, USA). Each library was barcoded for multiplex sequencing with 96 samples per lane. NGS libraries were quantified and qualified prior to sequencing using the KAPA Library Quantification Kit (KAPA Biosystems, Wilmington, MA, USA) and Agilent Bioanalyzer DNA 7500 chips (Agilent, Santa Clara, CA, USA), respectively. Libraries were sequenced as 150-mer paired-end reads using the Illumina HiSeq 2500 at the UCI Genomics High Throughput Facility.

All accession numbers for the respective RNA-Seq datasets are provided in the Data Availability, Results section, in Supplementary Data 1 and 2, and are on GitHub (https://github.com/aomidsalar/RNA-Seq_Splice-Break2).

Internal data preparation: bulk RNA-Seq

The internal data obtained for methods comparisons (bulk RNA-Seq and spatial transcriptomics) utilized four subject’s brain tissue obtained from the Banner Sun Health Research Institute (BSHRI) Brain and Body Donation Program. These four subjects consisted of one Parkinson’s Disease (PD) subject (144113–144114), one Alzheimer’s Disease (AD) subject (144111–144112), and two age-matched controls with no neurodegenerative disease (144105–144106 and 144107–144108). Briefly, frozen middle temporal gyrus (MTG) was microdissected into ~5 mm3 blocks to be compatible with the grid size used for the 10x Visium Spatial Gene Expression platform. Each block was dissected to contain a sulcus, all six cortical layers of the grey matter, and white matter. Frozen tissue was affixed to cryostat chucks with OTC and sectioned with a Microm HM525 Cryostat at −20 °C to 10 μm slices. Frozen tissue slides were adhered to the 10x Visium Spatial Gene Expression or Tissue Optimization slides following the manufacturer’s protocols. Additional sections were adhered to Selectfrost microscope slides for bulk RNA-Seq assays. All slides were frozen at −80 °C prior to library preparation.

For bulk RNA-Seq, two frozen slides from one control subject (144107/144108) were extracted for RNA using the Direct-Zol RNA mini prep kit. RNA was quantified and qualified using the Agilent Tape Station-HS RNA Screen Tape and Tape Station and had an RNA Integrity Number (RIN) of 5.6. 240 ng of RNA was used as input for each bulk RNA-Seq prep, which utilized the NEBNext Ultra II Directional RNA library prep kit. The polyA (mRNA) library preparation used the NEBNext Poly(A) mRNA Magnetic Isolation Module, and the ribo depletion library preparation used the NEBNext rRNA Depletion Kit V2 and SBP beads. Libraries were barcoded with the NEBNext Multiplex Oligos for Illumina. Library size and quality was evaluated using the Agilent Tape Station-HS DNA Screen Tape and Tape Station; libraries were quantified with the KAPA qPCR library quantification kit. Both RNA-Seq libraries were sequenced as paired-end 100-mer reads on an Illumina NovaSeq 6000 system using a NovaSeq S1 flowcell. Sequencing was performed by the USC Keck Genomics Platform (KGP).

For spatial transcriptomics, tissue optimization (i.e., permeabilization time) was tested according to the 10x Genomics protocols, with 12 min of permeabilization selected for spatial transcriptomics sequencing. Two sections of each of the four subjects that were adhered to the 10x Visium Spatial Gene Expression were processed according to the 10x Genomics Visium protocols. Hematoxylin and eosin (H&E) imaging was performed at 20X magnification and auto-stitching using a Keyence BZ-X810 Microscope to generate high-resolution TIFF files. The cDNA was amplified using 12 cycles following Cq determination steps. Library size and quality was evaluated using the Agilent Tape Station-HS DNA Screen Tape and Tape Station; libraries were quantified with the KAPA qPCR library quantification kit. Spatial transcriptomics libraries were sequenced as paired-end 150-mer reads on an Illumina NovaSeq 6000 system using a NovaSeq S4 flowcell. Sequencing was performed by the USC Keck Genomics Platform (KGP).

Downloading files

FASTQ files available on GEO were obtained using the NCBI tool fastq-dump, with command options to split files (--split-files) and append the read ID (-I). The LIBD spatial transcriptomics dataset26 is available on Globus and Github; FASTQ and BAM files from this study were downloaded directly from the Globus endpoint jhpce#HumanPilot10x (http://research.libd.org/globus/jhpce_HumanPilot10x/index.html); “filtered_feature_matrix.h5” files and tissue images, which were used for clustering and image annotation, were downloaded from their Github repository (https://github.com/LieberInstitute/HumanPilot). RNA-Seq FASTQ files from cerebellum (CER), hippocampus (HIPP), and prefrontal cortex (PFC) samples were downloaded directly from the Stanley Neuropathology Consortium website after requesting access to the data portal (http://sncid.stanleyresearch.org/). GTEx samples were obtained from the AnVIL Gen3 repository and downloaded as BAM after gaining access through dbGaP; they were sorted using samtools sort56 (version 1.6) and converted to FASTQ using bedtools bamtofastq57 (version 2.25.0).

HISAT2 alignment

In order to isolate reads that uniquely mapped to chrM and did not contain nuclear-mapped reads, we performed HISAT258 alignment to the nuclear genome on the RNA-Seq samples from the Zeppillo et al. study20. To do this, we first downloaded the Ensembl human genome reference FASTA file version GRCh38.10359 and removed the mitochondrial genome (chrM) from that file. Next, we built the index files for the human genome reference using the command “hisat2-build”58. We then performed alignment using HISAT258 default settings and the updated human genome reference file (without chrM), with the additional option “--un-conc” to store unmapped reads to separate FASTQ files. Those FASTQ files (containing reads that did not map to the nuclear genome) were then used as the input for Splice-Break2 analysis13.

Splice-Break2 analysis

RNA-Seq FASTQ files were pushed through Splice-Break2 (https://github.com/brookehjelm/Splice-Break2), using either the single-end or paired-end version based on the data format of the respective study. Default command line options described were used, apart from the pre-alignment steps which were skipped (i.e., command line options: --align=yes, --ref=rCRS, fastq_keep=no, --skip_preAlign=yes). For the spatial transcriptomics datasets, alignment (BCL to BAM file) was performed with 10x Genomics Space Ranger (https://support.10xgenomics.com/spatial-gene-expression/software/overview/welcome) and BAM to FASTQ conversion was performed using the 10x Genomics bamtofastq tool (https://support.10xgenomics.com/docs/bamtofastq). The FASTQ files for Read 2 were processed through the single-end version of Splice-Break2. Single-cell samples were pooled by tissue type (via FASTQ file concatenation) prior to being run through Splice-Break2. Initial read numbers for each study were obtained from the stats file (stats.txt); all other metrics used in this analysis regarding the mtDNA read % for the “Top 30” common deletions were obtained from the “<sample > _LargeMTDeletions_DNAorRNA_Top30_NARpub.txt” files output from Splice-Break2.

Identification of FASTQ reads spanning the 6335–13999, 7816–14807, and 8471–13449 breakpoints

We utilized the Unix command “grep” to identify reads that spanned the three deletion breakpoints described. For 6335–13999, we searched for a string containing the deletion breakpoint and ten bases upstream and downstream from that break (CTCCGTAGACCTAACCTGAC). For the 7816–14807 deletion, we searched for an 18 bp string (TCATCGACCTCCCCACCC) first, followed by strings ATCCTAGT and GAAACTTCGG, which was necessary for specificity. For the 8471–13449 deletion, we searched for a string containing the deletion breakpoint and fifteen bases upstream and downstream from that break (CAAACTACCACCTACCTCCCTCACCATTGG).

Cortical layer imputation

We analyzed two spatial transcriptomics datasets, one from the SpatialLIBD project26 and the other containing internal (USC) samples (GSE226663). Spatial transcriptomics datasets were processed using the 10x Genomics Visium pipeline instructions (https://support.10xgenomics.com/spatial-gene-expression/software/overview/welcome). Scripts for spatial transcriptomics processing and analysis are available on https://github.com/aomidsalar/RNA-Seq_Splice-Break2/.

To evaluate mtDNA deletions in cortical layers using spatial transcriptomics data that did not have stereological assessment, we conducted cortical layer imputation using the SpatialLIBD results as “ground truth” since that dataset included histology. First, we performed Seurat60 clustering starting with “filtered_feature_matrix.h5” files on a pool of our eight USC spatial samples along with four SpatialLIBD samples (151673–151676) to generate 10 clusters, as this resulted in the most similar cortical layer architecture to the SpatialLIBD annotation. We performed further sub-clustering of cluster 3 (using the Seurat60 command “FindSubCluster”, to generate clusters 3_0 and 3_1), which gave us further distinction between cortical layers 2 and 3. The sample barcodes and their assigned clusters were then used to generate cluster-specific BAM files using the script “split_spatial_bam_per_cluster.py” (provided in our GitHub repository). We compared the proportion of spots that appeared in the eleven Seurat clusters to the seven “ground truth” designations (white matter and cortical layers 1–6) to assign each Seurat cluster to its corresponding imputed layer. We assigned clusters that had over 40 percent of the spots overlapping with “ground truth” annotations to that given cortical layer and omitted clusters 8 and 9 which had an overall frequency of less than 5 percent. BAM files were converted to FASTQ files using the 10x Genomics bamtofastq tool (https://support.10xgenomics.com/docs/bamtofastq); the cluster-specific FASTQ files were concatenated (when required) to make imputed cortical layer FASTQ files for each sample, which were then run through Splice-Break2.

To find spot barcodes that contained either the 6335–13999 deletion or 8471–13449 deletion, we used the Unix tool “grep” to search for the strings containing the deletion breakpoints (see Methods section “Identification of FASTQ Reads Spanning the 6335–13999, 7816–14807, and 8471–13449 Breakpoints”). As a note, the 10x Genomics bamtofastq tool will output 3 types of files for each sample: (1) a “Read 1” file, which contains the UMI and barcode; (2) a “Read 2” file, which contains the cDNA sequence; and (3) an “Index” file, which contains the sample index. Reads that contained each of these deletions (from “grep” results of Read 2) were filtered and cross-referenced with matching headers in the Read 1 FASTQ to output their corresponding barcode sequences; barcodes were then stored in separate csv files (this was done using the script “grepfilestodeletionbarcodes.sh”; https://github.com/aomidsalar/RNA-Seq_Splice-Break2), and these barcodes were used for deletion analyses. Two proportion Z-tests were performed by calculating the proportion of spot barcodes in each imputed cortical layer that contained a “deletion spot”; spot counting and statistics were done manually in Microsoft Excel.

Graphical analysis

All figures in this study (Figs. 18, Supplementary Fig. 19) were made using R (v.4.1.3); all tables were made in Microsoft Powerpoint for Mac (v.16.61). Bar graphs, scatter plots, and boxplots (Figs. 1, 3–78d; Supplementary Figs. 1–7, 9) were made using the “ggplot2” package (v.3.3.5)61. All boxplots show the median as a solid black line; the first and third quartiles are captured by the bounds of the box. Boxplot whiskers are defined as the first and third quartiles ± interquartile range times 1.5, respectively, and outliers are denoted as points. The heatmap (Fig. 8a) was made using the “ComplexHeatmap” package (v.2.10.0)62. Spatial images (Fig. 8b, c; Supplementary Fig. 8) were made according to the “Secondary Analysis in R” script available on the 10x Genomics website (https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/rkit?src=event&lss=tradeshow&cnm=ts-2020-02-08-event-ra_g-keystone-banff-amr&cid=NULL). Multiple sequence alignment plots for the 6335–13999, 7816–14807, and 8471–13449 deletions (Fig. 2b–d) were generated using the ClustalOmega multiple sequence alignment tool on the EMBL-EBI website (https://www.ebi.ac.uk/Tools/msa/clustalo/) and the “ggmsa” R command (“ggmsa” package v.0.0.6)63.

Fig. 1: Correlations and comparisons of mtDNA deletions captured by DNA vs RNA-Seq.
figure 1

RNA-Seq data from 30 brain samples20 was processed through the Splice-Break2 pipeline and compared to results using the traditional mtDNA-enrichment and DNA sequencing approach13,55. All analyses included correlations and relative abundance of deletion reads detected in each sample (not normalized) and the deletion read rate for each sample (normalized). a The sum of the Top 30 deletions. The three most common deletions: (b) 6335–13999, (c) 7816–14807, and (d) 8471–13449. Spearman and Pearson’s correlations are shown. Statistical values for box plots are from Welch’s t-tests. All p-values were corrected for multiple tests using Bonferroni. All boxplots show the median as a solid black line; the first and third quartiles are captured by the bounds of the box. Boxplot whiskers are defined as the first and third quartiles ± interquartile range times 1.5, respectively, and outliers are denoted as points.

Fig. 2: MtDNA deletions captured by RNA-Seq.
figure 2

a The Top 30 most frequent mtDNA deletions evaluated in this study and previously described13. Multiple sequence alignment (MSA) plot of RNA-Seq reads from a dataset of 30 brain samples20 containing the (b) 6335–13999 deletion, (c) 7816–14807 deletion, and (d) 8471–13449 deletion.

Fig. 3: Sequencing metrics of 14 RNA-Seq datasets evaluated for mtDNA deletions.
figure 3

ah USC control sample sequenced by three library preparation methods: bulk RNA-Seq with ribosomal depletion, bulk RNA-Seq without ribosomal depletion (polyA), and spatial transcriptomics (10x Visium platform). il All GEO+ samples (n = 463); (m–p) all GTEx samples (n = 1107). Definitions: Total RNA-Seq Reads = FASTQ reads prior to alignment; MT Benchmark Coverage = average mitochondrial sequencing depth measured from two 250 bp segments within the RNR1 and CYB genes13; Deletion Read Rate = deletion reads/MT Benchmark Coverage. MTG (middle temporal gyrus); AM (amygdala); SN (substantia nigra); TL (temporal lobe); DLPFC (dorsolateral prefrontal cortex); CER (cerebellum); HIPP (hippocampus); PFC (prefrontal cortex); VTA (ventral tegmental area); LCM (laser capture microdissection); PD (Parkinson’s Disease); CTRL (control); AD (Alzheimer’s Disease); SCZ (schizophrenia); BD (bipolar disorder); MDD (major depressive disorder). Abbreviations for GTEx tissues are shown on figure.

Fig. 4: Analyses of age in GEO+ samples.
figure 4

Correlations between the Top 30 cumulative deletion, 6335–13999, 7816–14807, and 8471–13449 deletion read rates with age in brain20,21 and skeletal muscle22,23. a DLPFC samples (n = 30)20; (b) skeletal muscle samples (n = 36)22; (c) skeletal muscle samples (n = 30)23. df Stanley Neuropathology Consortium data21. d cerebellum samples (n = 58); (e) hippocampus samples (n = 58); f prefrontal cortex samples (n = 58). P-values shown from linear regression models for Deletion ~ Age, and include MT benchmark coverage and sex as co-variates. P-values in (b) also included diagnosis as a co-variate. All p-values were corrected for multiple tests using Bonferroni.

Fig. 5: Analyses of age in GTEx samples.
figure 5

Correlations between the Top 30 cumulative deletion and 8471–13449 deletion read rates with age in 11 GTEx tissues29. a The cumulative deletion read rate of the “Top 30” mtDNA deletions and (b) the 8471–13449 deletion. Tissues are from three paired datasets: 183 paired samples of cerebellum and cortex, 41 paired samples from multiple brain regions (i.e., amygdala, anterior cingulate cortex, caudate nucleus, frontal cortex, hippocampus, and substantia nigra), and 165 paired samples from non-brain regions (i.e., blood, liver, and skeletal muscle). P-values shown from linear regression models for Deletion ~ Age, and include MT benchmark coverage and sex as co-variates. All p-values were corrected for multiple tests using Bonferroni.

Fig. 6: Analyses of brain region and diagnosis in GEO+ samples.
figure 6

ad Comparisons between brain regions for the “Top 30” cumulative deletions, 6335–13999, 7816–14807, and 8471–13449 deletion in four studies18,21,24,25. el Comparisons of diagnosis in brain and muscle in six studies18,19,20,21,22,24. Bar graphs represent mean ± SEM. Sample size per tissue/diagnosis: (a) MTG (n = 23), AM (n = 23), and SN (n = 29)18; (b) SN (n = 10), VTA (n = 9)24; (c) dorsal SN (n = 7), ventral SN (n = 7)25; (d) CER (n = 58), HIPP (n = 58), PFC (n = 58)21; (e) PD+Dementia (n = 17), CTRL (n = 12)18; (f) PD (n = 5), CTRL (n = 5)24; (g) Basal YA (n = 12), Basal OA (n = 12), Basal PD (n = 12), Post Training PD (n = 5)22; (h) young CTRL (n = 8), old CTRL (n = 10), AD (n = 12)19; (i) SCZ (n = 15), CTRL (n = 15)20; (j) CTRL (n = 15), BD (n = 14), MDD (n = 15), SCZ (n = 14)21; (k) CTRL (n = 14), BD (n = 15), MDD (n = 14), SCZ (n = 15)21; (l) CTRL (n = 15), BD (n = 14), MDD (n = 15), SZ (n = 14))21. Statistical values for brain region tests are from Welch’s t-tests. P-values shown from linear regression models for Deletion ~ Diagnosis, and include MT benchmark coverage, age and sex as co-variates. All p-values were corrected for multiple tests using Bonferroni. MTG (middle temporal gyrus); AM (amygdala); SN (substantia nigra); VTA (ventral tegmental area); CER (cerebellum); HIPP (hippocampus); PFC (prefrontal cortex); TL (temporal lobe); DLPFC (dorsolateral prefrontal cortex); LCM (laser capture microdissection); YA (young adult); OA (older adult); PD (Parkinson’s Disease); CTRL (control); AD (Alzheimer’s Disease); SCZ (schizophrenia); BD (bipolar disorder); MDD (major depressive disorder). Symbols: (^p < 0.05); (*p < 0.01); (**p < 0.001); (***p < 0.0001).

Fig. 7: Analysis of brain region and tissue in GTEx samples.
figure 7

ac Comparisons between brain regions and tissues for the “Top 30” cumulative deletions, 6335–13999, 7816–14807, and 8471–13449 deletion in paired GTEx datasets29. Bar graphs represent mean ± SEM. d, e Comparisons across all 11 GTEx tissues. Sample size per tissue: (a) paired samples from cerebellum and cortex (n = 183 ea.); (b) paired samples from amygdala, anterior cingulate cortex, caudate nucleus, frontal cortex, hippocampus, and substantia nigra (n = 41 ea.); (c) paired samples from blood, liver, and skeletal muscle (n = 165 ea.). d the “Top 30” cumulative deletions for all GTEx tissues and matrix of p-values for individual comparisons. e The 8471–13449 deletion for all GTEx tissues and matrix of p-values for individual comparisons. Statistical values for paired tests (ac) are from repeated measures ANOVA. Statistical values for individual tissue comparisons (d, e) are from Welch’s t-tests. All p-values were corrected for multiple tests using Bonferroni. All boxplots show the median as a solid black line; the first and third quartiles are captured by the bounds of the box. Boxplot whiskers are defined as the first and third quartiles ± interquartile range times 1.5, respectively, and outliers are denoted as points. CER (cerebellum); CORT (cortex); FC (frontal cortex); HIPP (hippocampus); ACC (anterior cingulate cortex); AM (amygdala); CAUD (caudate nucleus); SN (substantia nigra); SM (skeletal muscle).

Fig. 8: Mitochondrial deletions in spatial transcriptomics and cortical layers.
figure 8

a Heatmap showing proportion of spots corresponding to ground truth designations of four replicate SpatialLIBD26 samples (151673–151676) vs. Seurat clusters after integrated clustering of 12x sections (i.e., 4x SpatialLIBD DLPFC + 8x USC MTG). Clusters selected are outlined in white and imputed layers are labeled. b Spatial image of one SpatialLIBD sample (151673) colored by imputed cortical layers from (a). c Annotated H&E images and matching spatial images with imputed cortical from USC MTG dataset. Percentage of spots (mean ± SEM) for each imputed layer that contained (d) the 6335–13999 deletion or (e) the 8471–13449 deletion. d, e Percentages shown are from all 12x sections (4x SpatialLIBD DLPFC + 8x USC MTG). Letters above bar graphs describe significant differences between cortical layers from two-proportion Z-tests. Layers represented by different letters: p < 0.05. L1 (Layer 1); L2 & L3 (Layers 2 + 3); L3 (Layer 3); L5 (Layer 5); L6 (Layer 6); WM (White Matter).

Statistics and reproducibility

Statistical analyses were performed in both R and Microsoft Excel. Spearman’s and Pearson correlation tests were done in R using the command “cor.test”. Linear regression calculations were made using the “lm” command and ANOVA test were done using the command “aov”; t-tests were done using the “t.test” command in RStudio or “TTEST” formula in Microsoft Excel. Equality of variances was testing using the FTEST formula in Microsoft Excel. Two-proportion Z-tests were calculated manually in Microsoft Excel. All statistical tests where four deletion metrics were evaluated include multiple comparisons corrections using the Bonferroni method; this was done using the command “p.adjust” with option “method = “bonferroni” in R.

All sample sizes for the RNA-Seq studies are shown in the figure legends and in Tables 1 and 2. All replicates analyzed were from different subjects, apart from GSE140089 that included PD subjects pre- and post-exercise. All deletion metrics for GEO+ datasets are shown in Supplementary Data 2.

Table 1 Summary of 12 GEO + RNA-Seq datasets evaluated for mtDNA deletions
Table 2 Summary of GTEx RNA-Seq datasets evaluated for mtDNA deletions

Scripts and data processing details

All scripts and details on the above data processing steps are available on GitHub (https://github.com/aomidsalar/RNA-Seq_Splice-Break2). This includes the following sections: (1) “Data_Availability_and_Accession”, which includes references to datasets analyzed in this study and their corresponding accession information; (2) “Processing_FASTQ_from_GEO”, which includes scripts used to format and run single-end and paired-end FASTQ files through Splice-Break2; (3) “Grep_for_Deletions”, which includes scripts used to isolate reads containing the 6335–13999, 7816–14807, or 8471–13449 deletion reads from FASTQ files; (4) “10xSpatial_DataProcessing”, which contains scripts for generating cluster-specific BAM files and identifying barcode sequences for reads containing 6335–13999 or 8471–13449 mtDNA deletions; and (5) “Spatial_Processing_Seurat”, which contains R scripts used to perform Seurat clustering and create annotated tissue images. The GitHub repository for this paper (https://github.com/aomidsalar/RNA-Seq_Splice-Break2) and the Splice-Break2 tool repository (https://github.com/brookehjelm/Splice-Break2) both contain a document for “RNA-Seq Best Practices for Splice-Break2” to help guide users with workflow and command lines.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Results

Common mtDNA deletions in DNA vs RNA sequencing

We used a paired dataset of 30 postmortem brain samples that underwent both DNA and RNA-Sequencing20 to explore Splice-Break2’s utility in analyzing common mtDNA deletions from RNA-Seq data (Fig. 1). These samples came from DLPFC of patients with schizophrenia (SCZ) and healthy controls (CTRL). MtDNA-enriched DNA sequencing was performed as previously described13,55. This DNA dataset had an average of 3.47 million reads per sample, with 90.31 ± 5.2% alignment to the revised Cambridge Reference Sequence (rCRS) version of the human mitochondrial genome (NC_012920.1)64. This resulted in an average MT Benchmark coverage (i.e., mitochondrial depth) of 23,466x. MT benchmark coverage is a measure of the average sequencing depth of the sample, measured from two 250 bp segments within the RNR1 and CYB genes13. RNA-Sequencing was performed as described in Zeppillo et al. 20. This RNA-Seq dataset had an average of 114.5 million reads per sample, with 16.17±5.9% alignment to the mitochondrial genome (NC_012920.1)64. Spearman and Pearson’s correlations between the Splice-Break2 results from the DNA and RNA datasets were performed for three of the “Top 30” most common mtDNA deletions, as well as the summation (cumulative read %) of all these 30 deletions together (Fig. 1). We also evaluated the summation of additional deletions of high frequency (n = 112), but those cumulative read %’s did not have a significant correlation between DNA and RNA (Supplementary Table 1), so we focused exclusively on the “Top 30” deletions for this investigation. The deletion read %’s for the “Top 30” deletions and the cumulative deletion read rate for every sample and this paired RNA/DNA dataset is provided in Supplementary Data 1.

The cumulative deletion read % of these “Top 30” had a significant and positive correlation between the DNA and RNA-Seq datasets after each was normalized to the amount of mitochondrial data (i.e., MT benchmark coverage) (Spearman’s rho. 0.567, p = 4.33e-3; Pearson’s cor. 0.454, p = 0.0468) (Fig. 1a). Not surprisingly, the DNA dataset had a significantly higher cumulative deletion read % than the RNA-Seq dataset (p = 1.09e-10), with 124-fold higher levels in the DNA (Fig. 1a). Three individual deletions were also examined for DNA/RNA correlations: the 6335–13999 deletion, which was the most frequently detected deletion in our previous study and was present in 98.9% of samples there, the 7816–14807 deletion, which was the second most frequently detected deletion and was present in 92.5% of samples there, and the 8471–13449 deletion (known as “The Common Deletion”), which was the third most frequently detected deletion in our previous study and was present in 91.4% of samples there13. The 6335–13999 deletion did not have statistically significant correlations (Spearman’s rho 0.409, p = 0.0993; Pearson’s cor. 0.235, p = 0.845) after multiple comparisons corrections (Fig. 1b). The DNA dataset also had a significantly higher deletion read % for this deletion than the RNA-Seq dataset (p = 1.35e-8), with 512-fold higher levels in the DNA (Fig. 1b). The 7816–14807 deletion did not have a significant correlation between the DNA and RNA-Seq datasets when evaluating the ranks (Spearman’s rho 0.342, p = 0.256); however, the correlation of read values was significant (Pearson’s cor. 0.466, p = 0.0378) (Fig. 1c). The DNA dataset also had a significantly higher deletion read % for this deletion than the RNA-Seq dataset (p = 7.65e-6), with 196-fold higher levels in the DNA (Fig. 1c). The 8471–13449 deletion had a significant and positive correlation between the DNA and RNA-Seq datasets (Spearman’s rho 0.740, p = 1.21e-5; Pearson’s cor. 0.672, p = 1.88e-4) (Fig. 1d). Again, the DNA dataset had a significantly higher deletion read % than the RNA-Seq dataset (p = 5.54e-7), with 22-fold higher levels in the DNA (Fig. 1d). DNA consistently had a higher deletion read % than the RNA-Seq dataset for each of the “Top 30” deletions, and a higher percentage of the “Top 30” deletions were captured in the DNA data (Supplementary Fig. 1).

The “Top 30” mtDNA deletions breakpoints from Splice-Break2, along with their adjusted positions (aligned to the first base based on the repeat sequences), and the mitochondrial genes that encompass these 5′ and 3′ breakpoints are shown (Fig. 2a). Confirmation and visualization of the three mtDNA deletions analyzed was performed using “grep” and the ggmsa R package, respectively63. Reads that contained the sequence CTCCGTAGACCTAACCTGAC, a 20-bp string that encompasses the 6335–13999 breakpoint, aligned as expected to the COX1 and ND5 genes, with retainment of 1 copy of the repeat sequence (Fig. 2b). Reads that contained the sequence TCATCGACCTCCCCACCC an 18-bp string that encompasses the 7816–14807 breakpoint (and that also contained the sequences ATCCTAGT and GAAACTTCGG), aligned as expected to the COX1 and CYB genes, with retainment of 1 copy of the repeat sequence (Fig. 2c). Reads that contained the sequence CAAACTACCACCTACCTCCCTCACCATTGG, a 30-bp string that encompasses the 8471–13449 breakpoint, aligned as expected to the ATP8 and ND5 genes, with retainment of 1 copy of the repeat sequence (Fig. 2d).

To determine if nuclear-mitochondrial DNA segments (NUMTs) were contributing to our detection of common mtDNA deletions, and to assess if it was necessary to process RNA-Seq data with these sequences removed, we compared the mtDNA deletion results from all reads to only those that uniquely mapped to chrM (i.e., after HiSat258 alignment to the nuclear genome and downstream processing of unmapped reads) (Supplementary Fig. 2). The correlations between the mtDNA deletion reads detected (before normalization) was very high for both the “Top 30” sum and the three individual deletions evaluated when comparing all reads to only uniquely mapped reads (Spearman’s rho range: 0.81 to 0.999, p-value range: 2.40e-7 to 2.72e-42; Pearson’s cor range: 0.957–0.998, p-value range: 2.06e-20 to 5.10e-39). There was no significant difference in either the number of deletion reads detected or the deletion read % (after normalization) for any deletion metric (Supplementary Fig. 2). Additional analysis of the uniquely mapped reads from RNA-Seq data compared to the DNA dataset demonstrated that utilizing uniquely mapped reads did not significantly improve the correlations between RNA and DNA (Supplementary Fig. 3); thus, we concluded removal of nuclear mapped reads was not necessary and we processed all RNA-Seq reads through the Splice-Break2 pipeline for the remainder of this study to streamline our workflow. However, both approaches (i.e., using all reads or only chrM uniquely mapped reads) are described in our “RNA-Seq Best Practices for Splice-Break2” document on GitHub.

Total reads, mitochondrial data and common mtDNA deletions across 14 RNA-Seq studies

We processed 14 human RNA-Seq studies18,19,20,21,22,23,24,25,26,27,28,29 through the Splice-Break2 pipeline and evaluated the effects of library preparation method and sequencing depth (i.e., total read numbers and MT benchmark coverage) on our ability to capture common mtDNA deletions (Fig. 3). Our analysis included 12 previously published studies and 2 newly presented here. The 12 previously published studies we included are as follows (see Tables 1 and 2 for additional details): (1) Simchovitz et al. (GSE114517) study of ribosomal-depleted RNA-Seq libraries isolated from substantia nigra (SN), amygdala (AM), and middle temporal gyrus (MTG) brain tissue of patients with PD+Dementia and CTRL18; (2) Nativio et al. (GSE159699) study of ribosomal-depleted RNA-Seq libraries isolated from lateral temporal lobe (TL) tissue of patients with AD and CTRL19; (3) Zeppillo et al. study (GSE224683) of polyA (non-ribosomal depleted) total RNA libraries isolated from DLPFC brain tissue of patients with SCZ and CTRL20; (4) Kim et al. study (i.e., The Stanley Neuropathology Consortium) of polyA (non-ribosomal depleted) total RNA libraries isolated from cerebellum (CER), hippocampus (HIPP), and prefrontal cortex (PFC) of patients with bipolar disorder (BD), SCZ, major depressive disorder (MDD) and CTRL21,65; (5) Lavin et al. (GSE140089) study of polyA (non-ribosomal depleted) total RNA libraries isolated from skeletal muscle (SM) of patients with PD and CTRL22; (6) Tumasian et al. (GSE164471) study of polyA (non-ribosomal depleted) total RNA libraries isolated from SM of CTRL23; (7) Aguila et al. (GSE114918) study of laser-capture microdissection (LCM) SN and ventral tegmental area (VTA) neurons isolated from patients with PD and CTRL24; (8) Monzón-Sandoval et al. (GSE166024) study of LCM ventral and dorsal SN neurons isolated from CTRL subjects25; (9) Maynard et al. spatial transcriptomics sequencing of DLPFC brain tissue isolated from CTRL subjects26; (10) Enge et al. (GSE81547) analysis of 2544 single cells (alpha, acinar, beta, delta, ductal, mesenchymal, and unsure cell types) isolated from pancreas tissue of CTRL subjects27; (11) Darmanis et al. (GSE67835) analysis of 357 single cells (neurons, fetal quiescent, astrocytes, endothelial, oligodendrocyte precursor cells, and microglia cell types) isolated from fresh brain tissue28; and (12) Lonsdale et al. (i.e., GTEx) study, evaluating multiple brain regions, skeletal muscle, liver and blood29. Single-cell samples were pooled by cell type prior to Splice-Break2 analysis, resulting in 7 pools of pancreas cell types and 6 pools of brain cell types (Table 1). The two newly presented studies include one bulk sequencing dataset of postmortem MTG brain tissue with RNA libraries prepared both with (n = 1) and without (n = 1) ribosomal depletion, as well as one spatial transcriptomics dataset of MTG brain tissue from patients with PD (n = 2), AD (n = 2), and CTRL (n = 4) (GSE226663). One of our internal control samples was sequenced using all three preparation methods; this helped us compare the effect of library preparation method on our ability to capture mtDNA deletions without additional effects of tissue type or sequencing site. In total, we examined 1570 samples ranging in age from prenatal to 96 years across the 14 datasets.

We processed all 1570 samples (Fig. 3a–p), including the one internal (MTG) sample that had been processed three ways (Fig. 3a–h), through the Splice-Break2 pipeline and compared the following key metrics to determine which variables influenced common mtDNA deletion levels: total RNA-Seq reads, mitochondrial (MT) benchmark coverage, MT benchmark coverage per million reads, and the “Top 30” cumulative deletion read numbers. Total RNA-Seq reads are the initial reads from each sample prior to alignment. MT benchmark coverage is a measure of the average sequencing depth of the sample, measured from two 250 bp segments within the RNR1 and CYB genes13. The “Top 30” cumulative deletion read amount is the sum of the “Top 30” reads. The deletion read rates for the internal samples processed three ways are also shown (Fig. 3e–h), which are calculated by normalizing the number of reads for each mtDNA deletion to the MT benchmark coverage, and is the metric used to evaluate the effect of age, tissue, and diagnoses (see Figs. 47). The deletion read %’s for the “Top 30” deletions and the cumulative deletion read rate for every sample and all RNA-Seq studies is provided in Supplementary Data 2 for the GEO+ datasets; individual-level data is not shown for the GTEx dataset because of dbGaP restrictions.

We observed that samples that used a ribosomal depletion library preparation method (but had similar abundance of sequencing reads prior to alignment) had less MT benchmark coverage than polyA (non-ribosomal depleted) samples (Fig. 3a–c, i–k). Due to a reduced amount of mitochondrial data in the samples subjected to ribosomal depletion, we observed little to no common mtDNA deletions in these samples; conversely, for samples prepared without ribosomal depletion (i.e., bulk polyA or other), we consistently observed these common mtDNA deletions and more mitochondrial data overall (Fig. 3). Samples prepared by LCM showed the highest number of deletion reads (Fig. 3l); however, it is important to note that these LCM samples were all derived from aged human brain regions with high dopaminergic neuron innervation, the SN and VTA, so these high values may not solely reflect the sample preparation method. Evaluations of our internal control samples prepped by three methods further confirmed our observation of higher abundance of mitochondrial reads and deletions in polyA (non-ribosomal depleted) preparations and showed more MT data was captured by spatial transcriptomics than by bulk RNA-Seq (Fig. 3b–h). This is in alignment with previous observations that this spatial transcriptomics method (i.e., 10x Genomics Visium) may capture more “off-target” and/or non-polyadenylated reads compared to other methods26,66. Single-cell RNA sequencing (scRNA-Seq) samples did not contain the three mtDNA deletions followed in this study, so these samples were not evaluated further (Table 1, Fig. 3l). Taken together, we found that bulk sequencing with polyA (non-ribosomal depletion), spatial transcriptomics, and LCM methods were amenable to mtDNA deletion investigations, while scRNA-Seq and ribosomal depletion methods were generally inutile due to insufficient mitochondrial data.

Common mtDNA deletions and the effects of age and sex

MtDNA mutations and deletions have been positively associated with aging in various tissues such as the brain, skeletal muscle, and heart4,5,43,67,68,69,70. One of the most investigated large mtDNA deletions is the 8471–13449 “common deletion” (also referred to as the 4977 bp deletion), which was found to increase with age in the heart and brain of healthy adult patients43,69,70. In contrast with the nuclear genome, the mitochondrial genome is particularly vulnerable to DNA damage because it is located in close proximity to damaging free radicals and reactive oxygen species generated through the OXPHOS pathway, does not have the protection of histones, and has less efficient DNA repair machinery71,72,73. Over time, these factors can lead to the accumulation of mtDNA deletions within cells70,71,73.

Among the GEO+ datasets, we chose four studies for age analysis because of the wide age distributions of their samples20,21,22,23. In the DLPFC study20, we observed significant positive correlations between age and the “Top 30” cumulative deletion read rate (p = 0.0119) and the 8471–13449 “common deletion” (p = 0.0159) (Fig. 4a). In the SM study from PD and CTRL samples22, we similarly observed significant positive correlations between age and the “Top 30” cumulative deletion read rate (p = 0.0113) as well the 8471–13449 “common deletion” (p = 0.021; Fig. 4b). In the SM study of CTRL samples23, we did not observe a significant correlation between age and any of the four mtDNA deletion metrics evaluated (Fig. 4c); only ~1/2 of the samples were included in the aging analysis because we observed a significant “batch effect” in this study (Supplementary Fig. 4). In the Stanley Neuropathology Consortium dataset21, which contained CER, HIPP, and PFC, we observed significant positive correlations between age and the “Top 30” cumulative deletion read rate (p = 0.0131) and the 8471–13449 “common deletion” (p = 0.0193) in the PFC (Fig. 4f); these deletion metrics were also observed to increase with age in the CER and HIPP, but the correlations with age were not significant after multiple-comparisons corrections (Fig. 4d, e). We did not observe a significant correlation with age for the 6335–13999 or 7816–14807 deletion in any of these datasets (Fig. 4). In addition, none of these datasets exhibited significant differences between biological sexes for any of the four deletion read rate metrics (Supplementary Table 2).

Age analysis for the GTEx RNA-Seq data was also performed on a subset of available samples, including 183 paired samples of cerebellum (CER) and cortex (CORT), 41 paired samples from multiple brain regions (i.e., amygdala (AM), anterior cingulate cortex (ACC), caudate nucleus (CAUD), frontal cortex (FC), hippocampus (HIPP), and substantia nigra (SN)), and 165 paired samples from non-brain regions (i.e., blood, liver and skeletal muscle (SM)) (Fig. 5 and Supplementary Fig. 5). In the paired CER and CORT dataset, we observed significant positive correlations between age and the “Top 30” cumulative deletion read rate (p = 0.00238) and the 8471–13449 “common deletion” (p = 5.88e-4) in the CORT, and also a significant positive correlation between age and the “Top 30” cumulative deletion read rate (p = 0.0342) in the CER (Fig. 5). The other common deletions evaluated were not significant for these brain regions after multiple comparisons corrections (Fig. 5 and Supplementary Fig. 5). In the paired analysis of 6 brain regions, we observed significant positive correlations between age and the mtDNA deletion metrics for the AM, CAUD and SN; we did not observe significant correlations with age in the ACC, FC or HIPP after multiple comparisons corrections (Fig. 5 and Supplementary Fig. 5). Significant age correlations in the AM included the 8471–13449 deletion (p = 0.0172) and the “Top 30” cumulative deletion read rate (p = 0.0351). Significant age correlations in the CAUD included the 6335–13999 deletion (p = 1.94e-3), the 7816–14807 deletion (p = 0.0358), the 8471–13449 deletion (p = 1.11e-5), and the “Top 30” cumulative deletion read rate (p = 1.20e-4). Significant age correlations in the SN included the 8471–13449 deletion (p = 0.00896), and the “Top 30” cumulative deletion read rate (p = 0.0182). Lastly, in the paired analysis of blood, liver and SM, we only observed significant and positive correlations between age and the mtDNA deletion metrics for SM, specifically for the 8471–13449 deletion (p = 0.0210), and the “Top 30” cumulative deletion read rate (p = 0.0210) (Fig. 5). Overall, we were able to recapitulate previously published findings that common mtDNA deletions increase with age in the brain and muscle, and we provide evidence from paired datasets that different brain regions and tissues have variable susceptibility to this increase, some of which, to our knowledge, is illustrated for the first time.

Common mtDNA deletions across brain regions, tissues, and the effect of diagnosis

Evaluation of the MTG, AM, and SN dataset18 of PD and CTRL subjects revealed no significant differences in “Top 30” cumulative deletion read % in the SN compared to the MTG or AM for any of the deletion metrics, after multiple comparisons corrections (Fig. 6a). However, the SN did display higher levels and it should be emphasized that this study used a ribosomal depletion method for library preparation, so it is perhaps not surprising we were unable to detect a significant difference since so little mitochondrial data is captured with that RNA-Seq method. The SN and VTA LCM dataset24 from PD and CTRL samples (Fig. 6b) and the dorsal and ventral SN LCM dataset25 of CTRL subjects (Fig. 6c) showed no significant differences in any of the four mtDNA deletion metrics we evaluated. The Stanley Neuropathology Consortium dataset21 of CER, PFC, and HIPP tissue showed significantly higher deletion read %’s in the HIPP compared to CER and PFC for the 7816–14807 deletion (p ≤ 0.01324) and the “Top 30” cumulative deletion read rate (p ≤ 3.12e-12; Fig. 6d). The HIPP had significantly higher deletion read %’s than the CER for the 6335–13999 deletion (p = 0.00636) and the 8471–13449 deletion (p = 2.92e-5) (Fig. 6d). The 8471–13449 deletion were also significantly higher in the PFC than in the CER (p = 0.0261) (Fig. 6d).

In general, we did not detect significant differences based on diagnosis in any of the GEO+ datasets we evaluated. The one exception to this was the SN and VTA LCM dataset24 from PD and CTRL samples had significantly higher levels of the 7816–14807 deletion in the CTRL subjects in both SN (p = 0.0109) and VTA (p = 0.0064) (Fig.6f and Supplementary Fig. 6). It should be noted that all of these studies had a small sample size per diagnostic group (range: n = 5–17).

Analysis of GTEx included tissue comparisons of the three paired datasets used in our aging analysis (Fig. 5). In paired CER and CORT, the CORT had significantly higher amounts of all four deletion metrics (6335–13999 deletion (p = 5.72e-19); 7816–14807 deletion (p = 1.43e-15); 8471–13449 deletion (p = 3.86e-33); and the “Top 30” cumulative deletion read rate (p = 7.68e-41) (Fig. 7a). Across the six paired brain regions, deletion levels were also significantly different (6335–13999 deletion (p = 1.46e-17); 7816–14807 deletion (p = 2.29e-15); 8471–13449 deletion (p = 6.64e-20); and the “Top 30” cumulative deletion read rate (p = 3.38e-21) (Fig. 7b). Similarly, in paired blood, liver and SM, all four deletion metrics were significantly different (6335–13999 deletion (p = 1.56e-5); 7816–14807 deletion (p = 5.99e-13); 8471–13449 deletion (p = 1.18e-10); and the “Top 30” cumulative deletion read rate (p = 9.13e-38) (Fig. 7c). Cross comparisons of all GTEx tissues revealed the following results: (1) blood had the lowest deletion read % and was significantly lower than all other tissues; (2) CER had significantly lower deletion levels than all other brain regions; (3) liver had significantly lower deletion levels than SM and cortical brain regions; (4) SM had significantly lower deletion levels than some (but not all) brain regions; (5) brain regions with dopaminergic neurons (i.e., the CAUD and SN) had remarkable enrichment of common mtDNA deletions, with significantly higher levels than all other tissues. The SN had by far the highest mtDNA deletion levels and was likewise significantly higher than all tissues including the CAUD. All tissue comparisons and p-values are shown in Fig. 7 and Supplementary Fig. 7.

Common mtDNA deletions across imputed cortical layers in spatial transcriptomics data

To investigate the common mtDNA deletion metrics across (imputed) cortical layers, we used 12 spatial transcriptomics samples from two separate human brain datasets. The two spatial transcriptomics datasets include the Maynard et al. “SpatialLIBD” dataset26, from which we included 4x section replicates of DLPFC tissue from a single healthy control subject (sample numbers 151673–151676), and a new internal dataset of 8x spatial transcriptomics samples of MTG tissue taken from duplicate sections of 4 patients (diagnosed as AD, PD, or CTRL). Both datasets were prepared according to 10x Genomics’s Visium spatial transcriptomics library preparation and sequencing protocols. The SpatialLIBD samples were included to assist in cortical layer imputation since they contained “ground truth” measures of cortical layer geography based on stereology/imaging analysis26. We performed Seurat60 clustering on all 12 spatial samples from both studies and imputed cortical layers based on the Seurat clusters that had the highest overlap with the SpatialLIBD ground truth annotations; this resulted in the following imputed cortical layers: “White Matter”, “Layer 1”, “Layer 2 + 3”, “Layer 3”, “Layer 5”, and “Layer 6” (Fig. 8a–c).

To determine the proportion of spots in each cortical layer that contained the 6335–13999 and/or 8471–13449 mtDNA deletion, we needed the spot barcode data (which is not output as part of the Splice-Break2 process). As such, we used “grep” to identify reads that spanned these breakpoints, as described previously, and then determined which cortical layer these reads mapped to using the spot barcode. The percentage of spots that contained the 6335–13999 deletion (Fig. 8d) or the 8471–13449 deletion (Fig. 8e) for all 12 samples was analyzed for each imputed cortical layer. We observed that the 6335–13999 deletion had a significantly higher percentage of spots (two-proportion Z-test) in imputed cortical layers 3 and 5 compared to the white matter (p ≤ 0.0066; Fig. 8d). The 8471–13449 “common deletion” also had significantly higher geographical representation in cortical layers 3 and 5 compared to white matter (p ≤ 3.30e-5; Fig. 8e). These spots containing deletions were also annotated on tissue images for visualization (Supplementary Fig. 8). Further, we observed similar trends in the percentage of spots with the 6335–13999 or 8471–13449 deletion when analyzing just the USC internal samples (Supplementary Fig. 9). These 8x spatial samples were split according to their imputed cortical layer (i.e., using “cluster BAMS” from Seurat60), and each imputed layer was run through Splice-Break2; we found a significant difference between white and grey matter for the 8471–13449 “common deletion” (p = 0.0393; Supplementary Fig. 9). Taken together, these results suggest cortical layers have different levels of common mtDNA deletions in the aged human brain and are most abundant in layers 3 and 5 where there is an enrichment of pyramidal neurons26,74,75.

Discussion

The aim of this study was to determine whether RNA-Seq data can be used to evaluate common mtDNA deletion levels in somatic tissues using the bioinformatics tool Splice-Break2. Initial analyses were done on a paired dataset where there was DNA and RNA-Seq data available; we observed robust differences in DNA and RNA deletion read %, with the DNA data containing at least 22-fold higher deletion read rates. We hypothesize that the decreased rate of these common deletions in RNA-Seq data may be due to transcript abundance or stability- that mitochondrial genomes with large deletions may transcribe polycistronic transcripts less efficiently than wild-type molecules, or that RNA transcripts containing deletion breakpoints may be less stable, or both. It is worth mentioning that polyadenylation of mitochondrial transcripts serves a different role than it does for nuclear-encoded genes where 3′ polyadenylation helps stabilize mRNA molecules and promotes protein translation; this also occurs for human mitochondrial-encoded genes, but transcripts can also be polyadenylated at many intragenic positions and are subject to polyadenylation-dependent RNA degradation mechanisms that are conserved features of bacteria (the evolutionary origin of mitochondria)76,77,78. In addition, the deletion read %’s in the DNA data are likely increased due to the long-range PCR amplification used for mitochondrial enrichment and the smaller size of the deleted genomes.

The mtDNA deletion metrics that were examined in this study were the cumulative read % of the “Top 30” most frequent deletions that we identified and Sanger-validated in our previous study, and the individual read % of three specific mtDNA deletions: 6335–13999 (adjusted position 6341–14005), 7816–14807 (adjusted position 7814–14805), and 8471–13449 (the “common deletion”; adjusted position 8482–13460). The cumulative read % of the “Top 30” deletions and the 8471–13449 deletion had statistically significant positive correlations between the DNA and RNA-Seq paired dataset examined, with the “common deletion” having the strongest correlations between methods. Future studies may include additional paired datasets (DNA, RNA-Seq, and other genomics methods) and analyses of less common mitochondrial deletions. Analysis of patients with large clonal deletions will also be of interest. The Splice-Break2 pipeline does output additional breakpoints besides the “Top 30” most common deletions, but those metrics should be used with caution for RNA-Seq data and may require further validation (such as qPCR and/or Sanger sequencing).

All Splice-Break2 data is reported as “deletion read %”, which is the number of deletion reads detected divided by the MT Benchmark coverage; however, this should not be interpreted as an absolute estimation of heteroplasmy rate (for DNA or RNA-Seq) because of inherent biases of these NGS methods. In our original methods article, we did observe a significant positive correlation between the Splice-Break2 deletion read % and qPCR quantification of the 8471–13449 “common deletion”, and here we show the deletion read %’s can correlate between DNA and RNA-Seq datasets- thus, we conclude that although absolute heteroplasmy rates cannot be inferred, the fold differences between samples is retained and allows us to use deletion read % for statistical analyses of age, tissue type and diagnoses. Future studies that pair mitochondrial/metabolic assays with NGS measurements will be important to understand the physiological relevance of these deletions and at what threshold deletion read %’s impact cellular function.

Splice-Break2 was used to examine various library preparation methods, such as bulk RNA-Seq, LCM RNA-Seq, spatial transcriptomics, and scRNA-Sequencing. We observed that bulk sequencing without ribosomal depletion allowed for more consistent capturing of mtDNA transcripts and deletions compared to bulk sequencing with ribosomal depletion. Additionally, the scRNA-Seq studies we evaluated had minimal to no mtDNA deletion capture despite containing comparable amounts of mtDNA transcripts (MT Benchmark coverage) compared to the other studies evaluated. Analysis of additional scRNA-Seq datasets and methods, including LCM, along with other non-RNA single-cell genomics methods (such as scATAC-Seq), is warranted in the future. The LCM RNA-Seq samples we evaluated contained the highest deletion read rates; however, it is important to note that these samples originated from aged brain tissue from regions innervated with dopaminergic neurons (SN and VTA), which have a demonstrated significant increase in mtDNA deletion burden79,80. Therefore, we cannot determine from these LCM datasets if the increased mtDNA deletion levels are strictly due to brain region, library preparation method, or are a combination of both effects. Overall, these data indicate that the most valuable RNA-Seq wet lab protocols for mtDNA deletion detection include bulk sequencing without ribosomal depletion (e.g., polyA), LCM RNA-Seq, and spatial transcriptomics.

Our analysis of aging, differences among brain regions, and diagnostic effects revealed trends that were consistent with the effects of mitochondrial dysfunction reported in current literature2,3,4,5,9,79,80. It has been reported in several studies that mtDNA deletions accumulate in some tissues as they age, and studies on the “hallmark” mtDNA deletion disorders (e.g., KSS, Pearson’s Syndrome) have proved these deletions can have functional (and even lethal) consequences if at a high enough threshold to affect cellular function36,37. The age-dependent increase in mtDNA deletions has been observed in various somatic tissues including the brain; in the substantia nigra, high mtDNA deletions have been linked to age-related and disease-related (i.e., PD) neuronal loss43,67,79,80. We observed statistically significant increases in the mtDNA deletion metrics due to age in brain and/or muscle datasets. Our most robust evaluation of aging was performed using paired RNA-Seq data from GTEx. Overall, we were able to recapitulate previously published findings that common mtDNA deletions increase with age in the brain and muscle, and show that different brain regions and tissues have variable susceptibility to this age-dependent increase.

We were able to observe some interesting differences in tissues and brain regions in the paired GTEx dataset, where common mitochondrial deletions were lowest in blood, lower in liver than skeletal muscle, higher in cortex than in cerebellum, and high (but variable) across cortical brain regions. In both GEO+ and GTEx datasets, we observed increased mtDNA deletions in brain regions containing dopaminergic neurons (SN, VTA and caudate nucleus); however, we did not detect an increase in PD. This is consistent with previous reports that have found an increase in SN tissue but no significant increase in PD specifically79,81,82,83. It would be interesting to investigate a larger cohort of patient samples taken at various stages of PD and/or compare these results neuropathological measures of dopaminergic cell death (e.g., SN depigmentation scores), as this data suggests the age effects in PD SN may not be linear. Measures of cell type abundance (e.g., histology, in situ gene expression, cell counts from flow cytometry or CBC tests, or single-cell RNA-Seq data) from paired tissue (preferably the same dissection/sample) maybe also improve analyses of age and diagnosis if that data can be obtained and used as a co-variate.

Our common mtDNA deletion analysis of spatial transcriptomics data, specifically across (imputed) cortical layers of the DLPFC and MTG, revealed increased deletion burden in grey matter compared to white matter, which is not surprising given its high metabolic activity74,84,85. Cortical layers 3 and 5 consistently contained the highest percentage of “spot barcodes” with mtDNA deletions. This supports the hypothesis that brain regions and cortical layers are differentially susceptible to mtDNA damage69,79,80,86. Future studies investigating the effect of mtDNA damage in neurodevelopment, psychiatric disorders or neurodegenerative diseases may want to focus on spatial transcriptomic approaches so that analyses can focus on specific cortical layers and increase the signal-to-noise ratio for these tests.

In summary, this robust analysis of multiple, highly used RNA-Seq methods demonstrated the utility of detecting mtDNA deletions of high frequency through our bioinformatics pipeline. The Splice-Break2 tool was effective in quantifying common mtDNA deletions in polyA (non-ribosomal depleted) bulk sequencing, LCM, and spatial transcriptomic datasets, and these datasets may also be amenable to other bioinformatics approaches of mtDNA deletion quantification. Of note, the current version of Splice-Break2 is only compatible with human data, as the alignment and annotations utilize the rCRS64, our catalogue of human mtDNA deletion breakpoints13, and MitoMap87, respectively. With the wide breadth of publicly available and restricted human RNA-Seq datasets that encompass a variety of tissues, diseases, and experimental/environmental conditions, the ability to incorporate mtDNA deletion metrics into these investigations may provide information about metabolic effects in tissues and their contributions to disease phenotypes and aging.