Introduction

Mitochondria are cytoplasmic organelles that perform major roles in cell operation, including energy generation through oxidative phosphorylation (OXPHOS), cell death, calcium levels control, lipid homeostasis and metabolic cell signaling1. These organelles have their own genome (mtGenome), with 16,569 bp of length and 37 genes, of which 13 are protein-coding genes involved in OXPHOS, 22 are transfer RNA (tRNA) genes and two are ribosomal RNA (rRNA) genes2,3. It also presents a non-coding control region known as displacement loop (D-loop), essential for replication and transcription regulation4. There are many copies of mitochondria in each cell and such copies may present different alleles for the same variant, a state called heteroplasmy.

It is well-known that mitochondrial DNA (mtDNA) is more susceptible to alterations in comparison to nuclear DNA5,6,7. These alterations may lead to mitochondrial dysfunction, which in turn may account for cellular deregulation due to DNA repair defects, leading to the development of different diseases, such as cancer8. It is notable that hallmarks of cancer (i.e. abilities acquired by tumor during carcinogenesis to survive and proliferate) include energy deregulation and evasion of cell death, both directly related to mitochondrial function9. In fact, many studies have shown an association of mtDNA instability and heteroplasmy to different types of cancer4,10,11,12,13,14,15.

One of these types of cancer is gastric cancer (GC), which is currently the fifth most incident and the third most lethal type of cancer worldwide16. In Brazil, GC is also one of the most frequent and aggressive types of cancer, being the sixth most incident and the fifth most lethal cancer16. This is even more alarming in the North region of Brazil, where GC is the second most incident type of cancer among men and the fifth among women17, probably due to eating habits and genetic background of the population.

Regarding genetic ancestry, it is important to highlight that, in addition to nuclear DNA, mtDNA also provides such information through haplogroups, which are basically groups of haplotypes. These provide information about women migration and have also been associated to development and outcome of different diseases, including GC18,19,20,21.

Moreover, GC usually presents an unfavorable clinic evolution because of nonspecific symptoms at early stages, leading to late diagnosis and a poor prognosis22. Thus, it is crucial to search for genetic markers that would allow an earlier detection and improve patient outcome.

In the last decade, with the advent of new technologies such as Next-Generation Sequencing (NGS), one focus of oncologic research has been high-throughput analyses of human genome related to cancer development. However, to this date, not many studies have investigated the association of mtGenome alterations to cancer, especially GC. For instance, a recent review has pointed out 16 studies associating mtDNA alterations to gastric cancer, but none involved NGS approaches23.

In this study, we sequenced the whole mtGenome in order to assess and compare variants and their heteroplasmy levels in FFPE samples from gastric cancer patients (paired samples, i.e. samples of both tumor and non-tumoral tissues) and cancer-free individuals from the North region of Brazil. To the best of our knowledge, this is the first study to perform such analysis of gastric cancer in Brazil.

Results

After processing the mtGenome sequences with the adequate mapping scores (Q30), the average depths of coverage were 195x and 36x for tumor samples and internal control (IC) samples, respectively. For external controls (EC), the average depth of coverage was 351x. All samples with low quality were excluded. Table 1 presents the sample size of all groups for each analysis.

Table 1 Sample number of each group in the different comparison analyses.

Out of the 20 sequenced sample pairs of GC patients, only eight had both samples with enough coverage for a paired comparison. In another eight patients, only one of their samples (five tumors and three internal controls) had enough coverage to be included in further analyses. The remaining four pairs did not pass our quality standards and were excluded from all analyses. As for external controls, 47 samples had enough coverage to be included in the general analyses.

It is noteworthy that (i) mean age was of 66 years old in case group and 41 years old in EC group (Mann-Whitney test, p < 0.001); and that (ii) case group was composed of 77% male and 23% female, while control group was composed of 28% male and 72% female (Chi-squared test, p = 0.002).

Therefore, we evaluated if sex or age range (≥70, 60–69, <60 years-old) in the tumor samples were associated to the number of variants (Kruskal-Wallis Test; H 1.52, 3.18; P 0.467 and 0.075, respectively), primary tumor location (χ2 13.07, 10.18; P 0.364 and 0.117), tumor degree of differentiation (χ2 6.17, 0.82; P 0.187 and P 0.662), pathogenic staging pT (χ2 9.19, 5.68; P 0.514 and 0.339) and pN (χ2 8.44, 6.43; P 0.392 and P 0.169) and found no association. Similarly, we did not find any dependency of age range (≥60, 50–59, 40–49, 30–39, 20–29, <20 years-old) or sex to the number of variants of EC (Kruskal-Wallis Test; H 5.22, 0.03; P 0.390 and 0.856, respectively). Thus, age and sex were disregarded in further analyses.

Paired sample analyses

When we compared the mtGenome of each of the eight paired samples from gastric cancer patients, we found that, in most cases, tumor samples presented more variants than their respective internal control samples (up to 5.5-fold more variants). Table 2 shows the number of variants by pair and the genes in which most variants were found. The full list of variants and proportions is found in Supplementary Material (Table S1).

Table 2 Number of variants (N) and the genes in which most variants were found in each pair of tumor and internal control samples from the same individuals.

Considering exclusively single variants (i.e. each variant individually), there were 50 variants present only in tumor samples, while eight were present only in internal controls. These 50 variants were distributed in 17 regions, but mostly in four regions: MT-DLOOP1 (16%), MT-ND5 (16%), MT-ND4 (10%) and MT-DLOOP2 (10%). As for variants only found in internal controls, most of them were located in MT-ND5 (25%).

By grouping all paired samples from tumor and IC, we found 25 variants that were exclusive to tumor samples and eight variants that were exclusive to internal control samples, as well as 36 variants that were shared by both groups (Fig. 1). The distribution by region and position of these variants is shown in Supplementary Material (Table S2).

Figure 1
figure 1

Number of exclusive and shared variants found in the paired samples (tumor and internal control) of gastric cancer patients. Groups are represented by the darker dots. Set size is the sum of variants for each group.

Hence, tumor samples presented more than 3-fold the number of exclusive variants in comparison to their paired internal control samples. This notable number of variants that were exclusive to tumor was distributed in 13 genes, of which three (MT-ATP8, MT-ND4L and MT-TG) were only affected in this group. The variants exclusive to internal control samples were distributed in seven genes and the shared variants were distributed in 14 genes, of which MT-CO3 and MT-RNR1 were affected in both groups.

In addition, the genes with more variants were: (i) for tumor-exclusive variants, MT-DLOOP1 and MT-DLOOP2 (with 16% each); (ii) for variants exclusive to internal control, MT-ND5 (with 25%); and (iii) for variants shared by both groups, MT-DLOOP1 (with 22%) and MT-ND5 (with 19%).

When comparing heteroplasmy, six out of the eight pairs matched our standards and were included in this analysis. We found that, with one exception, tumor presented notably more heteroplasmic regions than their paired internal control (3 to 9-fold times more; Fig. 2).

Figure 2
figure 2

Heteroplasmy levels for each pair of samples from gastric cancer patients, both tumor (gastric adenocarcinoma) and internal control (control). Each boxplot represents the distribution of heteroplasmy in different mitochondrial genes. Non-applicable (NA) represents unidentified regions.

Among the heteroplasmic regions, the most commonly affected were MT-DLOOP1, MT-ND4 and MT-ND5. It is noteworthy that these three, along with MT-ND6, were heteroplasmic in all tumor samples, but not in all internal controls. Four other genes were heteroplasmic in almost all tumor samples, but they were not as frequently affected among internal control samples (MT-CO1, MT-CYB, MT-ND1 and MT-RNR2). Additionally, seven other regions were heteroplasmic only in the tumor (MT-CO2, MT-DLOOP2, MT-ND4L, MT-TC, MT-TK, MT-TM and MT-TT), and there were no genes that were heteroplasmic in the internal control group but not in the tumor group.

General heteroplasmy analysis

When comparing heteroplasmy levels between all groups of samples (tumor, internal control and external control), we included all tumor (N = 13), internal control (N = 9) and external control (N = 40) samples that met our quality control standards (see Table 1). In addition to the quality scores, only variants with at least 5% heteroplasmy levels were considered for this analysis, in order to control possible artifacts and false-positives due to the limited coverage of the samples24,25.

Using this criterion, we assessed heteroplasmy levels of all three groups. Aiming a fair comparison of heteroplasmy levels, we evaluated the presence of the same distinct 630 variants in all samples considering a heteroplasmy level of 1% when absent or under 5%.

The log heteroplasmy levels were compared using a mixed linear regression assuming a fixed effect of the group and a random effect of the individuals. This model was better fitted to the data than a model without the group effect (Wald test χ2 10.11; P 0.006). The pairwise comparison (with the p-value adjusted according to Tukey) showed that the tumor group presented a significantly higher level when compared to internal group (t 2.60; P 0.025) and external group (t 3.04; P 0.006) both with statistical power above 73%. These control groups did not present statistically significant difference between each other (t −0.42; P 0.906).

Considering the previous analysis, the amount of heteroplasmic variants may be more important to tumor development than heteroplasmy levels of variants per se. In fact, out of 1,664 heteroplasmic variants found after filtering, 653 variants were found in tumor subgroup (representing 39% of the total), 152 in internal control subgroup (9%) and 859 in external control subgroup (52%). These proportions are very expressive, considering that tumor samples account for about only 1/3 of the number of external controls and they still showed almost half of all heteroplasmic variants.

Further, individual’s number of heteroplasmic variants average (i.e. total of heteroplasmic variants divided by number of samples) was of 50.23 for tumor, 16.89 for internal control and 24.54 for external control. Thus, tumor samples presented nearly 3-fold more heteroplasmic variants than internal control samples (Fig. 3). Then, a Poisson regression considering group effects was applied and it showed a significant difference between these groups (ANOVA F 5.46; P 0.007), with a statistical power above 90%.

Figure 3
figure 3

Distribution of heteroplasmic variants in the studied groups: tumor, internal control and external control. Each small dot represents the mean of heteroplasmic variants of a sample. The average number of heteroplasmic variants is also represented for each group.

In Fig. 3, it is possible to see that, even though the external control group was composed of more samples, most of them showed less heteroplasmic variants in comparison to tumor samples. Proportionally, internal control group also showed many samples with less heteroplasmic variants than tumor group. In addition, about half of tumor group presented samples with a high number of heteroplasmic variants, including some samples with more than 2-fold the mean tumor’s heteroplasmy variants found (approximately 50 variants).

This difference became even more evident in the pairwise comparison of number of heteroplasmic variants between all groups using a post-hoc Tukey test. Tumor group presented a statistically significant difference when compared to internal (P 0.027) and external (P 0.008) controls with a statistical power above 70%. However, when both controls were compared no significant difference was found (P 0.903).

Regarding location of the observed heteroplasmic variants, heteroplasmy levels were observed in all protein-coding genes and in almost all non-protein-coding genes – appearing in all DLOOP and rRNA genes, but not in all tRNA genes. It is noteworthy that two heteroplasmic variants were observed in unidentified regions in external control and tumor samples and that they are probably in tRNA genes due to location. These unidentified regions were excluded from further analyses.

Figure 4 shows a representation of all protein-coding and non-coding regions throughout 16,569 bp of the mtGenome. This graph supports our previous result that tumor group presents most heteroplasmic variants, followed by external control and internal control groups.

Figure 4
figure 4

Distribution of heteroplasmic variants in the mitochondrial genome for all groups: tumor, internal control and external control. Peaks represent the number of variants. Genomic length and gene division are also shown.

Moreover, out of the 37 genes in mtGenome, 32 were heteroplasmic in external control group, 28 in tumor group and 15 in internal control group. Hence, in order to compare heteroplasmic regions between all three groups, we analyzed 15 regions affected in all of them. Among such regions, 12 were protein-coding and three were non-protein-coding genes (one DLOOP and two rRNA regions) (Fig. 5). In this analysis, we observed a statistically significant difference between all groups, with tumor group presenting more heteroplasmic variants in MT-RNR1, MT-ND5, MT-ND4, MT-ND2, MT-DLOOP1 and MT-CO1.

Figure 5
figure 5

Joined comparison of all three groups regarding mitochondrial genes with heteroplasmic variants, with the respective p-value obtained with Poisson regression.

Mitochondrial haplogroup analyses

Furthermore, we assessed haplogroup information of all samples in order to investigate a possible association with GC. In some cases, tumor samples presented a different haplogroup than their paired samples, indicating that the tumors might not be an accurate ancestry informative tool, especially considering that MT-DLOOP presented a particularly high rate of variants in this group and these regions are the most informative of maternal ancestry26. Thus, all tumor samples were disregarded from haplogroup analyses and the case group was be represented only by the internal control.

Considering both case (internal control) and control (external control) groups, we observed 17 macro-haplogroups, out of which there were four Native American (A, B, C and D), four European (H, HV, J and V), six African (L0, L1, L2, L3 and L3e) and four Asian (M, N, R and Y) haplogroups (Table 3). Native American haplogroups were the most frequent in both case (36.4%) and control (44.6%). Following, control group presented about the same frequency rate for both European and African haplogroups (25.6% each), while case group presented a similar rate for European (27.3%), but not for African haplogroups (18.2%). Asian haplogroups were around 4-fold more frequent in cases (18.2%) than in controls (4.2%).

Table 3 Distribution of macro-haplogroups in case and control groups.

The most common haplogroup was C (Native American ancestry) in both groups, but it is interesting that case has more than 2-fold the frequency of the control (36.4% and 17.0%, respectively). In case group, the second most common haplogroup was H (European ancestry), with 18.2%, followed by the other five haplogroups with 9.1% of presence each. In control group, haplogroup L1 (African ancestry) is the second most common, with 15%, followed by haplogroup H, with 12.8%, and the other haplogroups.

In addition, our results showed a difference in the distribution of variants by mitochondrial ancestry (Fig. 6). Firstly, we compared it within cases using ANOVA and found that 53% of all variants were in individuals from Native American haplogroups, followed by European (18%), Asian (15%) and African (14%) haplogroups (F 4.48; P 0.047). As for control group, individuals from Native American haplogroups also presented the majority of variants (46%), followed by African (36%), European (16%) and Asian (2%) haplogroups (F 2.57; P 0.066). When we compared case and control there was no significant difference (F 0.93; P 0.338).

Figure 6
figure 6

Number of variants in the four ancestries (African, Asian, European and Native American) of both case and control groups.

Discussion

In this study, we investigated the association of mitochondrial variants with gastric cancer in individuals from a population of the Brazilian Amazon. This was done through whole mtGenome sequencing of FFPE samples from patients and cancer-free individuals. To the best of our knowledge, this is the first study to successfully sequence mtGenome from FFPE samples.

This is particularly important because DNA sequencing from this type of material can be challenging. Although good for long-term storage of tissue samples, the FFPE process may affect the material integrity so that the extracted DNA is generally of low quality, often leading to the loss of samples. However, some studies performing NGS have shown that DNA extracted from FFPE tissue samples can still provide results in concordance to frozen tissues27,28,29. The challenge presented by FFPE samples is probably the reason for the loss of samples and for the limited coverage in our study. In spite of that, we were able to successfully sequence the mtGenome, obtaining interesting results.

When we compared the samples from the same eight gastric cancer patients, i.e. paired tumor and internal control samples, we observed significantly more somatic variants present in tumor samples than in the respective internal control samples (up to 5.5-fold more), a kind of pattern that has been reported in other types of cancer, such as colorectal cancer and penile cancer11,30. In addition, studies comparing mtDNA variants and genomic instability of different types of cancer reported that gastric cancer was among the types with more somatic variants and genomic instability with a tumor-specific pattern10,25.

Regarding the mitochondrial genes in which most of the tumor-exclusive variants were found, we observed that MT-DLOOP2, MT-DLOOP1 and MT-ND5 were among the regions with more variants in at least half of the tumor samples included in this analysis (Table 2). There was no similar pattern observed when we considered the regions with more variants exclusive to internal control. As for shared variants, they were similarly distributed among different genes. Additionally, there was a mean of 2.43-fold increase in the number of genes with exclusive variants when we compared tumor and internal control samples. It is noteworthy that most of these variants are present in more than one sample and are counted as such in this analysis.

Therefore, when we considered single variants only, there were 50 variants present only in tumor samples and most of them was distributed in four regions – MT-DLOOP1 (16%), MT-ND5 (16%), MT-ND4 (10%) and MT-DLOOP2 (10%) –, while there were eight variants present only in internal control samples, of which most was found in MT-ND5 (25%). This suggests that, in tumor, there is an increase not only in the number of variants, but also in the regions in which such variants occur.

This greater presence of variants in tumor samples was reinforced when we grouped and compared the eight pairs. In this analysis, we found 25 tumor-exclusive somatic variants and eight somatic variants that were exclusive to internal control samples, as well as 36 shared variants. While some variants that were considered somatic in the pair-by-pair comparison were then considered germline, tumor samples still showed 3-fold more variants than internal control samples.

Such tumor-exclusive variants were distributed in 13 genes, of which three (MT-ATP8, MT-ND4L and MT-TG) were only observed in this group. Only a few studies have reported variants in MT-ATP831,32 and MT-ND4L31,33,34,35 in different types of cancer and none was in gastric cancer. No studies were found associating variants in MT-TG to cancer.

Furthermore, two genes (MT-CO3 and MT-RNR1) only presented shared variants, which are probably not involved in tumorigenesis. However, five genes (MT-CO1, MT-CYB, MT-DLOOP1, MT-ND1 and MT-ND5) not only presented germline variants, but also different somatic variants exclusive to tumor and internal control groups, indicating a possible association of such genes and variants with tumor development. It is noteworthy that, among these genes, MT-DLOOP1 and MT-ND5 stand out for presenting more germline variants than the others.

In addition to the shared variants, MT-ND5 also presents a high rate of exclusive variants not only in tumor samples but also in internal control samples, suggesting that this gene could also be altered in cells in other adjacent organs of cancer patients, contributing to possible tumor advances. Regardless, MT-ND5 encodes a core subunit of Complex I, essential for electron transport from NADH to ubiquinone in the mitochondrial respiratory chain, so that variants in this gene and others related to this process may lead to impairment in OXPHOS and an increase of reactive oxygen species (ROS) generation, which can contribute to cancer proliferation and metastasis36,37. In fact, a high rate of mutations in MT-ND5 has been reported in esophageal cancer38.

Moreover, the regions with more tumor-exclusive variants were the MT-DLOOP1 and MT-DLOOP2 (16% each). This corroborates previous studies that associated an increase of variants in such control regions to the development of different types of cancer, including hepatocellular carcinoma39, brain tumor40, oral squamous cell carcinoma41, colon cancer42 and gastric cancer43. Further, a study have associated five variants in these genes with gastric cancer, including A73G and T16519C44, which we found frequently in our tumor samples but not in internal control samples.

We also obtained interesting results in the paired analysis of heteroplasmy levels. In most cases, tumor presented notably more heteroplasmic regions than the paired internal control (3 to 9-fold). About these heteroplasmic regions, we highlight that: (i) the four most commonly heteroplasmic genes in tumor samples were not heteroplasmic in all internal control samples; (ii) four genes were heteroplasmic in almost all tumor samples, but not as frequent in internal control genes; (iii) seven genes were heteroplasmic only in tumor samples; and (iv) there were no genes heteroplasmic only in the internal control group. These results are suggestive of an involvement of such heteroplasmic variants in GC development.

This was reinforced in our analyses comparing heteroplasmy distribution between all groups of samples (tumor, internal control and external control). When we compared heteroplasmy level means, the number of variants was different between groups, so we normalized the number of single heteroplasmic variants in all groups and found a statistically significant difference from tumor to internal control (P 0.025) and external control (P 0.006). These control groups did not present such difference between each other (P 0.906).

Therefore, the number of heteroplasmic variants might be as important or even more important to cancer development than heteroplasmy levels per se. In the individual heteroplasmy mean comparison, we observed that individual tumor samples presented almost 3-fold more heteroplasmic variants than internal control samples. A study conducted with patients of hepatocellular carcinoma also reported that tumor samples presented more heteroplasmic variants and a higher heteroplasmy ratio than their matched cancer-free samples45.

Moreover, tumor group presented 39% of all heteroplasmic variants, while internal control group presented 9% and external control group presented 52%. This is especially interesting given that tumor group had about a third of the number of external control samples, but about half of tumor group was composed of samples with a high number of heteroplasmic variants, as seen in Fig. 3. When compared to the control groups, tumor group presented a statistically significant difference regarding this distribution of heteroplasmic variants (P 0.027 for internal control and P 0.008 for external control). There was no significant difference between both controls (P 0.903).

As for location of heteroplasmic variants, out of the 37 mitochondrial genes, 32 presented variants in at least one group, out of which 15 regions carried such variants in all three groups, allowing the comparison shown in Fig. 5. Among these regions, 12 were protein-coding and three were non-protein-coding genes (DLOOP and rRNA only). Further, the higher rate of heteroplasmic variants in tumor groups was statistically significant in six of such regions: MT-RNR1, MT-ND5, MT-ND4, MT-ND2, MT-DLOOP1 and MT-CO1.

In the current specialized literature, not many studies were found associating heteroplasmic variants to the development of gastric cancer, regardless of mitochondrial genomic location, including the six genes above. A study performed showed an association of heteroplasmic variants in DLOOP regions to gastric cancer development in a Chinese population46. Another study have indicated homoplasmic and heteroplasmic variants in mitochondrial 12 S rRNA (encoded by MT-RNR1) to the development of intestinal-type gastric cancer47. No studies were found associating heteroplasmic variants in the four other genes to GC development.

In the mitochondrial haplogroup comparison between case and control, we observed 17 macro-haplogroups, distributed in four main ancestries: Native American (A, B, C and D), European (H, HV, J and V), African (L0, L1, L2, L3 and L3e) and Asian (M, N, R and Y).

Our results showed Native American ancestry with more variants than the other three ancestries in both cases and controls, which could be related to the great number of individuals from this ancestry in each group. Native American haplogroups were the most frequent in both case (36.5%) and control (44.6%). They were followed by European (27.3%), African (18.2%) and Asian (18.2%) in cases and by European and African (25.6% each) and Asian (4.2%) in controls. This ancestry profile based on mtDNA is consistent to the expected for the studied geographic region, considering the admixture process that formed the Brazilian population26,48,49. In addition, it is informative that there was a statistical significance in case group (P 0.047), probably due to the high frequency of Native American haplogroups in this group. This is especially interesting considering that GC is one of the most incident types of cancer in the studied Brazilian region, as previously mentioned.

Moreover, in both case and control groups, the most frequent haplogroup was C (Native American ancestry), but it is important to highlight that this haplogroup accounted for more than 2-fold in case group than in control group (36.4% and 17%, respectively). This suggests that haplogroup C might be related to gastric cancer development, which is reinforced by a study performed with Tibetan patients of gastric cancer, in which haplogroup C was the third most common (accounting for 17%)50.

Considering that the parent haplogroup of C is M51, it is also interesting that this haplogroup is present in our case group, but not in our control group. On the other hand, our study reported sibling haplogroup N in control group, but not in case group. Similarly, it has been suggested that haplogroup N is associated to good outcome of gastric cancer clinical evolution when compared to haplogroup M18. In addition, haplogroup H (European ancestry) was the second most common haplogroup in case group (18.2%). Although this haplogroup has not been previously associated to gastric cancer, it has been indicated that it might influence other types of cancer52.

Conclusions

In this study, we were able to successfully sequence the whole mitochondrial genome of FFPE samples and associate it to gastric cancer development. To the best of our knowledge, this was the first study to achieve that. We found more mitochondrial variants in tumor group than in controls, suggesting that such variants and mitochondrial heteroplasmy might influence gastric cancer development. In addition, haplogroup C might also be important to the development of this type of cancer. Although limited by sample number, our findings contribute to a greater understanding of mitochondrial influence in gastric cancer. Further, functional studies are recommended to clarify the mechanism of this impact.

Methods

Sampling

In this study, we included formalin-fixed and paraffin-embedded (FFPE) tissue samples from patients going through gastric biopsy by endoscopy or surgical resection in the Unit of High Complexity in Oncology of the University Hospital João de Barros Barreto (UNACON/HUJBB-UFPA). Individuals with gastric cancer diagnosis had their samples collected before undergoing chemotherapy or radiotherapy. All samples were analyzed by a pathologist that confirmed the positive or negative diagnosis of cancer.

We investigated 70 individuals (90 samples), divided in two main groups (Fig. 7). The first group (“Case”) consists of samples obtained from 20 individuals with gastric cancer diagnosis before treatment: gastric tumor tissue samples (“Tumor” subgroup) and non-tumor tissue of the duodenum (“Internal Control” subgroup) were collected from each patient, totalizing 40 samples in the Case group. Duodenal portion was extracted during gastrectomy and confirmed to be cancer-free by a pathologist. These samples were chosen as internal control respecting the cancerization field effect. The second group (“External Control”) is composed by gastric tissue samples from 50 individuals with no history of cancer and no infection by Helicobacter pylori.

Figure 7
figure 7

Schematic visualization of the sampling design.

DNA extraction and quantification

The DNA extraction from FFPE samples was performed with DNA IQ™ System (Promega Corporation, Fitchburg, WI, USA), following manufacturer’s instructions. Quantification of DNA was performed with NanoDrop 1000 and Qubit™ dsDNA HS Assay Kit (Thermo Fisher Scientific, Wilmington, DE, USA).

Mitochondrial genome amplification

From the extracted DNA, mtGenome was amplified by PCR with specific sets of primers (Supplementary Table S2). These sets were composed of 33 pairs of designed primers with discontinuous genomic location in order to provide overlaps in order to cover the whole fragmented genome.

Mitochondrial genome sequencing

The mtGenome was sequenced in the MiSeq System (Illumina Inc., Chicago, IL, USA) with MiSeq Reagent Kit v3 (300-cycles) (Illumina). Preparation of libraries was done with Nextera XT DNA Library Preparation Kit (96 samples) (Illumina), according to manufacturer’s instructions. DNA quality during library preparation was assessed with High Sensitivity D1000 ScreenTape at Agilent 2200 TapeStation System (Agilent Technologies, Santa Clara, CA, USA).

Data analysis

Paired-end sequencing reads (.fastq files) were aligned with the human reference mtDNA sequence – revised Cambridge reference sequence, rCRS – using Burrows-Wheeler Alignment tool (BWA)53. SAMtools54 were used for mapping and sorting sequences, while Picard55 was used to mark duplicated reads. After preprocessing sequences with the aforementioned steps, paired-end.bam files were submitted to mtDNA-Server analysis56 for heteroplasmy detection and haplogroup inferences. Phred quality scores (Q scores) of 20, 30 and 20 were considered for mapping quality of reads, alignment quality and base quality, respectively. For heteroplasmy detection at mtDNA-Server, we considered at least 10x of coverage on forward and reverse strands. Using its own pipeline, mtDNA-Server performed automatic mtDNA haplogroup classification using PhyloTree 1751, implemented with HaploGrep57.

Statistical analyses were performed using R language58 and the following tests were employed for different analyses: Mann-Whitney test, chi-squared test, one-way ANOVA with Tukey’s HSD test and Poisson regression. Statistical power was also assessed using R. R packages ggplot59, UpSet60 and RCircos61 were used for graphic representations. In all analyses, p-value was considered statistically significant when ≤0.05.

Ethics approval and consent to participate

All participants signed an informed consent, with approval by the Committee for Research Ethics of Hospital João de Barros Barreto under Protocol No. CAAE: 64399617.7.0000.5634. All methods were conducted in accordance with the relevant guidelines and regulations.