Mycobacterium tuberculosis has afflicted and co-evolved with man over thousands of years. Its success is due to its ability to infect a host and persist in a dormancy state for years1,2. During this period, the host is asymptomatic and not infectious, making the study of this state unmanageable. The bacteria stay in the granuloma, a barrier made by the immune systems cells, until not-well characterised signals or a weakening of the immune system allows the bacteria to escape and develop an active disease3. In recent years, some in vitro approaches have been designed for trying to increase the knowledge of latency4 and a debate about the physiological state of M. tuberculosis during this period has emerged. The number of mutations during a time period can be used as a molecular clock to study the evolution of the pathogen6,7,8,9. There were two studies with apparently contradictory results. Ford et al.10 used bacteria isolated from macaque lesions that mimic those of tuberculosis (TB) latent infection and concluded that generation time for latent TB would be similar to active TB, so the bacteria is physiologically active. On the other hand, Colangeli et al.11, studying an outbreak in New Zealand, concluded that generation times during latency are longer than during active TB. Recently Colangeli et al.12, pairing index cases to their TB contacts as a latency approach, concluded that both studies were correct: during the first two years of latency, the generation time is similar to that of the active disease, while later it starts to increase for long periods of time and a reduced mutation rate is observed. Looking for a novel in vivo approach, we carried out the analysis of isolates from individuals known to have developed several episodes of active TB for studying the evolution of the bacteria during the period between these episodes. Since 2004, all strains of M. tuberculosis have been genotyped in Aragon, Spain, which allowed us to identify the cases of TB relapses. The DNA previously used for genotyping remained in storage and could be used for whole genome sequencing (WGS) to analyse the variability of the isolates in the different episodes of the disease.


Patient selection and risk factors

The search of cases with recurrent TB among the total of cases in Aragon revealed 127 patients from 2004 until 2019 (4.97%). The genotype of the isolates revealed that 114 patients were infected by the same or very similar RFLP pattern strain, which would imply a potential relapse, while 13 patients were infected with non-related strains, i.e., re-infections, so these were not considered for this study. Among the potential relapses, we selected the cases with at least one year between episodes. Eighty-one patients had isolates with less than one year between them, therefore this group was discarded. Based on the time distance between the isolates, the cases were split into two groups: cases with ≥ 1 year but ≤ 2 years between isolates (12 patients), and cases with more than two years between the diagnosis of their isolates (21 patients). Eighteen pairs of M. tuberculosis isolates with available DNA of at least two different episodes were studied: twelve patients in the > 2 years group and six patients in the ≥ 1 but ≤ 2 years (Fig. 1). The lineages of the selected isolates are shown in Fig. 2.

Figure 1
figure 1

Diagram with the discarded and selected cases for the recurrent cases study.

Figure 2
figure 2

Drawing of a minimal spanning tree (not scaled) with all the studied isolates. The number of SNPs and the TB lineages are indicated.

Several selected patients had risk factors to develop TB. At least 38.9% were HIV+ , 22.2% declared a high alcohol consumption, 38.9% were smokers and 16.7% were intravenous (IV) drugs users. The treatment for all patients was the standard for susceptible TB. Despite some of them did not follow it correctly, no drug resistance was developed. We split the patients into two groups: [1–2 years until relapse] and (2–14 years until relapse] in order to investigate if some of these risk factors were related to a shorter or longer relapsing period (time between the first and the second episode). The intervals were fixed according to the results obtain by Colangeli et al.12, being 160 months the maximum time between episodes observed in our study. Results are shown in Table 1.

Table 1 Risk factors of the 18 selected cases.

HIV status was significant (p value = 0.041) between the two groups, showing that HIV positive patients suffered relapse in the first two years more frequently than HIV negative patients. The gender was marginally significant for males, suggesting a trend of males suffering relapses in the first two years more frequently than females (p value = 0.066).

Analysis of the genomes

SNPs versus relapsing period

The number of existent SNPs between the first and its correspondent relapsed isolate ranged from 0 to 8. These SNPs were usually in the relapsed isolates, but we could also find some of them in the earliest isolates that then disappeared, showing different clones co-existing in the patient. The mutation effect of the SNPs and also the functional categories of the affected genes were analyzed; 26.3% belonged to cell wall and cell processes category and 21.1% to intermediary metabolism and respiration category. None of them were present in more than one case, therefore they do not seem to be directly implicated in relapsing. The detailed SNPs can be found in Table 2.

Table 2 SNPs found between the first and the relapsed paired isolates.

In order to represent the number of SNPs developed between the first and the relapsed isolates of the same patient versus the time between the diagnosis of both isolates (in months), resembling in that way the latency period, we reproduced the study of Colangeli et al.12 using the Poisson regression model. Equally to Colageli et al. 2020, we found this correlation not significant (p value = 0.34), meaning that those isolates with a longer relapsing period were not necessarily those with more SNPs. The results are shown in Fig. 3. Otherwise, we observed that pairs of L4.1 had a higher mutation rate per genome per year (0.93 SNPs) than other sublineages (0.58 SNPs in L4.8 and 0.32 in L4.3), and it is above the average found by this study (0.64 SNPs).

Figure 3
figure 3

Scatter plot showing the number of SNPs developed between the first and the relapsed isolate of the patient versus time between the diagnosis of both isolates (in months). A Poisson regression model was used. A trend of increase in the number of SNPs is observed as the months of relapsing period increase, however, it was not significant (p value = 0.34).

Mutation rate versus generation time

In order to analyse the correlation between the mutation rate and the relapsing period, we used the Poisson model, as described by Colangeli et al. 12. The generation time was fixed at 18 h as seen in M. tuberculosis actively replicating in vitro. Results are shown in Fig. 4. The mutation rate tends to diminish in longer relapsing periods, being marginally significant (p value = 0.061).

Figure 4
figure 4

Mutation rate versus time until relapse. Scatter plot showing the number of SNPs that differed between the 18 patients’ paired isolates (y-axis) as a function of the time between episodes (x-axis). The generation time is held constant at 18 h as seen in actively replicating M. tuberculosis in vitro. The relation between the mutation rate and the time between episodes was marginally significantly different from 0.0 (p value = 0.0613), indicating a trend of a higher mutation rate in the first months of relapsing period.

When we considered the data in (1–2) and (2–14) years of relapsing periods, the mutation rate is slightly lower for the second, estimated at 2.728 × 10–10 [95% CI: 1.433 × 10−10, 5.193 × 10−10] mutations per (bp × generation), than in the first period, estimated at 2.798 × 10–10 [95% confidence interval (CI): 1.209 × 10–10, 6.477 × 10−10]. However, this difference was not statistically significant (p value = 0.96). Results are shown in Fig. 5.

Figure 5
figure 5

Changes in mutation rate during the relapsing period with varying generation times. Mutation rate (mutations per (bp × generation)) is shown for generation times ranging from 18 to 320 h for each pair grouped by the number of years between the first and relapsed isolates. The dark blue line is obtained from the regressions and shows the estimated mutation rate for a given generation time (x-axis) and light blue regions show 95% confidence intervals. The first panel shows the relationship between mutation rate and generation time during early relapsing (in ≤ 2 years) based on n = 6 pairs. The second panel shows the relation between mutation rate and generation time for relapse in 2–14 years based on n = 12 pairs. In both panels the grey dashed vertical line is fixed at 18 h, and the horizontal line indicates the mutation rate of 2.798 × 10–10 mutations per (bp × generation) as seen in early relapsing with generation times held constant at 18 h, fixed in both graphs for comparison availability.

IS6110 copies variation between the isolates

All the IS6110 copies of the isolates were analyzed using the WGS data. We found several IS6110 movements between the first and the relapsed isolates as a result of different clones. The relapsed strain gained extra IS6110 copies in P7 (two copies gained) and P12 (one copy gained). On the other hand, we also observed four cases in which the relapsed isolate had lost some IS6110 copies, present in the first isolate: P10 (one copy lost), P13 (three copies lost), P14 (one copy lost) and P15 (one copy lost). All these extra and absent copies usually had a lower number of reads than the fixed copies, indicating that they were not completely extended in the bacteria population. All these movements were observed in strains with more than 2 years between the isolates, except P14 pair (14 months). The IS6110 locations can be found in Table S1.


Studying M. tuberculosis latency in humans is harsh due to the difficulty of isolating the dormant bacteria, which is not possible until the active disease. Much has been published regarding latent TB and the percentages of reactivation and disease, but the latency data in patients who have already passed the disease have not been studied. Different approaches were used to mimic this process10,11,12. This work shows, for the first time, results obtained using isolates of patients with recurrent TB. Aragon, a region in the North of Spain, has a low incidence of TB. Thanks to the surveillance protocol carried out in this region since 2004, all M. tuberculosis isolates are genotyped and registered, allowing to trace the clinical TB history of the patients. Around 5% of the TB cases in our population correspond to recurrent TB. Of them, 89.8% were TB cases with isolates showing identical IS6110-RFLP patterns, indicating a potential relapse. Most of them (71%) were later considered as fail of treatment. In contrast, 10.2% of the patients had isolates with different genotypes, considered as reinfections. Among the total of TB cases in our community, reinfection occurs in 0.5% of the TB cases, reflecting that reinfection is uncommon among our population. These data are in agreement with a previous study in Madrid population, which showed an 87.5% of relapses and 12.5% of reinfections among the cases with recurrent TB13. However, in a study in the Canary Islands, the results showed a higher reinfection percentage (44%) versus the 55% of relapses14. A more extreme result was obtained in a study in London, in which 72.6% of the repeated patients were classified as reinfections against a 27.4% of relapses15. The large variation of the results among the different studies suggests that they largely depend on the population sample studied. It would be very interesting to analyse the reinfection cases in each of the studies to understand the reasons for these differences. Regarding endemic TB regions, a higher percentage of recurrent TB was found. Around 9.5% of TB patients had recurrent TB in Malawi (39.6% had relapse and 14.4% reinfection, the rest was undetermined)16 and a study carried out in India demonstrated that the majority of relapses they had were among HIV negative people (95% of TB recurrences) while the majority of reinfections were among HIV positive people (75% of TB recurrences)17.

Regarding the epidemiological and risk factors of the relapsed TB cases studied, we found that relapse was significantly earlier in HIV positive patients (in the first two years since the first episode) when compared to HIV negative patients (p value = 0.041), what would be in accordance with a compromised immune system. We also found a trend that males suffered relapse earlier than females, which could be linked to other risk factors such as the use of IV drugs, smoking and the HIV status, which were more frequent in males in our study population. Any risk factor was found as significant for causing an earlier reactivation by Colangeli et al. 12, however they recognized that the clinical cases studied did not have in general any comorbidity.

The number of SNPs between the pairs ranged from 0 to 8. Remarkably, three among the 18 pairs had more than 5 SNPs between the first and the relapsed isolate, interpreted as not recent transmission18, even though the bacteria were isolated from the same patient. This could be related to clinical characteristics of the patients, as immunosuppression, HIV status or the treatment adherence. Surprisingly, several SNPs were found in the first isolates that were absent in the relapsed isolates, as if they had reverted. This phenomenon was extreme in P12, in which six out of the seven SNPs found were absent in the relapsed isolate. The explanation could be the presence of different clones in the patient19,20. In this way, in the different disease episodes a different clone was isolated, resulting from different bottlenecks and selective pressures of the original strain21,22. The reinfection with an identical strain has been described as a limitation of these kind of studies, but in our case, it can be discarded as only one of the pairs belonged to a large endemic cluster (P4, with 0 SNPs). The rest of the pairs were infected with orphan or small-outbreak strains of up to four cases, differently from other studies with large endemic clusters and high TB prevalence22.

Same as Colangeli et at.12, we did not find a significant correlation between the number of SNPs and the time between episodes. However, it is possible that P8 (160 months between episodes and 0 SNPs) is altering the trend of SNP accumulation when the time between episodes increases. This is one limitation when working with small sample size, that a single point could have a great impact in the results. None of the SNPs found seemed related with recurrence as all were unique and therefore not common to more than one pair of isolates. It has been described that 0.5 SNPs per genome, per year is the standard mutation rate for M. tuberculosis10. Some studies, where multiple MDR/XDR isolates coming from the same patients were sequenced, have reported that selective pressure and antibiotic resistance can increase this mutation rate as high as more than 3 SNPs17,21. Despite all strains had been under the selective pressure of treatment, they did not achieve such a higher rate, maybe because they were drug susceptible. The mean mutation rate found in our study was 0.64 SNPs, slightly above the standard, due to the high mutation rate found in L4.1, almost double than the standard.

The correlation between the mutation rate and the relapsing period was found just marginally significant (p value = 0.0613), differently to Colangeli et al.12, who found it significant. It is important to remark that the approaches were completely different: they used transmission events to mimic the latency period as the time between the diagnosis of the two cases, while we used isolates from the same patient who had a previous TB episode. We eliminated all patients with less than one year between the diagnosis of the episodes, as this was considered as a treatment failure, while Colangeli et al. 2020 had latency periods from one month, which was not possible in our clinical cases as a minimum of 6 months of treatment was required. We did not find a significant correlation between the mutation rate along the variable generation times analysed when we split the data into [1–2 years] and (2–14 years), we observed just a small difference. This difference was much smaller than that found by Colangeli et al. 2020 (as high as 8 × 10–10 for early latency), suggesting that mutation rate was constant during the relapsing period in recurrent TB cases. The mutation rate found in our study, 2.7 × 10–10, was similar to that found by Ford et al. 2011 (2 × 10–10)10, therefore both more distant from the one found by Colangeli et al. 2020. The reason why our results are similar to those of Ford et al. 2011 could be due to the similarity of the approaches applied, as they used lesions of the same macaques for studying latency and we used relapsed isolates from the same patients.

The analysis of the IS6110 element showed differences in the number of IS6110 copies in six of the pairs studied, affecting more than one IS copy in several pairs. It has been observed that IS6110 transposed more in great starvation conditions23, which could be similar to the conditions the mycobacteria found in the granuloma4. It was surprising that in four of the pairs studied, the relapsed isolates had lost 1 to 3 copies that were present in the first isolates. Noteworthy, the number of reads obtained in the fastQ files for these copies was considerably lower than for the rest of the IS copies. This suggests that those lost copies were not still fixed in the complete bacteria population, therefore a selection among the different clones present in the same patient had taken place24. It could be that the lost copies in the relapsed isolates had some deleterious effect for the mycobacteria as the relapsed bacteria were the ones without that IS copies. The fact that five out of the six pairs with IS6110 movements had more than 2 years of relapsing period supports the idea of IS transposing more during the asymptomatic state of the patient23.

The main limitation to analyse the evolution of the bacteria during the dormancy period is the approach used for resembling this state. There is not a perfect approach, as it is impossible to reproduce what is happening inside the granuloma of a concrete patient, but we think that using isolates of the same patient is the closest way to do it. The difficulty to obtain the complete epidemiological information of the patients is another limitation because it does not allow to determine the accurate development of the disease’s episodes. Another limitation is that some of the SNPs could be the result of a sequencing error or due to laboratory management, what would have a huge impact on the mutation rate. In addition, although there were more cases of potential relapses in our records, DNA of the isolates was not available. We decided not to re-cultivate these stored isolates to avoid more manipulation that could introduce errors such as additional SNPs that were not present in the original strains.

As a conclusion, the patients with HIV seemed to suffer reactivation in the first two years after the initial episode of TB more frequently than HIV negative patients. Besides, IS6110 movements occurred more frequently in patients with more than two years between episodes and it seems that different clones of the original strain could be responsible for the first and the following episodes. No correlation was found between the number of SNPs and the time between episodes, neither between the mutation rate and the relapsing period, just a trend of diminishing in longer time periods. Finally, the mutation rate seemed to be constant along all the period between episodes.

Material and methods

Selection of samples and patients

Of around 2553 cases of TB in Aragon since 2004, we first looked for those with more than one isolate more than 1 year apart and of a similar genotype. We used Bionumerics v6.7 software (v7.6, Applied Maths, Kortrijk, Belgium) to confirm that both isolates coming from the same patient shared an identical IS6110-RFLP pattern. Eighteen pairs of M. tuberculosis isolates with available DNA were included in this study. When there were more than two isolates from the same patient, the two more distant in isolation dates were considered for the evolution study during relapse. All data remained anonymous during the epidemiological search. Our regional ethical committee (Comité de Ética de la Investigación de la Comunidad Autónoma de Aragón, Record No. 20/2018) approved the methodology used in this work, detailed in 18/0336 project.

DNA of the bacterial isolates was obtained using the cetrimonium bromide method, as previously described25. No human DNA was sequenced. All DNA extractions were stored at – 20 °C until sequencing. All the isolates were genotyped by IS6110-RFLP and spoligotyping as previously described26,27. The genetic patterns obtained were stored and analysed in Bionumerics database software.

SNP annotation and lineage identification

Thirty-six isolates corresponding to 18 different patients were sequenced using Ion Torrent technology according to manufacturer’s instructions. The fastQ files obtained were mapped against the reference M. tuberculosis strain H37Rv (NC_000962.3) in order to obtain the Binary Aligned Map (BAM) and Variant Call Format (VCF) files, used for the SNP study. The fastQ files were uploaded in Bionumerics software for the study and comparison of the genomes. The SNP annotation was carried out using Snippy software (default parameters) and Integrative Genomics Viewer (IGV), from the Broad Institute28. The effect of the mutation (synonymous or non-synonymous) was observed using Genewise platform ( All the mutation points are referred to the H37Rv reference strain. For lineage identification, the SNP-based classification stablished by Coll et al. 201429 was used. This classification assigns specific SNPs to each TB lineages and sublineages.

IS6110 location

All the reads containing the first and the last 30 base pairs of the IS6110 sequence were extracted. These reads are formed by the beginning or the ending of the IS6110 along with part of the gene in which the IS is inserted. After the extraction of the sequences, BLAST was made in Tuberculist and Bovilist to know the insertion point. BLAST was also made automatically with the script, but manual BLAST was required for some ambiguous points. The script used in R is in the Supplementary Materials.

Mutation rate calculation

The mutation rate per (bp × generation) was calculated as previously described10, adjusting the parameters to our own data. Briefly, the mutation rate per bp × generation is defined as

$$\mu = \frac{{\mathop \sum \nolimits_{i = 1}^{n} m_{i} }}{{N\mathop \sum \nolimits_{i = 1}^{n} \left( {{{t_{i} } \mathord{\left/ {\vphantom {{t_{i} } g}} \right. \kern-\nulldelimiterspace} g}} \right)}}$$

where μ is the mutation rate, m is the number of SNPs between the first and the relapsed isolate, N is the genome size (since we had, on average, reads covering 97.4% of the M. tuberculosis genome, N = 0.974 × L where L is reference genome size), t is time since infection (in hours, Table 1), and g is generation time (in hours).

Statistical methods

Poisson regression was used to model the variation of mutation rate over a range of generation times. To control the deviations from distributional assumptions a robust variance of Robust Sandwich Estimator was used. Poisson models were used to obtain mutation rates per (bp × generation) by using bp × generation as an offset. Two Poisson models were fit according to the relapsing period (one for 1–2 years including n = 6 pairs and another model for 2–14 years including n = 12 pairs). We also fitted a Poisson model using the relapsing period as a continuous independent variable. The hypothesis that we tested was if the parameter associated to relapsing period was significantly different from 0. To test the Poisson model parameters, a two-sided chi square test using the robust variance was used. Software R version 4.0.5 (2021-03-31) was used to all statistics analysis. Regression Poisson of all models was implemented in R using a generalized linear model function and robust variance control with sandwich package30.