Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Genomic signatures of pre-resistance in Mycobacterium tuberculosis


Recent advances in bacterial whole-genome sequencing have resulted in a comprehensive catalog of antibiotic resistance genomic signatures in Mycobacterium tuberculosis. With a view to pre-empt the emergence of resistance, we hypothesized that pre-existing polymorphisms in susceptible genotypes (pre-resistance mutations) could increase the risk of becoming resistant in the future. We sequenced whole genomes from 3135 isolates sampled over a 17-year period. After reconstructing ancestral genomes on time-calibrated phylogenetic trees, we developed and applied a genome-wide survival analysis to determine the hazard of resistance acquisition. We demonstrate that M. tuberculosis lineage 2 has a higher risk of acquiring resistance than lineage 4, and estimate a higher hazard of rifampicin resistance evolution following isoniazid mono-resistance. Furthermore, we describe loci and genomic polymorphisms associated with a higher risk of resistance acquisition. Identifying markers of future antibiotic resistance could enable targeted therapy to prevent resistance emergence in M. tuberculosis and other pathogens.


Mycobacterium tuberculosis is estimated to have killed 1 billion people over the last 200 years1 and remains one of the world’s most deadly pathogens2. Drug resistance in bacteria, particularly the Enterobacteriaceae and Mycobacterium tuberculosis, imposes an unsustainable burden on health programs worldwide with some strains so extensively resistant that they are untreatable with existing antibiotic therapy3. Although recent advances in bacterial whole-genome sequencing have significantly improved the identification of drug resistance4, post hoc approaches to diagnosis miss the opportunity to preempt the emergence of drug resistance and implement preventive measures prior to the acquisition and spread of antibiotic resistant disease.

An increased risk of drug resistance emergence is often attributed to inadequate implementation of control measures5, but bacterial factors have also been proposed as potential contributors to drug resistance6. Evidence of differential drug resistance acquisition at the M. tuberculosis sublineage level is conflicting. Epidemiological and in vitro studies have suggested that the Beijing family, belonging to lineage 2, is hyper-mutable7 with a propensity to develop resistance at a higher frequency than other lineages8,9,10,11, while others cite evidence to the contrary12,13. Pre-existing resistance to one antibiotic (mono-resistance) is another factor that may influence the acquisition of multidrug-resistance14. Mono-resistance to isoniazid or rifampicin has been associated with increased rates of multidrug-resistance acquisition15,16, but the relative risk of either remains unclear. Similarly, phylogenetic analyses suggest a stepwise progression towards multidrug-resistance, where mutations conferring isoniazid resistance tend to precede those linked to rifampicin resistance17,18,19,20.

Phylogenetic trees have been increasingly used to study pathogen dynamics and evolutionary processes of a wide range of phenotypes of epidemiological interest, including virulence and drug resistance acquisition21,22. A necessary focus on improving the molecular diagnosis of drug resistance has led to the generation of large strain collections of drug resistant pathogens. However, unrepresentative samples of this kind enriched for drug-resistant isolates limit the ability to characterize the evolution and dynamics of drug resistance from a diverse background of ancestral susceptible strains. Inadequate sampling without comprehensive population level coverage or sufficient temporal span compounds this problem, while the monomorphic nature of the M. tuberculosis genome makes constructing time-calibrated phylogenetic trees particularly challenging. As a consequence, a single mutation rate is often applied to the data, but this assumption inappropriately forces lineages and sub-lineages to conform to the same global mutation rate, thus limiting the inferences that can be made from the data.

Overcoming these issues, we present findings from samples collected over a 17-year time span with population level coverage in the hyperendemic suburbs of Lima, Peru. We apply a genome-wide survival analysis to a time-calibrated phylogeny of 3135 M. tuberculosis strains, and show the existence of pre-resistance mutations among drug susceptible genotypes that increase the risk of future drug resistance emergence in M. tuberculosis. We demonstrate significant differences in the acquisition of drug resistance between lineages, on mono-resistant backgrounds, and at the level of nucleotide polymorphisms. Our findings were then tested and replicated in an independent publicly available data set of 1027 whole genomes collected in Samara, Russia, and in a collection of 1573 isolates from multiple countries to demonstrate that they can be globally generalized.


Population structure, genomic analysis, and patient demographics

A total of 3432 M. tuberculosis genomes from Lima (Peru) were analyzed, of which 3135 passed genomic quality filters. Of this, 2037 were part of a population level study carried out in 2009 where sputum samples were taken from all patients presenting tuberculosis symptoms in the Lima areas of Callao and Lima South23 (Supplementary Data File 1). Comparison of drug resistance prevalence between the population level sampling using molecular genotyping and reports of epidemiological data in Peru2 are consistent: 1.5% (32/2037) of samples were rifampicin mono-resistant; 5% were isoniazid mono-resistant (105/2037), and 13% were multidrug-resistant (251/2037) (Supplementary Table 1). The remaining samples were collected from cohort studies covering a 17-year period of research in the regions of Lima and Callao in order to achieve a sufficient temporal span in our sampling window (Supplementary Fig. 1). Both lineage 2 and lineage 4 had a similar distribution of sampling dates (Supplementary Fig. 2).

The isolates were first aligned to the reference genome H37Rv, then lineages and sublineages were assigned using clade specific SNPs24. Lineage 4 (L4, Euro-American) consisted of 2807 samples, while lineage 2 (L2, Beijing) had 327 isolates (Table 1). There was a single representative of lineage 1 (Indo-Oceanic), which was used to root the phylogenetic tree. The remaining samples from the data set, which included 5 M. caprae isolates, were not used in the downstream analysis. Lineage 4 had the highest diversity, comprising 1235 isolates from lineage 4.3 (LAM), 935 from lineage (Haarlem), 271 from lineage 4.1.1 (X-Type), and 312 from lineages denoted as Type T, which encompasses lineages 4.5, 4.7, 4.8 and 4.9. Other minor sublineages included lineage 4.2.2 (TUR) and lineage 4.4. All isolates from Lineage 2 were part of the Beijing sublineage (lineage 2.2), mainly from the sublineage 2.2.1 or Modern Beijing, and with only one representative of the sublineage 2.2.2 or Asia Ancestral25 (Table 1).

Table 1 Population structure.

The alignment of the isolates to the reference genome resulted in 64,586 SNPs, of which 18,022 were singletons (28%). Most SNPs were not widely distributed across the population, and only 8088 variants had a frequency in the dataset higher than 1%. A total of 59,789 SNPs (16,934 singletons, 28%) were identified for lineage 4, and 4821 SNPs (1370 singletons, 28%) for lineage 2. We applied the same analysis to a publicly available data set of 1027 isolates from Samara, Russia, as a validation set where, unlike the Peruvian dataset, lineage 2 constitutes the main lineage. We identified a total of 28,414 SNPs, consistent with previous publications of this data19.

Patient demographic metadata was available for 2220 samples, of which 88% were smear positive, 27% were previously treated for tuberculosis, and 2.8% were HIV positive, consistent with previous population level estimates in Peru26. The median age was 28 years (IQR 21–41). In our cohort, 86% of lineage 2 and 89% of lineage 4 were smear positive.

Phylogenetic analysis and drug resistance emergence

The maximum likelihood phylogeny constructed using these alignments grouped the isolates by lineage similarly to previously published global data sets24 (Fig. 1a, Supplementary Fig. 3). To study the temporal dynamics of drug resistance acquisition at the population level, the maximum likelihood phylogeny was time-calibrated using the sampling dates of the isolates, which extended from 1999 to 2016. Dated phylogenies were built separately for lineage 2 and lineage 4 in order to avoid the confounding effect of the temporal and population structures27. Before time calibration of the phylogeny, we tested the adequacy of the temporal correlation of evolutionary change to reliably infer the model parameters. First, a root-to-tip linear regression of the number of substitutions from the root and the sampling times was fitted to confirm a positive association between time and evolutionary change. As uneven sampling may bias the root-to-tip regression28, a date-randomization test was additionally performed using the full Bayesian model implemented in BactDating29 with the original dataset and 100 randomizations where the sampling times were permuted, representing the expectations of the model parameters in the absence of temporal signal. The substitution rate estimated for the original dataset and for the 100 randomizations was compared to verify a lack of overlap between the 95% credible intervals. Both lineage 4 and lineage 2 datasets showed a clear temporal signal (Supplementary Fig. 4), and thus model parameters could be confidently inferred from the data30,31. We used the relaxed clock model implemented in BactDating29, allowing the mutation rate to vary in each branch independently. We ran the MCMC until convergence of the chains was achieved, with an effective sample size (ESS) of at least 100 (Supplementary Fig. 5). The estimated rate for lineage 2 was 0.45 substitutions per genome per year (0.32–0.57 95% CI), while lineage 4 had a clock rate of 0.299 (0.25–0.36 95% CI) (Fig. 1b). The estimates of the molecular clock for both lineages were consistent with previous reports32. The most common recent ancestor (tMRCA) for our samples was placed at 565 CE (263;826 95% CI) for lineage 4 (Fig. 1c), while lineage 2 had a tMRCA in 1325 CE (1070;1499 95% CI) (Fig. 1d).

Fig. 1: Phylogenetic analysis of 3134 Mycobacterium tuberculosis isolates from Lima, Peru.
figure 1

Colors represent different lineages and sublineages. a Maximum likelihood phylogeny. Scale in number of substitutions per genome. b Violin plots showing the posterior density distribution of the inferred substitution rate in substitutions per genome per year derived by sampling from 107 MCMC iterations. The substitution rate was estimated separately for lineage 4 (blue) and lineage 2 (red). Box plots inside the violin indicate the median value of the distribution (black horizontal line) and the interquartile range. Whiskers denote 1.5x the interquartile range, while outliers are plotted as individual points. c Time-calibrated phylogeny of lineage 4. d Time-calibrated phylogeny of lineage 2.

Drug resistance was inferred for all isolates at the tips of the phylogenetic tree using well-established drug resistance associated SNPs33. In addition to molecular typing of drug resistance, all isolates included in the analysis had Drug Susceptibility Testing (DST) performed either by MODS or by the proportional method in agar. DST and molecular typing showed consistent results for 96% of the samples for rifampicin resistance and 92% for isoniazid resistance. The time of emergence of drug resistant mutations was inferred by reconstructing the ancestral sequences of the internal nodes in the phylogenetic tree. The time of emergence of a specific antibiotic resistance mutation was approximated to the inferred year of the internal node where such mutation first appeared. The phylogenetic estimates of the emergence of drug resistance conferring mutations in lineage 4 occurred around the time of the known introductions of the corresponding drug. In contrast the emergence of drug resistance in lineage 2 was observed to have arisen many years after the introduction of antituberculous drugs. This is consistent with the geographic spread of lineage 4 in Europe together with early widespread use of drugs in this region (Table 2, Fig. 2). For both lineage 2 and lineage 4, the earliest inferred occurrence of resistance was to isoniazid, by the Ser315Thr mutation in the gene KatG, around 1957 (1928;1978 95% CI) for lineage 2, and 1942 (1913;1960 95% CI) for lineage 4, in line with the reported wide introduction of isoniazid in 195234. The rifamycins were first isolated in 195734, and we estimate the date for the first acquisition of resistance to rifampicin due to the rpoB mutation Ser450Leu to have emerged around 1951 (1931;1971 95% CI) for lineage 4 and 1974 (1953;1988 95% CI) for lineage 2. None of the drug resistant nodes reverted to susceptible along the branches of the two phylogenetic trees.

Table 2 First emergence of drug resistance conferring mutations in Lima, Peru.
Fig. 2: Inferred posterior density distribution of the earliest occurrence of resistance to first line antituberculous drugs.
figure 2

Posterior density distribution inferred using a time-calibrated phylogeny for both lineage 4 and lineage 2. Arrows represent the approximate time of antibiotic introduction.

The emergence of compensatory mutations

It has been shown that secondary mutations arising after the acquisition of drug resistance may alleviate the fitness cost associated to antibiotic resistance mutations35, but little is known about their temporal dynamics. Data on M. tuberculosis drug resistance compensatory mechanisms is mainly limited to isoniazid and rifampicin resistance36,37. Non-synonymous mutations in the gene rpoC have been suggested as secondary compensatory mutations for rifampicin associated mutations in the rpoB gene37. A total of 34% (258/755) of lineage 4 isolates harboring rpoB mutations also had rpoC non-synonymous polymorphisms; for lineage 2, 38% (33/87) of isolates with rpoB mutations carried rpoC polymorphisms. No significant differences were observed between lineages in a logistic regression model (OR = 0.85, 95% CI 0.54–1.35, p-value = 0.49). Overall, 62% of rifampicin resistant isolates carried Ser450Leu rpoB mutations (525/842). Rifampicin resistant isolates harboring Ser450Leu rpoB mutations had a higher probability of carrying mutations in the rpoC gene (52%, 272/525) than isolates with other rpoB mutations (6%, 19/317) in a logistic regression model (OR = 16.86, 95% CI 10.55–28.50, p-value = 4 × 1029). Only 3% (9/291) of the isolates carried two non-synonymous mutations in the rpoC gene, while the rest had only one.

To understand the emergence of rpoC non-synonymous mutations, we scanned the phylogenetic branches of rifampicin resistant isolates from the root to the tip, using the inferred sequences of the ancestral nodes to determine the time of emergence of rpoC non-synonymous mutations. The analysis was repeated in 100 bootstrap phylogenies to infer confidence values around our estimates. In both lineage 2 and lineage 4, the emergence of rpoC non-synonymous mutations occurred immediately after or at the same time as the emergence of rifampicin resistance, and continued steadily over time (Fig. 3). For lineage 2, there was not a single emergence of rpoC mutations occurring prior to the rifampicin resistance conferring mutations. In the case of lineage 4, two rpoC mutations emerged before rifampicin resistance: c.765150 G > A and c.765590 C > A. Both mutations appeared once independently in the entire phylogeny. The mutation c.765150 G > A emerged in our dataset around the year 936 CE (727;1110 95% CI), in the MRCA of the clades X-type and Haarlem, all of which present the c.765150 G > A mutation (1206/2807 isolates). For c.765590 C > A, the estimated year of emergence was around 1867 CE (1813;1907 95% CI), and all the tips harboring the mutation belonged to a cluster of isolates part of the LAM11 sublineage (33/2807 isolates). Given that both mutations showed a clear phylogenetic structure and emerged independently only once in our dataset, they were considered phylogenetic mutations and were removed from the final analysis. The emergence of non-synonymous mutations in the rpoC gene was similar for isolates carrying the Ser450Leu rpoB mutation and for those isolates with other rpoB mutations (Supplementary Fig. 6).

Fig. 3: Dynamics of non-synonymous mutations in rpoC.
figure 3

a, b Cumulative number of non-synonymous mutations in rpoC over time. The x-axis represents the years since the inferred time of rifampcin resistance (time 0). Dark blue line shows the cumulative number of mutations for the ML tree, while the 95% confidence interval (shaded area) is inferred by repeating the analysis in 100 bootstrap phylogenies. The analysis was performed separately for a lineage 2 and b lineage 4.

Phylogeographic history of Mycobacterium tuberculosis in Peru

To estimate the year of M. tuberculosis introductions to Peru, we subsampled the Peruvian isolates and analyzed them alongside global representatives of both lineage 2 and lineage 4 for which both collection date and geographic origin were known (Supplementary Data File 2). The phylogenies were time-calibrated using a relaxed clock model as implemented in BactDating29, and MCMC convergence was assessed using the traces of the model parameters (Supplementary Fig. 7).

The phylogeographic history was inferred by reconstructing the ancestral states by maximum likelihood. The geographical origin of the isolates was treated as a discrete character, and we assumed that the time of introduction occurred at the first Peruvian node of each clade.

Two early introductions of lineage 2 from China in 1880 (1856;1905 95% CI) and 1906 (1881;1929 95% CI) accounted for 84% of the Peruvian dataset (275/327 isolates), and 77% (90/117) of the drug resistance isolates as well as 77% (20/26) of the independent resistance evolutionary events (Fig. 4a). Later introductions to Peru from China occurred between 1981 (1969;1989 95% CI) and 2004 (1999;2007 95% CI), with one introduction in 1986 (1974;1994 95% CI) from South Asia representing 6 isolates, one of which is drug resistant.

Fig. 4: Phylogeographic analysis of Mycobacterium tuberculosis introductions to Peru.
figure 4

a, b Inferred introductions of Mycobacterium tuberculosis in Peru. The top part shows a time-calibrated phylogeny, with inferred introductions to Peru highlighted in the nodes with colors representing the country from which the clade was introduced. Peruvian clades are shown in blue. The bottom part shows the estimated year of introduction. The analysis was done separately for a lineage 2 and b lineage 4. For visual representation purposes, only the year of introduction of clades with more than 10 tips are shown for lineage 4.

Lineage 4 was inferred to have been introduced in Peru several times over the years, mainly from Europe and Brazil (Fig. 4b). Lineage 4.3 (LAM) represents the first and main introduction from Brazil around 1512 (1383;1598 95% CI), shortly after the well documented arrival of the Europeans to the continent. Subsequent smaller introductions from Europe and Brazil around 1644 (1545;1713 95% CI) and 1743 (1671;1796 95% CI) shaped the LAM and T type lineages. We inferred that most of L4.1.2.1, part of the Haarlem sublineage, likely evolved from two introductions from Europe around 1693 (1614;1749 95% CI) and 1834 (1781;1869 95% CI). The most likely introductions of the X type clade (L4.1.1) occurred from Brazil in 1692 (1597;1754 95% CI) and South Africa for L4.1.1.3 in 1706 (1615;1773 95% CI).

Between lineage differences in drug resistance acquisition

The risk of acquiring drug resistance was calculated as the Cox Proportional Hazard Ratio (HR) using the time between sensitive internal nodes and the first drug resistant node in the time-calibrated phylogenetic trees. The Kaplan–Meier curves showed that lineage 2 had a higher probability of acquiring drug resistance than lineage 4 (Log-rank test p-value = 1.2 × 109; Fig. 5a). The estimated hazard ratio of drug resistance acquisition for lineage 2 was estimated to be 3.36 when compared to lineage 4 (HR 3.36, 95% CI 2.10–5.38, Likelihood ratio test p-value = 4.25 × 107). A similar trend was observed in the Samara dataset (HR 4.82, 95% CI 3.74–6.21, Likelihood ratio test p-value = 6.8 × 1034; Kaplan–Meier curve Log-rank test p-value = 1 × 1039; Fig. 5b). The risk of drug resistance acquisition was also higher in lineage 2 when compared to all sublineages of lineage 4 in the Peruvian dataset, using LAM3 as a reference (lineage 2 HR 3.32, 95% CI 1.84–6.28, Likelihood ratio test p-value = 1.9 × 104, all other p-values > 0.2; Kaplan–Meier curve Log-rank test p-value = 6.9 × 108; Fig. 5c). To assess the adequacy of our data to the proportional hazard assumption, we calculated the relationship between the Schoenfeld residuals against time. In all cases, a non-significant association between the Schoenfeld residuals and time supported the use of the proportional hazards model (Supplementary Fig. 8).

Fig. 5: Hazard ratio and Kaplan–Meier curve for different sublineages of Mycobacterium tuberculosis.
figure 5

ac Top: Hazard ratio (HR). Points and error bars represent the HR estimate and the 95% CI, respectively. The p-value for the HR was calculated using the likelihood ratio test. Bottom: Kaplan–Meier curve and numbers at risk. Y-axis represents the probability of remaining susceptible to any antibiotic, while the X-axis shows the time in years or the distance in branch length. Shaded areas show the 95% CI. Kaplan–Meier curves were compared and p-values were derived using the log-rank test. a Depicts the HR of lineage 2 compared to lineage 4 in the Peruvian dataset (HR 3.36, 95% CI 2.10–5.38, Likelihood ratio test p-value = 4.25 × 107) and the different Kaplan–Meier curve for lineage 2 and lineage 4 (Log-rank test p-value = 1.2 × 109). b Same metrics for the Samara dataset (HR 4.82, 95% CI 3.74–6.21, Likelihood ratio test p-value = 6.8 × 1034; Kaplan–Meier curve Log-rank test p-value = 1 × 1039). c Shows HR between lineage 2 and the different sublineages of lineage 4 found in the Peruvian dataset (LAM9, LAM3, LAM11, Haarlem, X type and T type), using LAM3 as a reference (lineage 2 HR 3.32, 95% CI 1.84–6.28, Likelihood ratio test p-value = 1.9 × 104, all other p-values > 0.2; Kaplan–Meier curve Log-rank test p-value = 6.9 × 108). Statistical significance of the hazard ratio differences presented next to the CI bars (*p < 0.05; **p < 0.01; ***p < 0.001).

In order to evaluate the robustness of our maximum likelihood phylogeny, we repeated both the dating and the survival analysis on 100 phylogenetic bootstrap replicates. In both lineage 2 and lineage 4, the Cox Proportional Hazard ratio was not significantly different between the maximum likelihood phylogeny and the bootstrap replicates (Supplementary Fig. 9a, b). Additionally, the Kaplan-Meier curve of the 100 replicates was similar to that of the maximum likelihood tree (Supplementary Fig. 9c, d).

Both phylogenetic trees were subsampled to include only the isolates from the 2009 population level study to prevent any biases due to inclusion of datasets enriched for drug resistance isolates (Supplementary Table 1). The results were congruent with those obtained with the entire dataset. Lineage 2 was characterized by a higher risk of drug resistance acquisition when compared to lineage 4 (HR 4.84, 95% CI 2.78–8.45, Likelihood ratio test p-value = 2.7 × 108; Kaplan-Meier curve Log-rank test p-value = 7.9 × 1010; Supplementary Fig. 10a). Moreover, lineage 2 also had a higher hazard ratio than any sublineage of lineage 4 (lineage 2 HR 5.1, 95% CI 2.17–11.9, Likelihood ratio test p-value = 1.8 × 104, all other p-values > 0.2; Kaplan-Meier curve Log-rank test p-value = 3.02 × 107; Supplementary Fig. 10b).

Several confounders may also explain the differential rate of drug resistance acquisition between lineages (Supplementary Fig. 11). We found no significant association between M. tuberculosis sublineages and HIV (Supplementary Fig. 11a) and smear positivity (Supplementary Fig. 11b) in a logistic regression model (n = 2133, all p-values > 0.1). It has been previously shown that prison conditions increase and amplify drug-resistance tuberculosis in Peru38,39. Our study population showed a higher proportion of prison infection in lineage 2 (5.8%, 12/206 patients) when compared to lineage 4 (1.4%, 28/2011 patients, Supplementary Fig. 11c), and a higher risk of lineage 2 infection within the prison population (n = 2217, OR = 4.4, 95% CI 2.1–8.5, p < 0.001). This finding did not explain the higher rate of drug resistance acquisition in lineage 2, since all the lineage 2 samples taken from prisoners belonged to the same cluster and none of them harbored drug resistance conferring mutations. Since previous treatment with antituberculous drugs has been associated to an increased risk of acquiring drug resistance38, we tested the association between previous treatment and the different sublineages of our cohort. Previous treatment history was available for 2236 samples contained metadata regarding previous treatment of tuberculosis. We found no significant association between M. tuberculosis sublineages and previous treatment with antituberculous drugs in a logistic regression model (n = 2236, all p-values > 0.1), suggesting that the between lineage differences in drug resistance acquisition observed in the survival analysis are not confounded by a differential distribution of antibiotics between sublineages (Supplementary Fig. 11d). Behavioral patterns may also affect the dynamics of drug resistance acquisition and transmission. Patient sex (Supplementary Fig. 11e) was not significantly associated with any of the M. tuberculosis sublineages (n = 2205, p-value = 0.3). The mean age for patients with lineage 4 infection was 33.3 (22–41 IQR), while the mean age for patients with lineage 2 was 30.4 (21–36 IQR). Although the distribution of patient age was similar between the two lineages (Supplementary Fig. 11f), lineage 4 showed an incident rate of age 1.09 higher than lineage 2 in a quasi-Poisson model (lineage 4 estimate = 0.09, standard error = 0.02, p = 0.001). On the other hand, age was not significantly associated with a higher risk of drug resistance acquisition in a logistic model (OR 0.999, 95%CI 0.994–1.003, p = 0.69).

The risk of developing MDR TB from isoniazid mono-resistance

To determine the effect of mono-resistance on the acquisition of further multidrug-resistance, the hazard ratio of acquiring rifampicin resistance was calculated for isoniazid mono-resistant ancestral genotypes versus susceptible ancestral strains. Genotypes with an isoniazid mono-resistant background had 15 times the hazard of developing rifampicin resistance tuberculosis relative to wild type susceptible strains (HR 15.12, 95% CI 10.54–21.69, Likelihood ratio test p-value < 1015; Kaplan–Meier curve Log-rank test p-value = 2, 7 × 1063; Fig. 6a). A larger hazard ratio was obtained in the Samara dataset (HR 37.28, 95% CI 18.81–73.88, Likelihood ratio test p-value = 3.4 × 1025; Kaplan–Meier curve p-value = 4.6 × 1063; Fig. 6b), although the low prevalence of mono-resistance clades in the Samara set may bias this estimate.

Fig. 6: Hazard ratio and Kaplan–Meier curve for rifampicin acquisition.
figure 6

a, b Top: Hazard ratio (HR). Points and error bars represent the HR estimate and the 95% CI, respectively. The p-value for the HR was calculated using the likelihood ratio test. Bottom: Kaplan–Meier curve and numbers at risk. Y-axis represents the probability of remaining susceptible to rifampicin, while the X-axis shows the time in years or the distance in branch length. Shaded areas show the 95% confidence interval. P-values for the Kaplan–Meier curves were calculated using the log-rank test. a Depicts the risk of acquiring rifampicin resistance from an already isoniazid mono-resistant background compared to a drug susceptible one (HR 15.12, 95% CI 10.54–21.69, Likelihood ratio test p-value = 1.3 × 1040) and the Kaplan–Meier curves for the different backgrounds (Log-rank test p-value = 2.7 × 1063). b Same metrics for the Samara dataset (HR 37.28, 95% CI 18.81–73.88, Likelihood ratio test p-value = 3.4 × 1025; Kaplan–Meier curve p-value = 4.6 × 1063). Statistical significance of the hazard ratio differences presented next to the CI bars (*p < 0.05; **p < 0.01; ***p < 0.001).

Multidrug-resistance was preceded by rifampicin mono-resistance only one time in the phylogenetic tree. Due to the infrequent occurrence of rifampicin mono-resistance prior to multidrug-resistance emergence, the risk of developing multidrug-resistance from a rifampicin mono-resistance background could not be reliably estimated.

Genomic signatures of drug resistance acquisition

Genome-wide survival analysis was performed using a Cox Proportional Hazard regression model to identify genetic variants in phylogenetic nodes inferred to be drug susceptible but associated with a higher risk of progression towards drug resistance. Resistant nodes were defined as those inferred to be resistant to any antibiotic in order to identify common pathways of increased risk of acquiring resistance regardless of the specific antibiotic. The association analysis was performed in lineage 4 and lineage 2 separately, and we further corrected for population structure using a kinship matrix, which reduced the genomic inflation factor (λ) to 1.16 (Supplementary Fig. 12).

Six variants in drug susceptible ancestral genotypes were associated with a higher risk of acquiring drug resistance in lineage 4 after population and multiple testing correction, three of which were in previously annotated genes (Fig. 7a). The variant with the lowest p-value corresponded to a 9 bp deletion at location 2,604,157 in the locus lppP, which encodes a lipoprotein and has been predicted to be required for growth in macrophages40. This deletion had a frequency of 1.7% in the population, and it evolved 12 times independently along the phylogenetic tree. Genotypes with this variant had a hazard ratio 7.36 times greater than those with an intact lppP (Fig. 7b, HR 7.36 95% CI 3.85–14.04, p-value = 7.46 × 10−10). We replicated our findings in a global data set of 1573 L4 isolates (Supplementary Data File 3), which was relatively enriched for drug resistance (55% of samples were resistant to any drug). The lppP deletion had a frequency of 9% and inferred susceptible genotypes with the deletion had a hazard ratio 3.6 times greater than those without it (HR 3.6, 95% CI 1.9–6.9, p-value = 8.7 × 105). Two synonymous polymorphisms at esx genes were found to be associated with a higher risk of acquiring drug resistance in inferred drug susceptible genotypes. The esx gene family encodes protein secretion systems described to be critical for growth, pathogenesis, and mycobacterial–host interactions41. The two polymorphisms were detected in the gene esxL at position 1,341,044 with a hazard ratio of 3.2 (HR 3.2 95% CI 1.91–5.37, p-value = 1.01 × 106), and at position 2,626,011 in the gene esxO (HR 11.12, 95% CI 5.50–22.5 p-value = 1.52 × 105) with a frequency in the population of 17 and 5%, respectively. In the L4 global dataset, the esxL SNP had a frequency of 19.5% while the esxO polymorphism was present in 10% of the isolates. Inferred susceptible genotypes in the global data set carrying the mutation in esxO had a risk of acquiring drug resistance 3.1 times higher than those with the reference genotype (HR 3.1, 95% CI 1.3–7.3, p-value = 0.009), while those carrying the mutation in esxL had a risk 1.4 higher, although differences where not statistically significant (HR 1.4, 95% CI 0.7–3.0, p-value = 0.3). Visual inspection of the short-read alignments around the described genes was undertaken to confirm high quality alignments over these regions (Supplementary Fig. 13).

Fig. 7: Genome-Wide association study (GWAS) results.
figure 7

a Manhattan plot for GWAS on increased risk of drug resistance acquisition in lineage 4. The red line represents the Bonferroni corrected p-value threshold of 3.37×10−5. Labels show the genes where the significant hits are located. Colors indicate the hazard ratio, with a scale of blue representing hazards ratio lower than 1 and a scale of reds for hazard ratios higher than 1. b Kaplan–Meier curve and numbers at risk of a 9 bp deletion in the gene lppP comparing the probability of remaining susceptible between those nodes without the deletion (blue) and those with it (red). Shaded areas represent the 95% CI. The p-value for the Kaplan–Meier curves was calculated using the log-rank test.

For the gene-based GWAS, non-synonymous variants were aggregated for each locus and a binary matrix was created reflecting whether internal nodes and tips contained at least one non-synonymous polymorphism for each gene. After population and Bonferroni multiple testing correction, a total of 35 variants had a p-value lower than the significance threshold of 4.38 × 105 (Table 3). Functional annotations curated in the Mycobrowser42 showed that most genes were related to metabolism and cellular respiration (frdB, cyp135A1, Rv3113, hemE, pyrG, galK, icd1, Rv1096, fum, galE2, Rv1751, Rv3712, asnB) as well as cell wall processes (kdpA, lppP, cfp2, lytB1, mmpL1, Rv1417, lgt, Rv0226c, caeA, murF, pstB).

Table 3 Gene-based association analysis.

No significant associations were identified for lineage 2 after correcting for population structure, possibly due to the lower diversity of lineage 2 and the strong lineage effect on the phenotype. The analysis could not be replicated in the Samara dataset as the Samara dataset was significantly smaller and hence lacked sufficient statistical power.


This study represents the largest population level genomic analysis of Mycobacterium tuberculosis to date. Our 17-year sampling time frame provided a unique opportunity to study drug resistance acquisition dynamics and evolution. To our knowledge this is the first description and evaluation of pathogen pre-resistance (pre-existing polymorphisms that predispose to the acquisition of future drug resistance).

Using an ancestral state genome-wide survival analysis to move in time through the phylogenetic tree, we show that M. tuberculosis is predisposed to acquire drug resistance mutations at the lineage level, after mono-resistance, and at the level of nucleotide polymorphisms. Identifying pathogen genetic factors that predispose strains to evolve drug resistance could help prevent the acquisition and spread of resistance as well as treatment failure by expanding treatments to those strains most likely to become resistant in the future.

Previous studies of acquired drug resistance at the sublineage level in M. tuberculosis have led to contradictory outcomes, with small sample sizes in fluctuation assays7,12 or using amalgamated sub-population level samples. Here we demonstrate that lineage 2 acquired resistance to antibiotics more rapidly than lineage 4. There were no significant differences observed in drug resistance acquisition between the sub-lineages of the most diverse lineage 4. Even though lineage 2 showed an increased risk in drug-resistance acquisition, lineage 4 evolved resistance earlier than lineage 2 for almost all drugs analyzed, with the exception of streptomycin. This may be explained by the Euro-American distribution of lineage 4 and the earlier widespread implementation of antibiotics in these regions. Our analysis also suggests that the acquisition dynamic of compensatory mutations was similar for both lineage 2 and lineage 4. After rifampicin resistance associated mutations evolve in a clade, non-synonymous mutations in rpoC start to occur and steadily accumulate over time. Thus, pre-resistance mutations emerge independently of compensatory mutations.

The identification and control of mono-resistant strains is a key component of tuberculosis public health infection prevention and control efforts. Mono-resistance is associated with worse clinical outcomes43 and an increased probability of progressing to multidrug-resistance15. At the population level, our study quantifies this risk and shows that isoniazid mono-resistant strains have at least 15 times the hazard of developing multidrug-resistance relative to wild type susceptible strains. Despite the use of new therapeutics, multi-drug resistant tuberculosis continues to require polypharmacy with increased toxicity and longer treatment duration2. Globally, molecular rapid drug resistance surveillance is focused primarily on rifampicin, with the widely implemented GeneXpert MTB/RIF PCR based assay unable to detect isoniazid mono-resistance. Although Drug Susceptibility Testing (DST) is the current gold standard for identification of drug resistance isolates and can detect isoniazid mono-resistance, known diagnostic delays associated with it may limit its use in reducing mono-resistance amplification44,45. Inadequate diagnosis of isoniazid mono-resistance will inevitably lead to inappropriate treatment and could fuel rapid evolution of multidrug-resistance, thus posing a significant threat to tuberculosis control.

We also identified loci associated with higher risk of future drug resistance acquisition. To remain polymorphic, these variants must be under balancing selection and only be positively selected once exposed to drug therapy. Rather than causing resistance directly, these variants could promote resistance acquisition by compensating for the fitness costs of resistance in vivo35 or by increasing drug tolerance46.

At the gene level, most non-synonymous mutations associated with pre-resistance genotypes were located in genes related to cell wall processes and metabolism. Functional studies and prospective clinical trials are warranted to confirm their association with future drug resistance acquisition.

The variant with the lowest p-value corresponded to a 9 bp deletion in the gene lppP present at a frequency of 1.7% in the population and arose 12 times independently. Deletions in lipoproteins have been well characterized in the past47, and lppP has been predicted to be required for growth in macrophages40. Lipoproteins can act as antigenic proteins47, and thus deletions in the genes encoding them may alter the interaction between the bacilli and the macrophages, potentially conferring a selective advantage in the presence of drug and increasing the probability of acquiring drug resistance. This variant was also present in a global dataset for L4 at a frequency of 9%, which could be explained by the higher prevalence of drug resistance isolates in most publicly available data sets.

Two additional synonymous variants were identified in the genes esxL and esxO, which encode the ESAT-6-like proteins esxL and esxO. These genes are part of a family of genes that encode immunogenic secreted proteins that play a role in mycobacterial growth, pathogenesis, and host-pathogen interactions41. Moreover, esxO has been associated with pathogenesis by inducing autophagy in infected macrophages48. Synonymous homoplastic variants in esx genes have been previously identified23, but their phenotypic effects are still unclear.

This study benefited from an unbiased population level coverage of both drug resistant and drug susceptible strains that enabled us to reliably correct for the founder effect and control for the influence of pre-existing population diversity. The large sampling size and time frame—a consequence of 17-years of continued research in the same location—allowed us to generate time-calibrated phylogenies without imposing a global mutation rate. This was a pre-requisite for the downstream analyses and our GWAS survival analysis approach. We were also able to replicate our sublineage and mono-resistance dependent hazards of acquired resistance in the smaller Samara dataset. However, the time scale and size of this publicly available data was insufficient to allow us to confirm the effect of the lppP deletion in a second independent dataset.

Although our phylogenetic analysis reveals trends of drug resistance acquisition over evolutionary time, prospective cohort studies are required to determine the effect of these mutations at the individual patient and household level. Non-bacterial factors are unlikely to influence our findings, since they would have to have been disproportionately and consistently associated to a specific lineage over long periods of time. Nevertheless, we explored the influence of confounding variables on our dataset. There was no difference observed in the proportion of patients receiving previous antituberculous treatment between the two lineages. This makes our findings unlikely to be influenced by differential exposure to drugs among lineages. There was no difference in sputum smear grade between lineages, suggesting that our findings are not a consequence of increased pathogenicity of lineage 2 in comparison to lineage 4. Moreover, other factors that may affect the rates of drug resistance acquisition such as HIV status, sex, or imprisonment, did not show differences between lineages. Even though the age of patients with lineage 4 infection was significantly higher than that of patients with lineage 2, the difference was small. Additionally, patient age was not associated to a higher incidence of drug resistance, and therefore it is unlikely that age differences explain the higher risk of acquiring antibiotic resistance of lineage 2. Differential healthcare systems influence the acquisition and transmission of drug resistance tuberculosis, and thus importation events to Peru of resistant strains from specific lineages could have affected the dynamics of drug resistance acquisition. We showed that lineage 2 in Peru is characterized by two importations around 1900 CE, which is consistent with major immigration events of laborers from China to Peru at the end of the 19th century to work in the railroads, guano mines, and cotton and sugarcane plantations49,50. Conversely, the majority of lineage 4 clades were imported from Europe and Brazil between the 16th and 18th centuries, compatible with European colonial expansion51. Therefore, significant immigration to Peru occurred well before the advent of antibiotics, which limits the influence of imported drug resistant strains. Moreover, the majority of introductions occurring in recent times were of drug susceptible clades. Nevertheless, it is possible that some resistance events may have arisen as a result of importation of resistant strains from countries with different drug selection pressures.

In summary, this population wide 17 year-long epidemiological study of M. tuberculosis genetics provides the first description and evaluation of pre-resistant polymorphisms in susceptible genotypes that predispose to the acquisition of future drug resistance. Prediction of future drug resistance in susceptible pathogens together with targeted expanded therapy has the potential to prevent drug resistance emergence in M. tuberculosis and other pathogens. Prospective cohort studies of participants with and without these polymorphisms should be undertaken to inform clinical trials of personalized pathogen genomic therapy. This ancestral state genome-wide survival analysis could also be employed to predict and prevent the emergence of resistance or indeed any important trait of interest in other organisms.


Ethics approval

Ethical approval for sample collection and processing was obtained from the Institutional Review Board of Universidad Peruana Cayetano Heredia and the Peruvian Ministry of Health for all individual studies from which this data was derived38.

Study design and sample selection

Samples were selected from previous projects taking place across the region of Lima. The first project consisted of a population level study carried out between 2008 and 2010 as part of the population level implementation of Microscopic Observation Drug Susceptibility (MODS) testing52,53. A total of 2139 unselected patients of tuberculosis were collected (Supplementary Data File 1). Of these patients, 284 were analyzed in previous studies (PRJEB5280)23, while 1855 were processed as part of this project (PRJEB39837).

A second set of 213 MDR-TB samples was obtained from a 3-year long household follow-up study conducted between 2010–201338, of which 185 randomly selected samples underwent whole-genome sequencing (PRJEB5280)23.

Additionally, 42 unpublished whole-genome sequences of samples collected from 1999 to 2007 in different regions of Lima were added to the study (PRJEB47846), as well as 575 samples that were collected between 2003 and 2013 as part of the CRyPTIC Consortium (PRJEB32234)54, and 489 samples from the TANDEM Consortium (PRJEB23245)55 taken between 2014 and 2016.

All samples without collection date, as well as reference clinical samples, were excluded from the final list. Drug Susceptibility Testing (DST) was performed either by MODS53 or by the proportions method on agar56.

To replicate our findings, all the analyses were repeated in a publicly available independent dataset of 1027 isolates from Samara, Russia (PRJEB2138), as the sampling was also representative of the population19. We also included a global data set of 1573 publicly available lineage 4 samples relatively enriched for drug resistance (Supplementary Data File 3).

Whole-genome sequence analysis

Quality analysis of the raw reads was performed using FastQC57. A de-novo assembly of the short reads was done with SPAdes genome assembler v3.14.058 across kmers of size 21, 33, 45, 55, 65, 75, 81, 101, 111, and 121. The resulting assembly contigs were mapped to the well annotated H37Rv reference genome (Gene bank: AL123456) using minimap259 with the asm20 option. Single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels) were identified with BCFtools mpileup and BCFtools call v1.960 using the multiallelic calling algorithm, keeping the information about every single site in the genome in a VCF file. Lastly, indels were left-aligned and normalized using BCFtools norm. A consensus sequence was created from the VCF file. In order to determine the quality of the variants, the raw reads were mapped against the resulting consensus sequence using the mem algorithm implemented in BWA v0.7.1761, after which the alignments were sorted using SAMtools v1.962 and filtered for possible PCR and optical duplicates using Picard v2.19.063. Local realignment around indels was performed using the GATK v3.8-1-0 ‘IndelRealigner’ module64. The mean coverage for each sample was calculated as the number of mapped bases (excluding soft-clipped bases) divided by the genome size. Samples with a mean coverage lower than 15x were excluded from subsequent analysis. SNPs and indels were detected as described in the previous step.

Variants that did not meet the quality criteria were filtered using a combination of BCFtools filter and custom scripts in Python v3.7.3 with the following cutoffs: minimum Phred-scaled quality score (QUAL) of 20; minimum mapping quality (MQ) of 20; minimum genotype quality (QG) of 20; minimum read position bias (RPB), mapping quality bias (MQB), and strand bias (SP) of 0.001; minimum depth (DP) of 10 and a maximum of 5 times the mean coverage; minimum of reads supporting the alternate allele (AD) of 75% of the total depth in that position, with no less than two reads in the forward (ADF) and the reverse (ADR) strands. Additionally, SNPs within 2 bp of an indel and indels within 3 bp of another indel were removed, as both situations can be indicative of mapping artifacts. Positions that did not meet the quality criteria were annotated using the IUPAC ambiguity codes65. Samples with more than 25 high quality heterozygous calls were removed to avoid the inclusion of putative mixed infections. Variants that overlapped 100 bp intervals around known hypervariable regions, such repetitive elements and transposases66, were removed from the analysis as this may affect the reliability of the alignment. Similarly, recombinant regions in genes coding for proline-glutamate (PE) or proline-proline-glutamate (PPE)67, and SNPs implicated in drug resistance33 were excluded in order to minimize homoplasies that could disrupt the tree topology. The resulting sequences were concatenated to generate a multiple sequence alignment. Sites with a proportion of ambiguous bases higher than 10% were excluded from the analysis. Last, samples with a proportion of ambiguous sites in the alignment higher than 5% were excluded.

The functional consequence of variants was assessed using the Variant Effect predictor (VEP) v104.368

Phylogenetic analyses

All the phylogenetic analyses were performed based on the alignment containing both lineage 4 and lineage 2 samples, as well as separately for each lineage.

A maximum likelihood phylogeny was inferred using RAxML-NG69 with the GTR model, 20 starting trees (10 random and 10 parsimony), 100 bootstrap replicates, and a minimum branch length of 10−9. A Lineage 2 sample randomly selected from our dataset was selected as an outgroup for the Lineage 4 phylogeny. Likewise, a random Lineage 4 isolate was used as a root for the Lineage 2 phylogeny. The tree containing both lineage 4 and lineage 2 samples was rooted using a lineage 1 isolate.

In order to construct a time-calibrated phylogeny, we tested whether there was a detectable amount of evolutionary change between samples collected at different times30,31. This was done in lineage 4 and lineage 2 separately in order to avoid population structure confounding in the temporal signal27. Two different tests of the temporal signal were applied: the root-to-tip regression method and the date-randomization test28. For the former, BactDating29 was used to perform a linear regression between the collection dates and their root-to-tip genetic distance in the maximum likelihood tree. Additionally, we carried out a date-randomization test, where evolutionary rates estimated by BactDating29 were compared between the observed data set and 100 data sets obtained by permutation of sampling dates70.

BactDating29 was used to time-calibrate the tree using the mixed model for 107 iterations to achieve both convergence of the MCMC chains and an effective sample size of at least 100.

The phylogenetic global context of the Peruvian isolates was investigated by subsampling the phylogenies and repeating the analysis alongside publicly available isolates representative of the global diversity of M. tuberculosis (Supplementary Data File 2). To subsample the phylogenies, first a random sample was selected from each phylogenetic cluster with a branch length lower than 1 SNPs per genome. The phylogeny was then divided into clusters of samples 50 SNPs apart, and a maximum of 20 samples for lineage 2 and 5 samples for lineage 4 were randomly selected for each cluster, unless it consisted of only 1 sample, in which case it was ignored. The phylogeny with the Peruvian subsamples and the global representatives was inferred separately for lineage 2 and for lineage 4 as described above.

The subsequent phylogenetic analysis was performed using the R package ape71. Marginal reconstruction of the ancestral sequences was carried out by maximum likelihood as implemented in Phangorn72, including gaps and ambiguity codes to reflect prior probabilities of character states65.

Time-to-event analysis

Time-to-event analysis was performed on the tree using the R package Survival73. The time was measured for all pairs of nodes as the distance between the older and the younger node in the time-calibrated phylogeny. An observation was defined as censored if both nodes were drug sensitive. On the other hand, an event occurred if the older node was drug sensitive and the younger node was drug resistant. Only the first acquisition of resistance was considered. Observations taking place before 1940 were discarded. The Kaplan-Meier survival curve and the Cox proportional hazard ratio were calculated. The Kaplan-Meier curves for different groups were compared using the log-rank test, where the null hypothesis is that there is no difference in survival between the different groups. Differences in the hazard ratio were tested using the likelihood ratio test. The entire pipeline was repeated for 100 phylogenetic bootstrap replicates.

Genome-wide association of predisposition to drug resistance

Missing base calls at the tips were imputed by maximum likelihood using the re-rooting method74 and the IUPAC ambiguous codes to reflect tip state prior probabilities. In short, for each tip the phylogeny was re-rooted at that tip and the marginal probabilities for the missing bases were calculated for that node using the R package phytools75. Association analysis was performed at the gene and SNP level using the variant sites alignment for the tips of the phylogenetic tree, as well as the reconstructed sequences of the internal nodes. The phenotype was defined as leading to resistance in the phylogenetic tree, and thus only drug susceptible nodes that immediately preceded the first resistant node of each branch were considered. For the gene level analysis, non-synonymous variants were aggregated, excluding lineage specific SNPs. Loci with a frequency of non-synonymous variants in the dataset lower than 1% were not considered in the analysis. At the SNP-level, variants with a frequency in the population (tips of the tree) lower than 1% were excluded. Furthermore, only those variants that were polymorphic at the node level were used. Genome-wide association was performed using the Cox proportional hazard model and the time between nodes. In order to correct for population structure, a genetic distance matrix was calculated using SNPs with a frequency in the population higher than 5%, and the eigenvectors were used as covariates in the Cox regression model. The genomic inflation factor (λ) was calculated as the ratio of the median of the empirically observed χ2 to the median of the expected χ2. The p-values were corrected for multiple testing using a Bonferroni correction. Functional annotation of the genomic variants was assessed using Mycobrowser42. Alignments were visually inspected for a random selection of samples using the Integrative genomics viewer (IGV)76 and the R software package Gviz77.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

All raw sequencing data are available with accession numbers listed in the Methods section. Samples sequenced as part of this study have been submitted to the European Nucleotide Archive under accessions PRJEB39837 and PRJEB47846.

Publicly available datasets used in this study include PRJEB5280, PRJEB32234, PRJEB23245, and PRJEB2138.

All other publicly available datasets are listed in Supplementary Data File 2 and Supplementary Data File 3.

Code availability

All custom code used in this article can be accessed at


  1. 1.

    Paulson, T. Epidemiology: a mortal foe. Nature 502, S2–S3 (2013).

    ADS  PubMed  Google Scholar 

  2. 2.

    World Health Organization. Global tuberculosis report 2020 (2020).

  3. 3.

    Raviglione, M. XDR-TB: Entering the post-antibiotic era? Int. J. Tuberculosis Lung Dis. 10, 1185–1187 (2006).

    Google Scholar 

  4. 4.

    Walker, T. M. et al. Whole-genome sequencing for prediction of Mycobacterium tuberculosis drug susceptibility and resistance: a retrospective cohort study. Lancet Infect. Dis. 15, 1193–1202 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Gandhi, N. R. et al. Multidrug-resistant and extensively drug-resistant tuberculosis: a threat to global control of tuberculosis. Lancet 375, 1830–1843 (2010).

    PubMed  Google Scholar 

  6. 6.

    Gygli, S. M., Borrell, S., Trauner, A. & Gagneux, S. Antimicrobial resistance in Mycobacterium tuberculosis: mechanistic and evolutionary perspectives. FEMS Microbiol. Rev. 41, 354–373 (2017).

    CAS  PubMed  Google Scholar 

  7. 7.

    Ford, C. B. et al. Mycobacterium tuberculosis mutation rate estimates from different lineages predict substantial differences in the emergence of drug-resistant tuberculosis. Nat. Genet. 45, 784–790 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. 8.

    Cox, H. S. et al. The Beijing genotype and drug resistant tuberculosis in the Aral Sea region of Central Asia. Respiratory Res. 6, 134 (2005).

    Google Scholar 

  9. 9.

    Drobniewski, F. Drug-resistant tuberculosis, clinical virulence, and the dominance of the Beijing strain family in Russia. JAMA 293, 2726 (2005).

    CAS  PubMed  Google Scholar 

  10. 10.

    Taype, C. et al. Genetic diversity, population structure and drug resistance of Mycobacterium tuberculosis in Peru. Infect. Genet. Evolution 12, 577–585 (2012).

    CAS  Google Scholar 

  11. 11.

    Pang, Y. et al. Spoligotyping and drug resistance analysis of Mycobacterium tuberculosis strains from national survey in China. PLoS ONE 7, e32976 (2012).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  12. 12.

    Werngren, J. & Hoffner, S. E. Drug-susceptible Mycobacterium tuberculosis Beijing genotype does not develop mutation-conferred resistance to rifampin at an elevated rate. J. Clin. Microbiol. 41, 1520–1524 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. 13.

    Glynn, J. R., Whiteley, J., Bifani, P. J., Kremer, K. & van Soolingen, D. Worldwide occurrence of Beijing/W strains of Mycobacterium tuberculosis: a systematic review. Emerg. Infect. Dis. 8, 843–849 (2002).

    PubMed  PubMed Central  Google Scholar 

  14. 14.

    Borrell, S. et al. Epistasis between antibiotic resistance mutations drives the evolution of extensively drug-resistant tuberculosis. Evolution Med. Public Health 2013, 65–74 (2013).

    Google Scholar 

  15. 15.

    Saunders, N. J. et al. Deep resequencing of serial sputum isolates of Mycobacterium tuberculosis during therapeutic failure due to poor compliance reveals stepwise mutation of key resistance genes on an otherwise stable genetic background. J. Infect. 62, 212–217 (2011).

    PubMed  Google Scholar 

  16. 16.

    Menzies, D. et al. Effect of duration and intermittency of Rifampin on tuberculosis treatment outcomes: a systematic review and meta-analysis. PLoS Med. 6, e1000146 (2009).

    PubMed  PubMed Central  Google Scholar 

  17. 17.

    Eldholm, V. et al. Four decades of transmission of a multidrug-resistant Mycobacterium tuberculosis outbreak strain. Nat. Commun. 6, 7119 (2015).

    ADS  CAS  PubMed  Google Scholar 

  18. 18.

    Cohen, K. A. et al. Evolution of extensively drug-resistant tuberculosis over four decades: whole genome sequencing and dating analysis of Mycobacterium tuberculosis Isolates from KwaZulu-Natal. PLOS Med. 12, e1001880 (2015).

    PubMed  PubMed Central  Google Scholar 

  19. 19.

    Casali, N. et al. Evolution and transmission of drug-resistant tuberculosis in a Russian population. Nat. Genet. 46, 279–286 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. 20.

    Manson, A. L. et al. Genomic analysis of globally diverse Mycobacterium tuberculosis strains provides insights into the emergence and spread of multidrug resistance. Nat. Genet. 49, 395–402 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. 21.

    Grenfell, B. T. Unifying the epidemiological and evolutionary dynamics of pathogens. Science 303, 327–332 (2004).

    ADS  CAS  PubMed  Google Scholar 

  22. 22.

    Galvani, A. P. Epidemiology meets evolutionary ecology. Trends Ecol. Evolution 18, 132–139 (2003).

    Google Scholar 

  23. 23.

    Grandjean, L. et al. Convergent evolution and topologically disruptive polymorphisms among multidrug-resistant tuberculosis in Peru. PLoS ONE 12, e0189838 (2017).

    PubMed  PubMed Central  Google Scholar 

  24. 24.

    Coll, F. et al. A robust SNP barcode for typing Mycobacterium tuberculosis complex strains. Nat. Commun. 5, 4812 (2014).

    ADS  CAS  PubMed  Google Scholar 

  25. 25.

    Ajawatanawong, P. et al. A novel ancestral Beijing sublineage of Mycobacterium tuberculosis suggests the transition site to modern Beijing sublineages. Sci. Rep. 9, 1–12 (2019).

    CAS  Google Scholar 

  26. 26.

    World Health Organization. Global tuberculosis report 2009 (2009).

  27. 27.

    Murray, G. G. R. et al. The effect of genetic structure on molecular dating and tests for temporal signal. Methods Ecol. Evol. 7, 80–89 (2016).

    PubMed  Google Scholar 

  28. 28.

    Rieux, A. & Balloux, F. Inferences from tip-calibrated phylogenies: a review and a practical guide. Mol. Ecol. 25, 1911–1924 (2016).

    PubMed  PubMed Central  Google Scholar 

  29. 29.

    Didelot, X., Croucher, N. J., Bentley, S. D., Harris, S. R. & Wilson, D. J. Bayesian inference of ancestral dates on bacterial phylogenetic trees. Nucleic Acids Res. 46, e134–e134 (2018).

    PubMed  PubMed Central  Google Scholar 

  30. 30.

    Drummond, A. J., Pybus, O. G., Rambaut, A., Forsberg, R. & Rodrigo, A. G. Measurably evolving populations. Trends Ecol. Evol. 18, 481–488 (2003).

    Google Scholar 

  31. 31.

    Biek, R., Pybus, O. G., Lloyd-Smith, J. O. & Didelot, X. Measurably evolving pathogens in the genomic era. Trends Ecol. Evol. 30, 306–313 (2015).

    PubMed  PubMed Central  Google Scholar 

  32. 32.

    Menardo, F., Duchêne, S., Brites, D. & Gagneux, S. The molecular clock of Mycobacterium tuberculosis. PLOS Pathog. 15, e1008067 (2019).

    PubMed  PubMed Central  Google Scholar 

  33. 33.

    Coll, F. et al. Rapid determination of anti-tuberculosis drug resistance from whole-genome sequences. Genome Med. 7, 51 (2015).

    PubMed  PubMed Central  Google Scholar 

  34. 34.

    Keshavjee, S. & Farmer, P. E. Tuberculosis, drug resistance, and the history of modern medicine. N. Engl. J. Med. 367, 931–936 (2012).

    CAS  PubMed  Google Scholar 

  35. 35.

    Gagneux, S. et al. Variable host-pathogen compatibility in Mycobacterium tuberculosis. Proc. Natl Acad. Sci. USA 103, 2869–2873 (2006).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  36. 36.

    Sherman, D. R. et al. Compensatory ahpC gene expression in isoniazid-resistant Mycobacterium tuberculosis. Science 272, 1641–1643 (1996).

    ADS  CAS  PubMed  Google Scholar 

  37. 37.

    Comas, I. et al. Whole-genome sequencing of rifampicin-resistant Mycobacterium tuberculosis strains identifies compensatory mutations in RNA polymerase genes. Nat. Genet.44, 106–110 (2011).

    PubMed  PubMed Central  Google Scholar 

  38. 38.

    Grandjean, L. et al. Transmission of multidrug-resistant and drug-susceptible tuberculosis within households: a prospective cohort study. PLOS Med. 12, e1001843 (2015).

    PubMed  PubMed Central  Google Scholar 

  39. 39.

    Warren, J. L. et al. Investigating spillover of multidrug-resistant tuberculosis from a prison: a spatial and molecular epidemiological analysis. BMC Med. 16, 122 (2018).

    PubMed  PubMed Central  Google Scholar 

  40. 40.

    Rengarajan, J., Bloom, B. R. & Rubin, E. J. From the cover: genome-wide requirements for Mycobacterium tuberculosis adaptation and survival in macrophages. Proc. Natl Acad. Sci. 102, 8327–8332 (2005).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  41. 41.

    Tufariello, J. M. et al. Separable roles for Mycobacterium tuberculosis ESX-3 effectors in iron acquisition and virulence. Proc. Natl Acad. Sci. USA 113, E348–E357 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. 42.

    Kapopoulou, A., Lew, J. M. & Cole, S. T. The MycoBrowser portal: a comprehensive and manually annotated resource for mycobacterial genomes. Tuberculosis 91, 8–13 (2011).

    CAS  PubMed  Google Scholar 

  43. 43.

    Van der Heijden, Y. F. et al. Isoniazid-monoresistant tuberculosis is associated with poor treatment outcomes in Durban, South Africa. Int. J. Tuberculosis Lung Dis. 21, 670–676 (2017).

    Google Scholar 

  44. 44.

    Ahmad, S. & Mokaddas, E. Current status and future trends in the diagnosis and treatment of drug-susceptible and multidrug-resistant tuberculosis. J. Infect. Public Health 7, 75–91 (2014).

    PubMed  Google Scholar 

  45. 45.

    Nguyen, T. N. A., Anton-Le Berre, V., Bañuls, A.-L. & Nguyen, T. V. A. Molecular diagnosis of drug-resistant tuberculosis; a literature review. Front. Microbiol. 10, 794 (2019).

    PubMed  PubMed Central  Google Scholar 

  46. 46.

    Hicks, N. D. et al. Clinically prevalent mutations in Mycobacterium tuberculosis alter propionate metabolism and mediate multidrug tolerance. Nat. Microbiol. 3, 1032–1042 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  47. 47.

    Tsolaki, A. G. et al. Functional and evolutionary genomics of Mycobacterium tuberculosis: Insights from genomic deletions in 100 strains. Proc. Natl Acad. Sci. USA 101, 4865–4870 (2004).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  48. 48.

    Mohanty, S. et al. Mycobacterium tuberculosis EsxO (Rv2346c) promotes bacillary survival by inducing oxidative stress mediated genomic instability in macrophages. Tuberculosis 96, 44–57 (2016).

    CAS  PubMed  Google Scholar 

  49. 49.

    Gonzales, M. J. Chinese plantation workers and social conflict in Peru in the late nineteenth century. J. Latin Am. Studies 21, 385–424 (1989).

    Google Scholar 

  50. 50.

    Ritacco, V. et al. Mycobacterium tuberculosis strains of the Beijing genotype are rarely observed in tuberculosis patients in South America. Mem. do Inst. Oswaldo Cruz 103, 489–492 (2008).

    Google Scholar 

  51. 51.

    Brynildsrud, O. B. et al. Global expansion of Mycobacterium tuberculosis lineage 4 shaped by colonial migration and local adaptation. Sci. Adv. 4, eaat5869 (2018).

    ADS  PubMed  PubMed Central  Google Scholar 

  52. 52.

    Moore, D. A. et al. Microscopic-observation drug-susceptibility assay for the diagnosis of TB. N. Engl. J. Med. 355, 1539–1550 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  53. 53.

    Caviedes, L. et al. Rapid, efficient detection and drug susceptibility testing of Mycobacterium tuberculosis in sputum by microscopic observation of broth cultures. J. Clin. Microbiol. 38, 1203–1208 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  54. 54.

    The CRyPTIC Consortium; 100000 Genomes Project. Prediction of susceptibility to first-line tuberculosis drugs by DNA sequencing. N. Engl. J. Med. 379, 1403–1415 (2018).

    Google Scholar 

  55. 55.

    Ugarte-Gil, C. et al. Diabetes mellitus among pulmonary tuberculosis patients from 4 tuberculosis-endemic countries: the tandem study. Clin. Infect. Dis. 70, 780–788 (2020).

    PubMed  Google Scholar 

  56. 56.

    Clinical and Laboratory Standards Institute (CLSI). Susceptibility Testing of Mycobacteria, Nocardiae, and Other Aerobic Actinomycetes; Approved Standard—Second Edition tech. rep. (2011).

  57. 57.

    Andrews, S. & Babraham Bioinformatics. FastQC: a quality control tool for high throughput sequence data. Manual (2010).

  58. 58.

    Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).

    MathSciNet  CAS  PubMed  PubMed Central  Google Scholar 

  59. 59.

    Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  60. 60.

    Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  61. 61.

    Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).

    PubMed  PubMed Central  Google Scholar 

  62. 62.

    Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    PubMed  PubMed Central  Google Scholar 

  63. 63.

    Thomer, A. K., Twidale, M. B., Guo, J. & Yoder, M. J. Picard Tools in Conference on Human Factors in Computing Systems - Proceedings (2016).

  64. 64.

    McKenna, A. et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  65. 65.

    Cornish-Bowden, A. Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res. 13, 3021–3030 (1985).

    CAS  PubMed  PubMed Central  Google Scholar 

  66. 66.

    Cole, S. T. et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393, 537–544 (1998).

    ADS  CAS  PubMed  Google Scholar 

  67. 67.

    Phelan, J. E. et al. Recombination in pe/ppe genes contributes to genetic variation in Mycobacterium tuberculosis lineages. BMC Genomics 17, 151 (2016).

    PubMed  PubMed Central  Google Scholar 

  68. 68.

    McLaren, W. et al. The Ensembl variant effect predictor. Genome Biol. 17, 1–14 (2016).

    Google Scholar 

  69. 69.

    Kozlov, A. M., Darriba, D., Flouri, T., Morel, B. & Stamatakis, A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35, 4453–4455 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  70. 70.

    Duchêne, S., Duchêne, D., Holmes, E. C. & Ho, S. Y. The performance of the date-randomization test in phylogenetic analyses of time-structured virus data. Mol. Biol. Evolution 32, 1895–1906 (2015).

    Google Scholar 

  71. 71.

    Paradis, E., Claude, J. & Strimmer, K. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289–290 (2004).

    CAS  PubMed  Google Scholar 

  72. 72.

    Schliep, K. P. phangorn: phylogenetic analysis in R. Bioinformatics 27, 592–593 (2011).

    CAS  PubMed  Google Scholar 

  73. 73.

    Terry M. Therneau & Patricia M. Grambsch. Modeling Survival Data: Extending the Cox Model (Springer, 2000).

  74. 74.

    Yang, Z., Kumar, S. & Nei, M. A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 141, 1641–1650 (1995).

    CAS  PubMed  PubMed Central  Google Scholar 

  75. 75.

    Revell, L. J. phytools: an R package for phylogenetic comparative biology (and other things). Methods Ecol. Evol. 3, 217–223 (2012).

    Google Scholar 

  76. 76.

    Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  77. 77.

    Hahne, F. & Ivanek, R. in Statistical Genomics: Methods and Protocols 335–351 (2016).

Download references


We would like to thank the participants of the study. LG was supported by the Wellcome Trust (201470/Z/16/Z), the National Institute of Allergy and Infectious Diseases of the National Institutes of Health under award number 1R01AI146338 and by the GOSH/ICH Biomedical Research Centre. OMK was supported by the Imperial Biomedical Research Centre (NIHR Imperial BRC, grant P45058). XD was supported by the NIHR Health Protection Research Unit in Genomics and Enabling Data. We thank the CRyPTIC project and the Tandem project for making whole-genome data available in the public domain. All authors acknowledge UCL Computer Science Technical Support Group (TSG) and the UCL Department of Computer Science High Performance Computing Cluster.

Author information




A.T.O., L.G., X.D., O.M.K., F.B., J.R.V., R.G., C.B. and D.M. conceived and designed the study. J.C. performed and supervised tuberculosis cultures, DNA extraction, laboratory, and sequencing work. L.G., X.D., F.B. and O.M.K. jointly supervised the research. A.T.O., L.G., F.B. and X.D. performed and advised on computational analyses. A.T.O. and L.G. wrote the manuscript with input from all co-authors. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Louis Grandjean.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review information

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Torres Ortiz, A., Coronel, J., Vidal, J.R. et al. Genomic signatures of pre-resistance in Mycobacterium tuberculosis. Nat Commun 12, 7312 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing