Introduction

Tuberculosis (TB), caused by Mycobacterium tuberculosis (MTB), affects ~ 10 million people annually1. Despite a recent reduction in TB incidence, increasing drug resistance is making infection control difficult1. New diagnostic tools and treatments are needed in developing countries to inform clinical and disease control activities2. The recommended combination of anti-TB drugs involves a 6-month standard regimen including first-line rifampicin, isoniazid, ethambutol, and pyrazinamide1. The detection of M. tuberculosis resistant to first-line rifampicin (RR-TB) and isoniazid, together called multidrug resistance (MDR-TB), is especially crucial in Pakistan3. The country is in the top eight for RR-TB and MDR-TB burden, with an estimated 6100 of its > 330,000 TB cases in 2019 presenting with RR-TB or greater resistance4.

Khyber Pakhtunkhwa is the third largest province of Pakistan, with an area of 74,521 km2 and a population size of approximately 30.5 million individuals. It has a history of armed conflicts, and TB control has been a neglected area of public health. Recently, a TB control program was launched at Hayatabad Medical Complex Peshawar to monitor disease incidence and perform drug susceptibility testing (DST) to assess levels of RR-TB and MDR-TB. RR-TB is mainly caused by the presence of mutations in the RNA polymerase β subunit (rpoB) gene. Mutations in an 81-bp core region of the rpoB gene, known as the RR-determining region (RRDR), account for 95% of the RR-TB detected in clinical isolates5. However, secondary mutations in the rpoA or rpoC genes can mitigate the initial fitness cost caused by a rpoB mutation6,7. About one-third of RR-TB involves rpoB mutant strains with companion rpoC or rpoA mutations8. Here, by sequencing fifty-one M. tuberculosis, we aimed to summarise the diversity and potential role of rpoB, rpoA, and rpoC mutations in the high MDR incidence setting of Khyber Pakhtunkhwa, to inform diagnostics tools for TB control and implementation. The analysis revealed some mixed MDR infections and heteroresistance, reflecting the complex dynamics of the high transmission setting in Pakistan, and reinforcing the need for robust diagnostic tools.

Materials and methods

A total of 4822 pulmonary and extrapulmonary samples positive for M. tuberculosis were processed between 2016 to 2019 by the provincial Reference Laboratory for TB at Hayatabad Medical Complex Peshawar. Samples were received from all the Programmatic drug management units in the province. All samples were processed using the N-acetyl-l-cysteine–sodium hydroxide (NALC–NaOH) concentration method, using Lowenstein–Jensen medium (LJ) solid culture and liquid MGIT tubes containing 7H9 media9. All samples were subject to phenotypic drug susceptibility testing (DST), performed through the automated BACTEC MGIT 960 system (BD Diagnostic Systems, NJ, USA)10. The DST for resistant isolates was repeated for confirmation of drug resistance profile. Resistance to isoniazid, rifampicin, and other drugs was assessed through the BACTEC MGIT 960 system, with established WHO critical concentrations11. As per WHO guidelines for MGIT DST, there is only a single concentration (1.0 umg/mL) for rifampicin. Fifty-one samples (4 pan-susceptible, 47 drug-resistant, including 44 MDR-TB) were selected for whole genome sequencing (WGS). The source of genomic DNA was MGIT Culture (Liquid Culture) and the same culture generation was used for the phenotypic and genotypic (WGS) characterisation. The study was approved by the Institutional Ethics Committee of KUST Kohat and Provincial Tuberculosis Reference Laboratory, Khyber Pakhtunkhwa (Ref. no.: PTP/PTRL-408/19). All study participants gave informed consent for the collection and use of their biological materials, and all methods were performed in accordance with the relevant guidelines and regulations.

Genomic DNA for the fifty-one isolates was extracted using the acetyl trimethylammonium bromide (CTAB) method12. The DNA underwent WGS at the London School of Hygiene and Tropical Medicine (LSHTM) using the Illumina MiSeq platform (QIAseq FX DNA Library Kit) with a 151 bp paired-end protocol (v3). The quality of the resulting raw sequence data (FASTQ files) was assessed using the FastQC tool (v0.11.8). Reads were trimmed using trimmomatic software (v0.39, LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36). The filtered reads were aligned to the H37Rv genome (Accession No. AL123456.3) using Burrow-Wheeler Aligner (BWA)-mem (v0.6) software13. The GATK tool (HaplotypeCaller v4.1.4.1, -ERC GVCF)14 was used to call single nucleotide polymorphisms (SNPs) and insertions and deletions (indels), and the variants were annotated for function (e.g. non-synonymous mutations, frameshifts). The M. tuberculosis isolates were profiled for lineage and genotypic drug resistance using TB-Profiler software (v3.0.0), which requires a minimum of 10 reads to call variants by default15. Mixed infections were defined as samples with more than one lineage, predicted by TB-Profiler, and supported by > 200 biallelic genomic positions. Mutations were assigned to the strains by comparing their frequency in read coverage to the predicted frequency of the strains in each mixture based on genome-wide variants. All results were compared to a mutational database consisting of > 35 k M. tuberculosis spanning all lineages and drug resistance types16. The effects of mutations on RpoA, RpoB, and RpoC protein structures were investigated using the DynaMut server17. In particular, the effects on protein stability (changes in vibrational entropy, ΔΔG: kcal/mol) and flexibility (ΔΔSVib: kcal/mol-1.K-1) were assessed using models downloaded from the PDB (ID: 6c04). Negative predicted stability changes are destabilising, whilst positive changes are stabilising. Consensus genomes were constructed for all isolates using bcftools (v1.9)18. A multiple sequence alignment was constructed using the consensus genomes, and phylogenetic reconstruction was performed using iqtree software (v1.6.12, -m GTR + G)19.

Results

High quality data was generated for fifty-one isolates with a median of 5,186,316 read pairs per isolate. Trimming removed low-quality reads leading to the retention of on average 70.9% of pairs with both reads and 97.2% with at least one read. Classification of the reads using Kraken software reported high percentages classified as Mycobacterium (median: 99.7%). Mapping of the reads led to an average depth of 216-fold coverage, and 99% of the genome with at least tenfold coverage. Of the fifty-one M. tuberculosis isolates, the majority were from lineage 3 (30, 58.8%), and were sourced from female patients (n = 29, 56.9%) and previously treated cases (n = 30, 58.8%) (Table 1). We identified four mixed infections, consisting of different sub-lineages (4.8 and 2.2.1, 4.9 and 3, 3 and 2.2.1, 3 and 4.6.2.1) (Table 1; Supplementary Table S1; Fig. 1), potentially reflecting the high transmission setting. By comparing the allelic frequencies at each mutation position with the predicted frequency of the strains within a mixture, we assigned mutations to the individual strains wherever possible. The phenotypic and genotypic drug resistance profiles were concordant across all samples at the dominant strain level, but the genomic data revealed the complexity of the four mixed lineage infections. In particular, one sample contained a combination of MDR-TB and isoniazid-resistant strains, whilst the other three isolate samples were combinations of MDR-TB with MDR-TB (n = 2) or pan susceptible (n = 1) strains (Supplementary Table S1).

Table 1 Patient demographic and genotypic data of the 51 M. tuberculosis isolates.
Figure 1
figure 1

Phylogenetic tree of M. tuberculosis strains (n = 51), with their lineages, multi-drug resistance profiles and specific mutations in rpoB, rpoA, and rpoC. The phylogenetic tree was created using a maximum likelihood approach. The first vertical band to the right of the tree denotes the lineage. The second vertical band denotes the drug resistance phenotype. The squares show the presence of a specific mutation.

Isolates with isoniazid resistance (n = 45) presented with underlying haplotypes katG S315T (31/45), fabG1 − 15C > T + katG S315T (6/45), fabG1 − 15C > T (3/45), fabG1 − 15C > T + ahpC − 54C > T (3/45), fabG1 − 15C > T + katG D311G (1/45), and katG V1A (1/45) (Supplementary Table S1). The RR-TB and MDR-TB isolates (n = 46) all had established rifampicin related mutations: rpoB S450L (41/46), R552 (3/46), H445D/L/R (3/46), S874Y (3/46), and S450W/F (2/46) (Table 2). We found two potential new mutations in rpoB (K891Q, P471R) (Table 2), but they appeared on the same background as established resistance markers (rpoB H445L and S450L) (Table 3; Supplementary Table S1). Of the RR-TB and MDR-TB samples (n = 46), 34 (73.9%) had at least one potential compensatory mutation, the majority in rpoC (32/46) and a minority in rpoA (5/46), including three isolates with mutations in both genes. The most common rpoC mutations were V483G (10), E1092D (9), I491T (7), V483A (4), P1040S (3), and K891Q (2), and when observed previously, have been present in predominantly RR-TB strains (Table 2), thereby reflecting their potential compensatory role. The rpoA mutations included D253Y (2/46), V183G (2/46), and D190E (1/46), all appearing to be previously unreported (Table 2), but only V183G was present on the background without rpoC mutations (Table 3). Interestingly, the only potentially unreported rpoC mutation (V517E) was in a mixed infection (MDR-TB/isoniazid-resistant) with rpoB S450W and rpoA D190E present (Table 3, Supplementary Table S1).

Table 2 Non-synonymous mutations in rifampicin candidate genes of M. tuberculosis.
Table 3 Combinations of rpoB, rpoC, and rpoA mutations present in the 51 M. tuberculosis isolates.

We assessed in silico the impact of eight low frequency mutations (rpoB P471R, R552L, S874Y, K891Q; rpoA V183G, D190E; rpoC G332, V517E) on the stability and flexibility of appropriate protein structures (RpoA, RpoB, RpoC) (Table 4; Figs. 2, 3, 4). These mutations all have low frequency in our study (1–3 isolates) and in the 35 k dataset (< 30 isolates). Note, the impact of the potentially unreported rpoA D253Y mutation was not assessed, as residues of D253 for RpoA could not be modelled because of the eight mutations considered, all but two (rpoB R552L; rpoA D190E) had a potentially stabilizing effect. The rpoC G332R mutation had the highest stability and lowest flexibility effects (Table 4) and is found exclusively in RR-TB strains (Table 2). The rpoA D190E led to the least stable and highest flexibility effects. This mutation was present on the uncommon rpoB S450W and rpoC V517E mutation background, where each mutation is found almost exclusively in RR-TB strains (Table 2).

Table 4 Protein structure stability and flexibility effects for low frequency mutations.
Figure 2
figure 2

Effect of the rpoC G332R and V517E point mutations on RpoC protein dynamics. ΔΔG; Free energy difference. ΔΔSVib ENCoM; vibrational entropy energy. The effects have been predicted through the DynaMut online server. The increasing (red) and decreasing (blue) molecular flexibility effects of point mutations have been depicted. The total energy calculated for mutations shows a stabilizing effect on protein structure.

Figure 3
figure 3

Effects of R552L, S874Y, and K891Q mutations on the RpoB protein structure. ΔΔG; Free energy difference. ΔΔSVib ENCoM; vibrational entropy energy. The increasing (red) and decreasing (blue) molecular flexibility effects of point mutations have been depicted.

Figure 4
figure 4

Location of residues in RpoA, RpoB, RpoC proteins.

Discussion

Resistance to rifampicin (RR-TB) has been associated with mutations in an 81 bp hot-spot region of the rpoB gene. However, mutations in rpoA and rpoC, have also been linked to RR-TB, particularly in a compensatory role20. We identified 24 mutations across rpoA, rpoB and rpoC genes in 51 M. tuberculosis (46 RR-TB) isolates, including six mutations [rpoA (2), rpoB (3), and rpoC (1)], which were not found in a large collection (~ 35 k) of global clinical isolates. All RR-TB could be explained by known mutations, including the rpoB S450L (41/46 isolates), which is the most prevalent mutation globally (~ 64%)20 and in Pakistan (~ 74%21; ~ 71%22). Four rpoB mutations (R552L, S874Y, K891Q, P471R) appear not to have been reported by recent genetic studies from Khyber Pakhtunkhwa21,22,23. Some recently reported rpoB mutations in Pakistan (L430P, H445P/Y, L452P21; S450Q, P454H, G455D22; D435G21,22, D435Y21,23), which are of much lower frequency than the S450L mutation and sometimes reported in combination with it, were not present in our RR-TB isolates. Similarly, all isoniazid resistance could be explained by known mutations, including the katG S315T (37/45), which has a high frequency globally (~ 75%) and in Pakistan (> 89%21,23). The majority of RR-TB/MDR-TB isolates had mutations in rpoC (32/46), which is a locus known to restore the fitness of rifampicin-resistant bacteria carrying mutations in rpoB24. Other studies have found a lower prevalence of rpoC mutations in MDR isolates (< 30%)6,25,26, including a study in Pakistan26, and inconsistencies may be due to differences in sampling sites and strategies. Eight mutations in rpoC (G332R, N416S, V483A, V517E/L, G594E, P1040S, E1092D) appear not to have been reported in recent genetic studies from Khyber Pakhtunkhwa21,22,23. The evidence for a prominent compensatory role of rpoA in RR-TB is less strong. Our study identified only three rpoA mutations, of which two appear not to have been reported before, and were in the presence of rpoC mutations. In general, potential rpoA and rpoC compensatory mutations and mechanisms need to be validated experimentally in fitness studies, or through large-scale studies involving phylogenetic analysis of longitudinally collected strains.

Phenotypic DST is considered to be the gold standard method to infer resistance. However, some studies have reported that liquid culture systems (e.g. the MGIT platform) often fail to detect low-levels of RR-TB27,28, including those involving H445D/L/R mutations27,29, which have been linked to treatment failure or relapse30,31,32. The role of such mutations can be investigated using protein structural analysis33, and previous studies have shown that mutations may affect the structural dynamics of proteins, making them weak targets, resulting in drug resistance4,34,35. Our in silico analysis revealed that the rpoC G332R mutation, which is rare (< 0.05%) globally and exclusively in RR-TB isolates, led to high protein stability and less flexibility, which may be favorable features for compensatory effects. This mutation has been previously reported in a South African setting5, where it was observed to originate through homoplastic evolution and thus likely to confer a selective advantage. Several of the other low-frequency mutations in the three genes have been reported previously (rpoB R552L36, rpoB S874Y37 and rpoA V183G6) in different settings and have been proposed as potential compensatory mutations. For other low-frequency mutations including rpoB K891Q, rpoB P471R, rpoC V517E, rpoA D190E and rpoA D253Y, we did not find any previous reports in the literature. These mutations fall outside the rifampicin resistance determining regions and co-occur with known resistance mutations, potentially to compensate for the fitness impact of the resistance mutations.

A potential limitation of the phenotypic DST exposed by our genomic analysis is its robustness to mixed infections and heteroresistance. Only the WHO recommended concentration of the rifampicin drug was assessed, and dominant rifampicin resistance mutants within MDR-TB mixes can complicate the phenotypic interpretation of any low-level rifampicin resistance mutants. In our study, five instances of within-host resistance mechanism diversity were observed as the result of mixed infections. Three instances involved mixed-lineage infections, two of which were a mixture of MDR-TB strains and one was a mixture of a pan-susceptible and an MDR-TB strain. Interestingly, the MDR-TB mixtures contained different resistance-conferring mutations in rpoB, katG and the inhA promoter region, as well as likely compensatory mutations in rpoC and potentially rpoA. These mutations were found in the same proportions within the raw data as lineage variants, indicating the different strains carry resistance markers. Heteroresistance was also observed in isolates with a single lineage. This observation could be indicative of acquired resistance, where mutations have not become fixed in the within-host bacterial population. Almost all single-lineage RR-TB isolates contained the rpoB S450L at a completely fixed level in the sequenced population. Some of these strains possessed additional mutations in rpoB, rpoC and rpoA at unfixed proportions (e.g. rpoB R552L, rpoB S874Y, rpoC V483G and rpoA D190E). Based on this observation it is likely that these mutations were acquired after the rpoB S450L, which confers high-level resistance and serve as compensatory mutations to restore the fitness penalty of the resistance mutation.

The application of sequencing technologies in TB endemic regions will assist clinical management and personalised anti-TB drug approaches, as well as disease control through surveillance activities. The management of MDR-TB strains is essential for the control of TB. In our study, we found that the mutations underlying MDR-TB are established variants. Known and putative drug resistance markers in established loci are detectable using low-cost sequencing-based approaches (e.g. amplicon-based), suitable for a low resource setting. These methods are also robust to mixed infections and heteroresistance35,38, which were present in our study and likely driven by the high transmission setting. Large-scale approaches using whole genome sequencing and analysis will assist with understanding the epidemiology and risk groups or factors underpinning the transmission of TB, as well as provide insights into drug resistance and compensatory mechanisms. Such insights will inform the deployment of anti-TB drug regimens and disease control tools and strategies in high burden settings, such as Pakistan.

Conclusion

Whole genome sequencing for TB clinical management and disease control will have the greatest benefit in complex community outbreaks in endemic regions. Our study in the Khyber Pakhtunkhwa province in Pakistan has provided a characterization of circulating rifampicin resistance and MDR-TB mutations. Such insights will assist sequencing-informed diagnosis for proactive TB patient management, and the deployment of anti-TB drug regimens and surveillance activities.