Introduction

Tuberculosis disease (TB), caused by bacteria in the Mycobacterium tuberculosis complex, is a major global public health problem. Pakistan is a high-burden TB country, being one of eight countries accounting for two-thirds of the estimated 10 million people globally that fell ill with the disease1. In 2019, Pakistan had ~ 570,000 TB cases (incidence rate 263 per 100,000) and 43,900 deaths1, but disease control is being compromised by increasing HIV prevalence and drug resistance. The country has a high burden for rifampacin resistant (RR-TB), as well as multidrug-resistance (MDR-TB), which is the additional resistance to isoniazid treatments. Pre-extensive drug resistance (pre-XDR-TB) is prevalent1,2, involving M. tuberculosis that are MDR-TB and resistant to any fluoroquinolone or at least one of the three second-line injectable drugs (capreomycin, kanamycin, amikacin). XDR-TB requires resistance to any fluoroquinolone and a second-line injectable. In January 2021, WHO updated these definitions of XDR-TB to include other drugs, such as bedaquiline3. Here, we adopt the older version of the definition as the underlying cases were treated within that framework. There were ~ 25,000 cases of MDR-/RR-TB in 20191. The National TB control program aims to reduce by half the prevalence of TB in the general population by 2025, but to achieve this will require the scaling-up of TB detection and clinical care, as well as improved systems for inferring disease transmission, thereby facilitating further targeted interventions.

Whole genome sequencing (WGS) is revolutionizing our understanding of drug resistance and clinical management, as well as transmission patterns, thereby assisting disease control4. M. tuberculosis drug resistance is linked to genomic variants in drug targets or pro-drug activators, including single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels), some occurring in gene–gene interactions. It is therefore possible to predict resistance genotypically for 19 anti-TB drugs and their groups (e.g. floroquinolines) using curated libraries of > 1000 mutations across > 30 loci5,6, thereby personalizing treatment. Genotypic predictions are an alternative to bacterial culture-based phenotypic drug susceptibility testing (DST), which can be time-consuming and resource intensive, with reproducibility and inhibitory concentration cut-off challenges for particular drugs5. Further, WGS data infers the population structure within the M. tuberculosis complex, which is phylo-geographical in nature, with strains falling within distinct (sub-)lineages7, and potential transmission chains identified through isolates with (near-)identical genomic variation8. The identification of highly virulent strain-types or lineages, drug resistance, and transmission clusters will assist the targeting of limited resources for TB control.

There have been recent studies using WGS to characterize M. tuberculosis genetic diversity in isolates sourced from Pakistan, where the predominant strains are from the Central Asian (CAS) family, set within lineage 32,9,10,11,12,13. A recent study of TB endemic province of Khyber Pakhtunkhwa (North West Pakistan) found that known mutations in rpoB (e.g. S405L), katG (e.g. S315T), or inhA promoter loci explain the majority of MDR-TB, but there was evidence of complex mixed infections and heteroresistance, which may reflect the high transmission nature of the setting13. An earlier study in the same province found similar MDR-TB mutations, but also additional variants in genes conferring resistance to other first and second-line drugs, including in pncA (pyrazinamide), embB (ethambutol), gyrA (fluoroquinolones), rrs (aminoglycosides), rpsL, rrs and gid (streptomycin) loci. Further, acquisition of rifampicin resistance often preceded isoniazid in these isolates, and a high proportion (~ 18%) of pre-MDR isolates had fluoroquinolone resistance markers, being a class of antibiotics that is widely available and used2. Eighteen M. tuberculosis isolates clustered within eight networks, thereby providing evidence of drug-resistant TB transmission in the Khyber Pakhtunkhwa province2. An investigation of XDR-TB isolates sourced across four provinces in Pakistan found similar genes linked to drug resistance as in Khyber Pakhtunkhwa11, and an increased frequency and expression of novel SNP mutations in efflux pump genes, potentially explaining some drug resistance11.

Here, we analyse 535 M. tuberculosis samples with WGS data, collected between years 2003 and 2020, with phenotypic testing of resistance across 12 drugs (rifampicin, isoniazid, ethambutol, pyrazinamide, streptomycin, ofloxacin, moxifloxacin, amikacin, kanamycin, capreomycin, ciprofloxacin, ethionamide). By identifying ~ 38 k SNPs, and inferring genotypic drug resistance across 19 anti-TB drugs (as well as fluoroquinolone and aminoglycoside classes), we sought to understand the phylogeny of M. tuberculosis in the largest Pakistan dataset, identify transmission events, and infer commonly circulating mutations linked to drug resistance. The genetic insights were validated in a large M. tuberculosis collection (n = 34 k) with WGS and drug susceptibility test data7.

Results

Isolates and whole genome sequencing data

A total of 535 M. tuberculosis isolates sourced between years 2003 and 2020 from Pakistan with publically available WGS and phenotypic susceptibility testing were analysed2,9,10,11,12,13. These isolates covered at least four provinces (Balochistan, Khyber Pakhtunkhwa, Punjab, Sindh), but a high proportion of locations were missing (69.5%), all from one study12 (Table 1). The majority of samples were from lineage 3 (L3 397, 74.2%; CAS strains), but the other main lineages were represented (L4, 80, 15.0%, including LAM, T and X strains; L2 36, 6.7%, including Beijing; L1 22, 4.1%) (Table 1; S1 Table).

Table 1 Mycobacterium tuberculosis samples (N = 535).

As expected phenotypic drug susceptibility testing (DST) was performed most often for first-line rifampicin (n = 487, 91.0%), isoniazid (n = 487, 91.0%), ethambutol (n = 479, 89.5%), and pyrazinamide (n = 444, 83.0%) (S2 Table). A total of 432 samples (80.7%) were phenotypically resistant to at least one drug (median 3, maximum 10). The number of potential errors on the phenotypic testing appeared modest (218/2430 tests, 9.0%), where established genotypic resistance markers were present in isolates with DST results that implied drug susceptibility. The discordance appeared for nine drugs, but more than half occurred in two drugs (ethambutol 96; pyrazinamide 42) (S2 Table). The majority of isolates were genotypically assessed as MDR-TB (328, 61.3%), with proportions of (pre-) XDR (113, 21.1%) and pan-sensitive (60, 11.2%) (Table 1). There were 31 pre-MDR isolates, and overall there was a high prevalence of rifampicin (460, 86.0%) and isoniazid (435, 81.3%) resistance associated mutations. Resistance to other drugs was also detected, including ethambutol (385, 72.0%), pyrazinamide (258, 48.2%), streptomycin (238, 44.5%), ethionamide (102, 19.1%), any fluoroquinolone (277, 51.8%) or aminoglycoside (75, 14.0%). Very few isolates appeared resistant to bedaquiliine, clofazimine and cycloserine (n < 3; Table 1). Across all lineages, the majority of isolates (> 75%) were at least MDR-TB resistant (S3 Table).

After sequence data alignment, high average coverage was observed across the samples (median 76-fold, range 30—2027 fold). Across the isolates, a total of 37,970 genome-wide SNPs were identified, with 23,741 (62.5%) found in single samples. A phylogenetic tree constructed using the 37,970 genome-wide SNPs revealed the expected clustering by lineage (Fig. 1; S1 Figure).

Figure 1
figure 1

A phylogenetic tree for the 535 M. tuberculosis isolates constructed using 37,970 SNPs. The surrounding rings of data for each isolate include: lineage (inner), drug resistance status, location, and transmission status (outer).

Evidence of transmission

The median (range) pairwise SNP differences across the 535 isolates was 390 (minimum 0, maximum 1811), with a multi-modal distribution, where modes represent differences within and between lineages (S2 Figure). At a threshold of 10 SNPs, 55 clusters formed consisting of a total of 169 isolates, where the median number of isolates in each cluster was 2 (range: 2—22) (S2 Figure). By reducing the cut-off to 5 SNPs, there were only 6 less clusters (total 49) consisting of a total of 33 isolates (overall 136 isolates) (S4 Table). The 169 transmitted isolates (SNP cut-off 10) were found in three of the four provinces recorded (Khyber Pakhtunkhwa 71/169; Punjab 9/169; Sindh 9/169), identified across all lineages (L1 7/169, L2 21/169, L3 98/169, L4 43/169) and in (pre-)XDR (75/169) samples (S3 Figure; S4 Figure). Most clusters had samples with the same drug resistance phenotype (44/55), and there was some evidence of clusters consisting of more than one location (35/55, excluding missing locations) (S3 Figure; S4 Figure). Comparing the 169 "transmitted" isolates in clusters to the others ("non-transmitted"; n = 366), there were overall differences in lineage (Chi-Square, P < 6 × 10–8) and drug resistance (Chi-square P < 5 × 10–15). Specifically, there was marginally weak evidence of an increased risk of transmission in lineage 2 (odds ratio (OR) = 3.00, P = 0.054) and lineage 4 (OR = 2.49, P = 0.073), compared to lineage 1. Signals of increased risk of transmission were stronger among those pre-XDR/XDR (OR = 5.79, P < 5 × 10–14), compared to a less resistant status. There was no association between transmission risk and province (Chi-Square P = 0.64), but there were high levels of missing location data (S5 Table).

A genome-wide association study (GWAS) approach was applied to detect loci potentially linked to transmissibility. It revealed nusG, Rv2307B, wag31, proX and murA genes to be the most associated with being in a transmission cluster (P < 10–5) (S6 Table). Rv2307 (beta = 0.745, P = 1.5 × 10–8) putatively codes for a glycine rich protein, while proX (beta = 0.706, P = 1.3 × 10–6) encodes osmoprotectant binding lipoprotein ProX. There were six mutations found in each of these genes, although no clear pattern relating to either phylogenetic or transmission status could be discerned, with mutations found in both transmission and non-transmission samples, as well as many samples having more than one of these mutations. The nusG (beta = 0.791, P = 5.8 × 10–10) encoded protein participates in transcription elongation, termination and anti-termination. There are five key mutations (S206G, E186A, R124L, A161V, F232C). By locating their position on a phylogenetic tree, only R124L was supported by isolates in more than one clade (S5 Figure). The wag31 gene (beta = 0.912, P = 3 × 10–7) codes for a cell wall synthesis protein, but only one mutation (G67S) was associated with a single small transmission clade (n = 5) (S5 Figure). The murA gene codes for a peptidoglycan biosynthesis pathway, and had five mutations (E226K, R247L, D318A, H394Y, E414K), but none were found in more than one clade and only two mutations overlapped with transmission samples (H394Y, E226K) (S5 Figure).

The transmission clusters involved six main sub-lineages (1.1.2, 2.2.1, 3, 3.1.2, 4.5, 4.9), and we looked for similar isolates in other populations within the global 34 k dataset. Using a more relaxed cut-off of 20 SNPs difference to allow for greater time between transmission events, three of the sub-lineages (3, 2.2.1, 4.5) revealed similar isolates collected from other countries (Fig. 2). Lineage 2.2.1 had 19 Pakistan isolates linked to 29 global samples, mostly from countries in Europe and Central Asia. Lineage 3 had 8 Pakistan isolates linked to 5 other samples from the UK, while sub-lineage 4.5 had two Pakistan samples linked to a single isolate from the UK.

Figure 2
figure 2

Phylogenetic trees for sub-lineages involving Pakistan samples and closely-related global isolates from previously published datasets. (a) Sub-lineage 2.2.1 (19 Pakistan, 25 other). (b) Lineage 3 (8 Pakistan, 4 UK).

Drug resistance mutations

The common mutations underlying genotypic drug resistance were in known loci. These included mutations in rpoB (D435GFYV 293/460, S450LFWY 308/460) linked to rifampicin, katG (S315NIT 374/416) and fabG1 (−15C > T 52/416) linked to isoniazid, embB (G406ASDC 51/385, M306ILV 280/385, Q497RKP 40/385) linked to ethambutol, gyrA (A90V 68/277, S91P 22/277, D94GAHYN 195/277) linked to fluoroquinolones, and pncA (118 low frequency < 25/258) linked to pyrazinamide (Table 2). A high proportion of mutations detected were present in the global 34 k dataset, including pncA 93/118, katG 19/38, rpoB 37/39, and embB 21/21. Nearly half all mutations identified (156/313) were present in single isolates, of which the majority were in the 34 k dataset (101/156) and absent from sensitive strains (S7 Table).

Table 2 Number of samples with known drug resistance-associated mutations.

We investigated isolates that had a DST implying resistance, but no established genetic mutations to explain this phenotype. There were 82 isolates (100/2430 tests; (S2 Table)) with this discordance across 9 drugs (amikacin (9), capreomycin (2), ciprofloxacin (4), ethambutol (17), isoniazid (25), kanamycin (7)), pyrazinamide (24), rifampicin (6), streptomycin (6)). We identified 68 distinct genetic markers in candidate genes to potentially explain the discordance (Table 3). Twenty-nine (42.6%) mutations had strong evidence of being linked with drug resistance, including from functional consequences, homoplasy or global data information7,14. Forty-six (67.6%) mutations were present in the global 34 k dataset, and all of these were absent in sensitive strains (S8 Table), reinforcing them as putatively resistant related.

Table 3 Putative novel drug resistant mutations.

For rifampicin resistance, we identified three inframe indels in rpoB (1291_1292insGCC, 1294_1296del and 1309_1311del) in three isolates. For isoniazid, several nonsense mutations in katG were found, with 3 mutations leading to premature stop codons (W438*, W204*, Q36*) and a frameshift mutation (587_588insGGT). For ethambutol resistance, variants in the embA promoter region (−42CAT > C, −27TA > T-16C > A, −8C > A) and embB were observed. For pyrazinamide resistance, several potentially new mutations were found in pncA, including three inframe indels (511_512insTCGCCG, 392_393insGGT and 451_462del), a premature stop codon (S18*), and SNPs in both the coding region (Val180Ala) and the promoter (−7 T > G). For streptomycin resistance, several mutations were found in gid including a premature stop codon (G71*), a frameshift (102_102del), and SNPs (A119D, A82P and D67G). These SNPs were found in the 34 k global dataset, and likely acquired as the result of homoplasy. The gid A119D mutation was present in 15 isolates (ten different sublineages), of which two had DSTs that reported resistance. The gid A82P mutation was present in three isolates from two different sub-lineages, but no DST data was available for these samples. The gid D67G was present in 38 global isolates from five different sublineages. Of these, seven isolates had DST data available with four presenting with resistance.

For second line injectables, the rrs 878g > a mutation (seen previously2) was observed in four lineage 3 strains with three independent homoplastic acquisitions, indicating it is unlikely to be strain-specifc. Mutations in rrs are generally clustered in two regions with the most common mutations involved with streptomycin resistance being located around position 514 and those involved with resistance to amikacin, kanamycin and capreomycin located around 1401. The rrs 878g > a falls between the two mutation hotspots, and of the three strains which had DST data (amikacin and kanamycin) in this study, two were resistant to both amikacin and kanamycin and the other was sensitive to both. For fluoroquinolones, the gyrA A288D mutation was found in three lineage 3 isolates and was acquired in each sample independently. One isolate tested resistant to ciprofloxacin with no known resistance mutation found in the gyrA and gyrB genes.

Discussion

The use of whole genome sequencing as a diagnostic is gaining traction in low resource and high TB burden settings, where it has the potential to have greater public health impact5,7,15. Portable sequencing platforms and multiplexing of M. tuberculosis isolates are making the application of WGS, both timely and cost effective5. Our findings in the largest analysis of isolates from Pakistan to date revealed that lineage 2 and 4 strains, which are pre-XDR and XDR-TB, are potentially being transmitted in the country. Evidence of increased transmission among lineages 2 and 4 is consistent with previous characterisations of these clades as more transmissible7, and therefore their strain-types should be monitored more closely despite greater prevalence of lineage 3. It is surprising that pre-XDR and XDR-TB samples were found to be clustered more than expected compared to MDR-TB isolates given the usual fitness cost of drug resistance. This observation suggests that compensatory mutations ought to be investigated in future work. Similarly, the finding that mutations in nusG, Rv2307B, wag31, proX and murA genes maybe associated with transmission should be followed-up experimentally, where those with variants appearing in more than one clade could be priortised. Advances in the characterisation of transmission events16, GWAS9,17 and machine learning methods18,19 could enhance the ability to detect mutations linked to transmissiblility. However, host factors and host–pathogen genetic interactions are also likely to be important. More broadly, the routine collection, processing and WGS of M. tuberculosis DNA across Pakistan will provide robust insights into mutations underlying drug resistance and geo-temporal dynamics.

Whilst our study uses a convenience sample that is not necessarily representative of the proportions of MDR-TB in the wider Pakistan population, it is enriched by the presence of many mutations that lead to drug resistance. The enrichment of drug resistant isolates from endemic TB regions with high transmission will reveal important resistance mutations, including potential novel variants. To investigate the underlying mechanisms of drug resistance, we compared susceptibility profiles from phenotypic methods and genotypic prediction. This analysis led to the identification of a number of potential new drug resistance mutations, including in genes causing resistance to rifampicin, isoniazid, ethambutol and pyrazinamide. Three inframe deletions were found in the rifampicin resistance determining region of rpoB. Inframe deletions have not been widely reported as a major mechanism of resistance to rifampicin and it is surprising to see a relatively high number of these mutations in our dataset. Previously unreported nonsense mutations were also found in the katG gene, a locus responsible for resistance to isoniazid. A novel nonsense mutation, frameshift and inframe indels were found in the pncA gene, which codes for the activator of pyrazinamide. Mutations in the promoter region of the pncA gene lead to changes in the expression of PncA and resistance20. The identified −7 T > G promoter mutation is thus likely to cause resistance. However the functional effects of SNPs found in the coding region of pncA are more difficult to predict20. The pncA V180A mutation has been reported previously to be associated with pyrazinamide resistance20. For streptomycin, we observed several point mutations and a premature stop codon in the gid gene. The gid D67G mutation was found in 38 isolates in the 34 k global dataset7, of which 57% of those were phenotypically resistant to streptomycin. The incomplete penetrance of the streptomycin-associated gid D67G mutation could be explained by the relative low-level resistance conferred by mutations in gid, which could be below established critical cut-offs of minimum inhibitory concentration for susceptibility phenotyping, but still elevated with respect to wild-type.

Overall, our work reinforces that the adoption of WGS platforms as a diagnostic tool, combined with mutational databases of drug resistance markers, will inform clinical decision making. The ability to perform WGS for genomic investigations across time and geography will improve the understanding of transmission dynamics, and inform control programmes to reduce disease burden. The benefits will be greatest in high prevalence TB settings, typically low and middle income countries, such as Pakistan. Although WGS is not currently at a viable level of affordability, it is anticipated that amplicon and whole genome approaches using (portable) next generation platforms will shortly become simple, affordable and accessible rapid diagnostics compared to traditional laboratory-based methods that currently require specialist training, equipment and long culture times. Importantly, there is evidence that WGS is more detailed and accurate in its profiling of drug resistance than traditional DST, thereby likely to improve treatment and mortality outcomes in drug-resistant TB in high-burden countries21.

Methods

Sequence data and processing

WGS were sourced across six studies2,9,10,11,12,13 (ENA accessions: PRJEB7798, PRJEB10385, PRJEB25972, PRJEB32684, PRJEB43284), where contributing isolates belong to a single patient. Phenotypic DSTs were conducted using WHO endorsed methods, as specified in descriptions of the original studies2,9,10,11,12,13. Raw reads were trimmed to remove low-quality sequences in Trimmomatic (v0.39)22, and aligned to the H37Rv reference genome (AL123456) with BWA mem (v0.7.17)23. SNPs and indels called by samtools software24 were processed using gatk GenotypeGVCFs (v4.1.3.0) (gatk.broadinstitute.org). Monomorphic SNPs and variants in non-unique regions of the genome (e.g. pe/ppe genes) were excluded. A multi-FASTA format file was created from the filtered SNP file and H37Rv reference fasta using bedtools makewindows (v2.28.0)25. This multiple alignment was used to construct a phylogenetic tree with IQ-TREE (v1.6.12), involving a general time reversible model with rate heterogeneity set to a discrete Gamma model and an ascertainment bias correction (parameters −m GTR + G + ASC), with 1000 bootstrap samples26. Pairwise distance matrices were calculated in Plink software (v1.90b4)27. Drug resistance and lineages were predicted in silico from raw sequence data using TB-Profiler (v2.4)5. The Pakistan analysis results were compared to a global collection of 34 k M. tuberculosis with WGS and DST data7.

A cut-off of 10 SNPs difference was established to define transmission clades, and label samples as “transmitted” or “non-transmitted”. A sensitivity analysis was performed to assess the impact of changing the cut-off. Linear mixed models were used perform a GWAS of transmissibility using SNPs, location, drug resistance and adjusting for M. tuberculosis (sub-)lineage and outbreak-based population structure, being implemented in GEMMA (v.1.1.2) (http://www.xzlab.org/software.html). We report association p-values less than a Bonferroni cut-off based on testing 4,000 genes (P < 1.25 × 10–5). To identify if samples involved in transmission clades (> 10 samples) were similar to others (< 20 SNPs) in the global dataset (n = 34 k)7, we constructed phylogenetic trees using FastTree for the relevant sub-lineages (1.1.2, 2.2.1, 3, 3.1.2, 4.5, 4.9). The likelihoods of ancestral locations were inferred with the ape (v5.0) and phytools packages in R.