Characterisation of drug-resistant Mycobacterium tuberculosis mutations and transmission in Pakistan

Tuberculosis, caused by Mycobacterium tuberculosis, is a high-burden disease in Pakistan, with multi-drug (MDR) and extensive-drug (XDR) resistance, complicating infection control. Whole genome sequencing (WGS) of M. tuberculosis is being used to infer lineages (strain-types), drug resistance mutations, and transmission patterns—all informing infection control and clinical decision making. Here we analyse WGS data on 535 M. tuberculosis isolates sourced across Pakistan between years 2003 and 2020, to understand the circulating strain-types and mutations related to 12 anti-TB drugs, as well as identify transmission clusters. Most isolates belonged to lineage 3 (n = 397; 74.2%) strain-types, and were MDR (n = 328; 61.3%) and (pre-)XDR (n = 113; 21.1%). By inferring close genomic relatedness between isolates (< 10-SNPs difference), there was evidence of M. tuberculosis transmission, with 55 clusters formed consisting of a total of 169 isolates. Three clusters consist of M. tuberculosis that are similar to isolates found outside of Pakistan. A genome-wide association analysis comparing ‘transmitted’ and ‘non-transmitted’ isolate groups, revealed the nusG gene as most significantly associated with a potential transmissible phenotype (P = 5.8 × 10–10). Overall, our study provides important insights into M. tuberculosis genetic diversity and transmission in Pakistan, including providing information on circulating drug resistance mutations for monitoring activities and clinical decision making.

www.nature.com/scientificreports/ the definition as the underlying cases were treated within that framework. There were ~ 25,000 cases of MDR-/ RR-TB in 2019 1 . The National TB control program aims to reduce by half the prevalence of TB in the general population by 2025, but to achieve this will require the scaling-up of TB detection and clinical care, as well as improved systems for inferring disease transmission, thereby facilitating further targeted interventions. Whole genome sequencing (WGS) is revolutionizing our understanding of drug resistance and clinical management, as well as transmission patterns, thereby assisting disease control 4 . M. tuberculosis drug resistance is linked to genomic variants in drug targets or pro-drug activators, including single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels), some occurring in gene-gene interactions. It is therefore possible to predict resistance genotypically for 19 anti-TB drugs and their groups (e.g. floroquinolines) using curated libraries of > 1000 mutations across > 30 loci 5,6 , thereby personalizing treatment. Genotypic predictions are an alternative to bacterial culture-based phenotypic drug susceptibility testing (DST), which can be timeconsuming and resource intensive, with reproducibility and inhibitory concentration cut-off challenges for particular drugs 5 . Further, WGS data infers the population structure within the M. tuberculosis complex, which is phylo-geographical in nature, with strains falling within distinct (sub-)lineages 7 , and potential transmission chains identified through isolates with (near-)identical genomic variation 8 . The identification of highly virulent strain-types or lineages, drug resistance, and transmission clusters will assist the targeting of limited resources for TB control.
There have been recent studies using WGS to characterize M. tuberculosis genetic diversity in isolates sourced from Pakistan, where the predominant strains are from the Central Asian (CAS) family, set within lineage 3 2,[9][10][11][12][13] . A recent study of TB endemic province of Khyber Pakhtunkhwa (North West Pakistan) found that known mutations in rpoB (e.g. S405L), katG (e.g. S315T), or inhA promoter loci explain the majority of MDR-TB, but there was evidence of complex mixed infections and heteroresistance, which may reflect the high transmission nature of the setting 13 . An earlier study in the same province found similar MDR-TB mutations, but also additional variants in genes conferring resistance to other first and second-line drugs, including in pncA (pyrazinamide), embB (ethambutol), gyrA (fluoroquinolones), rrs (aminoglycosides), rpsL, rrs and gid (streptomycin) loci. Further, acquisition of rifampicin resistance often preceded isoniazid in these isolates, and a high proportion (~ 18%) of pre-MDR isolates had fluoroquinolone resistance markers, being a class of antibiotics that is widely available and used 2 . Eighteen M. tuberculosis isolates clustered within eight networks, thereby providing evidence of drugresistant TB transmission in the Khyber Pakhtunkhwa province 2 . An investigation of XDR-TB isolates sourced across four provinces in Pakistan found similar genes linked to drug resistance as in Khyber Pakhtunkhwa 11 , and an increased frequency and expression of novel SNP mutations in efflux pump genes, potentially explaining some drug resistance 11 .
Here, we analyse 535 M. tuberculosis samples with WGS data, collected between years 2003 and 2020, with phenotypic testing of resistance across 12 drugs (rifampicin, isoniazid, ethambutol, pyrazinamide, streptomycin, ofloxacin, moxifloxacin, amikacin, kanamycin, capreomycin, ciprofloxacin, ethionamide). By identifying ~ 38 k SNPs, and inferring genotypic drug resistance across 19 anti-TB drugs (as well as fluoroquinolone and aminoglycoside classes), we sought to understand the phylogeny of M. tuberculosis in the largest Pakistan dataset, identify transmission events, and infer commonly circulating mutations linked to drug resistance. The genetic insights were validated in a large M. tuberculosis collection (n = 34 k) with WGS and drug susceptibility test data 7 .
After sequence data alignment, high average coverage was observed across the samples (median 76-fold, range 30-2027 fold). Across the isolates, a total of 37,970 genome-wide SNPs were identified, with 23,741 (62.5%) found in single samples. A phylogenetic tree constructed using the 37,970 genome-wide SNPs revealed the expected clustering by lineage ( Fig. 1; S1 Figure).

Evidence of transmission.
The median (range) pairwise SNP differences across the 535 isolates was 390 (minimum 0, maximum 1811), with a multi-modal distribution, where modes represent differences within and between lineages (S2 Figure). At a threshold of 10 SNPs, 55 clusters formed consisting of a total of 169 isolates, www.nature.com/scientificreports/ where the median number of isolates in each cluster was 2 (range: 2-22) (S2 Figure). By reducing the cut-off to 5 SNPs, there were only 6 less clusters (total 49) consisting of a total of 33 isolates (overall 136 isolates) (S4 Table).  Figure). Comparing the 169 "transmitted" isolates in clusters to the others ("non-transmitted"; n = 366), there were overall differences in lineage (Chi-Square, P < 6 × 10 -8 ) and drug resistance (Chi-square P < 5 × 10 -15 ). Specifically, there was marginally weak evidence of an increased risk of transmission in lineage 2 (odds ratio (OR) = 3.00, P = 0.054) and lineage 4 (OR = 2.49, P = 0.073), compared to lineage 1. Signals of increased risk of transmission were stronger among those pre-XDR/XDR (OR = 5.79, www.nature.com/scientificreports/ P < 5 × 10 -14 ), compared to a less resistant status. There was no association between transmission risk and province (Chi-Square P = 0.64), but there were high levels of missing location data (S5 Table).
A genome-wide association study (GWAS) approach was applied to detect loci potentially linked to transmissibility. It revealed nusG, Rv2307B, wag31, proX and murA genes to be the most associated with being in a transmission cluster (P < 10 -5 ) (S6 Table). Rv2307 (beta = 0.745, P = 1.5 × 10 -8 ) putatively codes for a glycine rich protein, while proX (beta = 0.706, P = 1.3 × 10 -6 ) encodes osmoprotectant binding lipoprotein ProX. There were six mutations found in each of these genes, although no clear pattern relating to either phylogenetic or transmission status could be discerned, with mutations found in both transmission and non-transmission samples, as well as many samples having more than one of these mutations. The nusG (beta = 0.791, P = 5.8 × 10 -10 ) encoded protein participates in transcription elongation, termination and anti-termination. There are five key mutations (S206G, E186A, R124L, A161V, F232C). By locating their position on a phylogenetic tree, only R124L was supported by isolates in more than one clade (S5 Figure). The wag31 gene (beta = 0.912, P = 3 × 10 -7 ) codes for a cell wall synthesis protein, but only one mutation (G67S) was associated with a single small transmission clade (n = 5) (S5 Figure). The murA gene codes for a peptidoglycan biosynthesis pathway, and had five mutations (E226K, R247L, D318A, H394Y, E414K), but none were found in more than one clade and only two mutations overlapped with transmission samples (H394Y, E226K) (S5 Figure).
For rifampicin resistance, we identified three inframe indels in rpoB (1291_1292insGCC, 1294_1296del and 1309_1311del) in three isolates. For isoniazid, several nonsense mutations in katG were found, with 3 mutations leading to premature stop codons (W438*, W204*, Q36*) and a frameshift mutation (587_588insGGT). For ethambutol resistance, variants in the embA promoter region (−42CAT > C, −27TA > T-16C > A, −8C > A) and embB www.nature.com/scientificreports/ were observed. For pyrazinamide resistance, several potentially new mutations were found in pncA, including three inframe indels (511_512insTCG CCG , 392_393insGGT and 451_462del), a premature stop codon (S18*), and SNPs in both the coding region (Val180Ala) and the promoter (−7 T > G). For streptomycin resistance, several mutations were found in gid including a premature stop codon (G71*), a frameshift (102_102del), and SNPs (A119D, A82P and D67G). These SNPs were found in the 34 k global dataset, and likely acquired as the result of homoplasy. The gid A119D mutation was present in 15 isolates (ten different sublineages), of which two had DSTs that reported resistance. The gid A82P mutation was present in three isolates from two different sublineages, but no DST data was available for these samples. The gid D67G was present in 38 global isolates from five different sublineages. Of these, seven isolates had DST data available with four presenting with resistance. For second line injectables, the rrs 878g > a mutation (seen previously 2 ) was observed in four lineage 3 strains with three independent homoplastic acquisitions, indicating it is unlikely to be strain-specifc. Mutations in rrs are generally clustered in two regions with the most common mutations involved with streptomycin resistance being located around position 514 and those involved with resistance to amikacin, kanamycin and capreomycin located around 1401. The rrs 878g > a falls between the two mutation hotspots, and of the three strains which had DST data (amikacin and kanamycin) in this study, two were resistant to both amikacin and kanamycin and the other was sensitive to both. For fluoroquinolones, the gyrA A288D mutation was found in three lineage 3 isolates and was acquired in each sample independently. One isolate tested resistant to ciprofloxacin with no known resistance mutation found in the gyrA and gyrB genes.

Discussion
The use of whole genome sequencing as a diagnostic is gaining traction in low resource and high TB burden settings, where it has the potential to have greater public health impact 5,7,15 . Portable sequencing platforms and multiplexing of M. tuberculosis isolates are making the application of WGS, both timely and cost effective 5 . Our findings in the largest analysis of isolates from Pakistan to date revealed that lineage 2 and 4 strains, which are pre-XDR and XDR-TB, are potentially being transmitted in the country. Evidence of increased transmission among lineages 2 and 4 is consistent with previous characterisations of these clades as more transmissible 7 , and therefore their strain-types should be monitored more closely despite greater prevalence of lineage 3. It is surprising that pre-XDR and XDR-TB samples were found to be clustered more than expected compared to MDR-TB isolates given the usual fitness cost of drug resistance. This observation suggests that compensatory mutations ought to be investigated in future work. Similarly, the finding that mutations in nusG, Rv2307B, wag31, proX and murA genes maybe associated with transmission should be followed-up experimentally, where those with variants appearing in more than one clade could be priortised. Advances in the characterisation of transmission events 16 , GWAS 9,17 and machine learning methods 18,19 could enhance the ability to detect mutations linked to transmissiblility. However, host factors and host-pathogen genetic interactions are also likely to be important. More broadly, the routine collection, processing and WGS of M. tuberculosis DNA across Pakistan will provide robust insights into mutations underlying drug resistance and geo-temporal dynamics.
Whilst our study uses a convenience sample that is not necessarily representative of the proportions of MDR-TB in the wider Pakistan population, it is enriched by the presence of many mutations that lead to drug resistance. The enrichment of drug resistant isolates from endemic TB regions with high transmission will reveal important resistance mutations, including potential novel variants. To investigate the underlying mechanisms Table 3. Putative novel drug resistant mutations. *Based on absence in the curated TB-Profiler mutation list; bolded, if not observed in a large TB Global dataset (34 k 7 ); underlined, if with multiple levels of evidence for drug resistance (see S8 Table). www.nature.com/scientificreports/ of drug resistance, we compared susceptibility profiles from phenotypic methods and genotypic prediction. This analysis led to the identification of a number of potential new drug resistance mutations, including in genes causing resistance to rifampicin, isoniazid, ethambutol and pyrazinamide. Three inframe deletions were found in the rifampicin resistance determining region of rpoB. Inframe deletions have not been widely reported as a major mechanism of resistance to rifampicin and it is surprising to see a relatively high number of these mutations in our dataset. Previously unreported nonsense mutations were also found in the katG gene, a locus responsible for resistance to isoniazid. A novel nonsense mutation, frameshift and inframe indels were found in the pncA gene, which codes for the activator of pyrazinamide. Mutations in the promoter region of the pncA gene lead to changes in the expression of PncA and resistance 20 . The identified −7 T > G promoter mutation is thus likely to cause resistance. However the functional effects of SNPs found in the coding region of pncA are more difficult to predict 20 . The pncA V180A mutation has been reported previously to be associated with pyrazinamide resistance 20 . For streptomycin, we observed several point mutations and a premature stop codon in the gid gene. The gid D67G mutation was found in 38 isolates in the 34 k global dataset 7 , of which 57% of those were phenotypically resistant to streptomycin. The incomplete penetrance of the streptomycin-associated gid D67G mutation could be explained by the relative low-level resistance conferred by mutations in gid, which could be below established critical cut-offs of minimum inhibitory concentration for susceptibility phenotyping, but still elevated with respect to wild-type. Overall, our work reinforces that the adoption of WGS platforms as a diagnostic tool, combined with mutational databases of drug resistance markers, will inform clinical decision making. The ability to perform WGS for genomic investigations across time and geography will improve the understanding of transmission dynamics, and inform control programmes to reduce disease burden. The benefits will be greatest in high prevalence TB settings, typically low and middle income countries, such as Pakistan. Although WGS is not currently at a viable level of affordability, it is anticipated that amplicon and whole genome approaches using (portable) next generation platforms will shortly become simple, affordable and accessible rapid diagnostics compared to traditional laboratory-based methods that currently require specialist training, equipment and long culture times. Importantly, there is evidence that WGS is more detailed and accurate in its profiling of drug resistance than traditional DST, thereby likely to improve treatment and mortality outcomes in drug-resistant TB in highburden countries 21 .
A cut-off of 10 SNPs difference was established to define transmission clades, and label samples as "transmitted" or "non-transmitted". A sensitivity analysis was performed to assess the impact of changing the cut-off. Linear mixed models were used perform a GWAS of transmissibility using SNPs, location, drug resistance and adjusting for M. tuberculosis (sub-)lineage and outbreak-based population structure, being implemented in GEMMA (v.1.1.2) (http:// www. xzlab. org/ softw are. html). We report association p-values less than a Bonferroni cut-off based on testing 4,000 genes (P < 1.25 × 10 -5 ). To identify if samples involved in transmission clades (> 10 samples) were similar to others (< 20 SNPs) in the global dataset (n = 34 k) 7 , we constructed phylogenetic trees using FastTree for the relevant sub-lineages (1.1.2, 2.2.1, 3, 3.1.2, 4.5, 4.9). The likelihoods of ancestral locations were inferred with the ape (v5.0) and phytools packages in R. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.