Introduction

In 2017, 558,000 tuberculosis (TB) cases of multidrug resistance (MDR) or rifampicin (RMP) resistance were estimated globally1. Resistance to anti-TB drugs increases the burden of TB because the treatment of drug-resistant TB is generally prolonged and costly, while the outcome is relatively poor2,3. Isoniazid (INH) resistance without concurrent RMP resistance accounts for 7.1% of the new TB cases in the world1. The spread of INH resistance serves as a reservoir of more combined drug resistance4,5; empirical treatment after rapid genetic assessment of RMP resistance alone increases the future risk for developing MDR-TB or extensively drug-resistant (XDR)-TB6,7. INH-containing preventive therapy is also affected heavily when the INH-resistance rate is high in a target population.

Extensive studies have been performed to identify the genetic mutations responsible for Mycobacterium tuberculosis (Mtb) drug resistance, and comprehensive lists of drug-resistance-conferring mutations have been provided8,9. However, unidentified genetic variants, including compensatory mutations, may also assist or facilitate the transmission of drug-resistant TB without reducing the fitness of the bacilli10. Investigation of this mechanism would contribute to the effective control of drug-resistant TB, and the accumulation of molecular epidemiological data in many areas around the world would deepen the understanding of the dynamics.

Vietnam is one of the 30 TB high-burden countries, with approximately 124,000 incidence cases reported in 20171. According to the national drug resistance survey conducted in 2011, among the new cases, resistance to any anti-TB drug accounted for 32.7% of the cases, and the proportion of INH-resistant TB reached 18.9%11. In our previous study cohort of 489 newly diagnosed patients in a city area, INH resistance was observed in 28.2% of the patients12, while RMP resistance remained in 4.9% of the patients. The predominant genetic mutations katG-S315T for INH and rpoB-S450L for RMP that confer drug resistance in Vietnam are similar to those reported in other Asian countries13,14,15. However, pathogen factors correlated with the transmission of INH-resistant TB have not been fully investigated.

Whole genome sequencing (WGS) has been recently used globally, offering new opportunities in the management of drug-resistant TB, since it can provide a huge amount of information, including genetic variants that are relevant to drug resistance throughout the genome16. WGS data also offer critical insights into the dynamics of TB endemics, transmission route, and the evolutionary patterns of genomic mutations10,17,18,19. Recently, bacterial genome-wide association studies (GWAS) controlling for population structure have also been performed for identifying the genes or genetic variants relevant to the TB phenotype, including drug resistance, by analyzing all single-nucleotide polymorphisms, small and large insertions/deletions (indels), or k-mers obtained from massive short-read data from next-generation sequencers (NGS)20,21.

In this study, we investigated drug resistance-conferring mutations carried by the clinical Mtb isolates from patients newly diagnosed with smear-positive pulmonary TB in Hanoi, Vietnam, by using WGS with a bacterial GWAS approach incorporating linear mixed models (LMMs). We then identified the genetic variants that may be relevant to the success in an extensive spread of INH-resistant strains, in reference to a previously published cohort study in KwaZulu-Natal, South Africa, as the second panel19.

Results

Prevalence of known drug resistance-conferring mutations

Among the 332 Hanoi samples analyzed with WGS, known mutations, which have been registered in the TBProfiler’s mutation database9,22, accounted for 80 (90.9%) of the 88 isolates with phenotypic INH resistance, 12 (92.3%) of the 13 isolates with RMP resistance, 63 (73.3%) of the 86 isolates with streptomycin (SM) resistance, 5 (100.0%) of the 5 isolates with ethambutol resistance, and 5 (55.6%) of the 9 isolates with pyrazinamide (PZA) resistance (Table 1). The most prevalent drug-resistance-conferring mutations were katG-S315T (26.2%) to INH, rpsL-K43R (13.3%) to SM, and rpoB-S450L (2.7%) to RMP. Mutations conferring resistance to the second-line drugs were rrs-A514C (4.5%) to amikacin and those in the fabG1-promoter (2.7%) to ethionamide. Others were rare mutations such as ethR-F110L to ethionamide, and gyrA-A90V and gyrA-D94G to fluoroquinolone (0.6%, 0.3%, and 0.6%, respectively). Among the 87 strains harboring katG-S315T mutations, 54 (62.1%) had at least one other drug-resistance-conferring mutation. Co-occurrence of katG-S315T and rpsL-K43R with or without other known mutations was observed most frequently (31 of the 87 isolates; 35.6%).

Table 1 Frequencies of known mutations conferring resistance to first-line drugs among all isolates, phenotypically resistant isolates, and susceptible isolates in the Hanoi sample set (N = 332).

Other than SNVs, in one isolate, a 353-bp deletion was found in the pncA region after the screening of zero or low-depth areas, which also covered the pncA promoter and a part of the Rv2044c nearby. One sample had an in-frame 6-bp deletion (M434-D435) in the rpoB gene; this strain was not resistant to RMP. Eight isolates harboring 1-bp deletion in the gid gene, and two other isolates carrying 1-bp deletion of the ethA gene were also found (Supplementary Fig. S1).

Mtb lineages/sublineages, drug-resistance-conferring mutations, and genetic clusters

The lineage (L)2 East Asian—mostly the Beijing strains—possessed the mutations that conferred drug resistance most frequently (52.6%); L1 (Indo-Oceanic; 23.2%) carried less and then L4 (Euro-American; 18.3%) the least (P < 0.0001) in Hanoi. The proportion of the above drug resistance was the highest in ancient Beijing strains (59.8%), followed by modern Beijing strains (40.9%), and then by non-Beijing strains (20.8%) (P < 0.0001).

Figure 1 shows the distribution of the strains harboring mutations of interest. Of note, katG-S315T, known as a major INH-resistance mutation in the world23, was reported with a frequency of 26.2% in Hanoi, which accounted for 85.3% of the primary INH-resistance, and it was distributed unevenly among the strains like scattered islands. This S315T mutation, regardless of whether it occurred alone or in combination with other mutations, was more frequently observed in the L2 and L1 strains than in L4 (P < 0.0001) and also more in ancient Beijing than in modern Beijing and non-Beijing strains (59.0%, 40.9%, and 19.4%, respectively, P < 0.001). katG-S315T was often carried by ancient Beijing strains, whereas clustered strains defined by pairwise SNV differences of <6 alone were observed more frequently among modern Beijing strains than among ancient and non-Beijing strains (24.2%, 18.9%, and 9.7%, respectively, P = 0.016).

Figure 1
figure 1

Distribution of drug resistant-relevant mutations and their presence in correlation with Mtb lineages and clustering. Clustered: defined by pairwise SNV difference <6 SNVs. Different colors indicate different mutations.

Framework of bacterial GWAS analysis

To investigate the pathogen factors involved in the wide spread of INH-resistant strains in Hanoi, we used a combination of a representative INH resistance-conferring mutation katG-S315T (S315T[+]) with genetic cluster (cluster[+]) <6 SNV differences as a “phenotype” or objective variable for GWAS. katG-S315T and SNV-based genetic clusters alone served as phenotypes for comparison. Initially 31-bp k-mers throughout the genome were set as the “genotype” or explanatory variables, because the presence or absence of such k-mers provides clues for identifying the SNVs, indels, or even structural variants without a standard reference sequence for mapping reads20. When candidate genes were obtained from the k-mer analysis, individual variants, including SNVs and indels, were further analyzed for confirmation using other variant-based platforms.

Association between phenotypes and 31-bp k-mers throughout the genome

When k-mer GWAS was performed for our Hanoi samples, the number of genes harboring significant k-mers associated with cluster[+] alone and S315T[+] alone were 7 and 48, respectively, whereas the combined phenotype, cluster[+]/S315T[+] extracted 403 statistically significant genes, controlling for the population structure using LMM (Table 2). For comparison, 337 samples from the South African KwaZulu-Natal study were analyzed in a similar way as for the combined phenotype (Table 2), and 14 genes were eventually extracted in common (Fig. 2a). Of these, eight (Rv1148c, mmpL6, PE_PGRS21, PE_PGRS53, pks12, rpoB, Rv2090, and wag22) were excluded from further analysis because the presence or absence of k-mers appeared only in one sample of either case or control group. PE_PGRS10 was further excluded since the presence of k-mers was not further confirmed by BLAST search for read sequences from both Hanoi and KwaZulu-Natal samples.

Table 2 Genes harboring k-mers significantly associated with different phenotypes (from (1) to (6)) in Hanoi and KwaZulu-Natal studies.
Figure 2
figure 2

Venn diagram showing the number of genes significantly associated with the given phenotypes and shared by the two study cohorts. (a) results from k-mer-based GWAS; (b) results from phyOverlap; cluster[+]: clustered (pairwise SNV difference between two isolates is no more than five SNVs); S315T[+]: harboring katG-S315T mutation; Hanoi: Hanoi’s study cohort; KwaZulu-Natal: KwaZulu-Natal’s study cohort.

The remaining five genes, namely, PPE18, gid, emrB, Rv1588c, and pncA, were confirmed by BLAST search and were aligned to the H37Rv reference sequence (Supplementary Fig. S2) and nominated as real candidates. In the Hanoi study population, significant k-mers included 36 k-mers from the PPE18 gene (all had the same P value = 1.840E-09), 33 k-mers from gid (the best P value = 5.213E-08), 31 from emrB (8.579E-08), 4 from Rv1588c (8.579E-08), and 31 from the pncA gene (4.437E-07) (Supplementary Table S1), when 6.208E-06 was applied as the threshold of statistical significance after Bonferroni correction. For the KwaZulu-Natal study population, within the same set of genes, the best P values obtained from 5, 101, 31, 12, and 29 k-mers were 2.998E-08, 2.998E-08, 7.802E-07, 2.453E-06, and 4.802E-11, respectively, with the threshold of significance after Bonferroni correction as 4.591E-06 (Supplementary Table S1). Positive associations with the phenotype were mainly observed in L2 of Hanoi’s samples and L4 of KwaZulu-Natal’s samples (Fig. 3).

Figure 3
figure 3

Distribution of k-mers derived from variant (or wild) types of five genes (PPE18/19, gid, emrB, Rv1588c, and pncA) that showed positive or negative associations with clustered strains carrying katG-S315T in the phylogenetic trees of Hanoi (a) and KwaZulu-Natal (b) study population.

Next, we attempted to confirm significant k-mers by using a DBGWAS approach to identify all the relevant sequences as “unitigs” (Table 3). Consequently, one significant unitig indicating merged k-mers was identified in PPE18, three were identified in gid, two in emrB, and two in Rv1588c (the best q value in each gene was 5.562E-07, 3.676E-05, 1.704E-05, and 1.704E-05, respectively) for the Hanoi samples (Supplementary Table S2, Fig. S3). For the KwaZulu-Natal samples, one significant unitig was found in PPE18, seven in gid, two in emrB, and one in pncA (the best q value in each gene was 4.856E-10, 7.943E-09, 7.943E-09, and 7.639E-12, respectively) (Supplementary Table S2).

Table 3 Genes with k-mers significantly associated with clustered strains harboring katG-S315T mutation (cluster[+]/S315T[+]) in both Hanoi and KwaZulu-Natal study cohorts, further investigated by using the DBGWAS platform and variant/deletion-based search.

All k-mers annotated with PPE18 in both study sets were mapped to either PPE18, 19, or 60. Nucleotide sequences within these three PPE genes were hardly distinguishable from each other (Supplementary Fig. S4); BWA-MEM mapping or simple BLAST search to the reference could not specify the exact PPE gene that each 31-bp k-mer belonged to. Nevertheless, the analysis of de novo assembled contigs demonstrated that the Hanoi variant k-mers were derived from PPE18 and KwaZulu-Natal’s k-mers initially annotated with PPE18 were derived from PPE19. Such a high degree of sequence similarity was also observed in significant unitigs identified in both study sets with the DBGWAS approach. Therefore, these k-mers and unitigs were designated as PPE18/19 in our study (Table 3).

Individual variant analysis using a genome-wide approach

When mapping the aforementioned significant k-mers of Hanoi samples on the H37Rv genome, we identified variants corresponding to the significant k-mers. Next, we attempted to confirm the variants by SNV- and small indel-based GWAS or BLAST search, and finally identified mutations encoding E99A and A101T, in PPE18, E173* (Glu173Stop) in the gid, F508S in emrB, P34P in Rv1588c, and Q141P in pncA. They were significantly associated with cluster[+]/S315T[+] (P values were 1.941E-09 for both PPE18 SNVs, and the P values were 2.991E-06, 1.507E-08, 1.511E-08, and 4.919E-07 for gid, emrB, Rv1588c, and pncA SNV, respectively) (Table 3). The PROVEAN web server24 and SIFT_4G tool25 predicted gid-E173* and emrB-F508S as deleterious or that affecting the protein function.

For the KwaZulu-Natal samples, mapping results showed SNVs in emrB (I461I) and pncA (T153fs) (variant-based GWAS P values = 1.077E-06 and 6.596E-11, respectively). Initially, the SNV in the PPE18/19 was not clearly mapped to the reference for the aforementioned reason. After the in-depth search for de novo assembled contigs, Q286R in PPE19 was identified. Large deletions were found in gid in 101 of the 337 samples (from 120 to 675 bp). Rv1588c did not show any SNV in the corresponding area of the k-mers (Table 3).

SNVs and small indels from VCF files, significantly associated with the cluster[+]/S315T[+] of the Hanoi cohort obtained by GWAS, included 329 SNVs and indels other than the six shown above (Supplementary Table S3) (the best P = 1.754E-13, Supplementary Table S4). None of them were lineage-specific SNVs that have been reported elsewhere22. After further excluding variants in ambiguous PE and PPE genes, synonymous SNVs, a mutation causing S315T, and those with frequency no more than two in case or control samples, 144 SNVs and small indels were extracted in the samples from Hanoi. Of these, 77 SNVs correlated well with the principal component (PC)-7 (Supplementary Table S4), which corresponded to a phylogenetic branch of the ancient Beijing strains (Supplementary Fig. S5), with a high percentage (92.9%) of INH resistance.

We further analyzed the structural variants by detecting zero or low-depth areas when H37Rv and four complete genomes in Hanoi (AL123456, and AP018033 to AP018036) were used as references to be mapped. We found two groups of gene deletions associated with this phenotype. The first group included 1-bp deletion in Rv1043c, 52-bp deletion in Rv2286c, 3-bp insertion in Rv0790c, 1-bp deletion in Rv3230c, and a big deletion in Rv2025c, and these correlated well with PC-9 (Table 4), which corresponded to the same branch of ancient Beijing strains shown in SNV analysis. The second group included a 238-bp deletion in accD2, 2-bp deletion in eis, a 459-bp insertion in Rv3077, a 12-bp deletion in Rv2690c, and correlated with PC-12 (Table 4) corresponding to another branch of modern Beijing strains.

Table 4 Genes with deletions/insertions significantly associated with clustered strains harboring katG-S315T mutation among Hanoi samples obtained by GWAS, and corresponding k-mers.

Convergence-based phyOverlap analysis

Seeking variants caused by convergent evolution is another alternative for detecting mutations supporting drug resistance, such as compensatory mutations26. We also tried to identify the phenotype-associated variants caused by convergence evolution. As expected, rpoB and pncA were also associated with katG-S315T mutation alone. However, no genes and inter-genic regions were significantly associated with the presence of cluster[+]/S315T[+] in both KwaZulu-Natal and our Hanoi panels (Table 5, Supplementary Table S5, Fig. 2b). One PPE gene, two PE genes, and one intergenic region, PPE47, PE_PGRS55, PE_PGRS20 and PE_PGRS3-PE_PGRS4 were extracted in the Hanoi samples only (Table 5).

Table 5 Genes detected by the phyOverlap method and their significant associations with the Hanoi strains harboring the katG-S315T mutation, with and without clustering, and loci shared with the KwaZulu-Natal study population.

PPE46 and PPE47 share large portions of identical nucleotide sequences. By BLAST search for variants in the de novo assembled contigs in addition to genome-wide screening of zero or low-depth areas (<15% of average depth) in the reference genome, we found that 37 (4 in L1, 19 in L2, and 14 in L4) strains had large deletions in PPE47 (Supplementary Fig. S6) and all had fusion with PPE46, resulting in PPE46-like chimeric genes in 35 isolates and PPE47-like chimeric genes in 2 isolates (Supplementary Fig. S7). SNVs identified from 3379708 to 3379763 of AL123456.3 (H37Rv) in the PPE47 region were the main reason for the significant association in the phylogenetic convergence test. In other candidates from phyOverlap, specific variants were not validated, presumably owing to the difficulties in mapping short reads followed by ambiguous base calling within PE_PGRS genes.

Analyses using logistic regression models adjusted for host confounders in Hanoi samples

By multivariate analyses using conventional logistic regression models after adjustment for patients’ gender, age, living area, as well as Mtb lineages, all k-mers from PPE18/19, emrB, and a part of k-mers from gid showed positive associations with clustered strains carrying katG-S315T mutations in the Hanoi study population (adjusted odds ratio [aOR] with 95% confidence interval [CI] = 13.20 [3.49–49.96], 11.98 [3.24–44.29], and 12.42 [2.81–54.90], respectively), whereas Rv1588c and pncA k-mers showed negative associations (aOR with 95% CI = 0.08 [0.02–0.31], and 0.01 [0.00–0.25], respectively). All variants corresponding to significant k-mers showed positive associations with cluster[+]/S315T[+]. The PPE46/47-like chimeric gene also showed positive association with cluster[+]/S315T[+] (aOR with 95% CI = 6.81 [2.13–21.72]) (Supplementary Table S6).

Discussion

We identified a variety of drug resistance-conferring mutations prevailing in Hanoi, northern Vietnam, which appeared most frequently in the East Asian Mtb lineage L2 particularly in ancient Beijing sublineage, and less in L1, and then least in L4. Large deletions that were not detected by conventional variant calling from mapped short reads were also found in pncA. Using the bacterial GWAS approach, we extracted candidate genes that were significantly associated with the clustered strains harboring the katG-S315T mutation and that were common to the two independent data sets—our cohort panel and a previous South African study, the KwaZulu-Natal cohort by Cohen et al.19—suggesting that expansions of INH-resistance TB can be facilitated by pathogen factors, at least in part.

A major drug-resistance conferring mutation, katG-S315T, accounted for 85.3% of the INH-resistance strains in our study population, which was similar to that reported in other studies in Vietnam, e.g., 73.2% was reported by Minh et al.13, 73.6% was reported by Huyen et al.27, 81.3% by Nguyen et al.15 and 78.4% was reported in Southeast Asian countries28. This katG-S315T is known as a mutation with low-fitness cost, spreads to Beijing strains and others, and is more likely to be clustered23,29,30. The acquisition of katG-S315T mutation precedes other resistance mutations, including RMP4,5,19. Indeed, this mutation occurs more frequently in MDR-TB strains than other mutations30,31, and it has an important implication in the transmission and control of MDR-TB.

In this setting, we performed bacterial GWAS with phenotypic combination of clusters defined by <6 SNVs32 and katG-S315T mutation as a surrogate, and searched for pathogenic variants contributing to the spread of INH-resistance. We identified five genes, namely, PPE18/19, gid, emrB, Rv1588c, and pncA, which were shared by two different sample panels in Asia and Africa with different major Mtb lineages, L2 and L4, respectively.

Of the five genes extracted from the k-mer GWAS, except for drug-resistance-conferring genes, the best P values were obtained from PPE18/19 both in the Hanoi population and in the KwaZulu-Natal cohort. The association was mainly observed in L2 of Hanoi’s samples and L4 of KwaZulu-Natal’s samples. PPE18/19 genes are members of a multigene family and share high sequence similarity with another PPE gene, PPE6033. Among these PPE genes, homologous recombination events frequently occur and contribute to the sequence diversity34,35. Among nonsynonymous SNVs and small indels detected in the PPE18 gene of our study population leading to amino acid change in Mtb39A, two SNVs were significantly associated with cluster[+]/S315T[+], suggesting the spread of INH-resistant TB. Although their role in virulence is not fully understood, the PPE18 protein, also known as Mtb39A, was shown to downregulate the proinflammatory response and Th1-type immunity, interacting with host TLR236, and facilitated survival and multiplication of Mtb bacilli in a mouse model37.

The T-cell epitopes in Mtb39A38,39,40 lead to strong T-cell proliferation and IFN-gamma production38. Therefore it is used as a subunit for the human TB vaccine candidate Mtb72F and its successor M72. It has been proved to be immunogenic and can stimulate both cellular and humoral immune response41,42. By comparing with the lists of the T-cell epitopes published elsewhere38,40,43,44, nonsynonymous mutations E99A and A101T in PPE18 identified in the Hanoi population and Q286R in PPE19 in the KwaZulu-Natal’s panel were both found to be included in the sequences acting as T-cell epitopes for cellular immunity. Despite the presence of PPE18 variants in the Hanoi strains, the original T-cell epitope sequences were mostly conserved in either PPE19 or PPE60, when the BLAST search was applied. Variants of these PPE genes associated with spread of INH-resistance strains may thus increase the antigenic diversity of the bacilli, which may help evade or exploit human immunity by unidentified mechanisms through the process of human-Mtb coevolution33,35,45.

Mutations in the gid gene have been reported to be associated with low-level resistance to SM46, which has been used as the first-line drug since a long time until recently. Because gid encodes methyltransferase that is responsible for the methylation of 16 S rRNA involved in the translational fidelity47, it is thus conceivable that the gid mutation may modulate the fitness of INH-resistance conferring mutations through the change in mRNA translational fidelity. Our study revealed that gid k-mers with E173* mutant were significantly associated with cluster[+]/S315T[+] and S315T[+], but not with cluster[+] alone, indicating that a concurrence of gid-E173* and katG-S315T may facilitate transmission even after controlling for population structure. The concurrence of gid 130-bp deletion and katG-S315T is the first step toward XDR-level drug resistance in Africa19. In our study population, all strains with gid-E173* mutation had at least one genetic mutation conferring resistance to first-line drugs. This non-sense mutation gid-E173* may facilitate the expansion of katG-S315T mutant strains. Indeed, other groups have reported that SM-resistant strains seem to be more clustered in Vietnam48.

Efflux pumps, including emrB, have been reported to be associated with pathogenicity since the up-regulation of efflux gene expression is involved in the development of resistance to anti-TB drugs49 and a wide array of physiologic processes such as the growth kinetics or transportation of a variety of compounds50. The combination of the katG-S315T mutant with emrB variant F508S in the Hanoi study may thus increase the drug efflux activity and facilitate Mtb survival and spread, by mitigating drug pressure. emrB (Rv0783c) belongs to the major facilitator superfamily (MFS) characteristically energized by the proton motive force (H+ or Na+)50,51, and may confer low-level resistance to RMP52. Although the role of I461I in emrB in KwaZulu-Natal’s population is unknown at present, this synonymous mutation (c.1383 C > A, ATC > ATA) is very rare in terms of codon usage of Mtb53, and the significance of codon usage bias and t-RNA modification should be taken into account, because rare codons are sometimes advantageous to the survival of Mtb under stress conditions54. When exposed to INH, various MFS efflux pump genes were reported to be overexpressed50, and these may induce sustained increased efflux activity with selection and stabilization of drug-resistant mutations55. This may also be relevant to the acquisition of additional drug resistance. Further studies are required for elucidating the function of the mutations in efflux pump genes. Indeed, SNVs in efflux pump genes are often found in XDR strains but not in drug-susceptible strains51, although it is often difficult to identify the extrusion of a drug to a specific gene50,56. The association pattern characterized by the presence or absence of the variant of emrB with the phenotype was quite similar to that of PPE18/19, whose phenomenon was shown beyond lineages, L2 in Hanoi, and L4 in KwaZulu-Natal. Although membrane localization of the PPE genes may be functionally linked to efflux pump activities, it is currently unknown.

Rv1588c is a partial REP13E12 repeat protein57. Although k-mers carrying the reference sequences in Rv1588c showed negative association with the clustered strains harboring katG-S315T in the two panels, their functional significance remains unclear. As a variant, only a synonymous variant P34P was found in Hanoi, which was associated with ancient Beijing sublineage.

The reference sequence (=wild type) k-mers in pncA were also associated negatively with the phenotype in the Hanoi cohort, and variant-carrying k-mers showed a positive association in the KwaZulu-Natal cohort. As a variant found in Hanoi, Q141P in pncA has been reported as a high-confidence mutation leading to PZA resistance58,59. PZA resistance is often observed among MDR-TB isolates60,61. Thus, the possibility of pncA mutations facilitating the transmission of katG-S315T mutant Mtb isolates may make TB management more challenging.

Convergence-based phyOverlap analysis, which is a different approach, revealed that four different gene/intergenic regions only in the Hanoi study population that may have been caused by convergence evolution were significantly associated with the clustered strains carrying the katG-S315T mutation; and these four were present in the PE or PPE region. The impact of genetic variation on the function of PE or PPE proteins remains largely unknown35. However, at least large deletions between PPE46 and PPE47 genes were observed in all three lineages (L1, L2, and L4) in Hanoi, and these were positively associated with the spread of INH-resistant TB. The deletion between the identical sequences of the two PPE genes leads to in-frame gene fusions through homologous recombination62, and a relatively high prevalence, indicating a clonal expansion of Haarlem strains (L4) in Tunisia34, suggested that the generation of the new chimeric genes may facilitate antigenic diversity and provide new determinants for pathogen’s virulence across the lineages.

Further analyses using logistic regression models confirmed that all variants corresponding to significant k-mers of the five genes and even variants detected by the phylogenetic convergence test in Hanoi samples were positively associated with the spread of INH-resistant TB, even after adjustment for other possible confounders.

Our study has some limitations. First, we were not able to trace the epidemiological link among the patients to corroborate the transmission chain. Hanoi is the capital city of Vietnam with on-going urbanization and this city consists of a floating population coming from many other provinces; thus, pursuing an epidemiological link is rather difficult. However, we have detailed information on the patients’ residential districts and we have added this information to the logistic regression analyses. Second, our samples were obtained in a population-based setting in an Asian city; but to generalize our finding, we analyzed another African cohort set available in the public by using the same methodology. Third, performing in vitro experiments to elucidate the functional significance of each genetic variant was beyond our scope owing to resource limitation. Nevertheless, the extracted genes were associated with the spread of INH-resistant strains carrying katG-S315T mutation and these reached statistically significant levels by using the bacterial GWAS approach based on LMM.

Previous studies suggest that the katG gene’s physiological function is not largely reduced by S315T substitution63. Its catalase-peroxidase-peroxynitritase activities may play a part to protect Mtb against reactive oxygen and nitrogen species derived from the phagocyte oxidative burst in human macrophages63,64. This may link KatG with other pathogen factors relevant to immune evasion or virulence such as PE/PPE65,66, although possible additive or synergistic effects on fitness should be further investigated. It is desirable to conduct validation studies in different populations. These Mtb genes are attractive candidates, presumably because of their relevance to the pathogen’s virulence, and they could be important sources to consider in in vitro and in animal models.

In conclusion, WGS data demonstrated the status of primary drug resistance at gene levels in the Hanoi city, and bacterial GWAS was performed to identify candidate genes that may facilitate the spreading of INH-resistant strains. Our findings provide new insights into the pathogenic mechanisms possibly mediated by the candidate genes including PE/PPE, by which drug-resistant Mtb can maintain epidemiological fitness and spread in high-burden countries such as in Asia and Africa.

Methods

Study sites, patient recruitment, and sample collection

This was a part of our cohort study of patients who were over 16 years of age and who were newly diagnosed with smear-positive pulmonary TB without any treatment history in Hanoi, Vietnam during 2007–2009, in which basic data with clinical interpretation were published in a previous report12,67,68,69. In brief, we included 7 of the 14 districts in Hanoi as the catchment area, where more than half of new smear-positive TB patients in the city were diagnosed and treated in the area during the study period12. Sputum specimens were collected before starting the treatment, 92.7% of which revealed culture-positive, and drug susceptibility testing for first-line drugs was performed using the WHO standard proportional method12. The patients’ clinico-epidemiological information was also collected. Written informed consent was obtained from all the patients.

Ethics statement

This cohort study was approved by the Ethical Committees of the Ministry of Health, Vietnam, National Center for Global Health and Medicine, and the Research Institute of Tuberculosis, Japan Anti-Tuberculosis Association, Japan. All experiments were performed in accordance with relevant guidelines and regulations. In the case of minors, their parents provided written informed consent.

WGS

Mycobacterial DNA samples from Löwenstein-Jensen culture media were extracted using the Isoplant kit (Nippon Gene, Tokyo, Japan) and analyzed using Illumina HiSeq and MiSeq systems (Illumina, San Diego, CA, USA). These experiments were performed using a class II safety cabinet in a biosafety level 3 laboratory to prevent contamination. For Hiseq. 2500, a library of WGS was prepared using an automated sample preparation system (Agilent Technologies Inc.) with the TruSeq DNA PCR free sample prep kit (Illumina). For Miseq, a library was prepared from 200 ng of genomic DNA with the TruSeq Nano DNA LT Sample Preparation Kit (Illumina), following the manufacturer’s instructions. Paired-end (2 × 150 bp) sequencing was performed using Hiseq. 2500. For Miseq, the paired-end (2 × 250 bp or 2 × 300 bp) sequencing system was used. The sequence data are available in the DDBJ/EMBL/GenBank databases under the accession numbers DRA008666-7 and DRA008677.

Extracting single nucleotide variants (SNVs) for Mtb lineages/sublineages and genetic clustering

Briefly, after trimming and excluding severely contaminated samples, sequence reads were mapped to the H37Rv genome (AL123456.3) by using BWA-MEM 0.7.15 (https://github.com/lh3/bwa), followed by variant calling with the Genome analysis toolkit (GATK version 3.7)70. Only paired-end fastq files with average depth more than 25X were accepted for the analysis. The criteria set for identifying SNVs and small indels included Q30 minimum base call quality score and a minimum coverage depth of 10X. Drug resistance-conferring mutations, including small indels and lineage-specific variations, were extracted using the TB-Profiler version 0.3.79,22. The Beijing genotype of lineage-2 (L2.2) was further classified into ancient and modern Beijing sublineages by detecting the SNVs at the nucleotide position 649,345, which is equivalent to the presence of IS6110 in the NTF region71.

Large deletions were screened throughout the mapped reads by seeking zero or low-depth areas (<15% of the average depth) using an in-house python script and then visualized for confirmation with the Integrative Genomics Viewer (IGV) version 2.3.91. For this deletion screening, complete genome sequences of the clinical isolates in our Hanoi cohort, AP018033 to AP01803672,73 as well as H37Rv genome, were used as reference sequences. After excluding ambiguous variants in categories of repetitive and insertion sequences and phages, genetic clusters were defined by the pairwise differences of no more than five SNVs74. A phylogenetic tree was constructed by the maximum likelihood method using RAxML version 8.2.875 and then visualized with plotTree (https://github.com/katholt/plotTree) by using a lineage-7 strain ERR181435 as an out-group.

Analyses of bacterial GWAS

The associations between the phenotypes and the presence or absence of 31-bp short sequences, k-mers, in the genome were investigated using a genome-wide efficient mixed model association algorithm, the GEMMA software (https://github.com/genetics-statistics/GEMMA). At first, the DSK software (https://github.com/GATB/dsk) was used for listing all the unique 31-bp DNA k-mers, and then their presence or absence in all the samples was analyzed as mentioned above20.

DBGWAS, an extended k-mer-based GWAS tool with compacted De Bruijn graph76, was further used to confirm the genetic variants associated with the phenotypes of interest. Sequence reads were assembled using SPAdes v3.13.077 and Platanus 1.2.478 when appropriate, and the generated contigs were used for BLAST search (ncbi-blast-2.8.1+) to identify the location of the phenotype-associated k-mers. Bonferroni correction was applied for multiple testing; the threshold of the significance after correction was calculated as 0.05 divided by the number of variants identified in the study samples.

To investigate whether any variant, including SNVs or indels in the whole genome, has any possible association with the phenotypes of interest, we used bugwas R package20 with built-in GEMMA. The bi-allelic SNVs were used to calculate the relatedness matrix of the samples for LMMs to control for the population structure. Likelihood ratio tests were used for assessing the significant associations.

Phylogenetic convergence tests (phyOverlap)26 were also performed to identify the convergent variants associated with the phenotypes. Herein, Benjamini-Hochberg adjustment at 0.05 false discovery rate level was applied as the q value for phyOverlap.

To compare the findings obtained from the GWAS analysis of our 332 samples in Vietnam, another set of WGS data from 337 clinical isolates, which were collected in the KwaZulu-Natal province of South Africa from 2008 to 2013 in a study conducted by Cohen et al.19 to investigate the emergence of drug-resistant TB (hereafter referred to as KwaZulu-Natal study), were retrieved from the public database, and analyzed in a similar way.

Other statistical analyses

Chi square and Fisher’s exact tests were performed to compare the differences in the proportions among the groups. Venn diagram (R version 3.4.4 VennDiagram package) was used to demonstrate the common gene(s) harboring variants, including k-mers, associated with different phenotypes. Possible associations between the given genetic variations and INH-resistant clusters, adjusted for Mtb lineages and patients’ age, gender, and living area were further studied using logistic regression models. These analyses were performed using STATA version 12 (StataCorp, College Station, TX, USA), and P values less than 0.05 were considered statistically significant.