Introduction

Tuberculosis (TB) remains a public health threat1, especially in developing countries where infection rates are increasing and diagnosis is a significant challenge1. Genotypic variations among clinical isolates of Mycobacterium tuberculosis, the causative agent of TB, are associated with specific phenotypes (e.g., resistance to autophagy)2, which highlights the importance of profiling the population structure of M. tuberculosis in all communities. Worldwide, the incidence of TB has reached an alarming level, especially in sub-Saharan Africa and South-East Asia (300+ per 100,000)3. With ~10.0 million new TB cases in 2014 alone, understanding the basis for disease transmission and screening for drug-resistant strains are essential for an effective control strategy4. In the Middle East, drug-resistant isolates of M. tuberculosis are of significant importance and require further scrutiny. This is particularly important in Saudi Arabia5,6 and Egypt7,8 where disease transmission dynamics are not affected by high levels of infection with human immunodeficiency virus (HIV) found in sub-Saharan Africa9. Migrant workers from countries with high prevalence of tuberculosis, such as South-East Asian countries, are often implicated in TB transmission beyond regional borders and require close monitoring by the public health authority10. This is the case in the Kingdom of Saudi Arabia, especially among foreign workers11, where TB constitutes a significant health problem. Recent reports have indicated the presence of multiple lineages of M. tuberculosis in Saudi Arabia, a country with an active guest workers program12. Annual visitors for religious rituals in Saudi Arabia may also contribute to the transmission of TB. In Egypt, another Middle Eastern country with a stable population where guest workers are rare, TB is still considered a significant problem13. Until now, the population structure of M. tuberculosis in those countries with historic ties to global trade remain elusive. In this study, we perform comparative analyses of whole genome sequences (WGS) to profile the dynamics of TB transmission in a relatively stable community (e.g., Egypt) compared to those undergoing significant social changes (e.g., Saudi Arabia).

Earlier reports provided clear evidence for the emergence of single- and multidrug- resistant tuberculosis (MDR-TB) in the Middle East. More importantly, they indicated the drastic variability in prevalence of single drug-resistant and MDR-TB (ranging from 3–50%)8,14,15,16, most likely related to lack of adherence to TB drug treatment. Earlier studies used growth-based phenotyping methods8,14, while more recent studies have employed a combination of genotypic approaches such as PCR and other targeted sequencing approaches to detect resistance-conferring mutations5,17. Information about the predominance of M. tuberculosis lineages in the Middle East has only been rudimentary and scarce, especially on a genome-wide level, until recently18. Analysis of a large number of M. tuberculosis isolates collected from different countries around the globe identified resistance to isoniazid (katG mutation) as a precursor for the emergence of MDR-TB and other more complex drug resistance profiles19. Unfortunately, there were no isolates included in this study from the Middle East, despite the inclusion of 8,316 other isolates19. More recently, a total of 76 isolates were analyzed using WGS from Saudi Arabia18. Studies have documented increased risks of TB incidence among “Hajj” pilgrims in Saudi Arabia, due to mass-gatherings and a large concentration of people from all over the world20,21. Some of these countries are described as countries with high-incidence of TB19. Furthermore, an increase drug-resistant (DR) strains among MDR-TB particularly to aminoglycoside and fluoroquinolone have been documented in Saudi Arabia recently5. The current study represents the first attempt to utilize WGS of clinical M. tuberculosis isolates from both Egypt and Saudi Arabia to profile the population structure of this important pathogen considering the patient’s country of origin for each isolate. Findings from our analysis suggest the significance of the problem of DR and MDR among M. tuberculosis isolates in the Middle East, particularly in stable communities (up to 35%), is actually much higher than identified before on a global level (18%)19 with little influence of migrant workers on disease transmission dynamics.

Results

Predominant lineages of M. tuberculosis in the Middle East

To characterize M. tuberculosis in the Middle East, we analyzed the genomes of archived isolates from sputum samples of 69 patients at the King Abdulaziz University Hospital (KAUH) in Jeddah, Saudi Arabia, that frequented the chest ward between January 2012 and March 2013. In addition, we analyzed the genomes of archived isolates from 60 patients at Cairo University, Cairo, Egypt, that frequented the university chest clinic during the same period. Of the total 129 samples, 95 isolates (61 from Saudi Arabia and 34 from Egypt) were chosen for genomic analysis based on successful culturing and ability to extract high-quality genomic DNA. For Egypt, all isolates were from Egyptian natives, while isolates from Saudi Arabia were collected from patients native to a range of countries in Africa (Chad, Somalia), the Arabian Peninsula (Saudi Arabia, Yemen), and Asia (Afghanistan, India, Myanmar, Indonesia, the Philippines). Patients ranged in age from 1 year to 76 years old with nearly half of them falling between ages 21 and 40. More demographic information of the patients is presented in Table 1. The results of the examination of sputum samples for the presence of M. tuberculosis and the drug sensitivity tests are compiled in Table 2.

Table 1 Patient demographics information for both Egypt and Saudi Arabia.
Table 2 Drug resistance profile of analyzed M. tuberculosis isolates from Egypt and Saudi Arabia.

To analyze the population structure of M. tuberculosis isolates, we employed a protocol based on 12-locus MIRU-VNTR typing22. Because of the demographic differences between the Egyptian and Saudi Arabian sources of the M. tuberculosis isolates under study (Table 1), we hypothesized the presence of different models for disease transmission and spread dynamics. In Egypt, we expected the predominance of a countrywide group of isolates to cluster in a sympatric pattern (isolates from a single geographical region with a stable population) compared to Saudi Arabia with more diversity of M. tuberculosis lineages in an allopatric pattern (isolates from multiple, isolated geographical regions with a diverse population)23. Out of 25 genotypes identified (out of 34 isolates), the most predominant lineage in Egypt was from the Indian-East African lineage (lineage 3), which was represented in 76.5% of examined isolates (Fig. 1). On the other hand, out of 59 genotypes identified (out of 61 isolates), the Euro-American (lineage 4) was represented in 66% of isolates in Saudi Arabia (Fig. 1), confirming earlier reports24. Analysis on the MIRU-VNTR website25 further categorized the isolates into sub-lineages summarized in Table 3. Moreover, the lineages of 16 isolates (out of 61 isolates) from Saudi Arabia were predicted differently depending on the algorithm used (TBinsight26 vs. MIRU-VNTRplus). To resolve discordance between the two databases, we used PCR of pks15/1 to distinguish between lineages 3 and 4. Interestingly, two isolates from Saudi Arabia, SA34 and SA60, of lineage 4, did not group with any known sub-lineage and appear to be specific to Saudi Arabia. The genomes of these two strains, among others from Egypt and Saudi Arabia, were subjected to further analysis using whole genome sequencing.

Figure 1
figure 1

Lineage distribution based on MIRU-VNTR typing for isolates from both (A) Egypt and (B) Saudi Arabia. If a conflict of lineage assignment occurred, PCR tests were performed to assign the appropriate lineage.

Table 3 Lineage differences between two databases of KSA and Egypt samples.

Sequence variations on a whole genome level

To gain more insights into diversity among the identified lineages of M. tuberculosis from the Middle East, we selected certain isolates for whole genome sequence (WGS) analysis on the Illumina platform. Our selection included isolates with drug resistance phenotypes, representative samples for each MIRU-VNTR cluster (N = 49, 19 from Egypt and 30 from Saudi Arabia), in addition to isolates with unassigned lineages. Sequencing isolate genomes yielded paired-end fragments of 101 bp (~805-fold coverage) or 251 bp (~114-fold coverage), respectively. The total number of reads, number of mapped reads, and coverage data for each genome can be found in Table S1. Whole genome sequences were aligned using bowtie27, and SNPS and insertions/deletions (indels) were identified using Varscan228. Several of these SNPs (N = 20), including those associated with drug resistance (N = 10), were confirmed by classical Sanger sequencing (Table S2). Interestingly, the majority of SNPs were nSNPs (Fig. 2) suggesting a rapidly evolving M. tuberculosis population in those patients. Additionally, the number of SNPs between all isolates was high (>100 SNPs), an indication of recent infection rather than infection relapse29, from a chronic or latent infection.

Figure 2
figure 2

Percentage of synonymous SNPs (sSNP), non-synonymous SNPs (nSNP) and SNPs located in intergenic areas (intergenic). SNPs were included if coverage was more than 20 reads and frequency was ≥50% of the reads.

In another analysis, we compared WGS-based lineage prediction with MIRU-VNTR (TBinsight analysis26) results. In this comparison, lineage assignments of 2 isolates (SA33 and SA47) were discordant where WGS moved them from lineage 4 to lineage 3. Among the 16 isolates that gave different lineage results between TB insight and MIRU-VNTR websites, seven were sequenced, and only one agreed with MIRU-VNTR while all others agreed with TBinsight. Moreover, one previous study of M. tuberculosis genotypes found in Saudi Arabia discovered 2 strains with novel spoligotypes (703777707770371, 467777377413771), unique to Saudi Arabia. To determine if these spoligotypes were represented in our samples, in silico spoligotyping was done for each sequenced sample30. None matched the novel spoligotypes identified by Al-Hajoj, et al.24, a reflection of targeting a different population from the earlier study.

Transmission dynamics of M. tuberculosis isolates within individual countries

To better understand the nature of M. tuberculosis isolate transmission within a population, phylogenetics of isolate variants were analyzed to reconstruct their evolutionary relationship. This phylogenetic analysis was based on the alignment of 17,358 SNPs detected in all 49 sequenced genomes when aligned to the genome of the standard laboratory strain, H37Rv (NC_000962.3). Likelihood ratio tests rejected a molecular clock, so downstream analyses were based on chronogram where rates of evolution were allowed to vary among branches. The resulting tree topology and divergence time estimates among major lineages were consistent with previous estimates31, with the most recent common ancestor (MRCA) of all isolates occurring ~3400 years ago. The phylogeny indicates that the vast majority of isolates from Egypt form a single clade in lineage 3, all of which diverged within the past 500 years, and two remaining isolates forming a clade in lineage 1 with an MRCA of ~150 years (Fig. 3). Nearly 80% of isolates from Egypt form a single clade with a crown age of ~78 years.

Figure 3
figure 3

Evolutionary relationships and divergence dates of isolates from Egypt and Saudi Arabia. Divergence dates are in years, with the 95% confidence intervals of node ages represented by blue bars. The topology was recovered with 100% posterior probability, except for those nodes labeled. Color coded boxes at the tips indicate the nationality of the patient from which the sample was isolated and correspond with the color of countries mapped in Fig. 4. SA tip labels refer to samples isolated in Saudi Arabia, whereas E tip labels refer to those isolated in Egypt.

The isolates from patients in Saudi Arabia were scattered throughout lineages 1–4, with no apparent clustering according to the nationality of the patients. In contrast, isolates from Egypt were significantly more closely related to each other than expected when compared to a null model where nationalities were randomized across the tree (p < 0.01). No other nationalities showed significant phylogenetic clustering. Nevertheless, in some instances isolates from individuals with the same nationality form clades (Fig. 3). For example, isolates from individuals from Somalia form two different clades in lineage 4 and a clade in lineage 3. With few exceptions, divergence dates of closely related isolates from patients in Saudi Arabia were older than 200 years old. Additionally, we found no significant clustering of TB isolates based on patient’s age, or gender (p > 0.05), another indication of a dynamic model of disease transmission. Surprisingly, however, was the lack of clustering among isolates from patients sharing the same nationality. This indicates recent spread of M. tuberculosis among guest workers, confirming region-specific lineage transmission patterns (sympatric model), even within Saudi Arabia with active guest worker programs, refuting our earlier hypothesis.

Drug resistance of M. tuberculosis isolates

In both Egypt and Saudi Arabia, M. tuberculosis isolates were screened for resistance to rifampicin (RMP), isoniazid (INH), ethambutol (EMB), streptomycin (STM), and pyrazinamide (PZA). To further analyze drug resistance among the 49 isolates (from both Egypt and Saudi Arabia) with whole genome sequences, the drug resistance profile of each isolate was re-tested by the group at the University of Wisconsin-Madison against STM, RMP, INH, and EMB. The results of the second screen were used as the gold standard and reference for evaluating genetic diagnostic methods. While no multidrug-resistant (MDR) isolates were identified in Saudi Arabia, almost 35% of isolates from Egypt were considered MDR against 3 drugs and 6% MDR against 4 drugs (Table 2). Both numbers are a sharp increase from the original identification of only 6% MDR isolates. Notably, 3.3% of isolates from Saudi Arabia were resistant to streptomycin and ethambutol (Table 2) but did not meet the criteria for MDR according to the WHO guidelines (no double resistance to both isoniazid and rifampicin)1,32. Of the 30 isolates from Saudi Arabia, three strains were determined to be resistant to STM by the clinical screening, and this phenotype was confirmed. A 4th strain, SA14, found to be sensitive in the original screening in the KSA, was found to be resistant to STM, and 2 more strains, SA51 and SA28, were found to be partially resistant to INH and EMB, respectively. No strains were found to be resistant to RMP. These results identified three additional drug-resistant strains that were missed by the clinical laboratory.

In a separate analysis focusing on isolates with whole genome sequences, the country of origin of patients with drug resistant isolates were considered to further characterize potential origin(s) of spread for drug resistance among M. tuberculosis isolates (Fig. 4). This analysis revealed that drug-resistant strains did not cluster together (p > 0.05); rather, they often grouped with susceptible strains. This was consistent for isolates from both Egypt and Saudi Arabia. Also, drug resistance strains did not cluster together based on patients country of origin (p > 0.05). For all screened isolates, the percentages of pansensitive isolates varied significantly between Egypt (10.5%) and Saudi Arabia (83.3%) (Fig. 4). It is noteworthy to mention here that levels of drug resistance were slightly different from those reported in Table 2, most likely because the improved prediction by whole-genome sequencing.

Figure 4
figure 4

Geographical maps of M. tuberculosis drug-resistant isolates examined in this study. (A) A map of Egypt with an inset pie chart describing percentages of drug-resistant phenotypes among Egyptian isolates with whole sequenced genomes (N = 19). Note the sum of all percentages >100% because of the high levels of MDR among Egypt isolates. (B) A map showing the countries of origin for patients examined in Saudi Arabia. The inset pie chart describing percentages of drug-resistant phenotypes among Saudi Arabian isolates with whole sequenced genomes (N = 30).

Molecular determination of drug resistance (DR)

After obtaining accurate DR profiles for the examined M. tuberculosis isolates, we attempted to analyze this important phenotype in light of the obtained genome-wide variations, particularly single nucleotide polymorphism (SNPs). For isoniazid, all INH resistant isolates (n = 13, one from Saudi Arabia and 12 from Egypt) were explained by the fabG1 (mabA) C-15T promoter variant or KatG Ser315Asn, both of which have been previously implicated in INH resistance33,34. Table S3 represents the variations called for each isolate, including the INH susceptibility testing results. No INH-susceptible isolates had a known resistance-conferring mutation in the queried genes. Two of three isolates from Egypt with no INH susceptibility testing results were likely resistant to INH since the isolates both harbored a KatG Ser315Asn mutation. All RMP resistant isolates harbored a RpoB Ser450Leu variation, which is widely known to confer resistance. Table S4 represents variations for each of the 48 isolates tested for RMP susceptibility. These isolates also harbored a known compensatory mutation in RpoC, Val483Gly, which compensates for the deleterious effects of the Ser450Leu mutation in RpoB35. No RMP susceptible isolates (n = 33) harbored a mutation in the RMP resistance-determining region (RRDR). No RMP-susceptible isolates had compensatory mutations in either RpoC or RpoA, agreeing with previous reports35.

For ethambutol, all EMB resistant isolates had a mutation previously reported to be associated with EMB resistance. Table S5 represents variations in four genes involved in EMB resistance. Interestingly, several isolates had more than one variation across the four genes, suggesting an increased MIC. However, nine of 28 EMB-susceptible isolates also had these variations, except for the EmbB Gly406Asp; this variation only existed in EMB resistant isolates. Three of the four isolates with no EMB susceptibility testing results also had the EmbB Gly406Asp variation, likely pointing to a resistant phenotype. Of all isolates, only three were phenotypically resistant to PZA, all from Saudi Arabia. Table S6 contains variants called in three genes associated with PZA resistance. Furthermore, none of the three had variations in pncA or its promoter region, rpsA, or panD. This phenomenon is not uncommon and has been reported in several reviews36,37,38. We observed only synonymous substitutions in pncA in PZA susceptible isolates (n = 7), which have been reported to not confer resistance39; and only one non-synonymous substitution in rpsA in a single isolate. Isolates from Egypt did not have PZA susceptibility results, however, 14 of these isolates had a non-synonymous mutation in pncA and one had a non-synonymous mutation in rpsA. Finally, only 4 isolates from Saudi Arabia and 3 from Egypt were resistant to STM. Only 2 of the 4 STM resistant isolates had a known resistance-conferring mutation in rpsL (Lys43Arg) (Table S7). Two resistant isolates from Saudi Arabia did not have a known molecular mechanism to explain the resistant phenotype. One susceptible isolate harbored an rrs variation at nucleotide 724, which has not been reported previously. Isolates with no STM susceptibility testing results did not have any variations in the three genes queried.

Discussion

Tuberculosis, and especially drug-resistant strains of M. tuberculosis, represent a major public health threat in developing and underdeveloped countries32. With 10 million new TB cases reported each year40, regional analyses are of paramount importance to help local health authorities control TB on national levels41. Recent analyses of historical and recent genomic data have profiled the nature of tuberculosis disease transmission through global trade, migration, conquests, and even simple travel42,43,44. In the Middle East, most of the reported studies focused on strain lineage distribution and profiling antibiotic sensitivity10,45,46. Unfortunately, scarce information is available on the epidemiology of M. tuberculosis on the genomic level from the Middle East, compared to other regions around the world19. However, several reports have indicated the circulation of DR and MDR in the Middle East47, which highlight the need for more thorough genomic analyses. In this report, a comprehensive SNP analysis of M. tuberculosis was carried out to identify association with drug resistance phenotypes (TB Profiler and related TBDREAM48), and identified poor correlation, most likely because of poor representation of isolates from the Middle East in this database. These results highlight the urgent need for testing of molecular diagnostic methods not only in the Middle East, but also in all understudied regions around the world. It may be that the region of isolation, country of origin, and lineage of the isolate may need to be taken into account when designing molecular diagnostic techniques in the future, and certainly when utilizing them in the field. While sensitivity and specificity information were calculated for RMP, EMB, and STM, the low sample size and <100% specificity or sensitivity reported by TB Profiler did not allow us to make meaningful interpretations of the utility of TB Profiler for the identification of drug-resistant isolates of M. tuberculosis. Drug-resistant strains, however, do not appear to be more contagious or spread more widely than drug susceptible strains, unlike earlier analysis of a M. tuberculosis Beijing cluster of isolates49. Nevertheless, without appropriate diagnosis and treatment of these strains, the risk of spread of drug-resistant strains can only increase. It is noteworthy to mention here that an earlier study on the problems of DR in Saudi Arabia identified a similar pattern for DR to the one reported here with the exception of DR to ethambutol, which was higher in the studied population (4.9%) compared to earlier findings (0.8%)5. A more recent study, however, identified DR in up to 8% of isolates within Saudi Arabia18.

Phylogenetic analysis of M. tuberculosis isolates showed lack of clustering among drug-resistant strains (p > 0.05; Fig. 3). In most cases, drug-resistant isolates formed clades with drug-susceptible isolates. Isolates from patients in Egypt formed two distinct clades in lineages 1 and 3, and overall were significantly more closely related to each other than expected by chance. By contrast, isolates from patients in Saudi Arabia were spread amongst all four lineages, and we were unable to reject the null hypothesis that the nationalities of migrant workers were randomly distributed throughout the phylogeny. The relative homogeneity of communities in Egypt compared to the heterogeneity of communities of migrant workers in Saudi Arabia may account for these contrasting patterns. Moreover, lineage 3 was more prevalent among Egypt isolates which contrasted with an earlier study that indicated the prevalence of lineage 4 (Euro-American clade)13, most probably due to the analysis of more isolates collected from a more diverse population in the earlier study. Our divergence time estimates indicate that the major clades among closely related Egyptian isolates (lineage 3) have all diverged within the past 78 years compared to >200 years for those isolates (lineage 4) from Saudi Arabia (Fig. 3). Few other lineages exhibit clades with similar divergence dates. Thus, while it is evident that tuberculosis is spread amongst people in Egypt, it is more difficult to determine whether migrant workers contract the disease in Saudi Arabia or in their home countries. The evolutionary diversity of isolates among individuals in Saudi Arabia suggests that this latter scenario likely plays some role. For example, it appears more likely that TB was spread among migrant workers from Somalia while they were living in Saudi Arabia than a scenario where three sister pairs of TB strains were independently introduced into Saudi Arabia from Somalia. By contrast, the four isolates from Chadian individuals included in this analysis belonged to the four different TB lineages. In this case, it is equivocal whether the disease was acquired in Saudi Arabia or introduced from Chad.

Overall, both phenotypic and genome-wide analyses provided more insights into the transmission dynamics of tuberculosis among populations within the Middle East with clearly different demography. Surprisingly, a population (e.g. Saudi Arabia residents) with high exposure to cases of tuberculosis with DR and MDR-TB, appear to have less of a problem of DR tuberculosis based on our analysis and others5. On the other hand, a homogenous population structure (mainly Egyptians) had an increasing problem of MDR17, an important highlight to the problem of adherence to drug therapy and a suitable target for the STOP program by the WHO. More importantly, considerations need to be given to genomic analysis of more M. tuberculosis isolates from different countries and to the prevalent lineage as well as drug-resistant genotypes by public health authorities before selection of a suitable diagnostic or drug regiment for a country within the Middle East.

Methods

Ethics statement and isolation of M. tuberculosis

Approval to use patient samples was obtained from the Institutional Review Board of Kaser Elini Faculty of Medicine, Cairo University as well as the Ministry of Health of the Kingdom of Saudi Arabia, and the Research Ethics committee, Faculty of Medicine, King Abdulaziz University. All isolates were de-identified to maintain patient privacy while information related to the patient gender, nationality, age and city of residency were kept to be used for the analysis of transmission dynamics. Approval to use de-identified clinical isolates of M. tuberculosis was also obtained from the University of Wisconsin-Madison’s Institutional Review Board. All methods conducted in this research adhered to the guidelines and regulations approved by the Office of Biological Safety at the University of Wisconsin-Madison.

All clinical isolates of M. tuberculosis were obtained from early morning sputum specimens. Sputum samples were digested and decontaminated by the ready to use N- acetyl- l- cysteine (NALC) 3% NaOH method (Nac-PACTMEA3, Alpha Tec, Inc, Vancouver, Washington, USA) and concentrated by centrifugation at 3,000 rpm for 20 minutes, as described by the manufacturer50. The resulting pellet was resuspended and used in part for detection of M. tuberculosis complex (MTBC) isolates by smear microscopy using fluorochrome stain (Fluo-RAL-Auramine staining kit for Mycobacteria Detection). The remainder of the resuspended pellet was used for culturing in the VersaTREK fluid culture system and for direct MTBC detection by Gene Xpert PCR. All clinical isolates were grown on Löwenstein–Jensen medium slants. Single colonies from each isolate were sub-cultured on Middlebrook 7H10 agar (BD, Franklin Lakes, NJ) supplemented with 0.5% glycerol and 10% ADC (2% glucose, 5% bovine serum albumin, fraction V, and 0.85% NaCl), and incubated at 37 °C under biosafety level 3 (BSL3) laboratory conditions. Single colonies were grown in Middlebrook 7H9 broth (BD Difco) supplemented with 0.5% glycerol and 10% ADC for genomic DNA extraction and drug resistance testing.

Phenotypic analysis

VersaTREK 528 (TREK Diagnostic Systems, Inc. Westlake, OH) was used to determine phenotypic drug resistance in the place of isolation (KAUH) following the manufacturer’s instructions. Inoculated cultures containing rifampicin (30 µg/ml), isoniazid (3 µg/ml or 12 µg/ml), ethambutol (150 µg/ml or 240 µg/ml), streptomycin (30 µg/ml or 120 µg/ml), or pyrazinamide (300 µg/ml) and were monitored for a positive signal for up to 42 days51. For further confirmation following sequence analysis, antibiotic sensitivity of selected isolates was determined using standard, semi-quantitative in vitro susceptibility testing. In this assay, Middlebrook 7H10 agar plates were embedded with BBL Sensi-Disc Susceptibility Test Discs (BD Biosciences, MD) to reach concentrations of 0.2 µg/ml INH, 5 µg/ml RMP, 10 µg/ml STM, 5 µg/ml EMB, or left free of antibiotic as a positive control. Plates were inoculated with the 50 µl of each isolate, directly or serially diluted from early log phase cultures grown in Middlebrook 7H9 broth to an OD600 of 0.1–0.2 and incubated at 37 °C for 21–28 days.

Finally, the GeneXpert Dx automated system (Cepheid AB, Solana, Sweden) was used to detect MTBC and rifampicin resistant genotypes at the KAUH following the manufacturer’s instructions as previously described52,53,54,55. GenoType MTBDRplus (Hain Lifescience GmbH, Nerhren, Germany) was also used to detect MTBC and rifampicin as well as isoniazid resistant genotypes56. GenoType Mycobacterium CM was used to identify nontuberculous mycobacteria (NTM) including M. avium ssp., M. intracellulare, and M. chelonae57.

Genotyping of M. tuberculosis isolates

Genomic DNA was extracted from M. tuberculosis cultures as previously described58. For mycobacterial interspersed repetitive units (MIRU), 12 previously identified loci were examined using published primer sets and PCR reaction mixes22. Amplified loci were run on 2% NuSeive agarose gels alongside 20 and 100 bp ladders. Fragment size data and antibiotic resistance information, if applicable, were entered into MIRU-VNTR plus25 as well as TBinsight26 for analysis to determine the lineages of all isolates. The MIRU-based lineage data was used to select representative samples from each lineage for whole genome sequencing.

Genomic DNA samples were prepared at the University of Wisconsin-Madison Biotechnology Center and sequenced on Illumina HiSeq2500 or MiSeq platforms. Adapters were trimmed from the raw reads using Trimmomatic (v0.36)59 and subsequently aligned using Bowtie2 (v2.2.4)27 using default parameters to the M. tuberculosis H37Rv strain (GenBank accession NC_000962.3). Using SAMtools (v1.3.1)60, the BAM files were sorted and the mpileup function was used to create an mpileup file; a minimum mapping quality of 20 was set, otherwise default parameters were used. Variants were called with VarScan2 (v2.3) mpileup2cns28 with a minimum variant quality of 20, a minimum depth of 10, and the variants parameter was flagged to only output variants; default parameters were used otherwise. All variant types (SNPs, insertions, and deletions) were considered in drug resistance association. A total of 20 SNPs shared between 3–5 different isolates including 10 involved in drug resistance were selected for confirmation by Sanger sequencing. Primers used for sequencing are shown in Table S2.

Genotypic-Phenotypic Association

For each drug, known resistance-conferring genes and variants were identified through a literature search and queried across all genomes (Tables S3–7. For a complete list of genes and genome positions queried in the whole-genome sequencing data, see Table S8. Isolates were considered “explained” if the genotype matched the phenotype (e.g. KatG Ser315Thr associated with INH resistance or no variation in codon 315 in INH susceptible); otherwise, the isolates were considered “unexplained” if the genotype did not match the phenotype (e.g. resistance to RIF detected by DST but no mutation was detected in the RRDR).

Phylogenetic analysis

For phylogenetic analysis, the consensus sequence of each isolate was extracted from CLCBio Genomics Workbench 8 after reference assembly. The reference strain M. tuberculosis H37Rv (NC_000962.3) and several other available M. tuberculosis genomes representing various lineages were included in a phylogenetic tree created using ParSNP, part of the Harvest genome software developed by the University of Maryland Center for Bioinformatics and Computational Biology. In general, ~ 230 genomes were downloaded from GenBank representing isolates from various regions throughout the world, and then aligned to sequenced genomes from KAUH isolates. FastTree 2 within ParSNP was used to build a phylogenetic tree with an approximately-maximum-likelihood method taking into consideration all SNPs, indels, and structural differences found in the genomes61. We then developed a chronogram using an alignment of 17,348 SNPs using Beast v.2.4.762 in order to estimate divergence times among the isolates. The best fitting model of molecular evolution was identified with jModelTest v2.1.463. We used three age priors with a loose uniform distribution to establish hard bounds for the backbone of the chronogram, based on the confidence intervals surrounding dates obtained in the more comprehensively sampled analyses of Bos et al.31. These included a prior on the most recent common ancestor (MRCA) of lineages 1–4 (2.7–5.1 Kya), lineages 2–4 (1.7–3.4 Kya), and lineages 2–3 (1.6–3.1 Kya). We ran two separate analyses, one where a strict clock was enforced and a second analysis where rates were allowed to vary among all branches. We then compared the likelihood of both analyses using likelihood ratio tests. Analyses were run for 200 million generations, with stabilization and convergence assessed using Tracer v1.664.

For In silico spoligotyping and drug resistance prediction, assembled whole genome sequences were uploaded to the CASTB website (http://castb.ri.ncgm.go.jp/CASTB/) where they were analyzed for spoligotype based on SPOLDB465. For genetic prediction of drug resistance, raw fastq files were uploaded to the TB Profiler website which provided information on mutations correlated with drug resistance and mutations located in genes known to be correlated with drug resistance. We also tested if a significant phylogenetic signal existed for drug resistance and for patient’s age, gender, or nationality in order to determine if we could identify modes or locations of disease transmission, and how these may differ in Egypt and Saudi Arabia. Isolates were placed into pseudo-communities according to these traits, and we tested whether the average phylogenetic distance among isolates within these communities was greater or smaller than expected compared a null model. These analyses were conducted using the ses.mpd function in the R package picante66.

Statistical analysis

All association studies between genotype, genome wide studies, and phenotypes of M. tuberculosis isolates were conducted using statistical modules described with each conducted analysis, as detailed above. All other simple statistical analyses of lineage distributions were calculated using statistical packages present in Microsoft Excel and GraphPad 5.0 softwares. A p ≤ 0.05 was considered signficant when tests of phylogenetic signal were used to determine whether drug-resistant isolates were more closely related to each other than expected by chance, or whether patients were significantly clustered accroding to their age, nationality, or gender. These analyses were conducted using the ses.mpd function in picante66.