Genomic analysis of the emergence of drug-resistant strains of Mycobacterium tuberculosis in the Middle East

Tuberculosis (TB) represents a significant challenge to public health authorities, especially with the emergence of drug-resistant (DR) and multidrug-resistant (MDR) isolates of Mycobacterium tuberculosis. We sought to examine the genomic variations among recently isolated strains of M. tuberculosis in two closely related countries with different population demography in the Middle East. Clinical isolates of M. tuberculosis from both Egypt and Saudi Arabia were subjected to phenotypic and genotypic analysis on gene and genome-wide levels. Isolates with MDR phenotypes were highly prevalent in Egypt (up to 35%) despite its relatively stable population structure (sympatric pattern). MDR-TB isolates were not identified in the isolates from Saudi Arabia despite its active guest worker program (allopatric pattern). However, tuberculosis isolates from Saudi Arabia, where lineage 4 was more prevalent (>65%), showed more diversity than isolates from Egypt, where lineage 3 was the most prevalent (>75%). Phylogenetic and molecular dating analyses indicated that lineages from Egypt were recently diverged (~78 years), whereas those from Saudi Arabia were diverged by over 200 years. Interestingly, DR isolates did not appear to cluster together or spread more widely than drug-sensitive isolates, suggesting poor treatment as the main cause for emergence of drug resistance rather than more virulence or more capacity to persist.

www.nature.com/scientificreports www.nature.com/scientificreports/ a country with an active guest workers program 12 . Annual visitors for religious rituals in Saudi Arabia may also contribute to the transmission of TB. In Egypt, another Middle Eastern country with a stable population where guest workers are rare, TB is still considered a significant problem 13 . Until now, the population structure of M. tuberculosis in those countries with historic ties to global trade remain elusive. In this study, we perform comparative analyses of whole genome sequences (WGS) to profile the dynamics of TB transmission in a relatively stable community (e.g., Egypt) compared to those undergoing significant social changes (e.g., Saudi Arabia).
Earlier reports provided clear evidence for the emergence of single-and multidrug-resistant tuberculosis (MDR-TB) in the Middle East. More importantly, they indicated the drastic variability in prevalence of single drug-resistant and MDR-TB (ranging from 3-50%) 8,[14][15][16] , most likely related to lack of adherence to TB drug treatment. Earlier studies used growth-based phenotyping methods 8,14 , while more recent studies have employed a combination of genotypic approaches such as PCR and other targeted sequencing approaches to detect resistance-conferring mutations 5,17 . Information about the predominance of M. tuberculosis lineages in the Middle East has only been rudimentary and scarce, especially on a genome-wide level, until recently 18 . Analysis of a large number of M. tuberculosis isolates collected from different countries around the globe identified resistance to isoniazid (katG mutation) as a precursor for the emergence of MDR-TB and other more complex drug resistance profiles 19 . Unfortunately, there were no isolates included in this study from the Middle East, despite the inclusion of 8,316 other isolates 19 . More recently, a total of 76 isolates were analyzed using WGS from Saudi Arabia 18 . Studies have documented increased risks of TB incidence among "Hajj" pilgrims in Saudi Arabia, due to mass-gatherings and a large concentration of people from all over the world 20,21 . Some of these countries are described as countries with high-incidence of TB 19 . Furthermore, an increase drug-resistant (DR) strains among MDR-TB particularly to aminoglycoside and fluoroquinolone have been documented in Saudi Arabia recently 5 . The current study represents the first attempt to utilize WGS of clinical M. tuberculosis isolates from both Egypt and Saudi Arabia to profile the population structure of this important pathogen considering the patient's country of origin for each isolate. Findings from our analysis suggest the significance of the problem of DR and MDR among M. tuberculosis isolates in the Middle East, particularly in stable communities (up to 35%), is actually much higher than identified before on a global level (18%) 19 with little influence of migrant workers on disease transmission dynamics.

Results predominant lineages of M. tuberculosis in the Middle east.
To characterize M. tuberculosis in the Middle East, we analyzed the genomes of archived isolates from sputum samples of 69 patients at the King Abdulaziz University Hospital (KAUH) in Jeddah, Saudi Arabia, that frequented the chest ward between January 2012 and March 2013. In addition, we analyzed the genomes of archived isolates from 60 patients at Cairo University, Cairo, Egypt, that frequented the university chest clinic during the same period. Of the total 129 samples, 95 isolates (61 from Saudi Arabia and 34 from Egypt) were chosen for genomic analysis based on successful culturing and ability to extract high-quality genomic DNA. For Egypt, all isolates were from Egyptian natives, while isolates from Saudi Arabia were collected from patients native to a range of countries in Africa (Chad, Somalia), the Arabian Peninsula (Saudi Arabia, Yemen), and Asia (Afghanistan, India, Myanmar, Indonesia, the Philippines). Patients ranged in age from 1 year to 76 years old with nearly half of them falling between ages 21 and 40. More demographic information of the patients is presented in Table 1. The results of the examination of sputum samples for the presence of M. tuberculosis and the drug sensitivity tests are compiled in Table 2.
To analyze the population structure of M. tuberculosis isolates, we employed a protocol based on 12-locus MIRU-VNTR typing 22 . Because of the demographic differences between the Egyptian and Saudi Arabian sources of the M. tuberculosis isolates under study (Table 1), we hypothesized the presence of different models for disease transmission and spread dynamics. In Egypt, we expected the predominance of a countrywide group of isolates to cluster in a sympatric pattern (isolates from a single geographical region with a stable population) compared to Saudi Arabia with more diversity of M. tuberculosis lineages in an allopatric pattern (isolates from multiple, isolated geographical regions with a diverse population) 23 . Out of 25 genotypes identified (out of 34 isolates), the most predominant lineage in Egypt was from the Indian-East African lineage (lineage 3), which was represented in 76.5% of examined isolates (Fig. 1). On the other hand, out of 59 genotypes identified (out of 61 isolates), the Euro-American (lineage 4) was represented in 66% of isolates in Saudi Arabia (Fig. 1), confirming earlier reports 24 . Analysis on the MIRU-VNTR website 25 further categorized the isolates into sub-lineages summarized in Table 3. Moreover, the lineages of 16 isolates (out of 61 isolates) from Saudi Arabia were predicted differently depending on the algorithm used (TBinsight 26 vs. MIRU-VNTRplus). To resolve discordance between the two databases, we used PCR of pks15/1 to distinguish between lineages 3 and 4. Interestingly, two isolates from Saudi Arabia, SA34 and SA60, of lineage 4, did not group with any known sub-lineage and appear to be specific to Saudi Arabia. The genomes of these two strains, among others from Egypt and Saudi Arabia, were subjected to further analysis using whole genome sequencing.
Sequence variations on a whole genome level. To gain more insights into diversity among the identified lineages of M. tuberculosis from the Middle East, we selected certain isolates for whole genome sequence (WGS) analysis on the Illumina platform. Our selection included isolates with drug resistance phenotypes, representative samples for each MIRU-VNTR cluster (N = 49, 19 from Egypt and 30 from Saudi Arabia), in addition to isolates with unassigned lineages. Sequencing isolate genomes yielded paired-end fragments of 101 bp (~805-fold coverage) or 251 bp (~114-fold coverage), respectively. The total number of reads, number of mapped reads, and coverage data for each genome can be found in Table S1. Whole genome sequences were aligned using bowtie 27 , and SNPS and insertions/deletions (indels) were identified using Varscan2 28 . Several of these SNPs (N = 20), including those associated with drug resistance (N = 10), were confirmed by classical Sanger sequencing (Table S2). Interestingly, the majority of SNPs were nSNPs (Fig. 2) suggesting a rapidly evolving M. tuberculosis www.nature.com/scientificreports www.nature.com/scientificreports/ population in those patients. Additionally, the number of SNPs between all isolates was high (>100 SNPs), an indication of recent infection rather than infection relapse 29 , from a chronic or latent infection.
In another analysis, we compared WGS-based lineage prediction with MIRU-VNTR (TBinsight analysis 26 ) results. In this comparison, lineage assignments of 2 isolates (SA33 and SA47) were discordant where WGS moved them from lineage 4 to lineage 3. Among the 16 isolates that gave different lineage results between TB insight and MIRU-VNTR websites, seven were sequenced, and only one agreed with MIRU-VNTR while all others agreed with TBinsight. Moreover, one previous study of M. tuberculosis genotypes found in Saudi Arabia discovered 2 strains with novel spoligotypes (703777707770371, 467777377413771), unique to Saudi Arabia. To determine if these spoligotypes were represented in our samples, in silico spoligotyping was done for each sequenced sample 30 . None matched the novel spoligotypes identified by Al-Hajoj, et al. 24 , a reflection of targeting a different population from the earlier study. Likelihood ratio tests rejected a molecular clock, so downstream analyses were based on chronogram where rates of evolution were allowed to vary among branches. The resulting tree topology and divergence time estimates among major lineages were consistent with previous estimates 31 , with the most recent common ancestor (MRCA) of all isolates occurring ~3400 years ago. The phylogeny indicates that the vast majority of isolates from Egypt form a single clade in lineage 3, all of which diverged within the past 500 years, and two www.nature.com/scientificreports www.nature.com/scientificreports/ remaining isolates forming a clade in lineage 1 with an MRCA of ~150 years (Fig. 3). Nearly 80% of isolates from Egypt form a single clade with a crown age of ~78 years.
The isolates from patients in Saudi Arabia were scattered throughout lineages 1-4, with no apparent clustering according to the nationality of the patients. In contrast, isolates from Egypt were significantly more closely related to each other than expected when compared to a null model where nationalities were randomized across the tree (p < 0.01). No other nationalities showed significant phylogenetic clustering. Nevertheless, in some instances isolates from individuals with the same nationality form clades (Fig. 3). For example, isolates from individuals from Somalia form two different clades in lineage 4 and a clade in lineage 3. With few exceptions, divergence dates of closely related isolates from patients in Saudi Arabia were older than 200 years old. Additionally, we found no significant clustering of TB isolates based on patient's age, or gender (p > 0.05), another indication of a dynamic model of disease transmission. Surprisingly, however, was the lack of clustering among isolates from patients sharing the same nationality. This indicates recent spread of M. tuberculosis among guest workers, confirming region-specific lineage transmission patterns (sympatric model), even within Saudi Arabia with active guest worker programs, refuting our earlier hypothesis.
Drug resistance of M. tuberculosis isolates. In both Egypt and Saudi Arabia, M. tuberculosis isolates were screened for resistance to rifampicin (RMP), isoniazid (INH), ethambutol (EMB), streptomycin (STM), and pyrazinamide (PZA). To further analyze drug resistance among the 49 isolates (from both Egypt and Saudi Arabia) with whole genome sequences, the drug resistance profile of each isolate was re-tested by the group at the University of Wisconsin-Madison against STM, RMP, INH, and EMB. The results of the second screen were used as the gold standard and reference for evaluating genetic diagnostic methods. While no multidrug-resistant (MDR) isolates were identified in Saudi Arabia, almost 35% of isolates from Egypt were considered MDR against 3 drugs and 6% MDR against 4 drugs ( Table 2). Both numbers are a sharp increase from the original identification of only 6% MDR isolates. Notably, 3.3% of isolates from Saudi Arabia were resistant to streptomycin and ethambutol (Table 2) but did not meet the criteria for MDR according to the WHO guidelines (no double resistance to both isoniazid and rifampicin) 1,32 . Of the 30 isolates from Saudi Arabia, three strains were determined to be resistant to STM by the clinical screening, and this phenotype was confirmed. A 4 th strain, SA14, found to be sensitive in the original screening in the KSA, was found to be resistant to STM, and 2 more strains, SA51 and SA28, were found to be partially resistant to INH and EMB, respectively. No strains were found to be resistant to RMP. These results identified three additional drug-resistant strains that were missed by the clinical laboratory.
In a separate analysis focusing on isolates with whole genome sequences, the country of origin of patients with drug resistant isolates were considered to further characterize potential origin(s) of spread for drug resistance among M. tuberculosis isolates (Fig. 4). This analysis revealed that drug-resistant strains did not cluster together (p > 0.05); rather, they often grouped with susceptible strains. This was consistent for isolates from both Egypt and Saudi Arabia. Also, drug resistance strains did not cluster together based on patients country of origin (p > 0.05). For all screened isolates, the percentages of pansensitive isolates varied significantly between Egypt  www.nature.com/scientificreports www.nature.com/scientificreports/ (10.5%) and Saudi Arabia (83.3%) (Fig. 4). It is noteworthy to mention here that levels of drug resistance were slightly different from those reported in Table 2, most likely because the improved prediction by whole-genome sequencing.

Molecular determination of drug resistance (DR).
After obtaining accurate DR profiles for the examined M. tuberculosis isolates, we attempted to analyze this important phenotype in light of the obtained genome-wide variations, particularly single nucleotide polymorphism (SNPs). For isoniazid, all INH resistant isolates (n = 13, one from Saudi Arabia and 12 from Egypt) were explained by the fabG1 (mabA) C-15T promoter variant or KatG Ser315Asn, both of which have been previously implicated in INH resistance 33,34 . Table S3 represents the variations called for each isolate, including the INH susceptibility testing results. No INH-susceptible isolates had a known resistance-conferring mutation in the queried genes. Two of three isolates from Egypt with no INH susceptibility testing results were likely resistant to INH since the isolates both harbored a KatG Ser315Asn mutation. All RMP resistant isolates harbored a RpoB Ser450Leu variation, which is widely known to confer resistance. Table S4 represents variations for each of the 48 isolates tested for RMP susceptibility. These isolates also harbored a known compensatory mutation in RpoC, Val483Gly, which compensates for the deleterious effects of the Ser450Leu mutation in RpoB 35 . No RMP susceptible isolates (n = 33) harbored a mutation in   www.nature.com/scientificreports www.nature.com/scientificreports/ the RMP resistance-determining region (RRDR). No RMP-susceptible isolates had compensatory mutations in either RpoC or RpoA, agreeing with previous reports 35 .
For ethambutol, all EMB resistant isolates had a mutation previously reported to be associated with EMB resistance. Table S5 represents variations in four genes involved in EMB resistance. Interestingly, several isolates had more than one variation across the four genes, suggesting an increased MIC. However, nine of 28 EMB-susceptible isolates also had these variations, except for the EmbB Gly406Asp; this variation only existed in EMB resistant isolates. Three of the four isolates with no EMB susceptibility testing results also had the EmbB Gly406Asp variation, likely pointing to a resistant phenotype. Of all isolates, only three were phenotypically resistant to PZA, all from Saudi Arabia. Table S6 contains variants called in three genes associated with PZA resistance. Furthermore, none of the three had variations in pncA or its promoter region, rpsA, or panD. This phenomenon is not uncommon and has been reported in several reviews [36][37][38] . We observed only synonymous substitutions in pncA in PZA susceptible isolates (n = 7), which have been reported to not confer resistance 39 ; and only one non-synonymous substitution in rpsA in a single isolate. Isolates from Egypt did not have PZA susceptibility results, however, 14 of these isolates had a non-synonymous mutation in pncA and one had a non-synonymous mutation in rpsA. Finally, only 4 isolates from Saudi Arabia and 3 from Egypt were resistant to STM. Only 2 of the 4 STM resistant isolates had a known resistance-conferring mutation in rpsL (Lys43Arg) ( Table S7). Two resistant isolates from Saudi Arabia did not have a known molecular mechanism to explain the resistant phenotype. One susceptible isolate harbored an rrs variation at nucleotide 724, which has not been reported previously. Isolates with no STM susceptibility testing results did not have any variations in the three genes queried.

Discussion
Tuberculosis, and especially drug-resistant strains of M. tuberculosis, represent a major public health threat in developing and underdeveloped countries 32 . With 10 million new TB cases reported each year 40 , regional analyses are of paramount importance to help local health authorities control TB on national levels 41 . Recent analyses of historical and recent genomic data have profiled the nature of tuberculosis disease transmission through global trade, migration, conquests, and even simple travel [42][43][44] . In the Middle East, most of the reported studies focused on strain lineage distribution and profiling antibiotic sensitivity 10,45,46 . Unfortunately, scarce information is available on the epidemiology of M. tuberculosis on the genomic level from the Middle East, compared to other regions around the world 19 . However, several reports have indicated the circulation of DR and MDR in the Middle East 47 , Figure 2. Percentage of synonymous SNPs (sSNP), non-synonymous SNPs (nSNP) and SNPs located in intergenic areas (intergenic). SNPs were included if coverage was more than 20 reads and frequency was ≥50% of the reads.
www.nature.com/scientificreports www.nature.com/scientificreports/ which highlight the need for more thorough genomic analyses. In this report, a comprehensive SNP analysis of M. tuberculosis was carried out to identify association with drug resistance phenotypes (TB Profiler and related TBDREAM 48 ), and identified poor correlation, most likely because of poor representation of isolates from the Middle East in this database. These results highlight the urgent need for testing of molecular diagnostic methods not only in the Middle East, but also in all understudied regions around the world. It may be that the region of isolation, country of origin, and lineage of the isolate may need to be taken into account when designing molecular diagnostic techniques in the future, and certainly when utilizing them in the field. While sensitivity and specificity information were calculated for RMP, EMB, and STM, the low sample size and <100% specificity or sensitivity reported by TB Profiler did not allow us to make meaningful interpretations of the utility of TB Profiler for the identification of drug-resistant isolates of M. tuberculosis. Drug-resistant strains, however, do not appear to be more contagious or spread more widely than drug susceptible strains, unlike earlier analysis of a M. tuberculosis Beijing cluster of isolates 49 . Nevertheless, without appropriate diagnosis and treatment of these strains, the risk of spread of drug-resistant strains can only increase. It is noteworthy to mention here that an earlier study on the problems of DR in Saudi Arabia identified a similar pattern for DR to the one reported here with the exception of DR to ethambutol, which was higher in the studied population (4.9%) compared to earlier findings (0.8%) 5 . A more recent study, however, identified DR in up to 8% of isolates within Saudi Arabia 18 .
Phylogenetic analysis of M. tuberculosis isolates showed lack of clustering among drug-resistant strains (p > 0.05; Fig. 3). In most cases, drug-resistant isolates formed clades with drug-susceptible isolates. Isolates from patients in Egypt formed two distinct clades in lineages 1 and 3, and overall were significantly more closely related to each other than expected by chance. By contrast, isolates from patients in Saudi Arabia were spread amongst all four lineages, and we were unable to reject the null hypothesis that the nationalities of migrant workers were randomly distributed throughout the phylogeny. The relative homogeneity of communities in Egypt compared to the heterogeneity of communities of migrant workers in Saudi Arabia may account for these contrasting patterns. Moreover, lineage 3 was more prevalent among Egypt isolates which contrasted with an earlier study that indicated the prevalence of lineage 4 (Euro-American clade) 13 , most probably due to the analysis of more isolates Figure 3. Evolutionary relationships and divergence dates of isolates from Egypt and Saudi Arabia. Divergence dates are in years, with the 95% confidence intervals of node ages represented by blue bars. The topology was recovered with 100% posterior probability, except for those nodes labeled. Color coded boxes at the tips indicate the nationality of the patient from which the sample was isolated and correspond with the color of countries mapped in Fig. 4. SA tip labels refer to samples isolated in Saudi Arabia, whereas E tip labels refer to those isolated in Egypt.
www.nature.com/scientificreports www.nature.com/scientificreports/ collected from a more diverse population in the earlier study. Our divergence time estimates indicate that the major clades among closely related Egyptian isolates (lineage 3) have all diverged within the past 78 years compared to >200 years for those isolates (lineage 4) from Saudi Arabia (Fig. 3). Few other lineages exhibit clades with similar divergence dates. Thus, while it is evident that tuberculosis is spread amongst people in Egypt, it is more difficult to determine whether migrant workers contract the disease in Saudi Arabia or in their home countries. The evolutionary diversity of isolates among individuals in Saudi Arabia suggests that this latter scenario likely plays some role. For example, it appears more likely that TB was spread among migrant workers from Somalia while they were living in Saudi Arabia than a scenario where three sister pairs of TB strains were independently introduced into Saudi Arabia from Somalia. By contrast, the four isolates from Chadian individuals included in this analysis belonged to the four different TB lineages. In this case, it is equivocal whether the disease was acquired in Saudi Arabia or introduced from Chad.
Overall, both phenotypic and genome-wide analyses provided more insights into the transmission dynamics of tuberculosis among populations within the Middle East with clearly different demography. Surprisingly, a population (e.g. Saudi Arabia residents) with high exposure to cases of tuberculosis with DR and MDR-TB, appear to have less of a problem of DR tuberculosis based on our analysis and others 5 . On the other hand, a homogenous population structure (mainly Egyptians) had an increasing problem of MDR 17 , an important highlight to the problem of adherence to drug therapy and a suitable target for the STOP program by the WHO. More importantly, considerations need to be given to genomic analysis of more M. tuberculosis isolates from different countries and to the prevalent lineage as well as drug-resistant genotypes by public health authorities before selection of a suitable diagnostic or drug regiment for a country within the Middle East. www.nature.com/scientificreports www.nature.com/scientificreports/ Methods ethics statement and isolation of M. tuberculosis. Approval to use patient samples was obtained from the Institutional Review Board of Kaser Elini Faculty of Medicine, Cairo University as well as the Ministry of Health of the Kingdom of Saudi Arabia, and the Research Ethics committee, Faculty of Medicine, King Abdulaziz University. All isolates were de-identified to maintain patient privacy while information related to the patient gender, nationality, age and city of residency were kept to be used for the analysis of transmission dynamics. Approval to use de-identified clinical isolates of M. tuberculosis was also obtained from the University of Wisconsin-Madison's Institutional Review Board. All methods conducted in this research adhered to the guidelines and regulations approved by the Office of Biological Safety at the University of Wisconsin-Madison.
All clinical isolates of M. tuberculosis were obtained from early morning sputum specimens. Sputum samples were digested and decontaminated by the ready to use N-acetyl-l-cysteine (NALC) 3% NaOH method (Nac-PACTMEA3, Alpha Tec, Inc, Vancouver, Washington, USA) and concentrated by centrifugation at 3,000 rpm for 20 minutes, as described by the manufacturer 50 . The resulting pellet was resuspended and used in part for detection of M. tuberculosis complex (MTBC) isolates by smear microscopy using fluorochrome stain (Fluo-RAL-Auramine staining kit for Mycobacteria Detection). The remainder of the resuspended pellet was used for culturing in the VersaTREK fluid culture system and for direct MTBC detection by Gene Xpert PCR. All clinical isolates were grown on Löwenstein-Jensen medium slants. Single colonies from each isolate were sub-cultured on Middlebrook 7H10 agar (BD, Franklin Lakes, NJ) supplemented with 0.5% glycerol and 10% ADC (2% glucose, 5% bovine serum albumin, fraction V, and 0.85% NaCl), and incubated at 37 °C under biosafety level 3 (BSL3) laboratory conditions. Single colonies were grown in Middlebrook 7H9 broth (BD Difco) supplemented with 0.5% glycerol and 10% ADC for genomic DNA extraction and drug resistance testing. phenotypic analysis. VersaTREK 528 (TREK Diagnostic Systems, Inc. Westlake, OH) was used to determine phenotypic drug resistance in the place of isolation (KAUH) following the manufacturer's instructions. Inoculated cultures containing rifampicin (30 µg/ml), isoniazid (3 µg/ml or 12 µg/ml), ethambutol (150 µg/ml or 240 µg/ml), streptomycin (30 µg/ml or 120 µg/ml), or pyrazinamide (300 µg/ml) and were monitored for a positive signal for up to 42 days 51 . For further confirmation following sequence analysis, antibiotic sensitivity of selected isolates was determined using standard, semi-quantitative in vitro susceptibility testing. In this assay, Middlebrook 7H10 agar plates were embedded with BBL Sensi-Disc Susceptibility Test Discs (BD Biosciences, MD) to reach concentrations of 0.2 µg/ml INH, 5 µg/ml RMP, 10 µg/ml STM, 5 µg/ml EMB, or left free of antibiotic as a positive control. Plates were inoculated with the 50 µl of each isolate, directly or serially diluted from early log phase cultures grown in Middlebrook 7H9 broth to an OD 600 of 0.1-0.2 and incubated at 37 °C for 21-28 days.
Finally, the GeneXpert Dx automated system (Cepheid AB, Solana, Sweden) was used to detect MTBC and rifampicin resistant genotypes at the KAUH following the manufacturer's instructions as previously described [52][53][54][55] . GenoType MTBDRplus (Hain Lifescience GmbH, Nerhren, Germany) was also used to detect MTBC and rifampicin as well as isoniazid resistant genotypes 56  Genotyping of M. tuberculosis isolates. Genomic DNA was extracted from M. tuberculosis cultures as previously described 58 . For mycobacterial interspersed repetitive units (MIRU), 12 previously identified loci were examined using published primer sets and PCR reaction mixes 22 . Amplified loci were run on 2% NuSeive agarose gels alongside 20 and 100 bp ladders. Fragment size data and antibiotic resistance information, if applicable, were entered into MIRU-VNTR plus 25 as well as TBinsight 26 for analysis to determine the lineages of all isolates. The MIRU-based lineage data was used to select representative samples from each lineage for whole genome sequencing.
Genomic DNA samples were prepared at the University of Wisconsin-Madison Biotechnology Center and sequenced on Illumina HiSeq2500 or MiSeq platforms. Adapters were trimmed from the raw reads using Trimmomatic (v0.36) 59 and subsequently aligned using Bowtie2 (v2.2.4) 27 using default parameters to the M. tuberculosis H37Rv strain (GenBank accession NC_000962.3). Using SAMtools (v1.3.1) 60 , the BAM files were sorted and the mpileup function was used to create an mpileup file; a minimum mapping quality of 20 was set, otherwise default parameters were used. Variants were called with VarScan2 (v2.3) mpileup2cns 28 with a minimum variant quality of 20, a minimum depth of 10, and the variants parameter was flagged to only output variants; default parameters were used otherwise. All variant types (SNPs, insertions, and deletions) were considered in drug resistance association. A total of 20 SNPs shared between 3-5 different isolates including 10 involved in drug resistance were selected for confirmation by Sanger sequencing. Primers used for sequencing are shown in Table S2.
Genotypic-phenotypic Association. For each drug, known resistance-conferring genes and variants were identified through a literature search and queried across all genomes (Tables S3-7. For a complete list of genes and genome positions queried in the whole-genome sequencing data, see Table S8. Isolates were considered "explained" if the genotype matched the phenotype (e.g. KatG Ser315Thr associated with INH resistance or no variation in codon 315 in INH susceptible); otherwise, the isolates were considered "unexplained" if the genotype did not match the phenotype (e.g. resistance to RIF detected by DST but no mutation was detected in the RRDR). phylogenetic analysis. For phylogenetic analysis, the consensus sequence of each isolate was extracted from CLCBio Genomics Workbench 8 after reference assembly. The reference strain M. tuberculosis H37Rv (NC_000962.3) and several other available M. tuberculosis genomes representing various lineages were included in a phylogenetic tree created using ParSNP, part of the Harvest genome software developed by the University of Maryland Center for Bioinformatics and Computational Biology. In general, ~ 230 genomes were www.nature.com/scientificreports www.nature.com/scientificreports/ downloaded from GenBank representing isolates from various regions throughout the world, and then aligned to sequenced genomes from KAUH isolates. FastTree 2 within ParSNP was used to build a phylogenetic tree with an approximately-maximum-likelihood method taking into consideration all SNPs, indels, and structural differences found in the genomes 61 . We then developed a chronogram using an alignment of 17,348 SNPs using Beast v.2.4.7 62 in order to estimate divergence times among the isolates. The best fitting model of molecular evolution was identified with jModelTest v2.1.4 63 . We used three age priors with a loose uniform distribution to establish hard bounds for the backbone of the chronogram, based on the confidence intervals surrounding dates obtained in the more comprehensively sampled analyses of Bos et al. 31 . These included a prior on the most recent common ancestor (MRCA) of lineages 1-4 (2.7-5.1 Kya), lineages 2-4 (1.7-3.4 Kya), and lineages 2-3 (1.6-3.1 Kya). We ran two separate analyses, one where a strict clock was enforced and a second analysis where rates were allowed to vary among all branches. We then compared the likelihood of both analyses using likelihood ratio tests. Analyses were run for 200 million generations, with stabilization and convergence assessed using Tracer v1.6 64 .
For In silico spoligotyping and drug resistance prediction, assembled whole genome sequences were uploaded to the CASTB website (http://castb.ri.ncgm.go.jp/CASTB/) where they were analyzed for spoligotype based on SPOLDB4 65 . For genetic prediction of drug resistance, raw fastq files were uploaded to the TB Profiler website which provided information on mutations correlated with drug resistance and mutations located in genes known to be correlated with drug resistance. We also tested if a significant phylogenetic signal existed for drug resistance and for patient's age, gender, or nationality in order to determine if we could identify modes or locations of disease transmission, and how these may differ in Egypt and Saudi Arabia. Isolates were placed into pseudo-communities according to these traits, and we tested whether the average phylogenetic distance among isolates within these communities was greater or smaller than expected compared a null model. These analyses were conducted using the ses.mpd function in the R package picante 66 . statistical analysis. All association studies between genotype, genome wide studies, and phenotypes of M. tuberculosis isolates were conducted using statistical modules described with each conducted analysis, as detailed above. All other simple statistical analyses of lineage distributions were calculated using statistical packages present in Microsoft Excel and GraphPad 5.0 softwares. A p ≤ 0.05 was considered signficant when tests of phylogenetic signal were used to determine whether drug-resistant isolates were more closely related to each other than expected by chance, or whether patients were significantly clustered accroding to their age, nationality, or gender. These analyses were conducted using the ses.mpd function in picante 66 . www.nature.com/scientificreports www.nature.com/scientificreports/ www.nature.com/scientificreports www.nature.com/scientificreports/