Phylogenomics and antimicrobial resistance of the leprosy bacillus Mycobacterium leprae

Leprosy is a chronic human disease caused by the yet-uncultured pathogen Mycobacterium leprae. Although readily curable with multidrug therapy (MDT), over 200,000 new cases are still reported annually. Here, we obtain M. leprae genome sequences from DNA extracted directly from patients’ skin biopsies using a customized protocol. Comparative and phylogenetic analysis of 154 genomes from 25 countries provides insight into evolution and antimicrobial resistance, uncovering lineages and phylogeographic trends, with the most ancestral strains linked to the Far East. In addition to known MDT-resistance mutations, we detect other mutations associated with antibiotic resistance, and retrace a potential stepwise emergence of extensive drug resistance in the pre-MDT era. Some of the previously undescribed mutations occur in genes that are apparently subject to positive selection, and two of these (ribD, fadD9) are restricted to drug-resistant strains. Finally, nonsense mutations in the nth excision repair gene are associated with greater sequence diversity and drug resistance.

M ycobacterium leprae is the main causative agent of leprosy, a disease that affects the skin, nerves, and mucosa of the upper respiratory tract in humans 1 . A second, distantly related leprosy bacillus, Mycobacterium lepromatosis, was recently discovered in humans and red squirrels (Sciurus vulgaris) 2 . Leprosy is curable with multidrug therapy (MDT), but remains a public health problem in South America, Africa, South and Southeast Asia, and Micronesia, where over 200,000 new leprosy cases are reported each year 3 . MDT, comprising rifampicin, dapsone, and clofazimine, has been used intensively since the 1980s and a few second-line drugs, ofloxacin, minocycline, and clarithromycin, are sometimes employed as therapeutic agents 4 . The emergence of drug-resistant (DR) and multidrug-resistant (MDR) M. leprae is increasingly reported [5][6][7][8][9][10][11][12] . For dapsone, rifampicin and ofloxacin, the resistance mechanism has been attributed to missense mutations in the drug resistance determining regions (DRDR) of the folP1, rpoB, and gyrA genes, respectively.
M. leprae is an obligate intracellular pathogen that has never been cultured axenically but can infect wild or experimental animals. The nine-banded armadillo (Dasypus novemcinctus) or the mouse footpad (MFP) can be used to produce bacilli, but both methods are cumbersome and time-consuming 13 . The genome of M. leprae is the smallest among mycobacteria (3.3 Mb) with 1614 genes encoding proteins and a remarkable 1300 pseudogenes 14 . Such reductive evolution is a hallmark of bacteria that have changed their lifestyle from free-living to strictly hostassociated 15 . Due to its 14-day generation time and the absence of horizontal gene transfer, the genome of M. leprae is highly conserved, with <300 single-nucleotide polymorphisms (SNPs) observed between distantly related strains, and only a few SNPs between close relatives 5, [16][17][18] . Four SNP types (branches 1-4) and 16 SNP subtypes (A-P) were defined by surveying 78 informative SNPs and six single-base insertion/deletions (InDels) 16,19 . Genotyping a large panel of M. leprae strains revealed strong geographical associations and suggested possible routes of dissemination of leprosy 16 whereas, a recent phylogenetic analysis of 16 whole-genome sequences of modern and ancient M. leprae strains, implicated the 3K subtype (branch 0) as the most ancestral 17 .
Leprosy seems to have appeared during the Iron Age (1200-600 BC) and the date of the most recent common ancestor of M. leprae was estimated to be from 2543 BC to 36 AD, based on whole-genome sequence analysis 17 . Similarly, the earliest accepted written record of leprosy is from 600 BC 20 , and the earliest osteological evidence dates from around 300 BC [21][22][23][24][25] . The oldest genomic evidence of leprosy is for samples from 80 to 240 AD in Central Asia 26 .
In this study, we develop and apply methods to isolate and purify M. leprae DNA that enable whole genome sequences to be obtained directly from human biopsy material, thus removing the necessity for passage through animals. This approach was successfully used to generate 120 new M. leprae genome sequences from drug-susceptible and DR strains from around the world, thereby enabling detailed phylogenetic and phylogeographic comparisons to be performed, new mutations associated with antimicrobial resistance to be detected, and the likely origin of leprosy to be proposed.

Results
Isolating M. leprae DNA from human skin biopsies. Genome sequencing has become routine practice in microbiology 27 , especially for micro-organisms that can be readily isolated, which is not the case of the leprosy bacillus. For decades, the sole source of M. leprae DNA suitable for genomics was from bacteria isolated 12 months after infection of armadillos or mice. Recently, we have developed and optimized methods that enable M. leprae DNA to be extracted directly from fresh or formalin-fixed skin biopsies from leprosy patients 28 . These methods include enrichment of M. leprae DNA by array capture 17 but this is less practical for large population-based investigations.
The DNA extraction method used in this study was applied directly to punch biopsies from clinically well-characterized patients of known bacillary index (BI) and exploits the fact that M. leprae resides intracellularly. Host cells are first disrupted and their DNA degraded, leaving the bacilli intact. The bacilli are then lysed and their DNA extracted and used for library preparation. This approach was applied to 106 biopsies whose BI ranged from 0 (no bacilli visible) to 6 (>1000 bacilli per microscopic field) thereby enabling a relationship between BI and sequencing efficiency to be established (Fig. 1). As expected, there was a direct correlation between genome coverage and the BI but, surprisingly, successful coverage could even be achieved with some specimens whose BI was as low as 1+.
Genome analysis of patient and animal cohort. We analyzed a total of 154 M. leprae genomes from 25 countries (Fig. 2, Supplementary Data 1), of which 120 were newly sequenced and 34 were previously published (Supplementary Data 1). The cohort comprised 147 human samples, 6 from red squirrels and 1 from an armadillo that were all naturally infected. Genome sequences were obtained directly from 109 human samples, 30 from bacilli passaged in mice, and 8 from armadillos. Thirty of these strains were from patients who had relapsed or not responded to MDT the remainder (124) were from supposedly drug-susceptible strains (87 were from confirmed primary cases, while disease history was unknown for the others).
A total of 3053 SNPs and 219 InDels (excluding tandem repeats) was found (Supplementary Data 2). The average SNP difference among the 154 genomes was 114. We found a total of 988 non-synonymous alleles (0.62 per protein-coding gene, or 0.61 per kb of protein-coding genes) and 530 synonymous SNPs (0.33 per protein-coding gene or 0.33 per kb of protein-coding genes), and 1763 mutations in intergenic regions and pseudogenes (1.07 mutations per kb of intergenic regions and pseudogenes). The SNP density for each gene is given in Supplementary 1 Correlation between bacillary index and successful sequencing. The content of M. leprae DNA in sequencing libraries derived from human skin biopsies was determined and found to be proportional to the bacillary index (not available for all samples). Empty circles are samples that were not included in the study due to insufficient genome coverage. Sample count and sequencing success rates are given at the top of each category Phylogeny of M. leprae. Phylogenetic analysis using both maximum parsimony (MP) and Bayesian inference resulted in consistent tree topologies and revealed distinct lineages and sublineages of M. leprae (Fig. 3). Strains belonging to the same SNP subtypes 16 clustered within single branches, with the exception of SNP subtype 3K, which is represented by a newly discovered ancestral lineage, termed here 3K-1, and the ancestral lineage referred to earlier as branch 0 17 and termed here 3K-0 (Fig. 3a). All strains from the two most ancestral lineages, 3K-0 and 3K-1, originated from Japan (8), China (1), Korea (1), the Marshall Islands (1), and New Caledonia (1), in agreement with earlier genotyping studies of hundreds of M. leprae strains which confirmed the predominance of the 3K genotype in East Asia, notably in Japan, China, and Korea 16,[29][30][31] .
M. leprae in East Africa showed higher diversity with subtypes 2E, 2F, and 2H representing distinct lineages (Fig. 3a). The geographic distribution of those lineages corroborates the findings of earlier studies reporting the presence of SNP type 2 in Medieval Europe, the Middle East and East Africa 16 . Two Ethiopian isolates, belonging to the 2F subtype, clustered closely with medieval European strains dating from the 11th to 12th century ( Fig. 3), which supports the hypothesis that the ancient Greek and Roman routes 32 connecting Europe, the Middle East, East Africa, and South Asia 16,[33][34][35] contributed to the dissemination of SNP type 2 M. leprae.
West Africa, on the other hand, harbors exclusively SNP type 4, suggesting that overland migration between East and West Africa was limited. SNP subtypes 4N, 4O, and 4P, albeit sharing the same ancestor, do not form a monophyletic clade as previously hypothesized 16 . Rather, the 4O and 4P subtypes cluster together in a branch distinct from 4N (Fig. 3).
Brazil, as expected, contains a great diversity of several M. leprae lineages, with the SNP type 4 and SNP subtype 3I being the most prevalent 36 . The 3I genotype was common in medieval Europe 17,21,37,38 , and is still present in red squirrels in the United Kingdom 2 . The modern Brazilian strain Br2016-45 branched between two medieval strains from Europe (Fig. 3), making it the most ancestral contemporary 3I strain in the Americas to date. The broad diversity of 3I genotypes from Brazil probably derives from multiple introductions from Europe. On the other hand, the strains circulating in the Southern USA and associated with zoonosis from the nine-banded armadillo, I-30, NHDP-55 and NHDP-63 18 , originated much more recently (Fig. 3b), in agreement with the rapid expansion and spread of the armadillo population since its introduction to this region about 150 years ago 39 .
Good representation of most M. leprae lineages enabled identification of lineage-specific markers. A set of 235 SNPs and 25 InDels were specific to single lineages or groups of related lineages (Supplementary Data 2), of which 73 non-synonymous SNPs and 5 InDels were within protein-coding genes. These new lineage-specific markers can be used for future genotyping schemes.
Dating analysis. Dating analysis of M. leprae ( Fig. 3b and Supplementary Figure 1) was done using BEAST v2.4.4 40 and the results were very similar to those obtained from a ten-fold smaller number of contemporary isolates 17 . The most recent common ancestor (TMRCA) of all M. leprae strains was estimated to be 3699 years old (95% Highest Posterior Density (HPD) 2731-4838 ya) and the substitution rate was 7.8 × 10 −9 per site per year. Overlapping results were obtained when using different models (Supplementary Note and Supplementary Table 1), indicating that the dataset was robust and sufficiently informative.
A striking observation is the relative youth of the SNP type 1 lineage and its association with South Asia (Fig. 3b). Earlier studies revealed a predominance of SNP subtype 1D in India and Nepal, followed by 1C, 1A 16 , and 2E, 2G, and 2H [33][34][35] . SNP type 1 predominates in Thailand 41 , Bangladesh, Indonesia, and Philippines 16 . The current phylogeography of M. leprae implies that humans brought leprosy to South Asia from other parts of the continent.
Two strains had known quinolone resistance mutations in the DRDR of gyrA and six harbored different single mutations elsewhere in the gene. Three isolates had a missense mutation in gyrB, including two within the DRDR (Table 1). Five strains harbor deleterious mutations in the ethA gene, encoding a monooxygenase that activates thioamide prodrugs in M. tuberculosis 45,46 . Interestingly, in addition to ethA and nth, three genes (fadD9, ribD, pks4) were mutated almost exclusively in MDR strains occurring 18, 19, and 4 times, respectively (Table 1).
Retracing the emergence of drug resistance in leprosy patients. Prior to the introduction of MDT in the 1980s, patients were treated with dapsone or other antimicrobials as monotherapies of  7,8 . Since genomics uncovered new mutations that are associated with antimicrobial resistance in other bacteria, such as those in ethA and gyrB, this prompted us to try and retrieve the clinical records of six patients whose strains displayed resistance to three or more drugs (dapsone, rifampicin, quinolones, and thioamides). Four of these extensively drug-resistant (XDR) strains were from multibacillary patients in Japan who had received a succession of monotherapies in the pre-MDT era and our genome analysis enabled the chronology of resistance emergence to be retraced. This is illustrated in Fig. 5, and sadly exemplified by the strain from patient Zensho-4 who was diagnosed in 1963 and first treated with protionamide followed by thiambutosin, both of which show cross-resistance and likely require activation by the EthA mono-oxidase that acquired the A25T missense mutation 47 ; then treatment began with dapsone leading to emergence of the T53I mutation in folP1, followed by rifapentine that selected the S456L mutation in rpoB, and continued with ofloxacin, to which resistance arose from the A91V mutation in gyrA and D464N in gyrB. Molecular drug susceptibility testing was performed in 1998 and the patient finally cured by a regimen comprising clofazimine, minocycline, chloramphenicol and levofloxacin/sparfloxacin. The fifth XDR strain was from a newly diagnosed Brazilian case (Br2016-15) with no history of treatment for leprosy, confirming the ongoing transmission of primary antimicrobial resistance, while details of the sixth case could not be recovered.
Genes under positive selection. We also identified genes containing an unusually high number of polymorphisms, multiple alleles, and homoplasies (Supplementary Figure 2), which could be indicative of positive selection 48 . Strikingly, the distribution of these polymorphic sites around the genome was not random as they were often clustered, especially proximal to either side of the origin of replication (Supplementary Figure 3). Protein-changing mutations were found in 540 genes, with an average of 1.77 mutations per gene (STD 2.12). Table 2 contains a ranking of genes with at least five non-synonymous mutations or regions with one or more homoplasy (excluding VNTRs). The most polymorphic gene by far was ML0411 49 encoding the serine-rich antigen, a member of the immunogenic, surface-exposed PPE protein family. Two other known T-cell antigens whose genes display variability are Lsr2 and EsxA (Table 2). Other than nth, three other polymorphic genes (ML1040c, ML1750c, and ML1512c) code for proteins that appear to function in nucleic acid or cyclic nucleotide metabolism ( Table 2).

Discussion
Here we have optimized and applied highly sensitive procedures to extract M. leprae DNA directly from human skin biopsies that is suitable for whole-genome sequencing. The resultant genome sequences were analyzed phylogenetically and used to retrace the origin of the leprosy bacillus, and to identify polymorphisms that had been positively selected during evolution. Such polymorphisms might reflect pressure from the human immune system, from MDT or other forces.
It is striking that the ancestral lineages of M. leprae predominate in East Asia, although we should keep in mind that Central Asia has been understudied, so it would be interesting to sequence more samples spanning the East-West axis of Asia, including the Middle East, where the 3K genotype is also  16 . Nevertheless, given the current data on the distribution of the 3K subtype we can deduce that the ancestor of M. leprae originated within Eurasia, probably in the Far East. Endonuclease III (Nth) and the formamidopyrimidine and endonuclease VIII family (Fpg/Nei) of DNA glycosylases are central to the base excision repair pathway in bacteria 50 . Mycobacterial genomes usually contain a single nth and two fpg/nei genes but M. leprae has lost both fpg/nei orthologues and retained the nth gene. Nth, Fpg, and Nei may have overlapping functions and, in enteric bacteria, mutator phenotypes were observed when nth was inactivated in combination with the fpg and nei genes [51][52][53] . In M. smegmatis, deletion of nth and both the nei homologs resulted in elevated spontaneous mutation frequencies and increased sensitivity to oxidative stress 54 . Therefore, in the absence of Nei, inactivation of nth in M. leprae should lead to increased sequence variability, which is consistent with our results.
Strikingly, all nth mutants were also drug-resistant so Nth loss likely favors emergence of drug resistance, and nth mutations might serve as a surrogate marker for potential drug resistance and treatment failure. A link between a higher mutation rate and drug resistance was observed in strains of M. tuberculosis (which has nth and two fpg/nei genes) belonging to lineage 2, but the molecular basis for this is unknown 55 . For a pathogen with an extremely reduced genome such as M. leprae, a hypermutator phenotype could be detrimental and ultimately lethal.
Drug resistance is alarming for leprosy control. There is growing evidence for primary quinolone resistance in strains of M. leprae from patients who have never been treated with quinolones for leprosy but may have received this drug for other infections 56 . Five new GyrA mutations were identified in this study, but their effect on FQ resistance remains to be determined. Since two of them arose independently in the GyrA intein, which is removed by protein splicing, they may not impact quinolone activity (Fig. 4). Three non-synonymous mutations were found in gyrB (Table 2) and experimental evidence exists for two of them conferring quinolone resistance in in vitro assays or in M. tuberculosis 57,58 . To our knowledge, this is the first report of M. leprae clinical isolates harboring mutations in gyrB. Thus, despite their apparent rarity, mutations in gyrB should be systematically assessed in drug resistance screening.
A range of known and new mutations was detected in the DRDR and elsewhere in rpoB ( Fig. 4; Table 1). Some of these might have a compensatory role in restoring fitness, that is known to be reduced to various degrees in rifampicin-resistant mutants of M. tuberculosis 59,60 . Similarly, compensatory mutations in rpoA and rpoC can occur in rifampicin-resistant M. tuberculosis 44 . The rpoA substitution T187P in the rifampicin-resistant M. leprae strain Br14-5 was shown to be compensatory in M. tuberculosis 44 . Rifampicin-resistant isolates of M. tuberculosis harbor more mutations in rpoC compared to rifampicinsusceptible isolates 44,61 . In our case, we observed no clear correlation between rifampicin resistance and mutations in rpoC, which occurred in two resistant and two wild-type strains.
Arguably the most intriguing finding of the present investigation was the remarkably high frequency of mutations in the fadD9 and ribD genes and in 19/23 cases these occur in strains that have at least one mutation that is associated with resistance to a leprosy drug ( Fig. 4; Table 1). Functional information for fadD9 is scarce and the mutations found are predicted to either abolish protein production (8/16) or to cause detrimental amino acid changes (Fig. 4). In the case of ribD, 14 different missense mutations were found in a group of 17 variant alleles, indicating that this is likely an essential function. From studies with M. tuberculosis it is known that ribD encodes an alternative dihydrofolate reductase, with relatively low activity compared to that conferred by the bona fide dihydrofolate reductase gene, dfrA 62 . In clinical isolates of M. tuberculosis, a promoter mutation causes overexpression of ribD that is associated with resistance to the old drug, para-amino salicylic acid (PAS), and to certain DHFR inhibitors 63 . This suggests that the mutations detected in the M. leprae ribD gene may also confer resistance to PAS and support for this is provided by the fact that vadrine (2-pyridyl-(4)-1,3,4oxydiazolone-(5)-p-aminosalicylate) was used as a drug to treat leprosy before dapsone became widely available 64 . It is thus possible that the ribD mutations we report here arose nearly 60 years ago following treatment with vadrine or another PAS derivative. Our discovery of these mutations and those in fadD9 should encourage further experimentation in order to establish their true role and contribution to antimicrobial resistance, especially to clofazimine.

Methods
Sample collection. Samples were taken from leprosy patients as punch biopsies of skin (preserved in 70% ethanol or formalin-fixed and paraffin-embedded (FFPE)), which is standard diagnostic procedure for leprosy, or from mouse foot-pads. Details about the samples used in this study are given in Supplementary Data 1 and below.
Origin of S15: Strain S15 corresponds to strain 92041 65 , which was isolated from a lepromatous leprosy patient originally from Martinique. The origin of S15 was erroneously attributed to New Caledonia in Monot et al. 16   LRC-1A, an unidentified strain from Japan: A sample obtained from the Leprosy Research Center (LRC) in Tokyo, initially labeled as Airaku-2, did not contain the mutations in folP1 and rpoB previously detected in Airaku-2 by PCR sequencing 7 . Therefore, we renamed this sample to LRC-1A (standing for SNP subtype 1A sample from LRC).
DNA extraction and library preparation. DNA was extracted from 101 human skin biopsies with known BI using a customized in-house protocol combining host tissue digestion and the QIAmp microbiome kit for host DNA depletion, strong bacterial cell lysis and silica-based purification. Punch biopsies (6 mm) in 70% ethanol were first rehydrated in Hank's balanced solution prior to mincing with scissors. Cells were detached from the tissue by 30 min incubation at 37°C with a mixture of 0.5 U of collagenase and dispase, followed by incubation at 56°C with 10 mg/ml of trypsin until complete digestion. Free cells were then suspended in 1 ml of phosphate-buffered saline (PBS) and DNA was extracted using the QIAmp DNA microbiome extraction kit according to the manufacturer's recommendations. Each run of extraction included a batch of five to nine samples and one blank control (500 μl of Hank's balanced solution). The presence of M. leprae was assessed by PCR using RLEP primers 2 prior to library preparation. Libraries were prepared from 50 µl of extracted DNA using the Kapa Hyperprep kit as described previously 2,5 . DNA from FFPE samples was extracted using the truXTRACTM FFPE DNA kit (Covaris) as described previously 28 . Libraries prepared from the extracted DNA were used directly for shotgun sequencing. M. leprae DNA extraction quality was assessed from the percentage of M. leprae DNA present in the library inferred by alignment to the reference genome sequence, with a minimum threshold set at 1%. This threshold was chosen because it yields an average genome coverage of at least 5× per sample in a multiplexed run of 10 samples on one HiSeq 2500 lane (yielding around 20 million reads per sample and 100 bases per read).
Library enrichment. Libraries with low M. leprae content underwent enrichment using whole-genome tiling arrays as described previously 17 . Briefly, Illumina libraries were hybridized onto custom Agilent SureSelect Capture Arrays containing ca. one million DNA probes (60 bp) spanning the entire M. leprae genome (tiled every 4 bp), followed by elution and PCR amplification.
Sequence processing. We took precautions in analyzing the data to avoid falsepositive SNP calls. All raw reads were adapter-and quality-trimmed with Trimmomatic v0.33 68 . The quality settings were "SLIDINGWINDOW:5:15 MIN-LEN:40". Paired-end (PE) data were additionally processed with SeqPrep (https:// github.com/jstjohn/SeqPrep) to merge overlapping pairs. This increases the accuracy of sequence in the overlapping area, avoids problems in estimating coverage and creates longer reads, which facilitates InDel calling. Duplicate reads were omitted from downstream analyses. This is especially important for libraries with insufficient M. leprae DNA fragments, which is not uncommon for low BI samples or samples that are difficult to process, like FFPE samples. In these cases, library enrichment with array-capture, or very deep sequencing often produce a high number of duplicate reads (DNA fragments that were sequenced multiple times, seemingly increasing the overall genome coverage), with each read having dozens or even hundreds of copies. Such reads will amplify possible artefacts and sequence errors, resulting in false SNP calls.
Sequence analysis. Preprocessed reads were mapped onto the M. leprae TN reference genome (GenBank AL450380.1) with Bowtie2 v2.2.5 69 . We filtered out all reads with mapping quality below 8 and omitted repetitive regions in the reference sequence. We also omitted ribosomal RNA (rRNA) genes because alignments in these regions tend to be error prone. This is because rRNA genes are highly conserved in bacteria, so sequences from other species could map to the M. leprae reference sequence. This usually happens when the content of M. leprae DNA in a sequencing library is scarce and is even more pronounced when libraries with low M. leprae content undergo array-capture. However, because lineage-specific mutations were previously observed in the M. leprae rrs gene 30 , we manually checked the alignments corresponding to the rRNA genes and added the curated results to Supplementary Data 2.  70 . To avoid false-positive SNP calls the following cutoffs were applied: minimum overall coverage of five nonduplicated reads, minimum of three non-duplicated reads supporting the SNP, mapping quality score >8, base quality score >15, and a SNP frequency above 80%. InDel calling was done using Platypus v0.8.1 71 followed by manual curation. Completed genome sequences of M. leprae Br4923 and Mycobacterium lepromatosis (GenBank JRPY00000000.1) were aligned against the M. leprae TN reference using LAST 72 using the default parameters for the former and the gamma-centroid option for the latter.
Mixed samples. A large number of missing values, especially in lineage-specific loci, points to the presence of more than one strain in a sequencing library. Although not thoroughly tested, in our opinion mixed data sets are mostly due to technical problems or contamination because in some cases we were able to identify the problematic strains. The possible presence of multiple M. leprae strains in single skin lesions was not tested in this study, but we expect it to be extremely low. Overall, a few mixed data sets were detected and some were removed from this study, except for samples that we deemed important and describe below. Nevertheless, results were not biased because loci with mixed alleles were treated as missing values.
Zensho-4 seems to contain a fraction of another strain (possibly around 40%) that is closely related to it. Only a few loci had mixed alleles, and these include the A91V substitution in gyrA (supported by 62% of reads) and the D464N substitution in gyrB (supported by 41% of reads). The latter was attributed to Zensho-4 for simplicity. Similarly, Zensho-5 seems to contain around 30% of Zensho-4. This is the main reason why we could not detect SNPs specific only to Zensho-5 (Fig. 3a), since such SNPs would be "diluted" with wild-type alleles from Zensho-4 and could not pass the SNP "purity" threshold. We included these two samples in this study because they are multi-drug-resistant and belong to the SNPtype 3K-0. Furthermore, mutations in genes conferring drug resistance from this study match with those from earlier reports of these samples, confirming their identity 73 .
Thai-311 contains <20% of an unidentified 3K-0 strain that belongs to the Kyoto-1/Zensho-5 cluster of strains (Fig. 3a). SNP calling was not significantly affected. Finally, sample Ye2-3 contained around 25% of an unidentified strain belonging to SNP-type 4. Because we only have few samples from Yemen, we decided to keep Ye2-3 in this study.
Phylogeny and dating analysis. Concatenated SNP alignments were used for the analyses. MP trees were constructed in MEGA6 74 using 500 bootstrap replicates. Sites with missing data were partially deleted (80% coverage cutoff), resulting in 3046 variable sites used for the tree calculation. The Subtree-Pruning-Regrafting algorithm was used as the MP search method. Dating analysis and discrete phylogeography were done using BEAST2 v2.4.4 40 . Details are given in the Supplementary Note.
Data availability. Sequence data are available from the NCBI Sequence Read Archive (SRA) under accession number SRP072827. Accession numbers for all samples used in this study are given in the Supplementary Data 1. Other relevant data supporting the findings of the study are available in this published article and its Supplementary Information files, or from the corresponding author upon request.