Whole-genome single nucleotide variant phylogenetic analysis of Mycobacterium tuberculosis Lineage 1 in endemic regions of Asia and Africa

Mycobacterium tuberculosis (Mtb) lineage 1 (L1) contributes considerably to the disease morbidity. While whole genome sequencing (WGS) is increasingly used for studying Mtb, our understanding of genetic diversity of L1 remains limited. Using phylogenetic analysis of WGS data from endemic range in Asia and Africa, we provide an improved genotyping scheme for L1. Mapping deletion patterns of the 68 direct variable repeats (DVRs) in the CRISPR region of the genome onto the phylogeny provided supporting evidence that the CRISPR region evolves primarily by deletion, and hinted at a possible Southeast Asian origin of L1. Both phylogeny and DVR patterns clarified some relationships between different spoligotypes, and highlighted the limited resolution of spoligotyping. We identified a diverse repertoire of drug resistance mutations. Altogether, this study demonstrates the usefulness of WGS data for understanding the genetic diversity of L1, with implications for public health surveillance and TB control. It also highlights the need for more WGS studies in high-burden but underexplored regions.

www.nature.com/scientificreports/ high burden countries such as the Philippines and India, each of which has several hundred thousand L1-infected patients each year 4 . Spoligotyping is another widely used genotyping method, with data available from most countries. It evaluates the presence of 43 out of 68 genomic segments called direct variable repeats (DVRs) located in the CRISPR region, also called the Direct Repeat (DR) region 12 . This scheme classifies Mtb strains with deletions at DVR39-42, 44 and 48 as the East African Indian (EAI) genotype, which is equivalent to L1. Each spoligotype pattern is represented by a series of 0 s and 1 s indicating the absence or presence of each of the 43 DVRs, and is usually abbreviated into a 15-digit octal code with each number representing 3 DVRs sequentially for 14 digits with the last digit as either 0 or 1. Each spoligotype is given an identification number for the ease of use, called the spoligotype international type (SIT). For example, most of the L1.2.2 strains have SIT19 spoligotype or similar, collectively referred to as EAI2. Spoligotypes classify EAI2 into two groups, EAI2_MNL (Manila), associated with the Philippines, and EAI2_NTB (Nonthaburi), associated with Thailand.
Although spoligotyping is cheap and fast, its discriminating power is limited, and is confounded by the presence of common homoplastic spoligotypes such as SIT48 and SIT236, meaning that they could belong to one of several sublineages 5 . It was not known whether the same homoplastic spoligotype in different countries belonged to the same sublineage or not.
This study aims to refine the WGS SNP-based genotyping scheme of L1 to better understand its genetic diversity and population structure in high-burden regions. We performed phylogenetic analysis of 1,764 L1 samples, representing all major sublineages and endemic areas in Asia, Oceania and Africa. While this work was on-going, the revised nomenclature with three major sublineages had been proposed 6 . We adopt this threegroup scheme here, but kept further nomenclature as backward compatible as possible. We compared the SNP phylogeny to the conventional spoligotype and the underappreciated 68-DVR patterns to gain deeper insights into the geographic distribution and dispersal history of L1 sublineages. We also examined genetic clusters and drug resistance mutations. The information is useful for evaluating TB burden and for understanding of the evolutionary history and pathogenesis, with implications for TB control.

Results
Phylogeny of L1 sublineages. WGS data of 1,764 M. tuberculosis (Mtb) L1 isolates from Asia, Oceania and Africa were obtained until November 2020. Most isolates were from Thailand (36%), Vietnam (26%), India (13%) and the Philippines (9%). The remaining were from other Asian countries (5%), Africa (n = 151; 9%) and Oceania (Australia and Hawaii; 2%) (Supplementary Table S1). Accession numbers of the raw paired-end reads of all Mtb isolates used in this study are provided in Supplementary Table S2.
Interestingly, a sublineage in the Napier's scheme 6 not present in our main dataset was L1.2.1, with most isolates being from patients in Europe. Therefore, we inferred a separate phylogeny of L1.2 isolates that included 410 isolates from our dataset and additional 364 isolates from Napier et al. 6 and other recent studies 10,14 (Fig. 2 Fig. S4). Some sublineages correlated well to a unique set of DVR deletions that may be used as markers (Table 1) Supplementary Fig. S1). Nevertheless, those deletions could occur in other sublineages as well, limiting their potential use as reliable markers. Deletions of some DVRs were clearly homoplastic, occurring several times independently in the evolutionary history of L1. DVR62 is a characteristic of both L1.3 and L1.1.1.9, and is deleted together with DVR61 in L1.1.1.10. This deletion was also found in many other sublineages of L1. 1 Table S4). The WGS of the isolates had the read depth of 15-46 (median = 26) and the breadth of coverage at the sequencing depth of 20 between 23-96% (median = 74%). These isolates formed two separate clades, suggesting two independent events of complete deletion. The spoligotype-negative event was related to gene cas1, the deletion of which was associated with more vulnerability to DNA damage 15 .
Geographic diversity of L1. Sublineages of L1 distributed preferentially around the Indian Ocean and the Western Pacific region (Fig. 3 Tables S5 and S6). Among the 1,835 sublineage-specific SNPs, ~ 80% were in non-essential coding regions or in noncoding regions. We recommend using the full set of SNP markers when possible. We also provided a subset of 125 barcoding SNPs, prioritizing synonymous SNPs within essential genes (Supplementary Table S6).
Drug resistance mutations. We used TBprofiler to predict drug resistance based on known genetic markers 16 (Supplementary Fig. S5). The majority of isolates (n = 1,272, 72%) did not possess any resistance   www.nature.com/scientificreports/ conferring mutations and, therefore, were presumably pan-sensitive. Isoniazid resistance was most common (n = 348, 19.7%), followed by ethionamide (n = 201, 11.4%), streptomycin (n = 188, 10.7%), rifampicin (n = 160, 9.1%), ethambutol (n = 99, 5.6%), pyrazinamide (n = 66, 3.7%) and fluoroquinolones (n = 28, 1.6%) (Supplementary Table S7). Mutations conferring resistance to other drugs were found in less than 1% of the isolates. By classifying drug resistance profile based on the latest WHO recommendations (see Methods), we found that 13% (n = 224) of the isolates were resistant to one first-line drug (mono-DR) while 7% (n = 123) were MDR. About 1% of the isolates were pre-XDR. Only one isolate was XDR. Most rifampicin-resistant isolates were also resistant to isoniazid (n = 142, 89%). The prevalence of drug resistance varied across sublineages (Supplementary Figs. S5 and S6) which were correlated with the geography (Fig. 3 (Table 2). This was consistent with a previous report of 19% MDR-TB in the Philippines 11 although the MDR incidence among new TB cases there was only 2% in 2012 17 . We caution that the prevalence of drug resistance may not be representative due to different sampling designs used by source studies.
Mutations associated with drug resistance were diverse. All rifampicin-resistant isolates had variants in the rpoB gene, with 31 distinct alleles associated with changes at 13 amino acid residues. For isoniazid resistance, 99% of the isolates (n = 344) had mutations in katG or fabG1/inhA involving 36 alleles, with katG Ser315Thr being most common (52%), followed by the -15C > T mutation in the fabG1 promoter (38%), with only 9 isolates having both variants (Supplementary Table S7). Among the other two first-line drugs, we identified 44 alleles of pncA among 63 isolates conferring pyrazinamide resistance and 30 alleles in embA, embB and embR conferring ethambutol resistance.
A small proportion of isolates (n = 28, 1.6%) possessed mutations in gyrA or gyrB conferring resistance to fluoroquinolones, with Ala90Val and Asp94(Gly/Ala/Asn) in gyrA being the most common (n = 26). Among those isolates, 18 were qualified as pre-XDR and mostly belonged to L1.1.1 or L1.3.2. The only isolate identified as XDR belonged to L1.1.1.9 from Thailand. In addition to pre-XDR mutations, it also had an insertion in Rv0678 at position 192 that made it likely to resist bedaquiline 18 .

Discussion
Here, we provide an updated classification scheme of Mtb L1 based on phylogenetic analysis of WGS data of samples from L1 endemic regions. The major differences between our nomenclature and the Napier's scheme 6 are in L1.1.2, L1.1.3 and L1.2.2 sublineages (Fig. 1). Several sublineages associated with a specific country or region are recognized in our scheme but not in the Napier's scheme. Comparisons between SNP sublineages and spoligotypes reveal some degree of congruence. There is greater congruence when the spoligotypes are defined by several deletion events, such as L1.1.3 and L1.2.2 (previously L1.2.1), which are equivalent to EAI6_BGD1 and EAI2 respectively. In contrast, spoligotypes defined by a single deletion event are more likely to be homoplastic. The most notable example is SIT48, typically defined by the deletion of DVR62, which occurred several times independently in the evolutionary history of L1. SIT48 can be a member of L1.1.1 or L1.3. Fortunately, since L1.1.1 is largely restricted to MSEA, SIT48 in other regions, such as India or Africa, most likely corresponds to L1.3 whereas SIT48 in MSEA may belong to either L1.1.1 or L1.3 as shown in Fig. 4.
We also clarify several genetic relationships between spoligotypes that could facilitate the development of better genetic groupings of spoligotypes. This is important as spoligotyping is more affordable in high-burden countries and provides data from much larger sample sizes with a higher geographic coverage than WGS.
The geography of isolates with most intact DVRs, similar to L1-MRCA, could be indicative of the region where L1 originated. Such isolates have the SIT236 spoligotype and are found exclusively in certain sublineages of L1.1.1, except for two L1.3.2 isolates (Fig. 4 and Supplementary Table S4). Since all sublineages of L1.1.1 except for L1.1.1.11 are mostly restricted to MSEA, where the diversity of L1 is also greatest (Fig. 3), it is possible that the L1-MRCA might have been present there. This is in contrast with recent genomic studies which predicted the origin of L1 in South Asia 10,19 , where none of the SIT236 isolates were present in our dataset. We note that their inferences were dominated by L1.1.2.2 and L1.3, and may suffer from small sample sizes (~ 300 or lower) and non-representative sampling.
One major limitation of this study is that it was based on WGS data collected for various purposes. Epidemiological findings such as the prevalence of drug resistance and genetic clusters were therefore only suggestive and require further investigation. A few source studies were intended for drug resistant bacteria which biased the estimates of drug resistance frequencies.
Our analysis revealed a vast diversity of resistance-conferring mutation combinations. The results confirmed the presence of the mutations in the rifampin-resistance determining regions (RRDR) in the vast majority (95%) of rifampin resistance isolates, which would allow the GeneXpert system to adequately detect rifampin resistance. Although available commercial line probe assays do not specifically identify mutations in about 37% of www.nature.com/scientificreports/ rifampicin-, 13% of isoniazid-and 55% of ethambutol-resistant isolates in this dataset, they should be able to diagnose the resistance by the absence of hybridization of wild-type probes. The line probe assay specifically detects most fluoroquinolone resistance mutations 20 .

Conclusion
This study refines a genotyping scheme of L1, particularly of the L1.2 sublineages. Many basal isolates of L1.2 were recognized in ISEA samples, suggesting a possible study site to further elucidate the evolutionary history of the clade. Mapping spoligotypes onto the SNP phylogeny clarifies the genotypic identity of some homoplastic spoligotypes. Analysis of drug resistance mutations revealed a vast diversity of mutated alleles, particularly among the first-line drugs. This has implications for the utility of genetic tests for diagnosis of drug-resistant TB. Finally, genetic clusters in L1 are rare and small in size, suggesting a more limited transmission potential of L1, especially in comparison with L2 21 .

Methods
Dataset. We compiled a collection of whole-genome sequencing (WGS) data of 1,764 M. tuberculosis L1 isolates from countries in Asia, Oceania and Africa. This is the main dataset. Most of the data were publicly available while a small proportion was from unpublished studies. Supplementary Table S1 provides a list of data sources and references for the WGS data used in this study. A full list of accession numbers for all 1,764 isolates is in Supplementary Table S2. Supplementary Table S3 shows the distribution of samples across countries and sublineages. We also obtained additional 364 L1.2 isolates from three recent studies 6,10,14 . This set was analyzed together with the 410 L1.2 isolates from the main dataset, referred to as the L1.2 dataset (n = 364 + 410 = 774) (Supplementary Table S2).
As additional sample quality control, we excluded samples if (i) the median read depth below 10 or the median breadth of coverage (depth at least 10) below 10%, or (ii) it was redundant as determined by the BioSample accession number and sample metadata if provided by the original study, or (iii) the processed reads had unusual %GC content, or (iv) SNPs specific to different sublineages were present in substantial proportions, indicative of mixed strains. The total of 1,764 isolates passed this sample quality control.
Joint SNP calling of all 1,764 isolates was performed using GATK GenotypeGVCFs v4.1.6.0 24 . For variant quality control, we dropped SNPs within a region annotated as repeat region, IS6110, PE, PPE, mobile element, phage or 13E12 repeat family. SNPs within known drug-resistant genes were excluded as in Ajawatanawong et al 25 . Finally, SNPs with quality by depth (QD) < 2 or root mean square of mapping quality (MQ) < 40 were also excluded. This resulted in 153,244 SNPs across 1,764 genomes. The additional L1.2 isolates were processed similarly.
Phylogenetic analysis. SNP calls passing the quality filters were converted into a multiple sequence alignment using a custom script based on GATK VariantsToTable (available at https:// github. com/ CENMIG/ snppl et). Point deletions and missing calls were converted into gaps in the alignment. The H37Rv strain (L4) was used as an outgroup. A maximum likelihood (ML) phylogeny was inferred using IQ-TREE v2.1.1 26 . The bestfit substitution model was determined to be GTR + G4 by ModelFinder 27 . The bootstrap branch supports were based on 1,000 replicates. We also inferred an ML phylogeny correcting for ascertainment bias as shown in Supplementary Fig. S8 and obtained consistent bootstrap support for all sublineages 28 .  Each DVR segment was identified as a deletion if one of the following criteria was satisfied: (i) the median read depth was below a cutoff and the proportion of sites with at least one read mapped to was < 50%; (ii) the 25% quantile of read depths was < 3; (iii) the 5% quantile of read depths was 0; (iv) if the overall median depth (across all DVRs) was > 50, the median read depth was smaller than a half of the median read depth. For (i), we used the cutoff value of 5, or half of the baseline median read depth calculated from all DVR regions, whichever was lower. This choice as well as the criteria (ii) and (iii) accounted for low-depth samples. The criterion (iv) accounted for samples with high sequencing depths by adjusting for overall read depth across all DVR regions. Given a DVR pattern, we extracted a relevant set of 43 DVRs to determine the spoligotype. SIT and spoligotype group for each spoligotype was obtained from the SITVIT2 database 31 . The code for DVR calling is available in GitHub at https:// github. com/ ythaw orn/ dvrca ller.
We tested our algorithm using a 480-isolate subset where spoligotypes had been determined experimentally and DVR patterns had been independently identified using a de novo assembly-based approach 5 . The above criteria yielded ~ 6% difference for the spoligotype and 16% difference for the DVR pattern. We also compared our predicted spoligotypes with two other methods, SpoTyping 32 and Galru 33 (Supplementary Table S2). The number of mismatches between each pair of methods ranged between 14%-17%. The difference in predicted spoligotypes was usually only about 1-2 DVR segments. Manual inspection of mapped read depths indicated that our method appeared to be more reliable.
Revising genotyping scheme and identifying lineage-specific SNPs. We used the following criteria to delineate genotypes based on the inferred phylogeny (Fig. 1). First, we preserved the genotypes that have previously been defined. Sublineage of each isolate was identified using two existing SNP-based genotyping schemes for M. tuberculosis L1 3,5 . Second, we assigned a new genotype to a monophyletic clade of previously unclassified strains with at least 10 isolates, bootstrap support ≥ 90%. We also required that all isolates shared at least one SNP specific to the group that was absent outside the group. Third, we expanded the definition of existing sublineages to include more of previously unclassified strains if the final clade also had consistent genomic and epidemiological features such as the DVR pattern and geography, and satisfied the previous conditions.
Once genotypes have been defined, we identified a list of SNPs specific to each genotype using a custom script (available in GitHub at https:// github. com/ ythaw orn/ group-speci fic-varia nts). These identified sublineagespecific SNPs serve as markers for genotyping those sublineages. We annotated variants within an ORF using four states of essentiality: essential (ES), growth defect (GD), nonessential (NE) and growth advantage (GA) 34 . Supplementary Table S5 shows that number of sublineage-SNPs identified. The full list of annotated sublineage-SNPs is in Supplementary Table S6.
Drug resistance mutations. We used TBProfiler 16 to predict resistance of each isolate to the following 19 anti-TB drugs: rifampicin, isoniazid, pyrazinamide, ethambutol, streptomycin, moxifloxacin, ofloxacin, levofloxacin, ciprofloxacin, amikacin, kanamycin, capreomycin, ethionamide, para-aminosalicylic acid, cycloserine, linezolid, bedaquiline, clofazimine and delamanid. Classification of drug resistance type was according to the latest World Health Organization (WHO) recommendations as follows. mono-resistance (mono-DR): one first-line drug (isoniazid, rifampicin, ethambutol, pyrazinamide); polydrug resistance (poly-DR): more than one first-line drug other than isoniazid and rifampicin together; multidrug resistance (MDR): both isoniazid and rifampicin together; pre-XDR: MDR and any fluoroquinolone; extensive drug resistance (XDR): pre-XDR and at least one of bedaquiline or linezolid 35,36 . Genetic clustering. We identified genetic clusters as clades in the phylogeny containing isolates that can be linked via pairwise SNP distances of at most 5, 12 or 20 SNPs 37,38 .

Data availability
Accession numbers of the raw paired-end reads of all Mtb isolates used in this study are provided in Supplementary Table S2.