Whole genomic sequencing based genotyping reveals a specific X3 sublineage restricted to Mexico and related with multidrug resistance

Whole genome sequencing (WGS) has been shown to be superior to traditional procedures of genotyping in tuberculosis (TB), nevertheless, reports of its use in drug resistant TB (DR-TB) isolates circulating in Mexico, are practically unknown. Considering the above the main of this work was to identify and characterize the lineages and genomic transmission clusters present in 67 DR-TB isolates circulating in southeastern Mexico. The results show the presence of three major lineages: L1 (3%), L2 (3%) and L4 (94%), the last one included 16 sublineages. Sublineage 4.1.1.3 (X3) was predominant in 18 (27%) of the isolates, including one genomic cluster, formed by eleven multidrug resistant isolates and sharing the SIT 3278, which seems to be restricted to Mexico. By the use of WGS, it was possible to identify the high prevalence of L4 and a high number of sublineages circulating in the region, also was recognized the presence of a novel X3 sublineage, formed exclusively by multidrug resistant isolates and with restrictive circulation in Mexico for at least the past 17 years.

According to the WHO, ten million people in the world became ill with tuberculosis (TB) in 2018, around 15% of which died from the disease 1 . In Mexico, an estimated 29,000 cases of TB occurred in 2018, of which almost 3% were new and 11% were previously treated cases that presented rifampicin-(RR-TB) and multidrug-(MDR-TB) resistance 1 .
Whole genome sequencing (WGS) of TB isolates has multiple advantages over traditional genotyping techniques (MIRU-VNTR and spoligotyping) since it allows more robust Mycobacterium tuberculosis complex (MTBC) classification into lineages and sublineages, through the identification of a panel of 62 single-nucleotide polymorphisms (SNPs) 2 . The WGS genotyping method facilitates understanding of the dynamics of the disease and enables the implementation of measures focused on containment of the specific types of TB genotypes circulating within a given region [2][3][4] . Moreover, WGS has a greater power of discrimination and resolution of single nucleotide differences between clinical isolates. This information is extremely useful for the establishment of genetic relationships and identification of relationship levels between strains, which can help determine the formation of a genomic cluster, i.e., of cases derived from recent transmission 4,5 . The WGS can also be used as a tool for epidemiological surveillance and control of TB transmission, since it allows a more accurate description of the etiology of an outbreak and better resolution of the transmission topology, enabling definition of the directionality and evolution of transmission 4 www.nature.com/scientificreports/ The studies related to TB genotyping in Mexico using traditional methodologies such as MIRU-VNTR and spoligotyping, have shown a high diversity of circulating lineages; T (20%), X (11%), LAM (6%), EAI (7%), H (3%) and, to a lesser extent, S (1%) and Beijing (1%) 8 . However, an important proportion (20-60%) of the isolates so far characterized are frequently defined as orphans or are misclassified [8][9][10][11] . This lack of information greatly limits the potential use of genotyping techniques to establish clear inferences regarding the lineages in circulation and the design of epidemiological and public health measures. The aim of this study was therefore to identify and characterize, through analysis of WGS, the lineages and genomic transmission clusters present in DR-TB isolates circulating in southeastern Mexico.

Materials and methods
Population. This is a descriptive cross-sectional study, which included 67 genomes of Mycobacterium tuberculosis complex (MTBC) from patients diagnosed with pulmonary TB in the state of Veracruz, Mexico. The MTBC strains were randomly selected from the drug resistant strains bank of the Public Health Institute of Veracruz, including isolates recovered from 2013 to 2016.
DNA extraction and WGS. Genomic DNA was extracted and purified following the CTAB method as previously described 12 . The DNA was quantified using a nanodrop (ThermoScientific, USA), with subsequent adjustment to a concentration of 0.2 ng/µL. The WGS libraries were prepared according to Nextera XT (Illumina, CA., USA) protocol, using 1 ng of DNA previously quantified by Qubit fluorometer (Invitrogen, CA, USA). Quality control of the genomic libraries was determined using TapeStation (Agilent Genomics), which was normalized and sequenced using NexSeq 500 (Illumina, CA., USA) in a 2 × 150 paired-end format.
Bioinformatics analysis. Given the potential presence of contaminant DNA not corresponding to MTBC, the Kraken software V2 13 was first used to classify the WGS reads. Further focus was directed only at those reads that belonged to MTBC species 14 . The WGS analysis, including mapping and variant calling (SNP and INDELS), was performed following a previously reported pipeline 7,15 , which has been described, validated and available online at http://tgu.ibv.csic.es/?page_id=1794. Variants that were present in at least 20 reads and at ≥ 90% of frequency within each isolate were called fixed-SNP (used to detect phylogenetic mutations). Unlike, variants in at least ten reads at ≤ 10% frequency called no fixed-SNP (used to detect antibiotic resistance). The analysis of the polymorphisms related to each anti-tuberculosis drug has been previously described 16 . Phylogenetic analysis, genotyping and identification of genomic transmission clusters. In order to build a phylogeny as well as to identify the genomic transmission, a concatenated alignment was created with the fixed-SNP of all clinical isolates. This alignment consisted in 7596 non-redundant positions. This alignment was used to infer phylogeny using the maximum likelihood phylogenetic approach implemented in MEGA V6 17 , applying a general time-reversible model of nucleotide substitution, with a gamma distribution (GTR + GAMMA) and considering 1000 bootstraps. The tree was visualized in iTOL v. 4 18 .
Moreover, this concatenated fixed-SNP alignment was used to identify the genomic clusters that reveal transmission events. We calculated pairwise genetic distances between each clinical isolate using "ape" R library. A ≤ 12 SNP threshold was applied to delineate the genomic clusters, as proposed by proposed by Walker et al. 5 .
Strains were classified according to the presence of 62 phylogenetic variants associated with lineages and sublineages, as proposed by Coll et al. 2 .
Identification of specific SNPs in the transmission clusters. This analysis was performed using the previously described pipeline 7,15 , according to the following procedure: (1) development of a SNP-call in the 67 isolates that comprise the sample; (2) selection of the variants found in the cluster with the largest number of isolates (C1); (3) comparison of the list of specific SNPs found in the clusters with a global collection of 300,000 SNPs in order to remove SNPs related with homoplasy, SNPs shared with strains from other countries, and mutations related to lineage assignment and drug resistance; and (4) distinction of resulting SNPs by type, essential or non-essential activity of the gene, and identification of synonymous and non-synonymous variants, for generation of the final list.
In silico spoligotyping using WGS. The in silico spoligotyping using the WGS reads was conducted using the program SpoTyping V 2.0, according to the authors instructions 19 . The binary code obtained for each isolate was analyzed with the SITVIT2 platform at http://www.paste ur-guade loupe .fr:8081/SITVI T2/ 20 , in order to identify the sublineage and assign the respective spoligotype international type (SIT).

Analysis of association between patient variables, lineages and genomic transmission clusters.
Information related to the clinical, sociodemographic and geographic location (jurisdiction) of patients at the time of diagnosis was obtained from laboratory records and anonymized. Association between variables was determined based on the calculation of odds ratio (OR), considering a 95% confidence interval (IC 95). Calculation of the Fisher's test was performed considering a value of p < 0.05 as significant in terms of establishing association between variables. All calculations were performed using the software SPSS V.12.

Genotyping, phylogenomic and identification of genomic transmission clusters. All samples
were successfully sequenced. The number or reads varied from 334,031 to 4,391,976. At least the 88% of the reference genome was covered and the average coverage depth of all isolates was 138.63 (ranging from 19 to 269, median 141). Detailed sequencing information of samples analyzed is in Supplementary Table S2.

Identification of specific SNPs in cluster C1 (X3). Excluding those SNPs associated with lineage and
antibiotic resistance-related genes, a subset of 83 SNPs was found specifically in isolates belonging to the C1 cluster and with an X3 sublineage (Supplementary Table S1), of which 26 were identified as synonymous. It is important to highlight that the remaining strains within the X3 sublineage that do not belong to C1:X3, do not share these exclusive SNPs (Fig. 1b). Functional annotation indicates that eight SNPs were detected in non-coding regions, 19 SNPs were found in hypothetical proteins, including six synonymous variants. 56 SNPs were found in the same number of genes with a specific function. These were divided into: 19 SNPs in genes with essential activity, including four synonymous variants, and 37 SNPs in the same number of genes with nonessential activity, which included 16 synonymous variants. Of the 56 variants found in genes with a known function, nine non-synonymous variants were found in genes associated with virulence; pks16, Ace, mmpL7, devS, mmpL3, pks6, mas, fadD22 and dacB; one gene was associated with transmission, emrB, three SNPs were found in genes related to drug resistance, embA, amiB2, and dnaE2, and five were found in genes associated with resistance and virulence, pks16, mmpl7, mmpL3, pks6 and pks4 (Supplementary Table S1).
In silico spoligotyping. A total of 38 spoligotype patterns were detected. Table 2 shows the octal codes, SITs and lineages identified by the in silico spoligotyping analysis conducted in the 67 genomes. A total of 54 (81%) isolates were assigned to a respective SIT and lineage, four (6%) had only a SIT assignation, and nine (13%) of isolates were classified as orphans.
Three main lineages were recorded by the in silico spoligotyping analysis: EAI (L1) and Beijing (L2), with two (3%) isolates each, and Euro-American as the predominant lineage that included 50 (75%) of isolates and considered four sublineages; X including eighteen (26%) of the isolates distributed among four spoligotyping patterns, T with seventeen (25%) of the isolates among ten spoligotyping patterns, LAM with 13% (9) of the isolates among eight spoligotyping patterns, and H with 7% (5) of the isolates in four spoligotyping patterns.
Comparison between the lineages and sublineages assigned by in silico spoligotyping and those defined by WGS phylogenetic variant analysis presented a coincidence of 70% ( Table 2). The main differences were those isolates identified with a SIT but no lineage, and those identified as orphans, which were correctly assigned using WGS analysis (Table 2).
Associations between patient variables, lineages and genomic transmission clusters. No significant association was observed between sublineages and most of the variables recovered. Only one risk association was found, between male sex and development of TB with the X3 sublineage where an OR of 3.8 was observed (IC 1.1-13,), however, the p value was limited (p = 0.053). www.nature.com/scientificreports/ In addition, four statistically significant associations were observed with C1 (Table 3): (1) development of a TB infection with a strain from C1 and presenting MDR-TB (p = 0.0005), unfortunately, the presence of empty cells made it impossible to calculate the OR value; (2) a protective association between residence in the northern part of the state of Veracruz and development of TB with a strain from C1 (p = 0.026); however, it was impossible to perform the respective OR analysis; (3) an association between residence in the central region of the state and development of TB with a strain from C1 (p = 0.039), with an OR of 8.3 (IC 95% 1-69.6); and (4) an association between residence in the region of Xalapa (capital city of the state) and development of TB with a strain from C1 (p = 0.04), with an OR of 5.7 (IC 95% 1.2-26.5). This illustrates the importance of place of residence, and specifically residence in the capital city (Xalapa), as a risk factor in acquisition of a TB infection with a strain from C1.

Discussion
According to the WHO, 24,096 new TB cases were reported in Mexico in 2018, of which less than 3% of the new, and 11% of the previously treated, cases presented resistance to rifampicin, as well as multi-drug resistance 1 . Nevertheless, 29 (43%) of the DR-TB isolates recovered and analyzed in this study came from individuals classified as new cases, remaining 38 (57%) were relapses, readmissions or retreatments, of which 36 (54%) were classified as MDR-TB. This figure shows the burden of DR-TB in the region. This study assigned lineages and sublineages to 100% of the isolates analyzed, something never achieved in previous studies in Mexico using conventional techniques such as MIRU and spoligotyping 8,9,21,22 . This demonstrates the higher resolution of WGS compared to traditional genotyping methods, as previously described [2][3][4][5][6][7]23,24 .
Three major lineages were identified (L1, L2 and L4), with 16 sublineages. L4 presents the highest proportion of isolates (94%) and sublineages (13). This diversity of lineages concurs with previous studies and confirms the predominance of isolates with the Euro-American lineage (L4) in the country [8][9][10]22,25,26 and also in countries from center and South America 27,28 . This success of the transmission has been explained as a consequence of the European colonization in the fifteen and sixteen centuries, with importation of sub-lineages 15,27,29,30 , and also in terms of its adaptation to the immune response of the host in the different locations 31 .
Only 21% of isolates were clustered by WGS analysis, while the use of traditional genotyping techniques in isolates from Mexico has produced percentages of clustering that range from 60 to 70% 8,10,25,32-34 . The overestimation of clustering by traditional genotyping techniques is well recognized, particularly in the case of spoligotyping, whereas WGS has a better clustering discrimination capability and provides better descriptions of transmission clusters that occur in the population, in addition, the inclusion of information from patients is of great help to identify epidemiological links and transmission routes in patients located within identified clusters 4,7,35,36 . Furthermore, L4 genetic clusters detected by MIRU-VNTR have been described as overestimated 37 . Undoubtedly, genotyping and phylogenomic analysis by WGS will have important implications for the genotypic and epidemiological analysis of TB in Mexico.
The most frequent L4 MTBC sublineages were L4.1.1 (X) with 21 (31%) isolates, followed by L4.1.2 (T) with 20 (30%) isolates. The T sublineage is frequently described in Mexico 8,21 , while X has a very low occurrence and has only been described with prevalences ranging from 21 to 29% in two reports from the central and northern regions of the country 11,34 . The high proportion found here is therefore unusual and is the first such finding for isolates circulating in the southern region of the country. However, the relatively small number of isolates included in this study must be taken into account. Considering the above, the incorporation of a greater number of isolates, recovered through an epidemiological-genomic surveillance system 38 , will allow a clearer definition of the lineages/sub lineages circulating in the region, and even identify imported cases derived from migration effects.
A significant association was identified between cluster C1 (X3) and geographical location, and this was most significant in the region of Xalapa, the capital city of the state. Previous studies in this region have shown similar associations between other lineages such as H2 and a specific region in the north of the state of Veracruz (Tuxpan) 10 . These results describe the preferential distribution of certain genotypes in specific geographic regions of the state, and illustrate the value of this analysis in terms of identifying prevalence and the transmission routes of specific lineages, information of importance for the adequate design of specific preventive strategies.
All isolates included in C1 (X3) were found to be MDR-TB and shared almost the same polymorphisms pattern related to resistance against rifampicin, isoniazid and pyrazinamide, with the exception of three isolates (1228-06, 481-14, 1305-14) that were phenotypically susceptible to ethambutol and lacked any of the common mutations frequently found in genes associated with resistance to this drug, apart from isolate 1040-16 that had the mutation G/T at genomic position 4247431, given place to change Met306Ile, nucleotide position 918, in the embB gene.
These data suggest that resistance to ethambutol has arisen in a second moment of transmission of this strain. Finally, two of the isolates showed SNPs that confer a potential pre-XDR-TB and XDR-TB condition, confirming Table 3. Variables with significant association with isolates from the C1 cluster. **p value < 0.05 was considered significant. a OR calculation was impossible to determine. www.nature.com/scientificreports/ the potential of WGS analysis as a tool for predicting drug resistance. This shows the evolution of this strain according to the interaction with their respective hosts. In addition, we described the functional annotation in the 83 variants exclusively identified in the C1 isolates, from which nine non-synonymous variants were in genes implicated in the virulence of M. tuberculosis, one in transmission, two in genes related to drug resistance and five in genes with participation in both of these processes. Altogether, this information evidences the evolution and adaptation of this clone to this region of Mexico, with an increase in virulence and tendency to develop pre-and XDR-TB forms. Further studies that include isolates from different regions of Mexico are necessary in order to evaluate the impact of this clone in the epidemiology of DR-TB and the participation, not only of the non-synonymous variants found in the genes associated with resistance, virulence and transmission, but also of the remaining SNPs found in the isolates that form this apparently new X sublineage of TB.
An important volume of genotyping data has been accumulated using spoligotyping and MIRU-VNTR, which is an important issue when comparative phylogenomic data is obtained by WGS genotyping. The in silico spoligotyping and comparison with phylogenomic analysis showed a concordance close to 70%, with the main differences found in those isolates lacking a lineage or classified as orphans.
A detailed search of these spoligotypes in SITVIT2 and the author´s database confirms that more than 80% of the spoligotypes has been described previously in the country ( Table 2). It was also observed that all of the members of C1 shared the same octal pattern (700076717760771) and SIT (3278, X3). This pattern, according to SITVIT2, has been previously described in a single isolate from Spain and in two isolates from Mexico recovered in 2003, one from the state of Colima and the other from Quintana Roo 33 and, more recently, in five isolates from the state of Veracruz recovered in the period 2012-2013 10 . In both of these reports, the isolates were clustered and presented a phenotypic MDR-TB condition. This information confirms that this X3 sublineage is strongly associated with MDR-TB, is restricted to Mexico and has had national circulation over the entire country for at least 17 years, with recent growing expansion in the central setting of the state of Veracruz. Further WGS studies, including isolates from other states, are necessary in order to determine the level of expansion of this sublineage.
Perhaps the major limitations of this study were related to the restricted number of isolates analyzed, the failure to include fully susceptible isolates and the lack of phenotypic DST studies for second line drugs. Undoubtedly, there is an urgent need to increase the number of studies related to WGS analysis of tuberculosis in Mexico in order to identify in greater detail the diversity of the circulating lineages and the presence and extent of genomic transmission clusters in the various regions of the country, as well as to determine the variables that could function as risk factors of transmission.
We conclude that WGS was extremely useful in terms of defining the lineages in the totality of the isolates analyzed and also in the identification of genomic transmission clusters. It also identified the presence of a cluster comprising a MDR strain with an X3 sublineage, specifically located in the center of the state of Veracruz, but one that has been expanding in Mexico over the last 17 years. Further studies will be required in order to explain the origin of this strain, its transmission routes and its implications for public health.

Data availability
The data underlying the genome sequences presented in this study are available at ENA: https ://www.ebi.ac.uk/ ena. Accession number: PRJEB30933. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.