Whole genome sequencing identifies bacterial factors affecting transmission of multidrug-resistant tuberculosis in a high-prevalence setting

Whole genome sequencing (WGS) can elucidate Mycobacterium tuberculosis (Mtb) transmission patterns but more data is needed to guide its use in high-burden settings. In a household-based TB transmissibility study in Peru, we identified a large MIRU-VNTR Mtb cluster (148 isolates) with a range of resistance phenotypes, and studied host and bacterial factors contributing to its spread. WGS was performed on 61 of the 148 isolates. We compared transmission link inference using epidemiological or genomic data and estimated the dates of emergence of the cluster and antimicrobial drug resistance (DR) acquisition events by generating a time-calibrated phylogeny. Using a set of 12,032 public Mtb genomes, we determined bacterial factors characterizing this cluster and under positive selection in other Mtb lineages. Four of the 61 isolates were distantly related and the remaining 57 isolates diverged ca. 1968 (95%HPD: 1945–1985). Isoniazid resistance arose once and rifampin resistance emerged subsequently at least three times. Emergence of other DR types occurred as recently as within the last year of sampling. We identified five cluster-defining SNPs potentially contributing to transmissibility. In conclusion, clusters (as defined by MIRU-VNTR typing) may be circulating for decades in a high-burden setting. WGS allows for an enhanced understanding of transmission, drug resistance, and bacterial fitness factors.

Tuberculosis (TB) remains among the top ten causes of deaths globally, with 10.4 million new cases in 2016 alone 1 . Peru remains a high burden country for multidrug-resistant (MDR) TB with 117 TB cases reported per 100,000 population in 2016 and approximately 9% being MDR or rifampicin-resistant (RR) 1 . Molecular methods have been instrumental in identifying outbreaks, and single nucleotide polymorphism (SNPs) identified using whole genome sequencing (WGS) have a higher resolution in identifying transmission links compared to traditional genotyping methods such as spoligotyping or Mycobacterium interspersed repetitive unit-variable number tandem repeats (MIRU-VNTR) [2][3][4][5][6][7][8][9][10][11][12] . Yet we don't yet fully understand how to use all the genetic information generated by WGS. By convention, as much as 10% of the genome is excluded 2,3 and in some instances too little remaining variation is found to enable the resolution of transmission chains 13 . Resolving transmission events accurately is particularly challenging in high-burden settings where multiple source case suspects are common. In addition to guiding public health interventions including appropriate contact tracing, identifying the source case can inform patient care in some cases, such as in pediatric TB where the source case microbiological data can inform treatment 14,15 . Furthermore, in high-burden countries the term 'outbreak' may not apply as TB has been 1 Boston children's Hospital, Boston, MA, USA. 2 Harvard Medical School, Boston, MA, USA. 3 Socios en Salud, Lima, Peru. 4 texas A&M University, college Station, tX, USA. 5 imperial college, London, UK. 6 University of South florida, tampa, fL, USA. 7 Brigham and Women's Hospital, Boston, MA, USA. 8 SUnY Downstate Medical center, Brooklyn, nY, USA. 9 Mailman School of Public Health, columbia University, new York, nY, USA. 10 Massachussetts General Hospital, Boston, MA, USA. correspondence and requests for materials should be addressed to A.D. (email: avika. dixit@childrens.harvard.edu) circulating continuously for decades 16 . Given the renewed emphasis on active case finding 17 and the widespread adoption of WGS, an intensification of the latter in high-burden settings is needed.
Control efforts against TB have been undermined by the emergence and spread of drug resistant TB (DR-TB). Current evidence suggests that most cases of DR-TB are a result of transmission rather than de-novo evolution of the bacteria during treatment [18][19][20] . Factors known to contribute to DR-TB transmission include delays in diagnosis and treatment 21 , host factors (e.g. age, immune status [22][23][24], as well as bacterial factors such as fitness and immunogenicity characteristics [25][26][27] . It is well recognized that Mycobacterium tuberculosis (MTB) strains with the same DR-conferring mutations have a range of fitness [28][29][30] . However, to date few molecular fitness determinants have been characterized and seldom in the context of high transmissibility [31][32][33][34] . Improved knowledge of such bacterial factors can inform efforts for transmission interruption by identifying targets for diagnosis, surveillance, and even potential therapeutics targeting fitness mechanisms. Here we use WGS data to examine the largest TB MIRU-VNTR cluster spanning pan-susceptible to MDR-TB isolates that was identified in 4,000 TB patients enrolled in a household transmissibility study. We examine both host data and TB genotypic data to understand the evolution of isolates within this cluster, infer the timing of emergence of antibiotic resistance, and identify genetic bacterial factors unique to this cluster that may have contributed to its success.

Materials and Methods
Study Design. A TB household transmissibility and treatment outcome study was performed in northern Lima, Peru from September 2009 to August 2012. The study procedures including patient enrollment and consent have been previously described 35,36 . Briefly, informed consent was obtained from participants or their parents or guardians. Patients were enrolled if they were diagnosed with pulmonary TB (PTB) at public health clinics and were followed through therapy. Their household contacts were also followed with tuberculin skin testing and monitored for development of TB for a period of 12 months. The following were collected at time of TB diagnosis: clinical signs and symptoms, sociodemographic characteristics (e.g. age, gender, occupation, household type), geographical coordinates of household and health center, co-morbidities (HIV status, diabetes mellitus, renal disorder), as well as alcohol, tobacco and drug use.
Approval was obtained from the Research Ethics Committee of the Peruvian National Institute of Health (Lima, Peru) and the Committee on Human Studies at Harvard Medical School (Boston, MA). All research was performed in accordance with relevant guidelines and regulations.
Culture, DST and genotyping. Lowenstein-Jensen (LJ) culture was performed from sputum specimens using standard NALC-NaOH decontamination. All sputum cultures positive for MTB complex subspecies tuberculosis were subjected to first-line drug susceptibility testing (DST) for isoniazid (critical concentration of 1 µg/mL and 0.2 µg/mL), rifampicin (40 µg/mL), streptomycin (4 µg/mL) and ethambutol (2 µg/mL) using the proportion method. Pyrazinamide susceptibility (100 µg/mL) was measured using the Wayne method 37 . Any resistant strains underwent further DST for second-line agents performed using the indirect proportion method on 7H11 agar as previously described [38][39][40] . DNA was extracted and genotyped by 24-loci MIRU-VNTR using standard methods 41 . A 'cluster' in the study was defined as a group of isolates having an identical MIRU-VNTR pattern.

Whole-genome sequencing (WGS) and variant calling. Sixty-three isolates collected between 2009
and 2012 were available for sequencing from the largest MIRU-VNTR cluster (n = 148) found in the study (Supplementary Table 3) based on the availability of culture for DNA extraction in 2013. DNA extraction, sequencing and read processing is described in the supplementary methods. Raw reads were processed and variants were called using a custom bioinformatics pipeline 42 (supplementary methods).

Phylogenetic analysis.
A multiple sequence alignment was generated as a concatenate of allelic states at all sites found to be variable using a custom script in Perl 5.10.1 42 . We generated an alignment containing only substitutions and another with both substitutions and insertions and deletions (commonly referred to as 'indels'). Variants in genes implicated in drug resistance 43 , transposases, and genes coding for proline-glutamate (PE) or proline-proline-glutamate (PPE) 44 (commonly referred to as PE/PPEs) were excluded from phylogeny building by convention (e.g. due to high levels of recombination 45 ), and also because drug resistance genes are under selective pressure and are expected to bias tree structure. A maximum likelihood phylogenetic tree was built for both alignments using RAxML 8.2.11 46 as implemented in the R package ips 47 (supplementary methods). Estimation of divergence times and timing of drug resistance acquisition was performed using BEAST 1.8.4 48,49 using an uncorrelated lognormal relaxed clock that allows for tree branches to evolve at different rates. The prior on the mean clock rate was assumed to be 0.5 SNP per genome per year based on published data 3 . We calculated the range of dates based on a 95% highest posterior density interval (HPDI) which indicates the smallest interval that includes 95% of the posterior probability distribution 50 . Please see supplementary methods for further details of the phylogenetic analysis.
To determine the number of times DR arose within the group of isolates, we examined the maximum clade credibility tree (where bifurcations denote internode branches supported by a posterior probability >0.5). We then counted the minimum number of paraphyletic switches from a sensitive to a resistant phenotype for each drug supported by bifurcating/high-confidence nodes.
SNPs occurring within a high-transmission cluster were further evaluated against other TB lineages using a large TB WGS database containing 12,032 strains curated from the literature through NCBI and the Reseq TB initiative 51 (Supplementary Data 1). To determine SNPs with phylogenetic convergence, mutations in the high-transmission cluster that were present in more than 5% of strains in at least two other main lineages (i.e. lineages 1, 2 and 3) in addition to lineage 4 were identified as likely being under positive selection. As some of the www.nature.com/scientificreports www.nature.com/scientificreports/ hit SNPs occurred in repetitive regions, we confirmed their accuracy by remapping simulated reads from different reference genomes and using Pacific Biosciences (PacBio) long-read sequences (supplementary methods).
Other data analysis. Variation in PE/PPE regions and indels between closely related strains was confirmed via visualization and checked for false positives due to copy number variants (supplementary methods). To study host-related factors that may be associated with transmission, the propensity to propagate (PTP) method was used as previously described 52 .

Results
Patient and isolate characteristics. A large cluster of 148 isolates, collected from 147 patients, with identical MIRU-VNTR pattern was identified. The majority of the patients in this cluster were male, HIV-negative and had multidrug resistant (MDR) TB (Table 1). About one-half were smear-positive (51.4%) and one-third used alcohol (35.8%) or other intoxicating substances (27.9%). From this cluster, 63 isolates were available for sequencing (supplementary methods). Two patients had isolates that did not meet our sequencing quality criteria and were subsequently excluded. Of those patients with high quality sequence data (n = 61) a higher majority were male with MDR-TB but were otherwise comparable to the superset of 148 ( None of the variables were significantly different between the patient with and without sequencing data using a t-test or a Chi-squared test. *N = 147 unless specified (including one patient with two isolates). **N = 60 unless specified (including one patient with two isolates). ***Mean and range. # Resistance pattern for three of the sequenced strains was inferred based on mutations. www.nature.com/scientificreports www.nature.com/scientificreports/ 371 SNPs and 81 indels in total. Of these, 24 substitutions and one indel occurred in DR conferring regions, and 42 substitutions and 23 indels in PE/PPE genes. With the exclusion of DR conferring regions, the average pairwise SNP difference between isolates was 21.69 (range: 0-84) and 22.7 (range: 1-100), excluding and including the PE/ PPE regions, respectively.
The geographic distribution of strains based on household coordinates, colored by resistance pattern is shown in Fig. 1. Comparison between genetic and geographic distance did not support that the cluster spread in a single geographic direction, even when three most distant strains were excluded (P = 0.2, Supplementary Fig. 1).
The SNP-based phylogenetic tree (Fig. 2, Supplementary Fig. 2) demonstrated that the most genetically homogeneous group consisted of 57 isolates. These formed two main clades, where the first contained isolates that were pan-susceptible or streptomycin mono-resistant, and the second consisted of INH mono-resistant or MDR isolates. Using a prior on the mean clock rate of 0.5 SNP per genome per year 3   www.nature.com/scientificreports www.nature.com/scientificreports/ Epidemiological vs. genotypic data. We compared the use of genetic and epidemiological data for transmission inference. PE/PPE region differences and indels between closely related pairs were visualized in Integrated Genome Viewer and confirmed to be high confidence (supplementary methods). Sequencing data was available for three of seven household contact pairs in the cluster of 147 patients. Only one household link (index/parent M23 -contact/child M12) was consistent with a recent transmission event on the tree and by genetic distance (SNP difference = 1). When high-confidence PE/PPE regions were included, the genetic distance between this pair increased to 3 variants. With the further addition of high confidence indels the pairwise distance increased to 4, in the predicted 5 year interval (Supplementary Data 2). No variation was observed in this pair in DR-related loci. The other two isolate pairs from household contacts (index/parent M02 -contact/ child M01, index/child M45 -contact/parent M58) were 17 and 421 SNPs apart, respectively. Isolates M01 and M02 were genetically closer to other isolates on the tree (M01-M28 6 SNPs apart and M02-M42-M11-M37 all within 4 SNPs of each other). A pair of isolates collected two months apart from a host who had not been on treatment (MDR strains M06 and M10) was found to have no SNP differences outside of PE/PPE regions. In PE/PPE regions, we found 5 high-quality SNPs; similarly, 2 indels were observed in other regions.
Looking at genetic evidence alone for recent transmission using a distance cutoff of ≤5 SNPs (excluding DR and PE/PPE regions) 3 , 139 links among 38 patients were identified (Fig. 3). Other than one pair (index/parent M23 and contact/child M12), none of these belonged to the same household. With the addition of high confidence indels and PE/PPE SNPs we used a cutoff of ≤12 variants: 5 SNPs plus 7 PE/PPE and indel variants based on the two serial isolates available from the same patient described above. Using this added variation and the cutoff of ≤12, there were 104 links among 38 patients, i.e. 25% fewer links than when these variants were excluded. Phylogenetic trees built by including indel variation also had notable differences within the cluster of 57 isolates (Supplementary Fig. 3).
Of the 375 isolates sequenced from a TB outbreak in London 13 , 325 (86.67%) met our quality criteria and were further examined. Using a genetic distance cut-off of ≤5 SNPs (excluding DR and PE/PPE regions), 309 of these isolates diversified into one large interconnected cluster consisting of 31,776 links. Among 38 serial isolates that were collected from 19 patients from that outbreak, the largest difference between a pair with the inclusion of indels and PE/PPE regions was also found to be 7 SNPs. When applying this as the threshold for identifying a genetic link, the interconnected cluster was reduced to 294 strains with 28,230 links, i.e. 11% fewer links than when PE/PPE variants and indels were excluded. www.nature.com/scientificreports www.nature.com/scientificreports/ Host factors. We measured host infectiousness in the cluster using the 'propensity to propagate' (PTP) method 52 and identified five patients as having the highest possible score (PTP > 4). This was related to patients being younger (20-29 years old) males with smear positive PTB and a history of substance use ( Supplementary  Fig. 4). Three of these were identified to be within the network of patients with genetically close MDR isolates (Fig. 3). The mean cluster PTP was also high at 1.699. Drug resistance. First and second-line drug resistance was acquired several times within the core cluster of 57 isolates (  : 1964-1990), the most recent PZA resistance acquisition event was predicted to be within the last year of isolation (95%HPD: 2006-2012). Ethambutol resistance was acquired at least twice within the MDR clade and contemporaneous with RR acquisition in both cases. The mutation Y319S was the most common embB mutation observed. Resistance and Bacterial Fitness. As there were several isolates measured to be resistant by the culture-based method that did not harbor any known resistance mutations, e.g. for EMB and PZA, we attempted a phylogeny-based genome wide association within the group of 61 isolates to identify new mutations associated with resistance (Supplementary Table 2). In addition to identifying the known mutations that confer resistance to isoniazid and rifampicin, we found an association between EMB resistance and a mutation (3778221AG) in the intergenic region between spoU and PE-PGRS51 genes (20 bp from spoU end and 347 bp before PE-PGRS51 start), corresponding to the acquisition of EMB resistance ca. 1980ca. (95%HPD: 1964ca. -1990 shown in Supplementary Fig. 5. We identified 175 mutations that were unique to the core cluster of 57 isolates and were absent from the four more distantly related isolates. Hypothesizing that a subset of these mutations may have contributed to transmissibility of this cluster, we measured which are under positive selection by looking at 12,032 other MTB isolates. Five mutations met our criteria for positive selection i.e. were found to have a frequency of >5% in at least three other TB lineages (lineage 1, 2, 3, and non-LAM-4) (Fig. 4). Of the five, two occurred in genes with known function, esxV which is an ESAT-6 like secreted protein and cobD, a cobalamin biosynthesis protein. As three of the five genes are known to contain repetitive regions, the accuracy of the convergent mutation calls was verified by simulating and remapping Illumina reads carrying the variant, and using PacBio long read sequences that was available for one strain (supplementary methods).

Discussion
In this detailed analysis of a MIRU-VNTR cluster with variable degrees of drug resistance from a high prevalence setting, we show that traditional genotyping methods have a significantly lower resolution in identifying transmission clusters as compared to WGS, particularly when variation in PE/PPE regions, indels are incorporated into the analysis. Additionally, we found complex evolutionary patterns within an otherwise identical MIRU cluster and identified the interplay of host and epidemiological factors contributing to transmission potential of a cluster.
The higher resolution gained by WGS is consistent with prior reports 2,3,53-55 , but the maximum genetic distance we find between the clustered isolates is larger than previously seen and we further estimate the group of LAM-4.3.3 sequenced isolates to have been circulating for over 8 decades in our study community. Previous studies performing WGS of MIRU-VNTR clusters in low prevalence settings have noted shorter genetic distance between isolates 2,3,13 , and in one case the distances were insufficient to reliably and consistently inform contact tracing interventions 13 . It is possible, that certain features of our selected cluster have led to the observation of   www.nature.com/scientificreports www.nature.com/scientificreports/ such high levels of diversity. First, our cluster spans the spectrum of pan-susceptible to resistant against seven drugs. Second, epidemiological links were known for only three pairs of patients in the cluster. Third, our isolates all belong to lineage 4, a lineage that has been noted to be the most phenotypically and genotypically diverse of the TB lineages 56 . However, the proportion of diversity that could be linked directly to drug resistance was low. A parsimonious explanation of the high degree of observed genomic diversity is that the rate of MIRU-VNTR pattern evolution is on average slow and on the order of decades. Despite this, MIRU-VNTR likely offers sufficient resolution in low prevalence settings as most TB cases there tend to be imported 2,3 .
The genes in the PE and PPE families constitute about 10% of the TB genome and have been grouped together based on the proline-glutamate (PE) and proline-proline-glutamate (PPE) signature motifs but members of this family are scattered throughout the genome and have diverse functions 57,58 . Because they carry a high GC content and contain repetitive areas they have been typically excluded from analysis of sequencing data 59 . However, recent advances in sequencing technology allow for longer read lengths and increased throughput, which when combined with more accurate bioinformatics pipelines, makes it possible to call variants in a proportion of these genes with high confidence. We identified several high quality indels and variation in the PE/PPE regions in our dataset. The commonly used cutoff of ≤5 SNPs to infer transmission does not take into account the different evolutionary rate of these regions that may be driven by intra-genomic recombination or other mechanisms. In our study, when identifying closely related strains with the inclusion of high confidence PE/PPE regions, the number of possible links between strains decreased. An accepted standard that accounts for variation in these regions would allow for improved resolution of transmission events. Although, comparison of ancestral relationships in the phylogenetic trees with and without the inclusion of indels did not show significant differences, there were notable differences within the closely related cluster, highlighting that similarity measures that rely on SNPs alone could be misleading. Inclusion of indels and PE/PPE regions in the estimation of divergence dates is limited by our current lack of knowledge regarding their evolutionary rates or clock-like behavior. Prior studies support that at least a proportion of them accumulate variation in a manner consistent with lineage 44 and coupled with our observation that these regions account for an appreciable proportion of variation seen between closely related isolates, including PE/PPE variants is likely to inform transmission inference.
We identified many genomic links using the SNP distance threshold of ≤5 criterion 3 that were not discovered within household contact investigation, providing evidence that household contact investigation is not sufficient to identify and treat secondary TB cases as transmission can occur anywhere in the community 60,61 . Additionally, 2 of 3 case pairs that belonged to the same household were found to have large genetic distances making it more likely that transmission occurred outside the household. Although the dataset used was relatively small, these findings add to the current limited literature on the topic 60,62,63 . Overall our study highlights the utility of WGS in resolving transmission links particularly in high burden settings where several transmission chains may occur simultaneously. WGS of a well characterized cluster through MIRU-VNTR led to identification of several sub-clusters with further granularity achieved from addition of variants in regions that are routinely excluded from these analyses. With decreasing cost of WGS, sequencing data could be integrated with epidemiological investigation in lieu of traditional fingerprinting methods to identify transmission clusters and for reconstruction of contact networks, particularly given the increasing emphasis on active case finding for TB elimination 17 .
Our phylogenetic dating procedures support the conclusion that the acquisition of MDR is not recent in Lima, and that MDR cases, given the observed phylogenetic structure, are mostly related to transmission. This finding is consistent with other studies carried out in other countries, e.g. South Africa 18 . Within the MDR subcluster, ethambutol and pyrazinamide resistance was acquired and transmitted. This finding is similar to that of a study www.nature.com/scientificreports www.nature.com/scientificreports/ in Uganda 64 , where pyrazinamide resistance typically arose in MDR strains with several different causative mutations. This highlights the importance of testing for pyrazinamide resistance in order to determine benefit of its use in MDR-TB treatment regimens 65 .
Phenotypic resistance could not be explained by genotype in a few isolates including in our study. To this effect, we undertook a GWAS procedure to identify drug resistant phenotype-genotype associations. We identified the intergenic SNP 3778221AG 30 bp downstream of the putative tRNA/rRNA methylase gene spoU to be significantly associated with ethambutol resistance. Although a causative mechanism for how this variant modulates EMB susceptibility or fitness is not clear, this finding is supported further by a recent large genome-wide association study of 1452 MTB isolates 66 .
The cluster under study was the largest such cluster observed in the Lima household transmission study. Its transmission success was likely due to both bacterial and host factors. We quantified the host predilection to transmit TB with the PTP measure and found the cluster to have a higher score than the median PTP measure reported by a study in Netherlands 52 . Our study had five patients with a particularly high PTP above the highest reported value of 3·9 52 , potentially contributing to transmission in the population we studied. A few prior studies have characterized bacterial genetic factors that contributed to increased transmissibility 24,67,68 . We add to this literature by identifying five cluster defining SNPs to be under positive selection in a large TB genomic dataset. One of these SNPs (esxV S23L) is a member of the ESAT-6 family of secreted proteins, some of which have been shown to be involved in host-pathogen interactions and may thus have contributed to increased transmissibility 68,69 .
Our study had several limitations. Contact tracing was done within household contacts and hence epidemiological links in the community were possibly missed. Sequencing a subset of isolates from the cluster may have led to missed links along the transmission chain. It may also have led to an underestimation of the diversity. However, the sampled subset demonstrated a substantial amount of diversity, more than would be expected within a cluster with identical MIRU pattern 2,3 . We also cannot exclude that the 2 outer most isolates were mis-assigned the reported MIRU pattern and because of this we focused on the isolates confirmed to be of the same lineage by in silico spoligotyping and the WGS SNP barcode. Finally, it is important to note that our dating estimates are heavily reliant on the molecular clock rate that has been previously reported in the literature.
In summary, our findings add to the evidence challenging the traditional interpretation of a MIRU-VNTR cluster as indicating recent transmission and suggest that the benefits of WGS over MIRU-VNTR may be even more prominent in high prevalence settings when TB transmission has been ongoing without interruption, especially when high confidence PE/PPE and indel genetic variants are included. WGS can also provide insights into biology of MTB to improve our understanding of DR, transmission and host-pathogen interaction.

Data Availability
Mycobacterium tuberculosis genome data were deposited in the NCBI BioProject database (ID: PRJ-NA343736 and PRJNA526078). Individual accession numbers for genomes analyzed in this study are given in Supplementary Data 3.