Tuberculosis (TB) remains among the top ten causes of deaths globally, with 10.4 million new cases in 2016 alone1. Peru remains a high burden country for multidrug-resistant (MDR) TB with 117 TB cases reported per 100,000 population in 2016 and approximately 9% being MDR or rifampicin-resistant (RR)1. Molecular methods have been instrumental in identifying outbreaks, and single nucleotide polymorphism (SNPs) identified using whole genome sequencing (WGS) have a higher resolution in identifying transmission links compared to traditional genotyping methods such as spoligotyping or Mycobacterium interspersed repetitive unit-variable number tandem repeats (MIRU-VNTR)2,3,4,5,6,7,8,9,10,11,12. Yet we don’t yet fully understand how to use all the genetic information generated by WGS. By convention, as much as 10% of the genome is excluded2,3 and in some instances too little remaining variation is found to enable the resolution of transmission chains13. Resolving transmission events accurately is particularly challenging in high-burden settings where multiple source case suspects are common. In addition to guiding public health interventions including appropriate contact tracing, identifying the source case can inform patient care in some cases, such as in pediatric TB where the source case microbiological data can inform treatment14,15. Furthermore, in high-burden countries the term ‘outbreak’ may not apply as TB has been circulating continuously for decades16. Given the renewed emphasis on active case finding17 and the widespread adoption of WGS, an intensification of the latter in high-burden settings is needed.

Control efforts against TB have been undermined by the emergence and spread of drug resistant TB (DR-TB). Current evidence suggests that most cases of DR-TB are a result of transmission rather than de-novo evolution of the bacteria during treatment18,19,20. Factors known to contribute to DR-TB transmission include delays in diagnosis and treatment21, host factors (e.g. age, immune status22,23,24), as well as bacterial factors such as fitness and immunogenicity characteristics25,26,27. It is well recognized that Mycobacterium tuberculosis (MTB) strains with the same DR-conferring mutations have a range of fitness28,29,30. However, to date few molecular fitness determinants have been characterized and seldom in the context of high transmissibility31,32,33,34. Improved knowledge of such bacterial factors can inform efforts for transmission interruption by identifying targets for diagnosis, surveillance, and even potential therapeutics targeting fitness mechanisms. Here we use WGS data to examine the largest TB MIRU-VNTR cluster spanning pan-susceptible to MDR-TB isolates that was identified in 4,000 TB patients enrolled in a household transmissibility study. We examine both host data and TB genotypic data to understand the evolution of isolates within this cluster, infer the timing of emergence of antibiotic resistance, and identify genetic bacterial factors unique to this cluster that may have contributed to its success.

Materials and Methods

Study Design

A TB household transmissibility and treatment outcome study was performed in northern Lima, Peru from September 2009 to August 2012. The study procedures including patient enrollment and consent have been previously described35,36. Briefly, informed consent was obtained from participants or their parents or guardians. Patients were enrolled if they were diagnosed with pulmonary TB (PTB) at public health clinics and were followed through therapy. Their household contacts were also followed with tuberculin skin testing and monitored for development of TB for a period of 12 months. The following were collected at time of TB diagnosis: clinical signs and symptoms, sociodemographic characteristics (e.g. age, gender, occupation, household type), geographical coordinates of household and health center, co-morbidities (HIV status, diabetes mellitus, renal disorder), as well as alcohol, tobacco and drug use.

Approval was obtained from the Research Ethics Committee of the Peruvian National Institute of Health (Lima, Peru) and the Committee on Human Studies at Harvard Medical School (Boston, MA). All research was performed in accordance with relevant guidelines and regulations.

Culture, DST and genotyping

Lowenstein-Jensen (LJ) culture was performed from sputum specimens using standard NALC-NaOH decontamination. All sputum cultures positive for MTB complex subspecies tuberculosis were subjected to first-line drug susceptibility testing (DST) for isoniazid (critical concentration of 1 µg/mL and 0.2 µg/mL), rifampicin (40 µg/mL), streptomycin (4 µg/mL) and ethambutol (2 µg/mL) using the proportion method. Pyrazinamide susceptibility (100 µg/mL) was measured using the Wayne method37. Any resistant strains underwent further DST for second-line agents performed using the indirect proportion method on 7H11 agar as previously described38,39,40. DNA was extracted and genotyped by 24-loci MIRU-VNTR using standard methods41. A ‘cluster’ in the study was defined as a group of isolates having an identical MIRU-VNTR pattern.

Whole-genome sequencing (WGS) and variant calling

Sixty-three isolates collected between 2009 and 2012 were available for sequencing from the largest MIRU-VNTR cluster (n = 148) found in the study (Supplementary Table 3) based on the availability of culture for DNA extraction in 2013. DNA extraction, sequencing and read processing is described in the supplementary methods. Raw reads were processed and variants were called using a custom bioinformatics pipeline42 (supplementary methods).

Phylogenetic analysis

A multiple sequence alignment was generated as a concatenate of allelic states at all sites found to be variable using a custom script in Perl 5.10.142. We generated an alignment containing only substitutions and another with both substitutions and insertions and deletions (commonly referred to as ‘indels’). Variants in genes implicated in drug resistance43, transposases, and genes coding for proline-glutamate (PE) or proline-proline-glutamate (PPE)44 (commonly referred to as PE/PPEs) were excluded from phylogeny building by convention (e.g. due to high levels of recombination45), and also because drug resistance genes are under selective pressure and are expected to bias tree structure. A maximum likelihood phylogenetic tree was built for both alignments using RAxML 8.2.1146 as implemented in the R package ips47 (supplementary methods). Estimation of divergence times and timing of drug resistance acquisition was performed using BEAST 1.8.448,49 using an uncorrelated lognormal relaxed clock that allows for tree branches to evolve at different rates. The prior on the mean clock rate was assumed to be 0.5 SNP per genome per year based on published data3. We calculated the range of dates based on a 95% highest posterior density interval (HPDI) which indicates the smallest interval that includes 95% of the posterior probability distribution50. Please see supplementary methods for further details of the phylogenetic analysis.

To determine the number of times DR arose within the group of isolates, we examined the maximum clade credibility tree (where bifurcations denote internode branches supported by a posterior probability >0.5). We then counted the minimum number of paraphyletic switches from a sensitive to a resistant phenotype for each drug supported by bifurcating/high-confidence nodes.

SNPs occurring within a high-transmission cluster were further evaluated against other TB lineages using a large TB WGS database containing 12,032 strains curated from the literature through NCBI and the Reseq TB initiative51 (Supplementary Data 1). To determine SNPs with phylogenetic convergence, mutations in the high-transmission cluster that were present in more than 5% of strains in at least two other main lineages (i.e. lineages 1, 2 and 3) in addition to lineage 4 were identified as likely being under positive selection. As some of the hit SNPs occurred in repetitive regions, we confirmed their accuracy by remapping simulated reads from different reference genomes and using Pacific Biosciences (PacBio) long-read sequences (supplementary methods).

Other data analysis

Variation in PE/PPE regions and indels between closely related strains was confirmed via visualization and checked for false positives due to copy number variants (supplementary methods). To study host-related factors that may be associated with transmission, the propensity to propagate (PTP) method was used as previously described52.


Patient and isolate characteristics

A large cluster of 148 isolates, collected from 147 patients, with identical MIRU-VNTR pattern was identified. The majority of the patients in this cluster were male, HIV-negative and had multidrug resistant (MDR) TB (Table 1). About one-half were smear-positive (51.4%) and one-third used alcohol (35.8%) or other intoxicating substances (27.9%). From this cluster, 63 isolates were available for sequencing (supplementary methods). Two patients had isolates that did not meet our sequencing quality criteria and were subsequently excluded. Of those patients with high quality sequence data (n = 61) a higher majority were male with MDR-TB but were otherwise comparable to the superset of 148 (Table 1). Sequencing data revealed that 58 isolates belonged to the Latin America-Mediterranean LAM-4.3.3 sublineage, and three isolates were more distant and belonged to the sublineages X -4.1.1, T-4.8, and LAM-4.3.2. In the LAM-4.3.3 group, we found 371 SNPs and 81 indels in total. Of these, 24 substitutions and one indel occurred in DR conferring regions, and 42 substitutions and 23 indels in PE/PPE genes. With the exclusion of DR conferring regions, the average pairwise SNP difference between isolates was 21.69 (range: 0–84) and 22.7 (range: 1–100), excluding and including the PE/PPE regions, respectively.

Table 1 Patient characteristics. None of the variables were significantly different between the patient with and without sequencing data using a t-test or a Chi-squared test.

The geographic distribution of strains based on household coordinates, colored by resistance pattern is shown in Fig. 1. Comparison between genetic and geographic distance did not support that the cluster spread in a single geographic direction, even when three most distant strains were excluded (P = 0.2, Supplementary Fig. 1).

Figure 1
figure 1

Geographic location of strains with resistance pattern. (A) All 148 strains with identical MIRU-VNTR patterns, (B) 61 strains that were sequenced. Maps generated using the plot() function as implemented in R package OutbreakTools70. MDR: multidrug-resistant (resistant to both isoniazid and rifampin).

The SNP-based phylogenetic tree (Fig. 2, Supplementary Fig. 2) demonstrated that the most genetically homogeneous group consisted of 57 isolates. These formed two main clades, where the first contained isolates that were pan-susceptible or streptomycin mono-resistant, and the second consisted of INH mono-resistant or MDR isolates. Using a prior on the mean clock rate of 0.5 SNP per genome per year3 (supplementary methods), the origin of the 61 isolates was estimated around the middle of the 14th century (1336 CE; 95%HPD 855–1680). The LAM-4.3 cluster of 58 isolates diverged ca. 1923 (95%HPD 1856–1967) and innermost cluster of 57 isolates diverged ca. 1968 (95%HPD 1945–1985).

Figure 2
figure 2

Bayesian maximum clade credibility phylogenetic tree created via BEAST (using single nucleotide polymorphisms) of 61 strains with nodes in increasing order of age. Numbers at nodes are posterior means of node ages (years ago). Node ages <15 years are not shown for clarity. Bars represent 95%HPD interval for node age. Color of tip represents drug susceptibility - Green: pan-susceptible, Dark Red: Resistant only to Isoniazid or Rifampicin, Dark Green: Resistant to a drug other than Isoniazid or Rifampicin, Red: multi-drug resistant.

Epidemiological vs. genotypic data

We compared the use of genetic and epidemiological data for transmission inference. PE/PPE region differences and indels between closely related pairs were visualized in Integrated Genome Viewer and confirmed to be high confidence (supplementary methods). Sequencing data was available for three of seven household contact pairs in the cluster of 147 patients. Only one household link (index/parent M23 – contact/child M12) was consistent with a recent transmission event on the tree and by genetic distance (SNP difference = 1). When high-confidence PE/PPE regions were included, the genetic distance between this pair increased to 3 variants. With the further addition of high confidence indels the pairwise distance increased to 4, in the predicted 5 year interval (Supplementary Data 2). No variation was observed in this pair in DR-related loci. The other two isolate pairs from household contacts (index/parent M02 – contact/child M01, index/child M45 – contact/parent M58) were 17 and 421 SNPs apart, respectively. Isolates M01 and M02 were genetically closer to other isolates on the tree (M01-M28 6 SNPs apart and M02-M42-M11-M37 all within 4 SNPs of each other). A pair of isolates collected two months apart from a host who had not been on treatment (MDR strains M06 and M10) was found to have no SNP differences outside of PE/PPE regions. In PE/PPE regions, we found 5 high-quality SNPs; similarly, 2 indels were observed in other regions.

Looking at genetic evidence alone for recent transmission using a distance cutoff of ≤5 SNPs (excluding DR and PE/PPE regions)3, 139 links among 38 patients were identified (Fig. 3). Other than one pair (index/parent M23 and contact/child M12), none of these belonged to the same household. With the addition of high confidence indels and PE/PPE SNPs we used a cutoff of ≤12 variants: 5 SNPs plus 7 PE/PPE and indel variants based on the two serial isolates available from the same patient described above. Using this added variation and the cutoff of ≤12, there were 104 links among 38 patients, i.e. 25% fewer links than when these variants were excluded. Phylogenetic trees built by including indel variation also had notable differences within the cluster of 57 isolates (Supplementary Fig. 3).

Figure 3
figure 3

Inferred transmission network based on genomic distance. Shown is a network graph depicting strains that were less than or equal to 5 single nucleotide polymorphisms (SNPs) apart (with the exclusion of PE/PPE regions and regions coding for drug resistance). Each vertex represents a strain or group of strains, edges are colored to denote number of SNP differences between the connected vertices. IG = Identical Group – consists of strains that had zero SNP difference, IG1 = M09, M14, M01; IG2 = M06, M10; IG3 = M04, M07; IG4 = M18, M28. Legend shows number of different SNPs and corresponding color of the edges. Plot generated using R package igraph.

Of the 375 isolates sequenced from a TB outbreak in London13, 325 (86.67%) met our quality criteria and were further examined. Using a genetic distance cut-off of ≤5 SNPs (excluding DR and PE/PPE regions), 309 of these isolates diversified into one large interconnected cluster consisting of 31,776 links. Among 38 serial isolates that were collected from 19 patients from that outbreak, the largest difference between a pair with the inclusion of indels and PE/PPE regions was also found to be 7 SNPs. When applying this as the threshold for identifying a genetic link, the interconnected cluster was reduced to 294 strains with 28,230 links, i.e. 11% fewer links than when PE/PPE variants and indels were excluded.

Host factors

We measured host infectiousness in the cluster using the ‘propensity to propagate’ (PTP) method52 and identified five patients as having the highest possible score (PTP > 4). This was related to patients being younger (20–29 years old) males with smear positive PTB and a history of substance use (Supplementary Fig. 4). Three of these were identified to be within the network of patients with genetically close MDR isolates (Fig. 3). The mean cluster PTP was also high at 1.699.

Drug resistance

First and second-line drug resistance was acquired several times within the core cluster of 57 isolates (Table 2, Supplementary Fig. 5, and Supplementary Table 1).

Table 2 Drug resistance acquisition in MTB strains based on phenotype.

First line drugs

Isoniazid (INH) resistance was acquired at least twice, and in both times with a katG S315T mutation, ca. 1968 (95%HPD: 1945–1985) and ca. 1972 (95%HPD 1952–1985). Rifampicin resistance (RR) was acquired within the large INH-resistant clade at least twice ca. 1980 (95%HPD: 1964–1990, rpoB D435V) and ca. 1986 (95%HPD: 1973–1996, most frequent variant rpoB S450L). Pyrazinamide (PZA) resistance followed RR and was acquired at least 4 times, the most frequent mutation was pncA Q10R acquired ca. 1980 (95%HPD: 1964–1990), the most recent PZA resistance acquisition event was predicted to be within the last year of isolation (95%HPD: 2006–2012). Ethambutol resistance was acquired at least twice within the MDR clade and contemporaneous with RR acquisition in both cases. The mutation Y319S was the most common embB mutation observed.

Second line drugs

Of the 25 strains tested for ciprofloxacin, one MDR isolate (M43) acquired resistance ca. 1996 (95%HPD: 1979–2009). Similarly, only one (M30, which was also resistant to capreomycin) of the 41 isolates tested for kanamycin acquired resistance ca. 2009 (95%HPD: 2006–2012). We were not able to identify any mutations in gyr or rrs to explain resistance to these two drugs. For capreomycin, seven of forty tested isolates were resistant and carried the tlyA G232D mutation estimated to have been acquired ca. 2005 (95%HPD: 2001–2008).

Resistance and Bacterial Fitness

As there were several isolates measured to be resistant by the culture-based method that did not harbor any known resistance mutations, e.g. for EMB and PZA, we attempted a phylogeny-based genome wide association within the group of 61 isolates to identify new mutations associated with resistance (Supplementary Table 2). In addition to identifying the known mutations that confer resistance to isoniazid and rifampicin, we found an association between EMB resistance and a mutation (3778221AG) in the intergenic region between spoU and PE-PGRS51 genes (20 bp from spoU end and 347 bp before PE-PGRS51 start), corresponding to the acquisition of EMB resistance ca. 1980 (95%HPD: 1964-1990) shown in Supplementary Fig. 5.

We identified 175 mutations that were unique to the core cluster of 57 isolates and were absent from the four more distantly related isolates. Hypothesizing that a subset of these mutations may have contributed to transmissibility of this cluster, we measured which are under positive selection by looking at 12,032 other MTB isolates. Five mutations met our criteria for positive selection i.e. were found to have a frequency of >5% in at least three other TB lineages (lineage 1, 2, 3, and non-LAM-4) (Fig. 4). Of the five, two occurred in genes with known function, esxV which is an ESAT-6 like secreted protein and cobD, a cobalamin biosynthesis protein. As three of the five genes are known to contain repetitive regions, the accuracy of the convergent mutation calls was verified by simulating and remapping Illumina reads carrying the variant, and using PacBio long read sequences that was available for one strain (supplementary methods).

Figure 4
figure 4

Single nucleotide polymorphisms in high-transmission cluster of 57 strains showing phylogenetic convergence and their percent frequency among Mycobacterium tuberculosis lineages. LAM: Latin America-Mediterranean.


In this detailed analysis of a MIRU-VNTR cluster with variable degrees of drug resistance from a high prevalence setting, we show that traditional genotyping methods have a significantly lower resolution in identifying transmission clusters as compared to WGS, particularly when variation in PE/PPE regions, indels are incorporated into the analysis. Additionally, we found complex evolutionary patterns within an otherwise identical MIRU cluster and identified the interplay of host and epidemiological factors contributing to transmission potential of a cluster.

The higher resolution gained by WGS is consistent with prior reports2,3,53,54,55, but the maximum genetic distance we find between the clustered isolates is larger than previously seen and we further estimate the group of LAM-4.3.3 sequenced isolates to have been circulating for over 8 decades in our study community. Previous studies performing WGS of MIRU-VNTR clusters in low prevalence settings have noted shorter genetic distance between isolates2,3,13, and in one case the distances were insufficient to reliably and consistently inform contact tracing interventions13. It is possible, that certain features of our selected cluster have led to the observation of such high levels of diversity. First, our cluster spans the spectrum of pan-susceptible to resistant against seven drugs. Second, epidemiological links were known for only three pairs of patients in the cluster. Third, our isolates all belong to lineage 4, a lineage that has been noted to be the most phenotypically and genotypically diverse of the TB lineages56. However, the proportion of diversity that could be linked directly to drug resistance was low. A parsimonious explanation of the high degree of observed genomic diversity is that the rate of MIRU-VNTR pattern evolution is on average slow and on the order of decades. Despite this, MIRU-VNTR likely offers sufficient resolution in low prevalence settings as most TB cases there tend to be imported2,3.

The genes in the PE and PPE families constitute about 10% of the TB genome and have been grouped together based on the proline-glutamate (PE) and proline-proline-glutamate (PPE) signature motifs but members of this family are scattered throughout the genome and have diverse functions57,58. Because they carry a high GC content and contain repetitive areas they have been typically excluded from analysis of sequencing data59. However, recent advances in sequencing technology allow for longer read lengths and increased throughput, which when combined with more accurate bioinformatics pipelines, makes it possible to call variants in a proportion of these genes with high confidence. We identified several high quality indels and variation in the PE/PPE regions in our dataset. The commonly used cutoff of ≤5 SNPs to infer transmission does not take into account the different evolutionary rate of these regions that may be driven by intra-genomic recombination or other mechanisms. In our study, when identifying closely related strains with the inclusion of high confidence PE/PPE regions, the number of possible links between strains decreased. An accepted standard that accounts for variation in these regions would allow for improved resolution of transmission events. Although, comparison of ancestral relationships in the phylogenetic trees with and without the inclusion of indels did not show significant differences, there were notable differences within the closely related cluster, highlighting that similarity measures that rely on SNPs alone could be misleading. Inclusion of indels and PE/PPE regions in the estimation of divergence dates is limited by our current lack of knowledge regarding their evolutionary rates or clock-like behavior. Prior studies support that at least a proportion of them accumulate variation in a manner consistent with lineage44 and coupled with our observation that these regions account for an appreciable proportion of variation seen between closely related isolates, including PE/PPE variants is likely to inform transmission inference.

We identified many genomic links using the SNP distance threshold of ≤5 criterion3 that were not discovered within household contact investigation, providing evidence that household contact investigation is not sufficient to identify and treat secondary TB cases as transmission can occur anywhere in the community60,61. Additionally, 2 of 3 case pairs that belonged to the same household were found to have large genetic distances making it more likely that transmission occurred outside the household. Although the dataset used was relatively small, these findings add to the current limited literature on the topic60,62,63. Overall our study highlights the utility of WGS in resolving transmission links particularly in high burden settings where several transmission chains may occur simultaneously. WGS of a well characterized cluster through MIRU-VNTR led to identification of several sub-clusters with further granularity achieved from addition of variants in regions that are routinely excluded from these analyses. With decreasing cost of WGS, sequencing data could be integrated with epidemiological investigation in lieu of traditional fingerprinting methods to identify transmission clusters and for reconstruction of contact networks, particularly given the increasing emphasis on active case finding for TB elimination17.

Our phylogenetic dating procedures support the conclusion that the acquisition of MDR is not recent in Lima, and that MDR cases, given the observed phylogenetic structure, are mostly related to transmission. This finding is consistent with other studies carried out in other countries, e.g. South Africa18. Within the MDR subcluster, ethambutol and pyrazinamide resistance was acquired and transmitted. This finding is similar to that of a study in Uganda64, where pyrazinamide resistance typically arose in MDR strains with several different causative mutations. This highlights the importance of testing for pyrazinamide resistance in order to determine benefit of its use in MDR-TB treatment regimens65.

Phenotypic resistance could not be explained by genotype in a few isolates including in our study. To this effect, we undertook a GWAS procedure to identify drug resistant phenotype-genotype associations. We identified the intergenic SNP 3778221AG 30 bp downstream of the putative tRNA/rRNA methylase gene spoU to be significantly associated with ethambutol resistance. Although a causative mechanism for how this variant modulates EMB susceptibility or fitness is not clear, this finding is supported further by a recent large genome-wide association study of 1452 MTB isolates66.

The cluster under study was the largest such cluster observed in the Lima household transmission study. Its transmission success was likely due to both bacterial and host factors. We quantified the host predilection to transmit TB with the PTP measure and found the cluster to have a higher score than the median PTP measure reported by a study in Netherlands52. Our study had five patients with a particularly high PTP above the highest reported value of 3·952, potentially contributing to transmission in the population we studied. A few prior studies have characterized bacterial genetic factors that contributed to increased transmissibility24,67,68. We add to this literature by identifying five cluster defining SNPs to be under positive selection in a large TB genomic dataset. One of these SNPs (esxV S23L) is a member of the ESAT-6 family of secreted proteins, some of which have been shown to be involved in host-pathogen interactions and may thus have contributed to increased transmissibility68,69.

Our study had several limitations. Contact tracing was done within household contacts and hence epidemiological links in the community were possibly missed. Sequencing a subset of isolates from the cluster may have led to missed links along the transmission chain. It may also have led to an underestimation of the diversity. However, the sampled subset demonstrated a substantial amount of diversity, more than would be expected within a cluster with identical MIRU pattern2,3. We also cannot exclude that the 2 outer most isolates were mis-assigned the reported MIRU pattern and because of this we focused on the isolates confirmed to be of the same lineage by in silico spoligotyping and the WGS SNP barcode. Finally, it is important to note that our dating estimates are heavily reliant on the molecular clock rate that has been previously reported in the literature.

In summary, our findings add to the evidence challenging the traditional interpretation of a MIRU-VNTR cluster as indicating recent transmission and suggest that the benefits of WGS over MIRU-VNTR may be even more prominent in high prevalence settings when TB transmission has been ongoing without interruption, especially when high confidence PE/PPE and indel genetic variants are included. WGS can also provide insights into biology of MTB to improve our understanding of DR, transmission and host-pathogen interaction.