Tracing Mycobacterium tuberculosis transmission by whole genome sequencing in a high incidence setting: a retrospective population-based study in East Greenland

In East Greenland, a dramatic increase of tuberculosis (TB) incidence has been observed in recent years. Classical genotyping suggests a genetically similar Mycobacterium tuberculosis (Mtb) strain population as cause, however, precise transmission patterns are unclear. We performed whole genome sequencing (WGS) of Mtb isolates from 98% of culture-positive TB cases through 21 years (n = 182) which revealed four genomic clusters of the Euro-American lineage (mainly sub-lineage 4.8 (n = 134)). The time to the most recent common ancestor of lineage 4.8 strains was found to be 100 years. This sub-lineage further diversified in the 1970s, and massively expanded in the 1990s, a period of lowered TB awareness in Greenland. Despite the low genetic strain diversity, WGS data revealed several recent short-term transmission events in line with the increasing incidence in the region. Thus, the isolated setting and the uniformity of circulating Mtb strains indicated that the majority of East Greenlandic TB cases originated from one or few strains introduced within the last century. Thereby, the study shows the consequences of even short interruptions in TB control efforts in previously TB high incidence areas and demonstrates the potential role of WGS in detecting ongoing micro epidemics, thus guiding public health efforts in the future.

resolution in delineating TB outbreaks, even in settings with highly similar strains [8][9][10][11] . Thus, WGS provides a unique opportunity to study transmission of Mtb in one of the world's least populated regions with one of the highest TB incidences and with a low genetic diversity of circulating Mtb strains. By employing WGS, we reveal the long-term transmission dynamics and explore the potential implications for public health interventions in this TB high incidence setting.

Results
From 1992 through 2012, 287 East Greenlandic TB cases were notified, of which 185 (64%) were Mtb culture-positive. For the remaining 102 cases, the laboratory register held information for cases diagnosed after 2000 (n = 91); four were confirmed by microscopy or PCR, 84 were culture-and microscopy negative and three had had no diagnostic samples examined at Statens Serum Insitut (SSI). Isolates from two patients could not be located, and library preparation failed for one isolate. Therefore, 182 (98%) culture-positive TB cases were included in our study (Table 1).
Mycobacterial interspersed repetitive units-variable number of tandem repeats (MIRU-VNTR) typing showed 10 unique strains and six clusters differentiated into two distinct complexes (Fig. 2). Two large clusters (1139-15 and 1144-17) comprised the majority of isolates (n = 134). WGS was successfully performed for all isolates, reaching an average coverage depth of the reference genome of at least 50 fold (range 51.6-437.4, median 93.5) and 98% of the reference genome was covered by at least one read. The median fraction of the reference genome fulfilling unambiguous criteria for variant detection was 98% (range 97-99).
Four genomic clusters (GCs) as defined by a maximum distance of 12 SNPs and nine ungrouped strains were detected (Fig. 2). The majority of isolates belonged to GC4 (n = 122) and GC3 (n = 26). While all isolates from the large MIRU-VNTR cluster 1144-17 belonged to GC4, WGS divided the 1139-15 cluster into two clearly distinct groups differentiated by 22 single nucleotide polymorphisms (SNPs), with the larger group belonging to GC4 and the smaller group to GC3. Apart from this, GC1 (n = 20) and GC2 (n = 5) correlated with specific MIRU-VNTR genotypes. Interestingly, a few isolates with identical WGS genotype were split by MIRU-VNTR typing, thus, demonstrating the independent evolution of the markers. All isolates were also differentiated in major lineages and sub-lineages according to the schema proposed by Coll et al. 12 . All strains belonged to the Euro-American lineage (lineage 4), more specifically the 4.4.1.1 (GC1), 4.8 sub-lineage (GC2, GC3, and GC4), and one isolate from the 4.3.3 sub-lineage (Fig. 1).
Isolates within GCs were closely connected, with more than half of isolates exhibiting a maximum distance of 5 SNPs to at least one other isolate. Yet, despite the close relationship of isolates, WGS suggested several sub-clusters within GC4 (Fig. 2). Two of these sub-clusters, GC4-A and GC4-B, are of special interest, since they both expanded after 2005 (95% HPD 2002-2008 and 2002-2007, respectively) (Fig. 3). The smallest of these two sub-clusters (GC4-A) comprised 18 highly similar isolates from TB cases diagnosed 2009-2012, including 11 isolates with identical genomes (Fig. 2). Twelve of these 18 TB cases were from the same settlement (Kuummiut). The remaining six cases were either reported as designated contacts to TB cases from Kuummiut (n = 2), or former residents of the settlement (n = 4). The larger sub-cluster GC4-B comprised 31 TB cases diagnosed from 2008-2012, of which 20 had identical genomes. The majority (n = 29/31) were diagnosed in the town of Tasiilaq (n = 29/31). Of the 30 cases diagnosed from 2010-2012, the majority (n = 26) were 16-23 years at time of diagnosis. Thus, they likely attended the same school and/or were part of the youth environment in Tasiilaq. Contact tracing links existed among 15 cases, which linked 4 of the 5 older cases to other cases in the sub-cluster. Contact tracing also confirmed possible transmission within other smaller sub-clusters of GC4, e.g. one sub-cluster including four young adults from smaller settlements, of whom three were brothers and/or have lived together at the boarding home in Tasiilaq. Thus, epidemiological data provided further evidence of transmission events within WGS suggested sub-clusters.
When analyzing the genome-based phylogeny dynamically over time, a simple algorithm that considers a minimum number of closely related isolates in one year was able to detect potential local outbreaks (Supplementary figure 1) likely associated with active transmission in the next year. That is, 79% (70/89) of identified putative outbreak isolates in 1992-2011 were associated with transmission events in the following year, compared to 15% (11/73) of non-outbreak isolates (Supplementary Figure S2).

Discussion
This population-based study included 98% of all East Greenlandic culture-positive TB cases over two decades and allowed us to perform an in-depth analysis of the longitudinal Mtb transmission patterns in this unique, isolated, TB high incidence region. Our data indicated that the source of recent Mtb infections in East Greenland most likely dates back to the introduction of Euro-American strains within the last century, thereby indicating that recent TB in East Greenland is caused by reactivation of latent TB and subsequent transmission, even after decades of effective TB control measures [1][2][3][4] . Of the four genomic clusters detected, the closely related GC3 and GC4 were responsible for the majority of cases with a common ancestor that evolved 30-50 years ago and clonally expanded in the 1990s. In this challenging scenario for TB control, WGS detected several sub-clusters within the highly clonal strain population. Detection of these apparent micro epidemics by WGS is likely crucially important for outbreak characterization and a more targeted public health action in this setting.
Delineating Mtb transmission is difficult in high incidence settings with very similar strain populations and complex social interactions are challenging for TB control measures such as traditional contact tracing. Additionally, classical genotyping, e.g. MIRU-VNTR genotyping, is not efficient under these conditions. Here, WGS confirmed the clonality of Mtb strains in East Greenland. The majority of isolates were from sub-lineage 4.8, which was present in the region throughout the entire study period. This included isolates from the genomic clusters GC2, GC3 and GC4 and seven ungrouped isolates, all of which were within < 100 SNPs of each other. The most common MIRU-VNTR genotypes within this sub-lineage (1139-15 and 1144-17) have only been reported sporadically from the rest of the Danish Kingdom. In contrast, isolates from GC1 correspond to a previously  Dating analysis using Bayesian coalescent analysis suggested the introduction of the 4.8 sub-lineage into East Greenland some 100 years ago. Although we cannot exclude multiple introductions of this globally successful Mtb lineage into the region 15 , the limited genetic strain diversity and the isolated remote setting favours the hypothesis of indigenous evolution of these strain types in East Greenland. Our dating analysis nicely correlates with the foundation of a Danish colony in Tasiilaq in 1894, which promoted contact with Denmark and the rest of Greenland 16 making the introduction of a strain from the Euro-American sub-lineage 4.8 thereafter plausible. An increasing TB incidence and Mtb infection prevalence described in East Greenland in the 1930s supports the hypothesis that Mtb was introduced a few years prior 17 . A Canadian study suggests an almost identical history of Mtb introduction into Nunavik, a high-incidence area in the Canadian arctic with even lower genetic strain diversity 10,11 .
Despite the close relationship of Mtb strains within the same genomic clusters, WGS allowed a detailed investigation of evolutionary history and longitudinal spread of the sub-lineage 4.8 population. Differentiation into the two largest genomic clusters (GC3 and GC4) presumably occurred in the 1970s. Since then, these two genomic clusters have spread to all East Greenlandic towns and settlements. In East Greenland, the 1970s was a time of great changes 18 . Many inhabitants moved from smaller settlements and contact between towns and settlements became easier with motorised boats and regular flights, potentially favouring the spread of Mtb in the following years. Yet, the most remarkable expansion of both genomic clusters occurred around 1990, and it is relevant to consider why this expansion occurred. In the 1980s, almost no TB was reported from East Greenland, or from Greenland in general 5,19 . TB was considered almost eliminated after decades of control efforts which led to a down-scaling of the TB control program. A lowered TB awareness in the 1980s and 1990s, could very likely have favoured conditions for Mtb expansion.
Our WGS analysis allowed a dissection of one MIRU-VNTR cluster (1139-15) into two distinct groups (Fig. 2). As described, one group caused TB in Ittoqqortoormiit (GC3), while the other caused TB in the rest of East Greenland (part of GC4). Ittoqqortoormiit, one of the most isolated towns in Greenland, was populated in 1925 by 87 persons from Tasiilaq 20 . However, the most recent common ancestor of GC3 and GC4 dates   back to the beginning of the 1970s. Thus, the Mtb strains causing TB in Ittoqqortoormiit have been introduced into this town after the 1970s. It is striking that after the initial introduction of an ancestral GC3 strain into Ittoqqortoormiit, transmission of GC3 strains (and of GC1 strains as discussed above) has stayed remarkably confined 13 . Such a geographically confined Mtb transmission is also reported from the similar setting of Nunavik in Canada 11 . As suggested by the Canadian authors, this has important implications for future efforts to control TB in similar settings, since control measures should primarily be focused on the local community when an apparent TB outbreak is recognized.
In 2009, a sudden increase in TB incidence was noted in the small settlement of Kuummiut and subsequently in the rest of East Greenland. This led to speculations whether inhabitants from Kuummiut had brought TB to the rest of the region, supported by reports of the same MIRU-VNTR cluster (1139-15) causing TB in both Kuummiut and Ittoqqortoormiit and a very similar strain causing TB in Tasiilaq (1144-17). However, as discussed above, WGS shows a clear dissociation between 1139-15 strains from Ittoqqortoormiit and the rest of East Greenland. Likewise, the strains that caused TB in Kuummiut were only found in this settlement or in epidemiologically linked individuals (GC4-A). Interestingly, GC4-A strains and strains from a large sub-cluster causing TB in Tasiilaq (GC4-B) were the result of consecutive bacterial expansion in 2005. Thus, it is unlikely that the increasing TB incidence in Kuummiut after 2009 was directly related to the increasing TB incidence in the rest of East Greenland. In combination with the evidence that TB in East Greenland today is most likely caused by the same Mtb strain as in the early 20 th century, our study suggests that increasing TB incidence in recent years was not caused by one single coherent outbreak, but by several smaller outbreaks caused by genetically similar Mtb strains on multiple locations consecutively. Even though a causative interaction with lack of TB awareness in the 1990s is speculative, this study raises a difficult question: when it is safe to downscale TB control efforts in previously high incidence countries?
As mentioned earlier, delineating TB dynamics in high incidence settings is challenged by complex contact patterns and the possibility that very similar isolates spread consecutively. Several studies have applied WGS for outbreak investigations in TB low-and high incidence settings 8,9,21,22 , and the Mtb mutation rates of 0.3-0.5 SNPs per genome per year estimated in other studies 8,9,23,24 are very similar to the overall mutation rate of 0.47 (95% HPD 0.37-0.64) SNPs per genome per year estimated in our analysis. However, while defining a minimum distance of 12 SNPs as a threshold for unlikely recent transmission and a maximum distance of 5 SNPs as the threshold for likely transmission is feasible in low incidence settings 9 , such a threshold is more difficult to define in high incidence settings 22 . Within the four genomic clusters detected in East Greenland, more than 50% of pairwise distances between isolates were within 5 SNPs. Thus, setting a specific lower threshold for likely transmission, e.g. to confirm or reject patient-to-patient transmission, is difficult 22 . The possibility of intra-patient evolution and mixed infections further complicate the deduction of direct links between cases [25][26][27] . Still, WGS identified smaller sub-clusters within the largest genomic cluster (GC4), in which transmission events were likely to have occurred. Since backwards mutation and SNP homoplasy are rare events in Mtb evolutionary history, our finding underlines that WGS enables a specific detection of outbreaks even in clonal strain populations in high incidence settings 26,28,29 . We tested this further by applying a simple SNP-based cut-off at a maximum of 3 SNPs, as proposed by Roetzer et al. as the maximum distance in human-to-human transmission 8 . Interestingly, this approach enabled the detection of "hotspot cases" associated with new cases appearing in the following year, thus likely identifying active transmission chains. This potentially allows for a WGS surveillance system using an SNP-based algorithm for automated outbreak detection and targeted public health action, if performing equally well in other settings and populations ( Supplementary Figures 2 and 3).
The major strengths of this study are the completeness of included samples and the unique data linkage. However, all epidemiological analyses were made retrospectively, which potentially limited the accuracy of the data available. As an example, our analysis of the geographical distribution of samples were based on the place of diagnosis, which does not necessarily reflect a person's place of living or the complex intraregional travel patterns. Another major limitation in our analysis are potentially missing TB cases. First, undetected TB cases in the area could have resulted in underestimating Mtb transmission. Secondly, Mtb DNA could only be obtained from TB cases with microbiologically confirmed Mtb infection, of which the percentage was low. Thus, only two thirds of all notified TB cases were included in the study. However, almost all of the non-included TB cases were cultureand microscopy negative and therefore not likely to cause Mtb transmission to any greater extent.
In conclusion, the combination of WGS and epidemiological data provided a unique understanding of the clonal Mtb expansion in a remote, TB high incidence setting. Since TB in East Greenland in recent years is most likely caused by an Mtb strain introduced into the region within the last century, and subsequently reactivated and expanded during a period of lowered TB awareness, the study provides evidence of the consequences of even small interruptions in the TB control efforts in previously TB high incidence areas. This is an important message for Greenland as well as for similar settings worldwide. WGS can detect sub-clusters caused by scattered cases in a high incidence setting, thus paving the way for WGS-based surveillance for targeted public health action.

Methods
Study design and population. We conducted a population-based study including all culture-positive TB cases diagnosed in East Greenland from 1992-2012. In Greenland, TB is mandatory notifiable to the National Board of Health, and contact tracing is performed routinely by local TB nurses. All notified TB cases with Mtb culture-positive samples were identified in the laboratory register at Statens Serum Institut which receives all diagnostic samples from Greenland. Culture-positive samples are stored at minus 80 °C.
Greenlandic citizens receive a unique personal identifier at birth through the Civil Registration System (CRS) enabling linkage across all public registers. Demographic data and contact tracing data were obtained from TB notifications, medical records or the CRS. Contact tracing data were only available from 2008-2012.
Scientific RepoRts | 6:33180 | DOI: 10.1038/srep33180 Genotyping and sequencing. MIRU-VNTR 24-locus genotyping and DNA extraction from frozen isolates was performed as described elsewhere 30,31 . Libraries for WGS were prepared with the Nextera XT kit and run on Illumina next generation sequencing platforms (MiSeq, HiSeq 2500, Nextseq) as instructed by the manufacturer (Illumina, San Diego, CA). Reads were mapped to the genome of Mtb strain H37Rv (GenBank ID: NC_000962.3) with the exact alignment program SARUMAN 32 . SNPs with a minimum coverage of 10x and 75% allele frequency were called by customised Perl scripts. For the phylogenetic analyses, we excluded positions with SNPs in repetitive regions, in genes associated with drug resistance and neighbouring SNPs within a window of 12 base pairs. Only positions with a clear wild type or SNP allele call according to aforementioned thresholds in all isolates were combined for a concatenated SNP alignment (Supplementary Table S1). Raw data in the form of fastq files were submitted to the EMBL-EBI ENA sequence read archive (Supplementary Table S2).

Phylogenetic analysis.
A minimum spanning tree was created with the MIRU-VNTRplus database website. MIRU-VNTR clusters were defined as strains with identical genotyping patterns and clonal complexes by a maximum difference of two loci 33 . A maximum parsimony tree was calculated from 1,385 concatenated SNPs with Bionumerics ® version 7.5 (Supplementary Table S1) (Applied Math, Sint-Martens-Latem, Belgium).
Phylogenetic lineages were determined according to lineage specific SNPs 12 . Strains were grouped into genomic clusters (GCs) defined by a maximum distance of 12 SNPs between one isolate and the nearest neighbour, as previously suggested as the minimum threshold of transmission 9,34 . Hence, genomic clusters were defined as groups of isolates between which transmission could have occurred. For all cluster-specific variants detected for the four genomic clusters see supplementary Table S3. Within GC4, sub-clusters were defined as groups of isolates clustering together in the phylogenetic tree. Whether these sub-clusters were actual clusters of Mtb transmission was explored by comparing WGS suggested links with contact tracing links and epidemiological data.
Bayesian coalescent analysis. The concatenated SNP alignment was used for a Bayesian coalescent analysis using BEAST v.1.8.2 35 (Supplementary file S4). The mean mutation rate was estimated from sampling dates with a tip-dating approach. Different demographic models were compared using a general time reversible (GTR) substitution model (strong support over HKY substitution model, log 10 Bayes factor > 10) with a discrete gamma distribution with four rate categories, a random starting tree, a lognormal relaxed clock, and assuming a uniform prior distribution and chain lengths of 150,000,000 with 10% burn in. Demographic models were compared with Tracer v.1.5 showing effective sample size (ESS) values in the hundreds and showed adequate mixing and convergence of the Markov chains. The likelihood of the different demographic models did not vary significantly (log 10 Bayes factor < 2), thus the coalescent constant size model (simplest model) were selected for dating analysis. A maximum clade credibility tree was created with TreeAnnotator v1.8.2 to identify diversification events giving the MRCA with 95% HPD.

Supplementary analysis.
In an approach to analyse genome-based phylogeny dynamically over time, we tested an SNP-based outbreak detection algorithm. Potential outbreaks were defined as three or more isolates detected with a maximum distance of 3 SNPs from each other in one calendar year. Additionally, isolates were considered to be connected to an ongoing outbreak if three or more isolates were detected in the following year within a maximum distance of 3 SNPs. Such a connection was considered as an indication of active transmission. Ethical Considerations. The Committee for Scientific Research in Greenland approved the study (approval no. 2012-071304). The project was reported to and followed all instructions from the Danish Data Protection Agency.