Introduction

The deep biosphere harbors a considerable portion of the Earth biomass with microbial cell numbers roughly equal to those in surface environments (Fry et al., 2009; Schrenk et al., 2010; Kallmeyer et al., 2012). Because of the extremely low flux of nutrients and energy, this is also the environment with the slowest growth rates and the longest estimated division times for prokaryotic cells—perhaps up to 1000 years per generation (Jørgensen, 2011; Morono et al., 2011). However, the cells are alive and exist in a dormant or low-activity state: anaerobic heterotrophic cells from such environments respond quickly to nutrients, reaching doubling times numbered in days not years (Morono et al., 2011). Metagenomic analyses suggest that subsurface environments harbor distinct microorganisms adapted for life in the subsurface (see, for example, Biddle et al., 2011; Castelle et al., 2013) and metatranscriptomics show that organisms from all domains of life are active in the subsurface (Orsi et al., 2013). Moreover, distribution of phage in the deep subsurface indicate ongoing viral production (Engelhardt et al., 2014). How microorganisms come to inhabit subsurface sediments, however, remains an open question (Schrenk et al., 2010): for example, they may be introduced during sedimentation or transported from either the basement or adjacent sediments.

Some of the most nutrient-rich subsurface environments are oil reservoirs, which contain large amount of hydrocarbons and fatty acids, and the effect of microbial activities on the former is well documented (Head et al., 2003; Ollivier and Alazard, 2010). In these reservoirs, indigenous microorganisms, that is, those not introduced by recent human activities, are thought to live in the biofilms attached to surfaces in the lower zone of the reservoir, where there is an oil–water contact (Magot, 2005; Bennett et al., 2013). Microbial lifestyle in oil reservoirs is postulated to be slow, because of low levels of electron acceptors and some nutrients, such as phosphorus (Ollivier and Alazard, 2010). Moreover, cell counts are similar to those in other active deep biosphere systems (Bennett et al., 2013).

As the temperature and pressure during oil formation can approach or exceed the upper limits for life (>100−150 °C and up to 100 MPa; Planckaert, 2005), it is unlikely that microorganisms in an oil reservoir originate from the source rock (Stetter and Huber, 1999). Instead, as oil migrates to shallower sediments, the newly formed oil reservoirs become occupied by an indigenous microbiome. As for subsurface sediments in general, two alternative hypotheses explaining where these microorganisms originate from have been proposed. In the first, ‘burial and isolation’ hypothesis (a corollary from the paleosterilization model of oil biodegradation), the microorganisms buried with the sediments adapt to the new environment as the oil reservoir is formed (Wilhelms et al., 2001). Under this hypothesis, the microbial community within an oil reservoir is isolated, and there is no subsequent migration of other microorganisms into the reservoir from adjacent areas. The support for this hypothesis comes from examination of cool reservoirs with no biodegraded oil. During their geological history, the reservoir sediments were buried at depths where the temperature exceeds the survival limit of hydrocarbon-degrading microorganisms, resulting in ‘paleosterilization’ (Wilhelms et al., 2001). The conjecture is that new organisms have not colonized such oil reservoirs after their subsequent uplift and cooling, and that is why no oil biodegradation is presently observed. In the alternative ‘colonization’ hypothesis, subsurface microorganisms populate the oil reservoir after their formation (Stetter and Huber, 1999; Fry et al., 2009). The microbiomes of some reservoirs may have assembled through a combination of these mechanisms.

Most oil reservoirs in the world have contained oil for several millions years (Myr) (Huang and Larter, 2005). For instance, oil generation from the Draupne formation, the main source rock in the North Sea, began 88 Myr ago and peaked at 78 Myr ago (Adda, 2012). The sediments of the Troll reservoir are shallow marine sandstones from the middle and late Jurassic (145–150 Myr ago) (Gautier, 2005). If bacteria have resided in the reservoir since the sediment deposition, their populations have been isolated for over 145 Myr. As a result, 900–90 000 single-nucleotide polymorphisms (SNPs) per genome are expected to accumulate between the reservoir bacteria and their counterparts from the outside via mutation (such wide range of uncertainty is because of our poor knowledge of generation time and mutation rate during growth and dormancy; see Supplementary Methods for calculations.) If the reservoirs are not isolated and the microbes colonized the reservoir from surrounding subsurface populations at a later time, we would expect ongoing gene flow, which will be detectable as numerous recombination events between oil reservoirs of any age. Lastly, if the reservoir has been colonized after humans started its development, we would expect large tracts of genomic identity of the colonizers to bacteria from, for instance, adjacent marine environments.

The Thermotogae are a bacterial phylum that has been detected in and isolated from numerous oil reservoirs, both continental and offshore (Cappelletti et al., 2014). Some lineages to date are found only in these environments (Ollivier and Alazard, 2010), whereas others appear to be physiologically adapted to the reservoir conditions (Slobodkin et al., 1999; Bonch-Osmolovskaya et al., 2003; Mongodin et al., 2005). Thus, these organisms may be indigenous to oil reservoirs (Grassia et al., 1996). However, an alternative hypothesis of human introduction during oil exploration is not ruled out (Ollivier and Alazard, 2010; Struchtemeyer et al., 2011).

Bacteria from the genus Thermotoga are hyperthermophilic anaerobes and have been detected and isolated from oil reservoirs, as well as from geothermally heated sea floors and continental hot springs (Huber and Hannig, 2006). Analyses of fosmid clones from Thermotoga isolates from geothermally heated sea floors of the Mediterranean and Atlantic (Thermotoga maritima MSB8 and Thermotoga sp RQ2), and the Kubiki oil reservoir in Japan (Thermotoga petrophila and Thermotoga naphtophila) revealed frequent recombination involving Thermotoga spp. across geographic regions (Nesbø et al., 2006), suggesting that these bacteria are not isolated. This raises the question of whether the subsurface isolates are recent invaders from geographically adjacent marine populations (or vice versa), or whether the two ecologically different habitats harbor distinct subpopulations adapted to their niches. To address this, we investigated the phylogeographic patterns of the Thermotoga spp. by analyzing the genomes of seven new isolates from the Troll oil reservoir in the North Sea and shallow marine hydrothermal vents at the Kuril Islands (Supplementary Table S1) together with the genomes of the Thermotoga isolates from the fosmid-clone study (Nesbø et al., 2006) and three Thermotoga-containing metagenomes. We find evidence for subsurface and marine subpopulations, although the genomes from both habitat types show signs of extensive DNA exchange across ecological boundaries. Therefore, our observations suggest that oil reservoir populations are not isolated and that the reservoirs were likely colonized from marine or already existing subsurface populations.

Materials and methods

Genomic data

Thermotoga spp. were isolated from oil production fluids of the Troll B and Troll C oil platforms, as described in Dipippo et al. (2009), and from shallow marine hydrothermal vents in the Kuril islands, as described in Svetlichny et al. (1991). Detailed description of the isolation procedure, DNA isolation, genome and fosmid clone sequencing and assembly and annotation is provided in Supplementary Information. The genomes and fosmid clones are available in GenBank under accession numbers CP003408, CP003409, AJII01000000, JSFG01000000, JSFI01000000, JSFH01000000, JSFJ01000000, KP239990 and KP229298. Description of the analysis and assembly of Thermotoga DNA from metagenomes can also be found in Supplementary Information.

Gene content and genome alignments

To determine the gene pool shared among all genomes, a bidirectional top-scoring BLASTP (Altschul, 1997) approach (E-value <10−5) was used to identify the pairwise orthologs. Unique genes and genes shared only by pairs of genomes were identified using the phylogenetic profile tool in IMG (Markowitz et al., 2014), with a percent identity of >80% and an E-value of <10−5.

Pan-genome calculations were performed in Panseq (Laing et al., 2010) using a fragment size of 500 bp and 80% identity cutoff for the analyses of TM-group genomes, and 90% identity cutoff for the analyses of TM-group and genomes assembled from metagenomes (to confidently exclude possible contamination). The data matrices of shared core SNPs and shared 500 bp fragments were converted into uncorrected distances and visualized in SplitsTree 4 (Huson and Bryant, 2006) using NeighborNet clustering.

Whole-genome alignments were carried out in MAUVE version 2.3.1 (Darling et al., 2010) using automatically calculated seed weights and minimum locally colinear block (LCB) scores. LCB positions with gaps were removed in GBlocks (Castresana, 2000). The edited LCBs were concatenated in Geneious 6 (www.geneious.com).

Quartet decomposition (QD) analysis

Positional orthologs (referred to as gene families) were extracted from the LCBs. Only gene families present in at least four genomes were retained for further analyses. The gene families were aligned in ClustalW version 2 (Larkin et al., 2007), and phylogenetic trees were reconstructed from 100 bootstrap samples in RAxML version 7.3.6 (Stamatakis, 2006) under GTR+Γ model. Resulting trees were analyzed using the Quartet Decomposition server (Mao et al., 2012), removing quartets with internal branch <0.02 substitutions/site or with any external branch 10 times longer that the internal branch. A detailed description of the analyses is in Supplementary Information.

Recombination detection

The relative rate of recombination to mutation and average recombination tract length were assessed using the pairwise program and likelihood look-up tables generated by the complete program in the LDhat package (McVean et al., 2002; Jolley, 2004). Detection of recombinant fragments was carried out in RDP version 4.33 (Martin et al., 2010) and LikeWind (Archibald and Roger, 2002). A detailed description of the analyses is in Supplementary Information.

Results and discussion

Limited genomic diversity within sampled sites

In the remainder of the article, we denote the newly sequenced (Supplementary Table S1) and other isolates with genomes closely related to T. maritima, T. petrophila and T. naphthophila as the ‘TM-group’ of Thermotoga spp.

Examination of the genomes for rearrangements, indels and SNPs reveal very low diversity among the Troll oil reservoir genomes (Supplementary Table S2). Whole-genome alignments resulted in three LCBs, with only a single 85 kb genome rearrangement. The nucleotide identity within each of the three LCBs ranges from 98.7% to 99.4%. Apart from variation in CRISPR (clustered regularly interspaced short palindromic repeat) regions (Supplementary Figure S1), the largest source of diversity is indels (3–23 nt) in 7 genes (Supplementary Table S3). On average, only 100 SNPs were detected between the Troll genomes from different platforms (Supplementary Table S2).

Troll genomes from platform B, Thermotoga sp. strains TBGT17.6.5 and TBGT17.6.6, differ only by seven SNPs that are located within a hypothetical gene flanking a CRISPR region. Comparison of the Thermotoga sp. TBGT17.6.5 genome with fosmid clones of two additional Troll B isolates also displayed only 6 and 4 SNPs over 42 148 and 33 295 bp of fosmid data, respectively. Similarly, there are only 54 SNPs between the two Troll genomes from the platform C. Mapping sequences from a Troll C metagenome (data not shown) onto the Thermotoga sp. CELL2 genome showed 99.6% identity over 364 180 bp (in contrast to 98.5% identity to the T. maritima MSB8 genome over 294 382 bp). Our low diversity observations are concordant with metagenomic studies of the Troll reservoir (data not shown) and two other oil reservoirs in this region (Lewin et al., 2014). Nevertheless, Troll isolates differ in fast-evolving CRISPR loci (Supplementary Figure S1) (Mongodin et al., 2005), suggesting that these are distinct lineages that have encountered different exogenous phage and mobile DNA.

Among three genomes of the Kuril Islands isolates, Thermotoga sp. EMP and 2812B differ only by 23 SNPs (Supplementary Table S2) and a 693-bp insertion in the EMP_01477 gene encoding flagellar hook-length control protein. The third Kuril Islands genome (Thermotoga sp. Mc24) showed much higher divergence to the other two genomes, as well as to the rest of TM-group genomes (Supplementary Table S2).

Although we report overall low within-geographic-region diversity of the two sampled sites, this is not a universal feature of the TM-group (Supplementary Table S2). The genomes of T. petrophila and T. naphthophila from the Kubiki oil reservoir in Niigata, Japan (Takahata et al., 2008), are only 96.7% identical. Unlike the Troll reservoir, the Kubiki oil reservoir has been flooded with water to enhance oil production (Takahata et al., 2000, 2008) that could have recently introduced new genetic diversity. The Troll reservoir is also older than Kubiki oil reservoir, as the sediments of the latter were formed in early Pliocene and late Miocene, 5–7 Myr ago (Kawai and Totani, 1971; Chakhmankhchev et al., 1996). Thus, the Troll microbiome may also have been isolated for a longer period of time and lost more diversity because of genetic drift. Consistent with the drift scenario, the genomes of the Troll reservoir isolates are smaller and encode fewer genes than any other known Thermotogae genome (Supplementary Table S1), a feature also observed in a Thermotoga sp. isolate from an Alaskan oil reservoir (Mongodin et al., 2005).

High gene content similarity across all TM-group genomes

All of the analyzed genomes share >90% of their genes, and the genomes from the same type of environment (that is, either oil reservoir or marine sediments) share larger proportion of genes (Figure 1). As the four Troll oil reservoir genomes and two of Kuril Islands genomes exhibit very high within-group DNA sequence identity and gene content similarity, only one representative genome from each group was selected for all further analyses (Thermotoga sp. Cell2 and Thermotoga sp. 2812B). Genome-wide alignment of seven representative TM-group genomes (T. maritima MSB8, T. petrophila RKU1, T. naphthophila RKU10, T. sp. RQ2, Thermotoga sp Cell2, Thermotoga sp. 2812B and Thermotoga sp. Mc24) resulted in 16 LCBs spanning 1 543 882 bp (‘core’ genome), including 83–89% of the individual genomes’ DNA. The observed divergence within the ‘core’ genome is on average 4% (Supplementary Table S2).

Figure 1
figure 1

Pairwise comparisons of shared gene pool across 11 TM-group genomes. Genes are designated as shared if they are a bidirectional BLASTP hits (E-value<10−5). The genomes are classified according to geographic location (Atlantic vs Pacific) and environment type (oil reservoir vs marine vent). The heatmap indicates that the isolates from the same environment type share more genes than the isolates in geographic proximity. Comparisons between genomes from the same type environment are outlined in bold.

Examination of the accessory 889 066 bp of the TM-group pan-genome showed that individual genomes carry between 29 and 64 unique genes found only in one genome, accounting for 81 630 bp in total (Supplementary Table S4). Also, 134 genes are shared by only two genomes, spanning an additional 43 873 bp of the accessory pan-genome (Supplementary Table S4). Thus, the accessory pan-genome is mainly composed of genes shared by at least three genomes. When genomes are grouped according to the environment type, only two genes are specific to isolates from oil reservoirs (hypothetical proteins; CELL2_08715 and CELL2_08720 in Thermotoga sp. CELL2), whereas nine genes are specific to isolates from marine sediments (a phosphotransferase domain-containing (TM0559), a hypothetical (TM1012) and 7 CRISPR-associated proteins (TM1791, TM1793–TM1795 and TM1800–TM1802 in T. maritima MSB8)). Future biochemical characterization of these proteins is needed to elucidate any contributions they may have to niche adaptation in the two types of environment.

Lack of consistent phylogenetic signal suggests rampant recombination within the TM-group

In the ‘shared DNA fragments’ network, the genomes from ‘oil reservoirs’ and ‘marine sediments’ form separate groups, with the ‘marine sediments’ genomes further differentiated according to their geographic proximity (Figure 2a). The ‘core SNPs’ network (Figure 2b), on the other hand, does not show such clear pattern: Thermotoga sp. RQ2 clusters with Thermotoga sp. CELL2 (shared ‘geographical proximity’) and T. maritima MSB8 clusters with Thermotoga sp. 2812B (shared ‘ecological niche’). In addition, phylogenetic histories reconstructed from typical marker genes (Supplementary Figure S2) and from a concatenation of the aligned genome LCBs were incongruent (see below). QD analysis (Zhaxybayeva et al., 2006; Mao et al., 2012) of phylogenetic histories of 1728 gene families present in at least four of the seven TM-group genomes reveals no support of any unique bifurcating phylogenetic tree by the majority of the gene families (Figure 2c and Supplementary Figure S3a). Moreover, evolutionary histories of 857 gene families were at least partially incongruent with the phylogenetic signal of the plurality of gene families. Further screening of gene families resulted in 69 and 25 gene families that strongly support ‘ecological niche’ and ‘geographical proximity’ divisions, respectively (Supplementary Figure S3b). In phylogenetic trees of 33 of the 69 (47%) ‘ecology-specific’ families, the observed groupings are likely because of recent recombination events, as the involved taxa have identical (or nearly identical) DNA sequences. Some of these genes could represent habitat-specific alleles spreading through the population via homologous recombination and selection, as observed in Sulfolobus and Vibrio populations (Shapiro and Polz, 2014). In 27 of the 33 gene families, these recent recombination events are observed between Thermotoga spp. from oil reservoirs in North Sea and Japan, 8000 km apart. Only 8 of the 25 ‘geography-specific’ families (37%) show evidence of recent recombination, and in 4 cases recombination is between the two Kuril Islands isolates, Thermotoga sp. Mc24 and 2812B. Although some phylogenetic conflict may be a result of uncertainty of phylogenetic reconstruction, incongruence because of gene flow will manifest itself in recombination events recorded among the TM-group lineages.

Figure 2
figure 2

Phylogenetic relationships among seven TM-group genomes based on (a) presence/absence of shared 500 bp genomic fragments, (b) core SNPs and (c) plurality gene families from the QD analysis. The genomes are classified according to geographic location (Atlantic vs Pacific) and environment type (oil reservoir vs marine vent). For networks (a) and (b), the data were obtained using PanSeq (Laing et al., 2010). Core SNPs were required to be present in all 7 genomes, whereas the genomic fragments were considered shared if they were at least 80% identical. Networks were calculated in SplitsTree using NeighborNet algorithm (Huson and Bryant, 2006) from uncorrected distances (a, b) and quartet topologies supported by plurality of gene families identified in the QD analysis (c).

Quantification of recombination events reveals high recombination to mutation ratios

In bacteria, recombination events are best described as a gene conversion process (McVean et al., 2002). Thus, for each LCB in the alignment of seven TM-group genomes, we calculated the population mutation rate (θ) and the gene conversion parameter (γ) that can be interpreted as the population rate of recombination due to gene conversion between two distantly linked loci (McVean et al., 2002). Across the TM-group the estimated average recombination tract length is 6800 bp (ranging between 2000 and 14 600 bp), and the average θ and γ are 0.045 (0.019–0.058) and 2.9 (1.1–3.8), respectively (Supplementary Table S5). The resulting high γ/θ ratios of 24–109 is indicative of high levels of recombination, consistent with the phylogenetic analyses above. The high level of recombination suggests that the low genomic diversity of TM-group bacteria within geographic populations could in part be because of genome homogenization via recombination.

The search for specific recombination breakpoints resulted in 330 and 471 recombination events detected in LikeWind (Archibald and Roger, 2002) and RDP (Martin et al., 2010), respectively (Figures 3 and 4 and Supplementary Table S6). As both algorithms rely on phylogenetic evidence, the number of recombination events is likely underestimated because of failure to detect recombination between sister taxa. On average, 67 recombination events per genome are predicted by RDP, ranging from 93 in Thermotoga sp. Mc24 to 48 in Thermotoga sp. CELL2 (Supplementary Table S6).

Figure 3
figure 3

Visualization of recombination events detected among seven TM-group genomes. The genomes are color coded and arranged on the circumference of a circle. The recombination events with predicted donor and recipient are shown as lines connecting the two genomes and the locations of recombined regions. The line color reflects the donor lineage. The predicted events were required to be significant in three out of four algorithms in RDP (Martin et al., 2010). The diagram was generated using Circos (Krzywinski et al., 2009).

Figure 4
figure 4

Detected recombination events in seven TM-group genomes. Sliding-window phylogenetic analysis of the alignment of concatenated LCBs was performed in the LikeWind program (Archibald and Roger, 2002) using 1000 bp windows selected in 100 bp increments. (a) Phylogenetic tree reconstructed from the whole alignment. (b, c) Two examples of significantly different tree topologies. (d) Across-the-alignment dynamics of Δlog L, the difference between the log-likelihood of the maximum likelihood (ML) tree of the 1000 bp window and the ML tree reconstructed from the entire alignment. The x axis shows the position along the concatenated alignment. Peaks correspond to tree topologies of sliding alignment windows that are very different from the whole alignment tree topology depicted in (a). The parametric bootstrapping analysis (Archibald and Roger, 2002) indicated that values of Δlog L>12.7 correspond to significantly different tree topologies, and the x axis crosses the y axis at that value. Position of the alignment is color coded by LCB.

Recombination events were detected among all isolates regardless of their geographic location (Figure 3 and, Supplementary Table S6). However, surprisingly few recombination events were inferred among some genomes of geographically proximal isolates (that is, with T. petrophila and T. naphthophila as recipients and Thermotoga sp. 2812B as donor, and with Thermotoga sp. RQ2 as recipient and T. maritima MSB8 as donor). This was supported by the QD analysis where the two quartets with the highest support and lowest conflict cluster T. maritima MSB8 and Thermotoga sp. 2812B. Interestingly, Thermotoga sp. RQ2 has obtained most DNA from the T. naphthophila lineage, whereas T. naphthophila RKU10 and T. petrotoga RKU1 have received most DNA from Thermotoga sp. Mc24. For the remaining genomes, most of the recombinant DNA comes from strains from the same type of environment. This was particularly pronounced for Thermotoga sp. CELL2 genome, which is predicted to have received much more DNA from the other ‘oil reservoir’ genomes (30 events) than from the geographically closer T. maritima MSB8 (2 events).

Combined evidence from the QD analysis, recombination detection and the examination of the pan-genome suggests high levels of reticulate evolution among TM-group members. Therefore, we infer that TM-group bacteria form a global interacting collective. Similar mixing across large geographic distances have been observed in several other taxonomic groups such as haloarchaea, Prochlorococcus and Vibrio (reviewed in Shapiro and Polz, 2014).

However, such global connectivity does not exclude the existence of local and ecology-driven subpopulations. We find that TM-group bacteria can be divided into two ‘habitat-based’ subpopulations (‘marine sediments’ and ‘subsurface’) that share more DNA with each other than they do with isolates of different habitat types from the same geographical region. These findings could be explained by either preferential gene exchange among bacteria from the same type of environment or preferential retention of acquired genes with a selective advantage in the ecological niche. The high level of recombination among genomes from oil reservoirs of different geological ages (Kubiki and Troll) is not compatible with isolation after burial. Although cells buried in the sediments may contribute to the population’s genetic makeup, our data support the ‘colonization’ hypothesis, under which the oil reservoirs are colonized either from an already existing subsurface population or from surrounding marine populations. In addition, the observation of distinct habitat-based subpopulations attests to indigeneity of Thermotoga in Kubiki and Troll reservoirs.

Other sites harbor Thermotoga spp. populations with similar properties

To investigate whether the observed patterns of extensive recombination and low within-population diversity can be generalized for other TM-group populations, we searched for publicly available metagenomes that harbor TM-group bacteria, using as a criterion >90% identity to Thermotoga genus genes. We identified three such sample sites (Supplementary Table S7): San Juan basin coal bed production water (An et al., 2013), a ‘continental hot spring’ fed from a subterranean reservoir (Great Boiling Spring (GBS), Nevada) (Costa et al., 2009) and the off-shore Tuna oil reservoir in the Bass Strait, Gippsland Basin, Australia (Sutcliffe et al., 2013).

Only 16 protein-coding sequences in the San Juan basin metagenome are TM-group like. In phylogenetic analyses, nine of them cluster with their homologs from the geographically close GBS. Moreover, on 11 of 16 phylogenetic trees, one or more of the ‘oil reservoir’ isolates are the sister taxon either to the CG7 homolog or to the group of CG7 and GBS homologs, supporting a close phylogenetic relationship among subsurface TM-group populations.

Three cellulolytic in situ enrichment cultures from the GBS site include >2000 Thermotogae-like contigs with 90% similarity to the genomes used in this study (Supplementary Table S7), consistent with 16S rRNA analysis (Peacock et al., 2013). Comparisons of Thermotogae-like sequences among the three cultures revealed that they are >98% identical across homologous contigs >5 kb in length, again demonstrating low within-site diversity for TM-group bacteria.

Our assembly of a 2.1-Mb draft genome from a GBS metagenome (referred as Thermotoga sp. GBS) and the already assembled draft genome of Thermotoga sp. A7A from the Tuna oil reservoir metagenome (Sutcliffe et al., 2013) allowed us to repeat the genome-wide phylogenetic analyses with 9 TM-group genomes. On the ‘core SNPs’ network, Thermotoga sp. GBS is most closely linked to Thermotoga sp. CELL2 (Figure 5). The close relationship between Thermotoga sp. GBS and Thermotoga sp. CELL2, which were isolated 7500 km apart in Nevada and the North Sea, respectively, demonstrates that Thermotoga sp. CELL2 does not represent a lineage specific to either oil or ancient North Sea sediments. A shallow ocean covered Nevada at the time the Troll oil reservoir sediments were laid down (Blakey, 2014), and thus these isolates could have originated from the same marine population and, according to the ‘burial and isolation’ hypothesis, stayed isolated ever since. However, the QD analysis identified 133 gene families that place Thermotoga sp. GBS and Thermotoga sp. CELL2 in different groups with >90% bootstrap support (Supplementary Figure S4). Many of these gene families are likely a result of gene exchange with both marine and other subsurface bacteria, suggesting that contact among lineages from the two environments is an on-going process. Hence, regardless of how the bacteria entered the oil reservoir sediments, they do not appear to have stayed isolated.

Figure 5
figure 5

Phylogenetic relationships of nine TM-group genomes. The network was calculated using core SNPs as described in Figure 2 legend, except the included fragments were required to be at least 90% identical. All nodes of the network have at least 96% bootstrap support. The plurality network reconstructed from the QD analysis has the same topology (not shown).

The close connection of Thermotoga sp. A7A to the two genomes from ‘marine sediments’ (T. maritima MSB8 and Thermotoga sp. 2812B; Figure 5) suggests that the Thermotoga population has entered the Tuna reservoir recently from a marine environment. The temperature of the reservoir is reported to be 105 °C (Sutcliffe et al., 2013), which is considerably higher than the proposed 80–90 °C threshold temperature for sustaining life in oil reservoirs (Stetter and Huber, 1999; Wilhelms et al., 2001), suggesting that Thermotoga sp. A7A may be inhabiting the petroleum pipelines, where the temperature is lower. The opposite direction of the migration, from subsurface to the marine habitat, is supported by the position of Thermotoga sp. RQ2 in the ‘core SNPs’ network (Figure 5). We conjecture that this migration happened relatively recently, consistent with a <8 Myr (Abdel-Monem et al., 1975) volcanic origin of the Azores.

Potential causes of low diversity

Not only do we observe low genomic diversity of Thermotogae within sites, but the ‘core SNP’ network shows that Thermotoga sp. GBS (Nevada; hot spring), Thermotoga sp. CELL2 (North Sea; oil reservoir) and Thermotoga sp. RQ2 (Azorez; marine sediment) are on shorter branches, and hence have smaller number of underlying nucleotide substitutions (Figure 5; 98.2–98.5% genome-wide nucleotide identity and 11 254–14 446 core SNPs). Several factors may contribute to the observed low diversity. Given the slower growth rates (and therefore lower mutation supply rates) in the subsurface environments, the limited diversity may simply be because of insufficient time to accumulate substitutions. In addition, subsurface sediments, including oil reservoirs, are among the most stable environments on Earth both spatially and temporally, as temperature and pressure stay constant at a given depth (S Larter, personal communication). This stability may contribute to low within-population diversity, as environmental variability is one of the main drivers in maintaining genetic diversity of a population (Kassen, 2002). If the diversity of phages that can infect these bacteria is smaller than what is seen in mesophilic environments (Weinberger et al., 2012), the selective advantage of rare genotypes would be lower. Finally, the inferred high recombination incidence is likely to further lower the divergence among TM-group genomes.

Why do we see such high levels of recombination among Thermotoga spp.? Hyperthermophiles, like Thermotoga, have to constantly repair their DNA because of the damaging effects of high temperatures (Grogan, 2000). Extensive recombination (repair) has been observed in other hyperthermophilic microorganisms such as Pyrococcus furiosus (Diruggiero et al., 1997), Sulfolobus islandicus (Whitaker et al., 2005) and Persephonella (Mino et al., 2013). Thus, the high levels of recombination observed in Thermotoga might be a by-product of high levels of DNA repair (Johnston et al., 2014).

Scenarios for gene dispersal across subsurface populations

Our analysis suggests larger gene flow among the subsurface populations than between subsurface and marine populations, but what would be the mechanism of gene flow across geographically isolated oil reservoirs? The gene flow may be mediated entirely via the surface, with preferential retention of genes with a selective advantage in oil reservoirs. Indeed, dispersal of thermophilic organisms from the subsurface has been suggested to occur via deep-sea hydrothermal plumes (Summit and Baross, 1999; Dick et al., 2013). Alternatively, some of the contact may occur strictly within the subsurface. Under this scenario, the subsurface populations belong to a large, sparsely populated, slow-growing collective of subsurface thermophiles that exists in areas with favorable conditions (anoxic, 55–90 °C, organic material and so on). The existence of such subsurface communities has been inferred from detection of thermophiles in hydrothermal vent plumes associated with seismic events (Summit and Baross, 1999) and from direct sampling of crustal fluids (Huber et al., 2006). Organisms from this collective may have colonized the oil reservoirs and GBS, as illustrated in Supplementary Figure S5. Under this scenario, the exchange between oil reservoirs is mediated by the exchange with microorganisms from surrounding sediments.

Concluding remarks

Our genomic and metagenomic analysis suggests that different subsurface populations of Thermotoga are not isolated and thus favors ‘colonization’ over the ‘burial and isolation’ scenario. Buried marine Thermotoga may instead be a substantial source of marine immigrants entering the subsurface collective. The phylogenetic analyses further suggest that some Thermotoga have likely colonized their oil reservoirs a long time ago, whereas others have entered reservoirs recently, probably during the commercial development. In both cases, gene flow appears to be an ongoing process. Although some gene flow likely involves dispersal via the surface (for example, burial, oil seeps and hydrothermal vent plumes), followed by subsequent selection for genes beneficial for subsurface life, we suggest that within-subsurface gene exchange is nonnegligible. Future investigations are needed to determine how such gene flow is achieved.

The level of diversity among Thermotoga spp. genomes is strikingly low when compared with, for instance, marine Vibrio populations, in which individual cells may carry up to 200 genes unique to their genomes (Cordero and Polz, 2014). This might be a result of the extreme environment Thermotoga spp. inhabit. As discussed above, low diversity may be generated by several alternative, but not mutually exclusive, mechanisms. Investigating genome diversity in Thermotogae living at less extreme temperatures, such as Mesotoga and Petrotoga lineages, may help elucidate the effect of temperature on population-level genome diversity and phylogeography.