Global metagenomic survey reveals a new bacterial candidate phylum in geothermal springs

Analysis of the increasing wealth of metagenomic data collected from diverse environments can lead to the discovery of novel branches on the tree of life. Here we analyse 5.2 Tb of metagenomic data collected globally to discover a novel bacterial phylum (‘Candidatus Kryptonia') found exclusively in high-temperature pH-neutral geothermal springs. This lineage had remained hidden as a taxonomic ‘blind spot' because of mismatches in the primers commonly used for ribosomal gene surveys. Genome reconstruction from metagenomic data combined with single-cell genomics results in several high-quality genomes representing four genera from the new phylum. Metabolic reconstruction indicates a heterotrophic lifestyle with conspicuous nutritional deficiencies, suggesting the need for metabolic complementarity with other microbes. Co-occurrence patterns identifies a number of putative partners, including an uncultured Armatimonadetes lineage. The discovery of Kryptonia within previously studied geothermal springs underscores the importance of globally sampled metagenomic data in detection of microbial novelty, and highlights the extraordinary diversity of microbial life still awaiting discovery.


Supplementary Note 1. CRISPR-Cas
We used 99 CRISPR-associated (cas) gene sequence alignments and hidden Markov models from the TIGRFAM database (originally built by Haft et al. 14 and later expanded by Zhang et al. 15 ) to precisely find and identify Cas family members within the scaffolds of the 'Ca. Kryptonia' genomes. We recovered and classified the corresponding CRISPR type for complete and partial CRISPR-Cas loci in all genomes following the unified CRISPR classification from 2011 (ref. 16 ).
In the Dewar Creek genomes (GFM JGI-4, and SAGs JGI-5 -JGI-17), we identified the presence of three CRISPR repeat-spacer arrays located at different genomic loci, two of them containing associated cas genes (Supplementary Data 2). We classified one of the CRISPR-Cas systems as type I-B that contained 34 spacers, and harbored a conserved repeat (here referred as repeat_I-B; Supplementary Fig. 11) that was also found in the separate array (31 spacers) without cas genes associated. The second cas locus was found to be a highly unusual completed fusion between two main CRISPR-Cas types (type I and III; subtypes I-B and III-A). This fusion shares the genes responsible for spacer acquisition (cas2 and cas1) and crRNA processing (cas6), containing also the whole set of essential genes needed in the CRISPR ribonucleoprotein (crRNP) complex (cas5, cas7, and cas8 for the type I-B; and csm2, csm3, csm4, csm5, and cas10 for the type III-A) and the target degradation (cas3, and csm6 for type I-B and type III-A, respectively) 17,18 . Additionally, the fusion integrates a cas4 gene (recently related with possible programmed cell death in bacteria 19 ), a csx1 gene (usually related in subtype III-U), and three hypothetical genes without any known domain assigned (Supplementary Fig. 5). Interestingly, the CRISPR-Cas fusion array was associated with a singular hybrid repeat (from now on repeat_fusion) where the 5' end of the sequence agrees better with type I repeat patterns and the 3' end matches with type III-B repeat pattern ( Supplementary Fig. 11), and as many as 150 spacer sequences ( Supplementary Fig. 5). After the reconstruction of the three repeat-spacer arrays using GFM JGI-4 and the corresponding SAGs, we did not observe any spacer multi-variation across the Dewar Creek genomes, indicating that we might have collected a clonal population or maybe these CRISPR-Cas system are not dynamic in terms of spacer acquisition (Supplementary Data 2).
We used the combination of the 3 genomes found in the Gongxiaoshe sample (GFM JGI-3, SAG JGI-20, and SAG JGI-22) to identify the same type I-B/III-A fusion ( Supplementary Fig. 5). This hybrid cas gene arrangement was associated with the same repeat_fusion sequence found in Dewar Creek genomes and we recovered a distinct spacer set across the genomes. The same occurred in the repeat_fusion-spacer locus without any cas genes found in GFM-JGI-3 and SAG JGI-20 ( Supplementary Fig.  5). This finding clearly indicated activity in the spacer acquisition for this CRISPR system. Additionally, we detected repeat-spacer arrays containing repeat_I-B in all Gongxiaoshe genomes. These arrays also presented multi-variation in the spacer content although they were either not associated with cas genes or associated with an incomplete array of cas genes (assigned to type-I due to the presence of the marker gene cas3).
We used the above mentioned hybrid cas gene arrangement as an anchor to reconstruct the fragmented CRISPR-Cas system of Jinze genomes GFM JGI-2 ( Supplementary Fig. 5) and SAG JGI-25. We were unable to recruit the corresponding cas genes associated with the hybrid system in SAGs JGI-23 and JGI-24. We detected a total number of 2, 126, 71, and 53 spacers connected with the repeat_fusion across 1, 4, 2 and 4 different loci for the GFM JGI-2, SAG JGI-23, SAG JGI-24, and SAG JGI-25, respectively. All the spacers detected were unique for each genome, strongly suggesting a dynamic and active CRISPR-Cas system. Moreover, the four Jinze genomes presented an almost identical repeat_I-B (one C:A SNP in position 6 of the repeat) to the repeat in the genomes from Dewar Creek and Gongxiaoshe. We were able to recruit the corresponding type I-B cas genes in the 3 Jinze SAGs but not in the GFM JGI-2 genome. In total, we collected unique sets of 2, 18, 9 and 57 spacers associated with repeat_I-B for Jinze genomes GFM JGI-2, SAG JGI-23, SAG JGI-24, and SAG JGI-25, respectively, also hinting a dynamic activity of this CRISPR-Cas system.
From the GBS sample, where we have a single genome (GFM JGI-1), we only identified those cas genes present in the type I-B/III-A fusion in two separate scaffolds ( Supplementary Fig. 5). Due to the fragmentation of GFM JGI-1 genome we were unable to recruit any spacers associated using the assembled data. We used then a different algorithm (CRISPR assembler "Crass" 20 ) that uses directly the raw reads to search for CRISPR spacers and repeats (see Methods). We recovered in total 56 spacers associated with the GFM JGI-1 sample.
We used the IMG/M database to search for high sequence identity of the total collection of 795 unique spacer groups that we clustered from all 'Ca. Kryptonia' genomes. After removing self-hits (spacers from its own source), we exclusively detected 29 perfect spacer hits (for both sequence identity and length) across five distinct scenarios: (i) 3 spacers from Jinze 'Ca. Kryptonia' genomes hit Jinze metagenomic scaffolds, (ii) 5 spacers from Gongxiaoshe 'Ca. Kryptonia' genomes hit Gongxiaoshe metagenomic scaffolds, (iii) 15 spacers from Jinze 'Ca. Kryptonia' genomes hit Gongxiaoshe metagenomic scaffolds, (iv) 4 spacers from Gongxiaoshe 'Ca. Kryptonia' genomes hit Jinze metagenomic scaffolds, and (v) 2 spacers from GBS JGI-1 genome hit GBS virome sample (Supplementary Data 3). All of the targeted scaffolds (ranging from 0.2-7.2 Kbp) were assigned to phage or "unknown" based on their gene content (Supplementary Data 5). We hypothesize that most of these scaffolds are viral also based on the number of hypothetical genes, and their length and directionality 21 . We found 11 shared spacers between Jinze and Gongxiaoshe genomes (Supplementary Data 2). Interestingly, one of these spacers hit the same phage scaffold detected in both metagenomes ( Supplementary Fig. 5) indicating the presence of the same infective phage in these ecosystems. We also considered 41 other non-perfect spacer matches (at least 90% spacer identity over 91% sequence length) within the same above-mentioned scenarios (Supplementary Data 3).
Notably, we observed no perfect hits (for both sequence and length) for any of the 'Ca. Kryptonia' spacer groups recruited against any other metagenome. Across all samples used in this comparison we detected a 92% identity (33/36 nt) over 100% length match of a spacer (group_302) from Gongxiaoshe SAG JGI-22 genome with 2 similar scaffolds detected in Yellowstone National Park samples 22 . This may indicate the presence of similar players across similar environments, although the scaffold gene content did not provide sufficient phylogenetic information for taxonomic assignment. Moreover, we noticed a striking spacer match of 89% identity (32/36 nt) over 100% of the sequence length from a Gongxiaoshe spacer (group_291) to a scaffold from a virome study in the GBS location harboring a terminase gene. Lastly, we were able to connect spacers from Dewar Creek and Jinze to spacers coming from the GBS metagenome (other than JGI-1). Based on the detection of all these proto-spacers (original spacer sequence location: e.g phage), we identified a PAM (proto-spacer adjacent motif) for the type I-B CRISPR system that agreed with the requirement for this type (CCn-Protospacer; Supplementary Fig. 11). In the case of the type I-B/III-A CRISPR-Cas fusion, we were unable to detect any PAM, in accordance with the lack of PAM for all type III system described so far 16 .
Taken together, these data indicate the activity of the CRISPR systems of 'Ca. Kryptonia' and strongly support the presence of the same or very similar 'Ca. Kryptonia' infecting phages across similar environments.

Supplementary Note 2. Metabolic and functional features of Candidatus Kryptonia Extreme thermophilic adaptation
Reverse gyrase is an enzyme introducing positive supercoils into circular DNA, which has been strongly associated with extreme thermophilic and hyperthermophilic lifestyle, i. e. optimal growth temperature of above 70°C 23 . A reverse gyrase gene was found in representatives of all 'Ca. Kryptonia' genotypes suggesting that most if not all members of this candidate lineage are extreme thermophiles or hyperthermophiles. Since no other lineages in the FCB superphylum harbor this gene, we hypothesized that it has been acquired by horizontal transfer from other extreme thermophilic or hyperthermophilic microbes. Alignment of reverse gyrase sequences and phylogenetic tree reconstruction placed 'Ca. Kryptonia' sequences into a branch distinct from all other bacterial sequences and pointed to crenarchaeotes of the Thermoproteales order as the likely source of the reverse gyrase gene in 'Ca. Kryptonia' (Supplementary Fig. 7). This suggests an event independent of proposed horizontal gene transfer of reverse gyrase from archaea into the ancestors of two bacterial phyla with hyperthermophilic representatives, Aquificae and Thermotogae 24 . We used the GC content of the 'Ca. Kryptonia' 16S rRNA gene to estimate its optimal growth temperature 25 . All 'Ca. Kryptonia' genotypes had 16S genes with GC content of 61%, which corresponds to the predicted optimal growth temperature between 66°C and 76°C. This range is in agreement with the temperatures of geothermal spring sites in which 'Ca. Kryptonia' has been found.
Another potential signature of extreme thermophilic or hyperthermophilic lifestyle of 'Ca. Kryptonia' is the absence of dihydrouridine synthase genes in all 'Ca. Kryptonia' genotypes. These enzymes perform posttranscriptional reduction of uridines in RNAs and are common in bacteria, eukaryotes and euryarchaeota 26 . Since dihydrouridine ring is not aromatic and non-planar, uridine reduction prevents stacking interactions with other nucleosides, thereby increasing RNA flexibility. Psychrophilic bacteria and archaea are known to have many dihydrouridine modifications in their tRNAs 27 , whereas thermophilic tRNAs often lack dihydrouridine and carry other types of modifications, such as methylated nucleosides, which stabilize their secondary structure 28 . A diverse complement of potential tRNA methyltransferases was identified in different 'Ca. Kryptonia' genomes including up to four copies of SpoU family methylases (identified by matches to Pfam family PF00588), which include bacterial tRNA methyltransferases trmH, trmJ and trmL [29][30][31] . Only two of 'Ca. Kryptonia' SpoU-like methylases (exemplified by JGI1_01862 and JGI1_00473 from GFM-1) have putative orthologs in other members of the FCB superphylum suggesting that the remaining enzymes may be involved in thermophilic adaptation.
Extreme thermophilic and hyperthermophilic bacteria and archaea produce a variety of compatible solutes, some of which have not only osmoprotective, but also thermoprotective properties 32 . These include amino acids and derivatives, such as aspartate and beta-glutamate, sugars and derivatives, such as trehalose and mannosylglycerate, and phosphorylated compounds, such as diglycerol phosphate and di-myo-inositol phosphate. All 'Ca. Kryptonia' genomes have myo-inositol phosphate synthase gene (exemplified by JGI1_02220 from GFM-1), but no di-myo-inositol phosphate synthase gene was found. However, the genomes from all locations encode multiple glycosyltransferase genes, most of them of unknown specificity (see below). Some of these glycosyltransferases could be responsible for synthesis of sugar-derived compatible solutes. In addition, genomes from the Dewar Creek hot spring harbor a bifunctional trehalose-phosphate synthase/trehalose-bisphosphatase (exemplified by JGI4_00881 in GFM-4), which produces trehalose, a compatible solute with some thermostabilizing properties 33 .
Ca. Kryptonia replication machinery. Whereas multiple lines of evidence supported affiliation of 'Ca. Kryptonia' genotypes within the FCB superphylum, they did not stably associate with any established phylum within this group. Since members of the FCB superphylum are extremely diverse and adapted to a broad variety of lifestyles, ranging from obligate photoautotrophy 34 to obligate endosymbiosis 35 , we analyzed replication machinery of 'Ca. Kryptonia' and its FCB relatives in an attempt to refine its placement within the superphylum. Bacterial replication is an essential process carried out by protein apparatus well conserved over large phylogenetic distances 36 . However, since bacterial replication machinery is distinct from that of archaea and eukaryotes, replication proteins were not part of the universal single-copy gene set. Orthologs of E. coli dnaE, dnaN and holA coding for DNA polymerase III alpha, beta and delta subunits, respectively, were collected from 'Ca. Kryptonia' and a set of reference FCB genomes in IMG. These were identified as proteins with hits to signature Pfam models (PF07733, PF02767 and PF13177) that were subjected to further refinement based on bi-directional best hits. Protein sequences were aligned, active site residues were analyzed and phylogenetic trees were constructed ( Supplementary Figs. 12-14). 'Ca. Kryptonia' sequences in dnaE, dnaN and holA trees were again robustly associated with the FCB superphylum, but their specific affiliation was different. In the dnaN tree ( Supplementary Fig. 12), 'Ca. Kryptonia' appeared as a basal lineage to the Ignavibacteria and Chlorobi; in the dnaE tree ( Supplementary Fig. 13), 'Ca. Kryptonia' are a sister lineage to Ignavibacteria; and in the holA tree ( Supplementary Fig. 14), 'Ca. Kryptonia' does not cluster with either phylum. Analysis of active site residues in DNA polymerase subunits also suggests that 'Ca. Kryptonia' is a distinct lineage within the FCB superphylum. For instance, in the dnaN binding pocket a histidine residue corresponding to H175 in E. coli is nearly universally conserved in FCB sequences, but it is replaced with tyrosine in 'Ca. Kryptonia' (Supplementary Fig. 12). In addition, clamp-binding pockets in 'Ca. Kryptonia' dnaE and holA are poorly conserved. The consensus sequence of clamp-binding site in 'Ca. Kryptonia' dnaE is QV[QAG]LF, which is similar to the E. coli consensus for clampbinding sites (QL[S/D]LF) (Supplementary Fig. 13D). However, the consensus sequence of clamp-binding site in holA is LLLPVM, which is different both from 'Ca. Kryptonia' dnaE consensus and from consensus clamp-binding sites in other members of FCB superphylum (Supplementary Fig. 14). Other potential clamp-binding proteins in 'Ca. Kryptonia' include polB (DNA polymerase II) and error-prone DNA polymerase detected in 'Ca. Kryptobacter tengchongensis' JGI-24 and 'Ca. Chrysopegis kryptomonas' JGI-23. Consensus sequence of the clamp binding site in the former is DG[LI]NF and QQHLF in the latter, which is reminiscent of 'Ca. Kryptonia' dnaE clampbinding site and E. coli consensus clamp-binding site. Although the lack of sequence conservation between different clamp-binding sites does not always indicate the difference in interaction mechanism 8 , it is uncommon in the representatives of the FCB superphylum and supports 'Ca. Kryptonia' distinct position within this group.

Proteases
A putative cysteine endopeptidase from the peptidase family C25 was recovered and found to be distantly related to the well-characterized gingipains from the anaerobic periodontal pathogen Porphyromonas gingivalis 37 (IMG gene id: 2599801257). A conserved domain for the propeptide (pfam08126) and the highly conserved Cys-His catalytic diad 38 were identified, and lend support for the presence of an unusual secreted protease from 'Ca. Kryptonia' that could be active given the circumneutral pH found within the geothermal spring habitat.

CAzymes
Similar to that of Bacteroidetes and thermophilic sister phylum Ignavibacteria, the 'Ca. Kryptonia' genomes harbor an expanded repertoire of glycosyl hydrolases, predicted enzymes containing carbohydrate-binding modules (CBMs), and other saccharolytic machinery for polysaccharide processing and degradation (Supplementary Data 14). In line with these predictions, we identified a TonB-dependent porin (SusC-like protein) and a glycan-binding SusD-like protein that formed the SusC/D outer membrane transport system for the putative uptake of oligosaccharides 39 . The SusC/D transport system was conserved across the majority of the 'Ca. Kryptonia' genomes, but notably lacking from Great Boiling Springs 'Ca. Thermokryptus mobilis' GFM JGI-1.
A plethora of glycosyltransferases of varying substrate specificity were identified (Supplementary Data 14), with the vast majority comprising seven CAzy (http://www.cazy.org/) families and the largest number constituting family GT41 with specificity for peptide β-N-acetylglucosaminyltransferase 40 . Twenty-three glycosyltransferase family GT41 homologs were identified in GFMs JGI-3 and JGI-4, and genome-wide estimates of combined glycoside hydrolases (GHs) and polysaccharide lyases (PLs) yielded an average 53 GHs/PLs per genome, comparable to members of the Bacteroidetes found in marine habitats and the human gut capable of organic matter degradation 41 .

Bacillithiol
Notably, the gene complement bshA (glycosyltransferase), bshB (N-acetylhydrolase), and bshC (cysteine-adding enzyme) for the synthesis of bacillithiol, a major lowmolecular-weight thiol in Bacillus subtilis and related bacteria 42 , were identified across all the 'Ca. Kryptonia' genotypes. Recently, bacillithiol has been found to contribute to oxidative stress resistance in Staphylococcus aureus 43 . In this context, the genetic potential for bacillithiol synthesis in 'Ca. Kryptonia' might confer a functionally similar trait as in S. aureus and provide a protective role against reactive oxygen species. Additionally, we identified homologs for phytoene desaturase (crtI), beta-carotene 3hydroxylase (crtZ), and a fusion type lycopene β-cyclase 44 for the biosynthesis of zeaxanthin glycosides. Curiously, the genetic potential for carotenoid biosynthesis was restricted to GFM JGI-3 from Gongxiaoshe pool, Yunnan Province, China and the complement SAGs including divergent genotype JGI-24.

Ether-linked membrane lipids
Unusually, we identified two geranylgeranylglyceryl phosphate synthases, which catalyze the first step in the synthesis of ether-linked membrane lipids in archaea, and a geranylgeranyl reductase suggestive of the incorporation of isoprendoid lipids into their membrane. The 'Ca. Kryptonia' enzymes had close matches to clade IIb, and displayed homology to those previously identified within the Bacteroidetes with confirmed activity based on a radiolabelled substrate assay 45 . However, as with members of the Bacteroidetes, the 'Ca. Kryptonia' genomes do not encode an sn-glycerol-1-phosphate dehydrogenase gene and therefore the presumptive function for the predicted geranylgeranylglyceryl phosphate synthases remains enigmatic.

Supplementary Note 3. Co-occurrence of 'Ca. Kryptonia' with other lineages in shotgun metagenomic data
Since many standard primers used in SSU rRNA surveys of microbial community composition are biased against 'Ca. Kryptonia,' we used SSU rRNA profiles obtained from shotgun metagenomic data for 22 geothermal sites (Supplementary Table 5) to assess co-occurrence of 'Ca. Kryptonia' genotypes with other lineages and attempt to identify its metabolic partners. Mapping of raw reads to 16S and 18S rRNA sequences and their assembly retrieved 1,579 sequences longer than 300 nt. Since many of them contained introns, these sequences were trimmed to retain only nucleotides aligning to the 16S rRNA covariance model and clustered with similarly trimmed reference 16S sequences retrieved from the SILVA database. Clustering by UCLUST at 90%, 92% and 94% identity resulted in 136, 152 and 171 clusters, respectively, which had representatives from two or more hot spring samples. Computation of Spearman's rankorder correlation showed that the clusters most strongly correlated with 'Ca. Kryptonia' did not change depending on the clustering cutoff; therefore 90% identity clusters were used in the subsequent analysis. 16S rRNA clusters at 90% identity most highly correlated with the abundance of 'Ca. Kryptonia' genotypes listed in Supplementary  Table 6 include Armatimonadetes, 3 different lineages of Chloroflexi and Thermus spp. nitrate reductase, Nap, in 'Ca. Kryptonia' is less clear. In most organisms, this enzyme functions in dissimilatory nitrate reduction to ammonia, a pathway thought to be more favorable under conditions of low nitrate concentration, but high electron donor availability 51 . So while Thermus spp. may be responsible for denitrification at high nitrate concentrations, 'Ca. Kryptonia' may take over as a nitrate reducer at lower nitrate levels. However, in other organisms, periplasmic nitrate reductase alone is capable of supporting anaerobic growth 52 . Further, unlike bacterial respiratory nitrate reductase and assimilatory nitrate reductase, the active center of Nap is located in the periplasm, so no transmembrane nitrate/nitrite exchange is necessary 53 . Nitrous oxide does not require any transporters to reach the cytosol; but in any case nitrous oxide reductases are periplasmic or extracellular enzymes.