Metagenome-assembled genomes provide new insight into the microbial diversity of two thermal pools in Kamchatka, Russia

Culture-independent methods have contributed substantially to our understanding of global microbial diversity. Recently developed algorithms to construct whole genomes from environmental samples have further refined, corrected and revolutionized understanding of the tree of life. Here, we assembled draft metagenome-assembled genomes (MAGs) from environmental DNA extracted from two hot springs within an active volcanic ecosystem on the Kamchatka peninsula, Russia. This hydrothermal system has been intensively studied previously with regard to geochemistry, chemoautotrophy, microbial isolation, and microbial diversity. We assembled genomes of bacteria and archaea using DNA that had previously been characterized via 16S rRNA gene clone libraries. We recovered 36 MAGs, 29 of medium to high quality, and inferred their placement in a phylogenetic tree consisting of 3,240 publicly available microbial genomes. We highlight MAGs that were taxonomically assigned to groups previously underrepresented in available genome data. This includes several archaea (Korarchaeota, Bathyarchaeota and Aciduliprofundum) and one potentially new species within the bacterial genus Sulfurihydrogenibium. Putative functions in both pools were compared and are discussed in the context of their diverging geochemistry. This study adds comprehensive information about phylogenetic diversity and functional potential within two hot springs in the caldera of Kamchatka.


Solexa sequencing
Sequencing library preparations for Solexa3 84 bp paired-end sequencing were performed by the UC Davis Genome Center DNA Technologies Core Facility (http://dnatech.genomecenter.ucdavis.edu/), where the samples were sequenced on two lanes.

Construction of the microbial phylogenetic tree
The genomes used by Hug  PhyloSift builds an alignment of the concatenated sequences for a set of core markers for each taxon. For more information about the available core marker gene sets see Darling et al. 2014 (updated last on 2018-02-12). We used 37 of these single-copy marker genes. The amino acid alignment of these 37 concatenated genes was trimmed using trimAl v.1.2 3 . Columns with gaps in more than 5% of the sequences were removed, as well as taxa with with less than 75% of the concatenated sequences.
MAGs from ARK and ZAV that did not meet this threshold were manually kept in the alignment. The final alignment 4 comprised 3,240 taxa (Supplementary Table S3) and 5,459 amino acid positions. This alignment was then used to build a new phylogenetic tree in RAxML v. 8.2.10 on the CIPRES Science Gateway web server 5 . First, we searched for the best protein substitution model of the alignment with its empirical base frequencies using the Bayesian Information Criterion (BIC) within RAxML. Then, this substitution model; i.e, LG plus CAT (after Le and Gascuel) 6 , was used to infer a phylogenetic tree. We chose the rapid bootstrapping algorithm (flag -f a) to find the best scoring maximum likelihood tree with 10 starting trees in one run with the number of 3 bootstraps automatically determined (MRE-based bootstopping criterion). One hundred fifty bootstrap replicates were conducted. The full tree inference required 2,236 computational hours on the CIPRES supercomputer (raw tree file in newick format available on figshare 7 ). The Interactive Tree Of Life website iTOL was used to finalize and polish the tree for publication 8 . Taxa from the genera Buchnera (Enterobacteriaceae, endosymbiont of aphids) and Mycoplasma were removed manually from the tree during visual inspection due to them being at the end of extraordinarily long evolutionary branches.
For a detailed walk-through of the analyses, please refer to the associated Jupyter notebooks for ZAV 9 and for ARK 10     This file needs to be unzipped and the resulting FASTA sequence file formatted as a BLAST database using the makeblastdb program.

Execution ---------
The BLAST hits to mitochondrial genomes are filtered for hits over 98.6% identity and at least 120 bases long.
BLAST is used to screen the input sequences against a database of the rRNA gene sequences . 60% of length covered with foreign hits, or less than 200 bp that are NOT covered Each contributing hits must be 100 bp or longer with identity >= 98% The best match to chromosomes from unrelated organisms is longer than the best match to chromosomes from the related organism group 16 Legends for the supplementary Excel tables: Table S1: Summary statistics of quality filtering and metagenomic assembly.
Arkashin Schurf (ARK) and Zavarzin Spring (ZAV) were assembled separately resulting in two metagenomes. Bp = basepairs, nt = nucleotides. Here for each metagenome-assembled genome (MAG), we report basic assembly statistics, including total assembly size, contig N50 and GC content, quality measurements, including completion, contamination and strain heterogeneity, and taxonomic inference. Table with all the taxa that were used to construct the tree in Fig. 2            Shown are all complete KEGG pathways; i.e., gene pathways of which all genes (blocks) were represented that were found in one pool but not in the other; Arkashin Schurf (ARK) and Zavarzin Spring (ZAV). For each pathway its KEGG-ID, name and pathway module are given. Presence of pathways was predicted based on the retrieval of homologous genes.  Shown are all complete KEGG pathways; i.e., gene pathways of which all genes (blocks) were represented that were found in both pools; Arkashin Schurf (ARK) and Zavarzin Spring (ZAV). For each pathway its KEGG-ID, name and pathway module are given. In total there were 78 complete environmental information processing KEGG pathways. For clarity, only mineral and organic ion transport, drug resistance, and bacterial secretion systems are shown. Presence of pathways was predicted based on the retrieval of homologous genes.