A distinct and active bacterial community in cold oxygenated fluids circulating beneath the western flank of the Mid-Atlantic ridge

The rock-hosted, oceanic crustal aquifer is one of the largest ecosystems on Earth, yet little is known about its indigenous microorganisms. Here we provide the first phylogenetic and functional description of an active microbial community residing in the cold oxic crustal aquifer. Using subseafloor observatories, we recovered crustal fluids and found that the geochemical composition is similar to bottom seawater, as are cell abundances. However, based on relative abundances and functional potential of key bacterial groups, the crustal fluid microbial community is heterogeneous and markedly distinct from seawater. Potential rates of autotrophy and heterotrophy in the crust exceeded those of seawater, especially at elevated temperatures (25 °C) and deeper in the crust. Together, these results reveal an active, distinct, and diverse bacterial community engaged in both heterotrophy and autotrophy in the oxygenated crustal aquifer, providing key insight into the role of microbial communities in the ubiquitous cold dark subseafloor biosphere.


Metagenomic Sequence Analysis
Sequences were assessed for Illumina adapter sequences in the 3'-end of the read using cutadapt v1.7.1 (parameters: -a AGATCGGAAGAGC -e 0.08 -overlap=3) 1 . Sequences were then trimmed based on quality scores using Trimmomactic v0.33 with a sliding window of 10bp and an average quality score cutoff of 28 2 . All sequences trimmed below 75bp in length were discarded (parameters: SLIDINGWINDOW:10:28 MINLEN:75). Only read pairs for which both mate survived trimming were retained for assembly and read coverage analysis. For each sample library, quality trimmed sequences were assembled using IDBA-UD v1.1.1 using the default parameters 3 . Contigs < 500bp in length were removed from consideration for further analysis.
Assemblies and putative CDS were retrieved for samples from the Guaymas (Taxon object ID: 3300001683) and Abe, Lau Basin (Taxon object ID: 3300001681) hydrothermal vent plumes.
Both sets of assemblies were generated using IDBA. Assemblies from sediment sampled at 5 cmbsf from the south Pacific (unpublished) and at 75 cmbsf from the Arctic Mid-Ocean ridge (DDBJ/EMBL/Genbank Accession: LAZR00000000) 4 were processed using Prodigal to generate putative CDS. The sample from the south Pacific was assembled using IDBA-UD. The sample from the Arctic Mid-Ocean ridge was assembled using SPAdes v3.0.0 5 .
From the hmmsearch results, putative CDS were assigned to TIGRFAM roles based on the best match. For each metagenome, the relative abundance of each TIGRFAM role was determined (no. of putative CDS assigned to a specific role ÷ total no. of putative CDS assigned to all TIGRFAM roles). The relative abundance of the 115 identified TIGRFAM roles for each sample was visualized using principal component analysis (PCA) to determine the relationship between samples. PCA was performed using the Python library sklearn v0.16.1. Values underwent dimensionality reduction, while being fit to the model.

Assessment of Carbon Fixation Potential
Utilizing the IMG annotations, a database of genes representing the necessary and essential components of the major carbon fixation pathways searched using BLASTp v2.2.30+ (parameters: -evalue 0.001 -max_target_seqs 3) against the putative CDS for each NP sample 8 .
The database contained genes necessary to identify the reductive acetyl-CoA/Wood-Ljunghal pathway, the reverse citric acid (rTCA) cycle, the Calvin-Bensson-Bassham (CBB) cycle, the 3hydroxypropionate/4-hydroxybutyrate (3-OH-propionate/4-OH-butryrate) cycle, and the 3hydroxypropionate (3-OH-propionate) bicycle. Successful matches were limited to matches with > 30% amino acid identity (AAID) and > 30% protein alignment. Putative carbon fixation relevant CDS were used to recruit sequences from quality trimmed sequence libraries, which had been normalized by random sampling to the size of the smallest library (27,737,142 sequences), using BWA v0.6.1-r104 (parameters: bwa aln -n 0; bwa samse -n 0) 9 . Utilizing the output SAM file, the number of sequences assigned to a specific carbon fixation gene was determined.
The quality trimmed metagenomes were queried using Meta-RNA (parameters: -m ssu -e 1e-10) to identify putative 16S rRNA gene fragments 10 . Putative 16S rRNA fragments were then assembled using EMIRGE 11 using the SILVA SSURef_111 12 as a guide (emirge_amplicon.py, parameters: -l 113 -i 163 -s 33 -a 32 -phred33). As part of the EMIRGE assembly process, a length-normalized estimate of relative abundance was determined and full length 16S rRNA sequences with > 1% estimated abundance were identified.
Putative taxonomies were assigned to the assembled full length 16S rRNA sequences using mothur v1.34.4 13 . 16S rRNA sequences were aligned to a SILVA SSURef111 database containing sequences from all three domains of life (align.seqs). Sequences that failed to align in this step were removed (remove.seqs). The remaining sequences were classified based on the SILVA taxonomy (classify.seqs, parameters: cutoff=80, iters=1000).
To determine the putative taxonomies of certain carbon fixation processes (rTCA and Calvin cycles), the putative carbon fixation CDS were BLASTp (parameters: -evalue 1e-5 -max_target_seqs 5) against the NCBI RefSeq database 14 . MEGAN4 was then used to determine the last common ancestor (LCA) of the top five hits for each putative carbon fixation CDS 15 .
Utilizing the number of sequences recruited to each putative carbon fixation CDS, CDS with the same taxonomic assignment, up to the Genus level, had their sequence counts combined and was compared to the total number of sequences recruited for that gene.
For the 100 essential phylogenetic markers gene analysis, putative CDS identified were BLASTp against the NCBI RefSeq database (as above). MEGAN4 was used to assign an LCA to each putative phylogenetic marker based on the top five best RefSeq matches, as described in 16 . Each putative phylogenetic marker was assigned taxonomy at the Class level. Putative markers without assignments at the Class level were consider to lack a taxonomic assignment, though many possess assignments at higher taxonomic levels. The percent relative abundance of each putative phylogenetic marker within the NP metagenomes was determined using libraries normalized by random subsampling to the size of the smallest metagenome and were recruited to the putative phylogenetic markers using BWA (as above). The number of recruited sequences was determined from the SAM file and the sequence count of putative marker genes with the same taxonomic assignment were combined (no. of sequences assigned to putative phylogenetic markers within a Class ÷ total no. of sequences recruited to all putative phylogenetic markers [including those without assignments] × 100).

Metagenomic Results
A total of 29.9 Gbp sequence was generated, with an average of 7.5 ± 1.1 Gbp per sample. The number of putative CDS predicted by Prodigal was on average ± 2.3% of the number of putative CDS predicted using the IMG annotation pipeline. In total, the IMG pipeline predicted 937,375 putative CDS from the North Pond metagenomes, with 234,343 ± 146,386 putative CDS per sample.

SUPPLEMENTARY TABLES AND FIGURE LEGENDS
Supplementary