A catalogue of 136 microbial draft genomes from Red Sea metagenomes

Earth is expected to continue warming and the Red Sea is a model environment for understanding the effects of global warming on ocean microbiomes due to its unusually high temperature, salinity and solar irradiance. However, most microbial diversity analyses of the Red Sea have been limited to cultured representatives and single marker gene analyses, hence neglecting the substantial uncultured majority. Here, we report 136 microbial genomes (completion minus contamination is ≥50%) assembled from 45 metagenomes from eight stations spanning the Red Sea and taken from multiple depths between 10 to 500 m. Phylogenomic analysis showed that most of the retrieved genomes belong to seven different phyla of known marine microbes, but more than half representing currently uncultured species. The open-access data presented here is the largest number of Red Sea representative microbial genomes reported in a single study and will help facilitate future studies in understanding the physiology of these microorganisms and how they have adapted to the relatively harsh conditions of the Red Sea.


Background & Summary
The Red Sea is an ideal marine environment to study microbial adaptation to physical conditions atypical of global oceans: high temperature, high salinity, and high irradiance. In late summer 2011, we undertook the King Abdullah University of Science and Technology (KAUST) Red Sea Expedition (KRSE2011) in the eastern Red Sea in order to map its diversity along environmental gradients that occur with changes in latitude, longitude, and depth 1 . This time of year is not only when temperatures and evaporation (and hence salinity) are highest, but also when a foreign water mass called the Gulf of Aden Intermediate Water (GAIW) intrudes into the Red Sea 1,2 (Fig. 1). The GAIW brings nutrient-rich water to the Red Sea, providing nitrogen, phosphorus, and other elements to this otherwise oligotrophic sea, and is likely to introduce important microbial diversity.
Insights into the taxonomic, evolutionary, and functional diversity of the Red Sea have largely been based on studies of pure cultures [3][4][5] and single marker genes such as the 16S rRNA 6,7 , or internal transcribed spacer 8 . Recently, investigations of microbial ecology have steered towards whole genome-based culture-independent methods notably single-cell genomics and metagenomics 9,10 . Single-cell genomics is an exciting field that recovers complete and partial single cell genomes from complex environments, albeit the need of specialised equipment, high cost and relatively low throughput [11][12][13] . Metagenomics is paving the way forward by harnessing the recent wave of sequencing technology and bioinformatics advancements to recover genomes of individual populations or populations of closely related organisms [14][15][16] . Application of these methods has resulted in the recovery of numerous genomes of uncultivated microorganisms that have provided surprising insights into the diversity and function of microbial communities 10,14,[17][18][19] .
During the KRSE2011, eight stations were sampled along a cruise track from south to north, capturing gradients in temperature, salinity, oxygen, and nutrients, including the unique GAIW water mass ( Fig. 1 and Table 1 (available online only)). At each station, samples were collected from the surface to mesopelagic depths (10,25,50,100,200, and 500 m), except for stations 12 and 34, which had depths shallower than 500 m ( Fig. 1 and Table 1 (available online only)), in order to capture a greater variation in environmental parameters and microbial diversity. Here, we successfully reconstructed 136 genomes from 45 individually assembled metagenomes (Figs 1 and 2, Tables 1 and 2 (available online only), Data Citation 1) by differential read coverage and tetranucleotide frequency methods. Of these, 43 were 'near-complete' with an estimated completion minus contamination of ≥90%, while the other 93 draft genomes had completion minus contamination of ≥50% (Table 2 (available online only)). The green lines represent the three Gulf of Aden Intermediate Water (GAIW) sampling points. The numbers within the circles represent the number of genomes recovered from each of the sample. Colors represent the high (dark red) to low (dark blue) water temperature. A total of 45 samples of 20 l each were collected and filtered through a series of filters. For this study, DNA extraction was performed on the small microbial fractions (between 0.1 to 1.2 μm). Extracted DNA was sequenced on the Illumina HiSeq 2,000 generating paired-end reads (2 × 93 bp). Reads from each metagenome were cleaned and assembled individually. Genomes were binned based on tetranucleotide and coverage-based method, refined and quality checked. All 136 genomes were annotated by IMG/ER and taxonomically assigned based on genome trees inferred from single-copy genes.  domains based on 122 and 120 single-copy marker genes, respectively. The clades represented by the triangles are collapsed at the phylum (P) level except for phyla containing genomes from this study which are expanded at the class (C) level and highlighted in red. Certain phyla have genome representatives only at the phylum level (Thaumarchaeota, Marinimicrobia, Cyanobacteria, and Bdellovibrionaeota). Numbers in parentheses indicate the count of recovered genomes from a particular taxonomic level. Dashed lines indicate nodes for class level. Robustness of the tree is indicated by black circles (size of circles scaled from 80 to 100% bootstrap support values). Trees were inferred independently. The archaeal tree was rooted with the DPANN superphylum 9 while the bacterial tree was 'arbitrarily' rooted with the phylum Chloroflexi 42 but should be treated as unrooted. To our knowledge, this is the largest number of microbial genomes from the Red Sea to be reported in a single study.
Phylogenomic analysis based on sets of single-copy marker genes universal to either the bacterial or archaeal domain showed that the 136 genomes encompassed seven phyla across these domains: Thaumarchaeota, Euryarchaeota, Actinobacteria, Cyanobacteria, Bdellovibrionaeota, Proteobacteria, and Marinimicrobia ( Fig. 2 and Table 2 (available online only)). As expected, most of the recovered genomes were affiliated with known marine microorganisms such as phototrophic Prochlorococcus 20,21 and Synechococcus 22,23 ; representative of clades first discovered in the Sargasso Sea (SAR86, SAR116, SAR324 and SAR406) [24][25][26] ; common marine bacteria in tropical biomes such as Alteromonas macleodii 27 ; an ammonia oxidizing thaumarchaeon from the genus Nitrosopelagicus 28 ; euryarchaeotal Marine Group II organisms reported to be abundant in surface waters 29 ; members of the Alpha-and Gamma-proteobacteria such as Aeromicrobium, Erythrobacter, Maritimibacter, Idiomarina, Marinobacter, Candidatus Thioglobus (SUP05 cluster) and several unclassified Gammaproteobacteria, consistent with the high relative abundance of these two groups in the recent Tara Oceans survey 30 . Additionally, actinobacterial Acidiimicrobia and Nocardioides genomes thought to be responsible for secondary metabolite production in marine ecosystems 31 were recovered from the metagenomes. An important strength of this dataset is the recovery of multiple, closely-related genomes from different stations or depths in the Red Sea (Data Citation 2). When complemented with physicochemical data 1 , genome plasticity between these organisms to confer fitness under varying conditions can be investigated in future studies.
To allow easy access to the genomes, all 136 genomes were functionally annotated and deposited into the National Centre for Biotechnology Information (NCBI) and Integrated Microbial Genomes (IMG) databases 32 . The wealth of metagenomic and genomic data described here greatly expands the repertoire of microbial genomic information from the Red Sea which might help to better understand the effects of global warming to ocean microbiomes. These datasets will also strengthen studies to better understand the drivers of marine nutrient cycling, help approaches for bioprospecting for novel thermo-and halo-philic enzymes, and allow for a better understanding of microbial adaptation strategies against high temperature, salinity and solar irradiance.

Genome binning, refinement, and annotation
For each metagenome, genome bins were recovered based on tetranucleotide frequencies and read coverage using MetaBAT v0. 26.1 (ref. 37) with default parameters. The completeness and contamination of the bins were assessed using CheckM v1.0.3 (ref. 38) using the lineage-specific workflow (Table 2 (available online only)). Bins were further refined using the CheckM 'merge' and 'outliers' commands which merge bins with complementary sets of marker genes to improve completeness and remove contigs from bins which appear to be outliers relative to reference GC and tetranucleotide distributions in order to reduce contamination 38 . The FinishM v0.0.7 (https://github.com/wwood/finishm) 'roundup' workflow which comprise of 'wander' and 'gapfill' modes was used to scaffold contigs together and fill gaps within individual bins. The 'wander' mode uses a de Bruijn graph (kmer length of 51 bp and coverage cutoff of 5) to determine contig ends which are connected while the 'gapfill' mode align the reads to regions of ambiguous nucleotides and replaces them with the appropriate nucleotides. Genome bins that passed the quality filter of completion minus contamination of ≥50% were submitted to IMG/ER 32 for gene calling and functional annotation.

Genome tree construction
The archaeal and bacterial genome trees (Fig. 2) were inferred from the concatenation of 122 and 120 proteins, respectively, identified as being present in ≥90% of the genomes in their respective domains and, when present, single-copy in ≥95% of genomes ( Supplementary Tables 1 and 2). These marker genes were aligned using HMMER v3.1b1 (ref. 39) and the tree inference from the concatenated alignment with FastTree v2.1.7 (ref. 40) under the WAG+GAMMA models (Data Citation 2). Support values were determined using 100 non-parametric bootstrap replicates 41 . The archaeal tree was rooted with the DPANN (Diapherotrites, Parvarchaeota, Aenigmarchaeota, Nanohaloarchaeota, and Nanoarchaeota) superphylum in concordance with a recent large-scale phylogenomic study 9 while the bacterial tree was 'arbitrarily' rooted with the phylum Chloroflexi 42 but should be treated as unrooted. The trees were visualized in ARB 43 , annotated by iTOL 44

Code availability
All versions of third-party software and scripts used in this study are described and referenced accordingly in the Methods sub-sections for ease of access and reproducibility.

Data Records
The raw Illumina sequencing paired-end reads (Table 1 (available online only)), 45 assembled metagenome sequences (Table 1 (available online only)) and 136 assembled genome sequences (Table 2 (available online only)), generated from the KAUST Red Sea Expedition 2011, are available from NCBI databases (Data Citation 1). The genome trees and associated fasta amino acid alignment files are available from Figshare (Data Citation 2).

Technical Validation
To validate the completeness and contamination of the genomes, we accessed the number of marker genes present in all bacterial and archaeal genomes using CheckM 38 . The genomes were also manually cleaned from vector contamination by comparing against the UniVec core database (ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/).

Usage Notes
The annotated genome assemblies can be downloaded and accessed via the Integrated Microbial Genomes (IMG) system (https://img.jgi.doe.gov/cgi-bin/m/main.cgi). The IMG genome IDs are provided in Table 2 (available online only).