Background & Summary

The Red Sea is an ideal marine environment to study microbial adaptation to physical conditions atypical of global oceans: high temperature, high salinity, and high irradiance. In late summer 2011, we undertook the King Abdullah University of Science and Technology (KAUST) Red Sea Expedition (KRSE2011) in the eastern Red Sea in order to map its diversity along environmental gradients that occur with changes in latitude, longitude, and depth1. This time of year is not only when temperatures and evaporation (and hence salinity) are highest, but also when a foreign water mass called the Gulf of Aden Intermediate Water (GAIW) intrudes into the Red Sea1,2 (Fig. 1). The GAIW brings nutrient-rich water to the Red Sea, providing nitrogen, phosphorus, and other elements to this otherwise oligotrophic sea, and is likely to introduce important microbial diversity.

Figure 1: Experimental workflow for this study.
figure 1

The circles superimposed on the Red Sea 3D map shows the sampling points during the King Abdullah University of Science and Technology Red Sea Expedition 2011. The green lines represent the three Gulf of Aden Intermediate Water (GAIW) sampling points. The numbers within the circles represent the number of genomes recovered from each of the sample. Colors represent the high (dark red) to low (dark blue) water temperature. A total of 45 samples of 20 l each were collected and filtered through a series of filters. For this study, DNA extraction was performed on the small microbial fractions (between 0.1 to 1.2 μm). Extracted DNA was sequenced on the Illumina HiSeq 2,000 generating paired-end reads (2×93 bp). Reads from each metagenome were cleaned and assembled individually. Genomes were binned based on tetranucleotide and coverage-based method, refined and quality checked. All 136 genomes were annotated by IMG/ER and taxonomically assigned based on genome trees inferred from single-copy genes.

Insights into the taxonomic, evolutionary, and functional diversity of the Red Sea have largely been based on studies of pure cultures35 and single marker genes such as the 16S rRNA6,7, or internal transcribed spacer8. Recently, investigations of microbial ecology have steered towards whole genome-based culture-independent methods notably single-cell genomics and metagenomics9,10. Single-cell genomics is an exciting field that recovers complete and partial single cell genomes from complex environments, albeit the need of specialised equipment, high cost and relatively low throughput1113. Metagenomics is paving the way forward by harnessing the recent wave of sequencing technology and bioinformatics advancements to recover genomes of individual populations or populations of closely related organisms1416. Application of these methods has resulted in the recovery of numerous genomes of uncultivated microorganisms that have provided surprising insights into the diversity and function of microbial communities10,14,1719.

During the KRSE2011, eight stations were sampled along a cruise track from south to north, capturing gradients in temperature, salinity, oxygen, and nutrients, including the unique GAIW water mass (Fig. 1 and Table 1 (available online only)). At each station, samples were collected from the surface to mesopelagic depths (10, 25, 50, 100, 200, and 500 m), except for stations 12 and 34, which had depths shallower than 500 m (Fig. 1 and Table 1 (available online only)), in order to capture a greater variation in environmental parameters and microbial diversity. Here, we successfully reconstructed 136 genomes from 45 individually assembled metagenomes (Figs 1 and 2, Tables 1 and 2 (available online only), Data Citation 1) by differential read coverage and tetranucleotide frequency methods. Of these, 43 were ‘near-complete’ with an estimated completion minus contamination of ≥90%, while the other 93 draft genomes had completion minus contamination of ≥50% (Table 2 (available online only)). To our knowledge, this is the largest number of microbial genomes from the Red Sea to be reported in a single study.

Table 1 Characteristics of the 45 Red Sea metagenomic samples
Figure 2: Phylogenetic trees for the archaeal (green lines; top left) and bacterial (blue lines; bottom right) domains based on 122 and 120 single-copy marker genes, respectively.
figure 2

The clades represented by the triangles are collapsed at the phylum (P) level except for phyla containing genomes from this study which are expanded at the class (C) level and highlighted in red. Certain phyla have genome representatives only at the phylum level (Thaumarchaeota, Marinimicrobia, Cyanobacteria, and Bdellovibrionaeota). Numbers in parentheses indicate the count of recovered genomes from a particular taxonomic level. Dashed lines indicate nodes for class level. Robustness of the tree is indicated by black circles (size of circles scaled from 80 to 100% bootstrap support values). Trees were inferred independently. The archaeal tree was rooted with the DPANN superphylum9 while the bacterial tree was ‘arbitrarily’ rooted with the phylum Chloroflexi42 but should be treated as unrooted.

Table 2 Characteristics of the 136 genomes reported in this study

Phylogenomic analysis based on sets of single-copy marker genes universal to either the bacterial or archaeal domain showed that the 136 genomes encompassed seven phyla across these domains: Thaumarchaeota, Euryarchaeota, Actinobacteria, Cyanobacteria, Bdellovibrionaeota, Proteobacteria, and Marinimicrobia (Fig. 2 and Table 2 (available online only)). As expected, most of the recovered genomes were affiliated with known marine microorganisms such as phototrophic Prochlorococcus20,21 and Synechococcus22,23; representative of clades first discovered in the Sargasso Sea (SAR86, SAR116, SAR324 and SAR406)2426; common marine bacteria in tropical biomes such as Alteromonas macleodii27; an ammonia oxidizing thaumarchaeon from the genus Nitrosopelagicus28; euryarchaeotal Marine Group II organisms reported to be abundant in surface waters29; members of the Alpha- and Gamma-proteobacteria such as Aeromicrobium, Erythrobacter, Maritimibacter, Idiomarina, Marinobacter, Candidatus Thioglobus (SUP05 cluster) and several unclassified Gammaproteobacteria, consistent with the high relative abundance of these two groups in the recent Tara Oceans survey30. Additionally, actinobacterial Acidiimicrobia and Nocardioides genomes thought to be responsible for secondary metabolite production in marine ecosystems31 were recovered from the metagenomes. An important strength of this dataset is the recovery of multiple, closely-related genomes from different stations or depths in the Red Sea (Data Citation 2). When complemented with physicochemical data1, genome plasticity between these organisms to confer fitness under varying conditions can be investigated in future studies.

To allow easy access to the genomes, all 136 genomes were functionally annotated and deposited into the National Centre for Biotechnology Information (NCBI) and Integrated Microbial Genomes (IMG) databases32. The wealth of metagenomic and genomic data described here greatly expands the repertoire of microbial genomic information from the Red Sea which might help to better understand the effects of global warming to ocean microbiomes. These datasets will also strengthen studies to better understand the drivers of marine nutrient cycling, help approaches for bioprospecting for novel thermo- and halo-philic enzymes, and allow for a better understanding of microbial adaptation strategies against high temperature, salinity and solar irradiance.

Methods

Metagenomic sequencing and assembly

Seawater samples were collected from eight stations and from different depths (10, 25, 50, 100, 200, and 500 m; locations are shown in Fig. 1) during summer as part of KRSE2011 (ref. 1). Genomic DNA was extracted from the 0.1–1.2 μm size fraction using an established phenol-chloroform extraction protocol1,33. Paired-end libraries (2×100 bp) were prepared using Nextera DNA Library Prep Kit (Illumina) and sequenced on a HiSeq 2000 (Illumina). Reads were quality checked and trimmed using PRINSEQ v0.20.4 (ref. 34) generating read lengths of ~93 bp and a total of ~10 million reads per sample with median insert sizes ranging from 183–366 bp1 (Data Citation 1). Trimmed metagenome reads were individually assembled (Table 1 (available online only)) using IDBA-UD v1.1.1 (ref. 35) using the ‘--pre-correction’ option. To obtain coverage profile of contigs from each metagenomic assembly, the trimmed reads were mapped back to contigs using BWA v0.7.12 (ref. 36) with the bwa-mem algorithm.

Genome binning, refinement, and annotation

For each metagenome, genome bins were recovered based on tetranucleotide frequencies and read coverage using MetaBAT v0.26.1 (ref. 37) with default parameters. The completeness and contamination of the bins were assessed using CheckM v1.0.3 (ref. 38) using the lineage-specific workflow (Table 2 (available online only)). Bins were further refined using the CheckM ‘merge’ and ‘outliers’ commands which merge bins with complementary sets of marker genes to improve completeness and remove contigs from bins which appear to be outliers relative to reference GC and tetranucleotide distributions in order to reduce contamination38. The FinishM v0.0.7 (https://github.com/wwood/finishm) ‘roundup’ workflow which comprise of ‘wander’ and ‘gapfill’ modes was used to scaffold contigs together and fill gaps within individual bins. The ‘wander’ mode uses a de Bruijn graph (kmer length of 51 bp and coverage cutoff of 5) to determine contig ends which are connected while the ‘gapfill’ mode align the reads to regions of ambiguous nucleotides and replaces them with the appropriate nucleotides. Genome bins that passed the quality filter of completion minus contamination of ≥50% were submitted to IMG/ER32 for gene calling and functional annotation.

Genome tree construction

The archaeal and bacterial genome trees (Fig. 2) were inferred from the concatenation of 122 and 120 proteins, respectively, identified as being present in ≥90% of the genomes in their respective domains and, when present, single-copy in ≥95% of genomes (Supplementary Tables 1 and 2). These marker genes were aligned using HMMER v3.1b1 (ref. 39) and the tree inference from the concatenated alignment with FastTree v2.1.7 (ref. 40) under the WAG+GAMMA models (Data Citation 2). Support values were determined using 100 non-parametric bootstrap replicates41. The archaeal tree was rooted with the DPANN (Diapherotrites, Parvarchaeota, Aenigmarchaeota, Nanohaloarchaeota, and Nanoarchaeota) superphylum in concordance with a recent large-scale phylogenomic study9 while the bacterial tree was ‘arbitrarily’ rooted with the phylum Chloroflexi42 but should be treated as unrooted. The trees were visualized in ARB43, annotated by iTOL44 and edited in Illustrator CC 2014 (Adobe).

Code availability

All versions of third-party software and scripts used in this study are described and referenced accordingly in the Methods sub-sections for ease of access and reproducibility.

Data Records

The raw Illumina sequencing paired-end reads (Table 1 (available online only)), 45 assembled metagenome sequences (Table 1 (available online only)) and 136 assembled genome sequences (Table 2 (available online only)), generated from the KAUST Red Sea Expedition 2011, are available from NCBI databases (Data Citation 1). The genome trees and associated fasta amino acid alignment files are available from Figshare (Data Citation 2).

Technical Validation

To validate the completeness and contamination of the genomes, we accessed the number of marker genes present in all bacterial and archaeal genomes using CheckM38. The genomes were also manually cleaned from vector contamination by comparing against the UniVec core database (ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/).

Usage Notes

The annotated genome assemblies can be downloaded and accessed via the Integrated Microbial Genomes (IMG) system (https://img.jgi.doe.gov/cgi-bin/m/main.cgi). The IMG genome IDs are provided in Table 2 (available online only).

Additional Information

How to cite this article: Haroon, M. F. et al. A catalogue of 136 microbial draft genomes from Red Sea metagenomes. Sci. Data 3:160050 doi: 10.1038/sdata.2016.50 (2016).