Background & Summary

Microbial communities are key drivers of marine biogeochemistry. Our understanding of the incredible complexity and diversity of natural microbial populations has been greatly enhanced by the advent of cultivation-independent techniques for sequencing DNA directly from an environmental sample. Despite progress in describing the complexity of these natural systems, many gaps remain in our understanding of the distribution of genes and organisms in the oceans as well as the selective forces that structure community composition and distribution across space and time.

While previous large-scale marine sequencing efforts such as the Global Ocean Survey1,2 and Tara Oceans3 expeditions have greatly expanded our understanding of ocean microbiomes, these ecosystems remain vastly undersampled. The oceans present many challenges for sampling, including both their dynamic nature–e.g. weather, turbulence, movements of water masses, and mixing–as well as their remoteness. Further, understanding the forces that shape these communities requires detailed physical and chemical measurements associated with individual samples to provide information on the selective pressures that might play a role.

Here we present whole community metagenomic data from 610 samples collected in the Atlantic and Pacific Oceans. These data represent snapshots of microbial communities sampled across space and time, and are associated with physical and chemical measurements which are of value in addressing integrative research questions. The first set of metagenomes, collected under the auspices of the bioGEOTRACES component of the international GEOTRACES program4, comprises 480 samples collected in 2010-2011 (Data Citation 1). These samples come from 91 stations visited over four major cruise transects, with 2-10 depths (median 5) sampled at each station (Fig. 1; Table 1). An extensive suite of physical and chemical measurements, comprising over 147 unique data types including salinity, oxygen, temperature, nutrients, and detailed trace metal concentrations4, are available for these samples.

Figure 1: GEOTRACES metagenomic sampling locations.
figure 1

(a) Global map indicating the location of each sampling station where metagenomes were collected on the indicated cruise. Sample locations are shown in relation to sites sampled during two other large-scale marine metagenome sampling projects, the GOS and TARA datasets13 for context. (b) Depth distribution of metagenome samples along each of the four GEOTRACES cruises. Transect distances are calculated relative to the first station sampled in the indicated orientation. The depth distribution of samples collected below 250 m are not shown to scale for clarity (ranging from 281–5601 m; see Table 3 (available online only)).

Table 1 Summary of GEOTRACES cruise metagenome samples.

The second set of metagenomes contains time series data collected at two long-term ocean study sites: Station ALOHA in the North Pacific Subtropical Gyre, sampled as part of the Hawai’i Ocean Time-series (HOT) program5, and the Bermuda-Atlantic Time-series Study (BATS) Station6 in the Sargasso Sea (Data Citation 2). Water samples were collected every month for two years (2003-2004) at both locations, and we sequenced libraries from 3 depths per month (between 1-180 m), representing surface water, the deep chlorophyll maximum, and the bottom of the euphotic zone (Fig. 2; Table 2). Two additional samples collected from each site in 2009 are also included (Table 3 (available online only)). This temporal sampling scheme provides opportunities to compare and contrast variations within and between these two oligotrophic ocean regimes across seasonal, inter-annual and intra-annual time scales. Station ALOHA, for example, remains stably stratified throughout much of the summer7, and is often considered to be a N-limited ecosystem8,9; BATS, on the other hand, is subject to deep winter mixing events and is generally considered to be a P-limited system6. Both HOT and BATS metagenomes are associated with concurrent measurements of numerous other parameters, including physical characterization (e.g., light, temperature, salinity), nutrient concentrations, biological process rate measurements, and in situ cell concentrations5,6.

Figure 2: Time-series metagenome sampling.
figure 2

Black dots indicate the time and depth of each sample sequenced from Station ALOHA (Hawai’i Ocean Time-Series; a, b) and the Bermuda-Atlantic Time Series station (BATS; c, d). Sampling scheme is depicted in the context of total bacterial counts (a, c) and chlorophyll abundance (b, d) data from HOT22 and BATS6. The middle depth samples were chosen to track the deep chlorophyll maximum.

Table 2 Summary of time series metagenome samples.
Table 3 Detailed sampling location and accession information for all metagenomic datasets.

The complete dataset contains over 5 terabases (in 1.67×1010 paired-end reads) of raw sequence data (Table 4) (Data Citation 1 and Data Citation 2). In addition to the paired-end reads, we also include a set of assembled contigs from each metagenome library (Data Citation 3 for GEOTRACES and Data Citation 4 for HOT and BATS). As these metagenomes represent the microbial community in whole water samples, sequences from bacteria (39% of reads), archaea (4%), eukaryotes (1%) and viruses (2%) are present in roughly the same proportions observed in other marine datasets10. Future improvements in reference databases will likely continue to reduce the number of unidentified reads and refine read recruitment.

Table 4 Per-library summary statistics of the metagenome datasets.

We anticipate that these data will be useful for addressing a wide variety of research questions and generating new hypotheses across a broad range of disciplines including, but not limited to, microbial ecology, population genetics, evolution, and oceanography. In particular, the physical, chemical, and biological measurements associated with these samples enable studies of the relationships between microbial community structure, functional potential, biogeochemical cycles, and specific environmental variables.

Methods

Whole water samples were collected onto 0.2 μm filters and preserved using previously described protocols for qPCR sampling1114. Briefly, water was transferred from the appropriate Niskin bottle into a clean 500 mL amber bottle which had been washed three times with seawater from the same Niskin bottle. Replicate filters were prepared from each water sample by passing 100 mL of seawater through a 25mm diameter, 0.2 μm pore size polycarbonate filter under vacuum (9 in Hg maximum pressure). Filters were then chased with 3 mL of sterile preservation solution (10 mM Tris, pH 8.0; 100 mM EDTA; 0.5 M NaCl) and then immediately transferred to cryovials and stored at -80 °C. All glassware and collection bottles were cleaned in 10% bleach followed by extensive rinsing with 18 mΩ water (Millipore Milli-Q).

Total community DNA was extracted using a phenol/chloroform-based extraction method15 that was slightly modified for these samples. Lysing Matrix E beads (MP Biomedicals), 400 ul Phenol:Chloroform:IAA (25:24:1) and 400 ul 2x TENS buffer (100 mM Tris-HCL pH 8.0, 40 mM EDTA, 200 mM NaCl, 2% SDS for 2x buffer) were added to a microcentrifuge tube containing the filter and then vigorously agitated using a beadbeater for 40 seconds. After spinning at 19,000 xg for 5 minutes, the aqueous phase was transferred into a Phase Lock Gel tube (5 Prime), mixed with an equal volume of chloroform, and then spun at ~27,000 x g for 5 minutes. The supernatant was removed and mixed with an equal volume of AMPure XP beads (Beckman Coulter), and incubated at room temperature for 10 minutes. Beads were washed twice with 75% ethanol, dried, and resuspended in 20 uL ultrapure glass distilled water (Teknova). Total DNA yield was quantified using the PicoGreen assay (ThermoFisher) with yields ranging from ~10-2600 ng total DNA.

Sequencing libraries were prepared and sequenced by the MIT BioMicro Center. Libraries were constructed using the NextEra XT kit (Illumina) on an automated Tecan Freedom EVO robotics platform, starting from 1ng of input DNA. Relevant adapter sequences for downstream quality trimming are 5’-CTGTCTCTTATACACATCTCCGAGCCCACGAGAC-3’ and 5’-CTGTCTCTTATACACATCTGACGCTGCCGACGA-3’. Target library insert length was ~250 nt. The resulting libraries were sequenced using the Illumina NextSeq platform to produce 150+150 nt paired reads. Sixteen metagenomes were multiplexed on each lane, and a median total of ~25 million raw paired-end reads was obtained for each sample (range: ~2.6–323 million, due to variations in library loading).

To characterize the overall taxonomic content of the metagenomes (see Background & Summary), low quality regions of sequencing data and Illumina adapter sequences were first removed using Trimmomatic (V0.36)16. The trimmed reads were then assigned taxonomy using Kaiju (V1.5.0)17 in MEM mode with the SEG low complexity filter enabled. Kaiju classification employed a database containing the NCBI nr database (consisting of 103 million bacterial, archaeal viral, fungal, and microbial eukaryotic protein sequences; accessed 2017-05-16)18, the Moore Foundation Marine Microbial Eukaryote Transcriptome Sequencing Project dataset19, and 729 marine single cell genomes20. Figures were generated using Ocean Data View (V4.7.10; http://odv.awi.de) and R (V3.3.2; https://www.R-project.org), with the assistance of a number of R-based tools (tidyverse: https://CRAN.R-project.org/package=tidyverse; sp: https://CRAN.R-project.org/package=sp; geosphere: https://CRAN.R-project.org/package=geosphere; patchwork: https://github.com/thomasp85/patchwork; ggworldmap: https://github.com/thackl/ggworldmap).

Metagenome assemblies of each library were generated using metaSPAdes21 (v3.9.0 and v3.10.1). Paired-end reads were first quality trimmed with Trimmomatic as above, and then used as input for the metaSPAdes algorithm (with the default --meta settings). Assembled contigs shorter than 200bp were discarded.

Code availability

No custom code was used to generate or process these data. Software versions and any relevant variables and parameters employed are as follows:

Trimmomatic (V0.36): -phred33 ILLUMINACLIP:NexteraPE-PE.fa:2:30:10

LEADING:3 TRAILING:3 SLIDINGWINDOW:10:20 MINLEN:75

Kaiju (V1.5.0): -a mem -x

SPAdes (V3.9.0): --meta

SPAdes (V3.10.1): --meta

Data Records

The raw Illumina sequencing reads and sets of assembled contigs for all metagenomes are available from the NCBI Sequence Read Archive (Data Citations 1, 2, 3, 4). Accession numbers, sample date/location, cruise information, and library size for each metagenome can be found in Table 3 (available online only).

Technical Validation

To confirm the reliability of the automated library preparation steps and ensure that sample cross-contamination was minimized, we randomly included blank buffer samples among our samples and verified that these did not yield successful libraries. Prior to sequencing, the quality of the Illumina libraries was assessed on a Fragment Analyzer (Advanced Analytical) to ensure that the median insert size and overall distribution was in the expected range (peak fragment length ~1200 bp, range 300-3000 bp), with a total yield >1 ng; libraries which did not meet these criteria were reprepared. Whenever possible, libraries with relatively low sequencing coverage (<1×107 paired-end reads) were subjected to additional rounds of sequencing. Sequencing quality (as assessed by per-base average sequence quality scores, quality over the length of the read, kmer overrepresentation, etc) was monitored by the MIT BioMicro Center’s automated sequencing analysis pipelines, and libraries were resequenced if necessary.

Usage Notes

All metagenomes are associated with standardized GEOTRACES, HOT, and BATS bottle identification numbers to enable cross-referencing with the relevant databases of physical, chemical, and biological measurements (Table 3 (available online only)). GEOTRACES data can be accessed from the British Oceanographic Data Centre (https://www.bodc.ac.uk/geotraces/). HOT data can be accessed from http://hahana.soest.hawaii.edu/hot/hot-dogs/index.html, and BATS data from http://bats.bios.edu/. GEOTRACES data are periodically updated, and users are encouraged to access the most recent data release. Note that each individual metagenome may not have associated measurements for all possible parameters. Access to data from all of these sources is subject to their respective data use policies.

The assembled contigs for each metagenome (Data Citations 3 and Data Citations 4) are deposited in the NCBI Sequence Read Archive as analysis objects with accession numbers as listed in Table 3 (available online only). Assemblies are also available from the iMicrobe database: Geotraces samples at https://www.imicrobe.us/#/projects/277; HOT samples at https://www.imicrobe.us/#/projects/271; and BATS samples at https://www.imicrobe.us/#/projects/276.

Some use cases for the metagenomes could benefit from overlapping the paired-end data to create longer reads for downstream analysis. In our experience, on average 51% of paired reads overlap. We also note that these reads and assembled contigs may contain a small amount of contamination arising from the sampling, library preparation, and/or sequencing steps. While we worked to minimize this as much as possible, note that the data described here have not been pre-screened.

Additional information

How to cite this article: Biller, S. J. et al. Marine microbial metagenomes sampled across space and time. Sci. Data 5:180176 doi: 10.1038/sdata.2018.176 (2018).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.