High spatial resolution global ocean metagenomes from Bio-GO-SHIP repeat hydrography transects

Detailed descriptions of microbial communities have lagged far behind physical and chemical measurements in the marine environment. Here, we present 971 globally distributed surface ocean metagenomes collected at high spatio-temporal resolution. Our low-cost metagenomic sequencing protocol produced 3.65 terabases of data, where the median number of base pairs per sample was 3.41 billion. The median distance between sampling stations was 26 km. The metagenomic libraries described here were collected as a part of a biological initiative for the Global Ocean Ship-based Hydrographic Investigations Program, or “Bio-GO-SHIP.” One of the primary aims of GO-SHIP is to produce high spatial and vertical resolution measurements of key state variables to directly quantify climate change impacts on ocean environments. By similarly collecting marine metagenomes at high spatiotemporal resolution, we expect that this dataset will help answer questions about the link between microbial communities and biogeochemical fluxes in a changing ocean.

www.nature.com/scientificdata www.nature.com/scientificdata/ comparison, systematic and sustained biological measurements of the microbial component of ocean ecosystems has lagged far behind.
We present a dataset of 971 ocean surface water metagenomes collected at high spatio-temporal resolution in an effort to more mechanistically link marine microbial traits and biodiversity to both chemical and hydrodynamic ecosystem fluxes as a part of a novel Bio-GO-SHIP sampling program. Samples were collected in the Atlantic, Pacific, and Indian Ocean basins (Fig. 1, Table 1). This effort has been supported by GO-SHIP, SOCCOM, the Plymouth Marine Laboratory Atlantic Meridional Transect (PML AMT), and three National Science Foundation (NSF) Dimensions of Biodiversity funded cruises (AE1319, BVAL46, and NH1418) ( Table 2). Whereas the median distance between Tara Oceans sampling stations was 709 km and the median distance between bioGEOTRACES sampling stations was 191 km, the median distance between sampling stations in the current Bio-GO-SHIP dataset is 26.5 km (Fig. 2). In addition, the majority of Bio-GO-SHIP samples were collected every 4-6 hours, allowing for analysis of diel fluctuations in microbial composition and gene content 12 . We anticipate that our high-resolution sampling scheme will allow for a more detailed examination of the relationship between the broad range of geochemical parameters measured across the various cruises (Table 2) and microbial diversity and traits.
Due to their rapid generation times and high diversity, microbial genomes integrate the impact of environmental change 13 and can be used as a 'biosensor' of subtle biogeochemical regimes that cannot be identified from physical parameters alone 12,[14][15][16] . Thus, the fields of microbial ecology and oceanography would benefit from coordinated, high resolution measurements of marine 'omics products (i.e., metagenomes, metatranscriptomes, metaproteomes, etc.). This dataset provides an important example of the benefits of a high spatial and temporal resolution sampling regime. In addition, our data highlights the need for increased sampling of marine metagenomes in the Central and Western Pacific Ocean (Fig. 1), areas above 50°N and 50°S (Fig. 2), and below the euphotic zone. We hope and expect that these challenges will be addressed by the scientific community in the coming decade.

60˚S
A total of 971 metagenomic libraries from 932 locations were prepared using Illumina-specific Nextera DNA transposase adapters and a Tagment  To quality control tagmentation products, dimers that were less than 150 nucleotides long were removed using a buffered solution (1 M NaCl, 1 mM EDTA, 10 mM Tris-HCl, 44.4 M PEG-8000, 0.055% Tween-20 final concentration) of Sera-mag SpeedBeads (ThermoFisher, Waltham, MA). Metagenomic libraries were quantified using a Qubit dsDNA HS Assay kit (ThermoFisher, Waltham, MA) and a Synergy 2 Microplate Reader (BioTek, Winooski, VT). Libraries were then pooled at equimolar concentrations. Pooled library concentration was verified using a KAPA qPCR platform (Roche, Basel, Switzerland). Finally, dimer removal as well as read size distribution were checked using a 2100 Bioanalyzer high sensitivity DNA trace (Agilent, Santa Clara, CA). 54 samples were sequenced on two Illumina HiSeq 4000 lanes using 150 bp paired-end chemistry with 300 cycles (Illumina, San Diego, CA). A total of 666 samples were sequenced on three Illumina NovaSeq S4 flowcells and an additional 251 samples were sequenced on a combination of S1 and SP flowcells using 150 bp paired-end chemistry with 300 cycles. The sequencing strategy produced a total of 2.42 × 10 10 reads, or 3.65 × 10 12 bp. The  Table 2. Publicly available metadata variables collected on Bio-GO-SHIP cruises. These data may be updated as additional samples or stations are processed by the principal investigators of each dataset. Another 48 metadata variables not listed here were collected aboard the GO-SHIP, PML AMT, and NSF cruises and may be available upon request from CCDHO, BODC, or SOCCOM. *C13.5 is a partial occupation of the A13.5 GO-SHIP line that was aborted due to COVID-19. Thus, CTD casts corresponding to DNA collection were only performed at 8 stations.
www.nature.com/scientificdata www.nature.com/scientificdata/ median number of bases per sample was 3.41 billion (range: 61,400,000-21.4 billion). Prior to read trimming and quality filtering, 74% of all forward and reverse reads had an average quality score ≥Q25 (Table 1). The sequencing cost per bp in US dollars was $8.2 × 10 −9 .

Data Records
The majority of the samples here were collected under the auspices of the international GO-SHIP program and the national programs that contribute to it [21][22][23][24] . Links to publicly available metadata variables collected via CTD cast are provided in Table 2 All sequencing products associated with the Bio-GO-SHIP program can be found under BioProject ID PRJNA656268 hosted by the National Center for Biotechnology Information Sequence Read Archive (SRA) 30 . SRA accession numbers associated with each metagenome file are provided in Supplementary Table 1.

technical Validation
To ensure that no contamination of metagenomes occurred, negative controls were used. To ensure optimum paired-end short read sequencing, a 2100 Bioanalyzer high sensitivity DNA trace (Agilent, Santa Clara, CA) was used for each library to confirm that ~90% of the sequence fragments were above 200 bp and below 600 bp in length (Table 3). A Qubit (ThermoFisher, Waltham, MA) and a KAPA qPCR platform (Roche, Basel, Switzerland) were used to ensure that all pooled libraries were submitted for sequencing at a concentration > 15 nM.   Table 3. Sequencing run breakdown of Bio-GO-SHIP metagenomes including technical validation statistics. *Run 1 was concentrated via SpeedVac to 15 nM and bead size-selected such that 90% of fragments were between 200-600 bp by the UC Davis Genome Center DNA Technologies Core prior to sequencing. Final values for this run are not available.