Microbial metagenomes and metatranscriptomes during a coastal phytoplankton bloom

Metagenomic and metatranscriptomic time-series data covering a 52-day period in the fall of 2016 provide an inventory of bacterial and archaeal community genes, transcripts, and taxonomy during an intense dinoflagellate bloom in Monterey Bay, CA, USA. The dataset comprises 84 metagenomes (0.8 terabases), 82 metatranscriptomes (1.1 terabases), and 88 16S rRNA amplicon libraries from samples collected on 41 dates. The dataset also includes 88 18S rRNA amplicon libraries, characterizing the taxonomy of the eukaryotic community during the bloom. Accompanying the sequence data are chemical and biological measurements associated with each sample. These datasets will facilitate studies of the structure and function of marine bacterial communities during episodic phytoplankton blooms.


Background & Summary
In pelagic marine ecosystems, a major proportion of primary production is transformed by heterotrophic microbes on the scale of hours to days [1][2][3] . Much of this rapidly-processed primary production is made available in the form of dissolved organic carbon (DOC), released from phytoplankton by direct excretion or through trophic interactions. Bacterial uptake of DOC produces living biomass and regenerates inorganic nutrients 1 .
Monterey Bay is a coastal ecosystem with high primary production driven by frequent upwelling of nutrient-rich waters 4,5 . Intense phytoplankton blooms can develop 6 , and these vary dynamically in terms of taxonomic composition. In 2016, the fall phytoplankton bloom (Fig. 1) was dominated by an unusually intense bloom of the dinoflagellate Akashiwo sanguinea 7 . A. sanguinea cell abundances reached 4.9 × 10 6 cells L −1 , and chlorophyll a concentrations reached 57 µg L −1 (at ~6 m depth) over the period spanning mid-September to mid-November. Here we present metagenomic, metatranscriptomic, and iTag data on the bacterial and archaeal communities during a 52-day period spanning this unusual plankton bloom in Monterey Bay (Table 1). iTag data on the eukaryotic microbial communities provides contextual information on community dynamics of the bloom-forming phytoplankton and grazer communities. and 7 m depth at approximately 10 a.m. PST. Samples were collected daily except during October 7 -November 1 when the ESP was offline for repair. ESP filters were preserved with RNAlater at the completion of sample collection and stored in the instrument until retrieval. While the ESP was offline, grab samples were collected by Niskin bottle at the M0 mooring site 2-3 times per week, with time of sampling, depth of sampling, and filters the same as for the ESP samples except that filters were flash frozen in liquid nitrogen.
Environmental data (temperature, salinity, chlorophyll a fluorescence, light transmission, and dissolved O 2 concentrations) were collected by a CTD instrument mounted with the ESP 9 . Additional environmental data were obtained from grab samples collected at the M0 mooring 2-3 times per week [total dimethylsulfoniopropionate concentration (DMSPt), dissolved DMSP concentration (DMSPd), DMSPd consumption rate, chlorophyll a, and cell counts by flow cytometry and microscopy] 10,11 (Online-only Table 1). DNa/RNa extraction. Total community nucleic acids for metagenome, metatranscriptome, and 16S iTag sequencing were obtained from the same 0.22 µm filter (0.22-5.0 μm size fraction) using the ZymoBIOMICS DNA/RNA Miniprep Kit (Zymo Research, Irvine CA). At extraction start, internal standards were added to the lysis buffer tube (see Usage Notes), and the filter was cut into small pieces under sterile conditions to facilitate extraction. RNA was treated according to the manufacturer's instructions with in-column DNase I treatment. After elution, RNA was treated with Turbo DNase (Invitrogen, Carlsbad CA) and concentrated using Zymo RNA Clean and Concentrator (Zymo Research). Except for a few cases of low nucleic acid yields, duplicate filters were sequenced for each sample date.
DNA for 18S rRNA gene sequencing was extracted from the 5.0 μm filters using the DNeasy Plant Mini Kit (Qiagen, Venlo NL) with modifications. Filters were cut into pieces and added into a prepared lysis tube containing ~200 µl of 1:1 mixed 0.1 and 0.5 mm zirconia/silica beads (Biospec Products, Bartlesville, OK) and 400 μl Buffer AP1. Internal standards (see Usage Notes) were added just prior to extraction. Three freeze-thaw cycles were performed using liquid nitrogen and a 65 °C water bath. Following freeze-thaw, bead beating was performed for 10 min, followed by centrifugation at 8,000 rpm for 10 min to remove foam. Following centrifugation, 45 μl of proteinase K (>600 mAU/ml, Qiagen) was added to each tube and incubated at 55 °C for 90 min with gentle rotation. Filters were then removed and the tubes incubated at 55 °C for 1 h. The DNeasy kit protocol was resumed at the RNase A addition step. Final DNA was eluted in 75 μl of diluted (1:10) TE buffer.
Metagenome sequencing and analysis. Sequence data were generated at the Department of Energy (DOE) Joint Genome Institute (JGI) using Illumina technology. Libraries were constructed and sequenced using the HiSeq-2000 1TB platform (2 × 151 bp). For assembly, reads were trimmed and screened, and those with no www.nature.com/scientificdata www.nature.com/scientificdata/ mate pair were removed using BFC (v r181) 12 . Remaining reads were assembled using SPAdes (v 3.11.1) 13 . The read set was mapped to the final assembly and coverage information generated using BBMap (v 37.78) 14 with default parameters. Assembled metagenomes were processed through the DOE JGI Metagenome Annotation Pipeline (MAP) and loaded into the Integrated Microbial Genomes and Microbiomes (IMG/M) platform 15,16 . Metatranscriptome sequencing and analysis. Sequence data were generated at the DOE JGI using Illumina technology. Libraries were constructed and sequenced using the HiSeq-2500 1TB platform (2 × 151 bp). Metatranscriptome reads were assembled using MEGAHIT (v 1.1.2) 17 . Cleaned reads were mapped to the assembly using BBMap.
To generate an overview of microbial community composition during the bloom (Figs 2 and 3), the 16S and 18S rRNA amplicon libraries (raw reads) were primer-trimmed using Cutadapt (v 1.18) 21 and analyzed using QIIME2 (v 2018.6) 22 . The DADA2 23 plugin in QIIME2 was used to generate exact sequence variants (ESVs), which were classified using the QIIME2 naive Bayes classifier trained on 99% Operational Taxonomic Units (OTUs) from the SILVA rRNA database (v 132) 24 after trimming to the primer region. Taxonomic bar plots were generated using QIIME2.

Data Records
The raw Illumina sequencing reads for metagenomes, metatranscriptomes, and 16S rRNA and 18S rRNA iTags are available from the NCBI Sequence Read Archive under 342 separate project IDs (summarised in Online-only Table 2) which we have gathered under a single BioProject umbrella ID 25 .
Contigs assembled within each individual metagenome and metatranscriptome are available from the JGI Integrated Microbial Genomes portal (Online-only Table 2).
Chemical and biological data associated with each sample are available at the Biological and Chemical Oceanography Data Management Office (BCO-DMO) 9,10 . Measured parameters include temperature, salinity, depth, light transmission, concentrations of dissolved oxygen and chlorophyll, concentration and consumption rates of DMSP, and cell counts for heterotrophic bacteria, Synechococcus, Akashiwo, and photosynthetic eukaryotes.

technical Validation
For metagenomic and metatranscriptomic Illumina data, BBDuk (version 37.95; https://jgi.doe.gov/ data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/) was used to remove contaminants, trim reads that contained adapter sequence, and trim reads where quality dropped to zero. BBDuk was used to remove reads that contained four or more 'N' bases, had an average quality score across the read <3, or had a minimum length ≤51 bp or 33% of the full read length. Reads mapped with BBMap to masked human, cat, dog and mouse references at >93% identity were separated into a chaff file. Reads aligned to common microbial contaminants were also separated into a chaff file. For metatranscriptomic data, reads containing ribosomal RNA and known JGI spike-in sequences were removed and placed into separate fastq files. The internal DNA and mRNA standards www.nature.com/scientificdata www.nature.com/scientificdata/ added for quantification purposes at the nucleic acid extraction step (see Usage Notes) were recovered at 0.5-5.0% of sequences as expected.
For 16S rRNA and 18S rRNA, BBDuk was used to remove contaminants and trim reads that contained adapter sequence. This program was also used to remove reads that contained one or more 'N' bases, had an average quality score across the read of <10, or had a minimum length ≤51 bp or 33% of the full read length. Reads mapped with BBMap to masked human, cat, dog and mouse references at >93% identity or aligned to common microbial contaminants were separated into a chaff file. The 16S and 18S rRNA reads amplified from the internal DNA standards added for quantification purposes (see Usage Notes) were recovered at their expected level (0.5-5.0% of sequences).  Samples were collected at ~6 m, and 16S rRNA genes were amplified from community DNA in the 0.22 to 5.0 µm size range. Taxonomic groups were defined based on exact sequence variants using DADA2 in QIIME 2 (https://qiime2.org) and assigned taxonomy with the naive Bayes q2-feature-classifier trained using the 515F/806R region from 99% operational taxonomic units from the SILVA 132 16S rRNA database. Assignments of the 30 most abundant taxa are given at the family level. www.nature.com/scientificdata www.nature.com/scientificdata/ Sequence datasets were checked for consistency with the expected composition of coastal marine microbial communities. Taxonomic assignments of 16S and 18S rRNA ESVs matched those of marine microbes common in coastal areas in general 26,27 and in Monterey Bay seawater in particular 11 (Figs 2 and 3). Taxonomic assignments of protein-encoding genes from metagenomic datasets were likewise representative of coastal and Monterey Bay microbial communities, and had taxonomic assignments consistent with the iTag datasets.

Usage Notes
Sample processing included the addition of internal standards to allow for calculation of volume-based absolute copy numbers for each gene or transcript type (i.e., counts L −1 rather than % of sequence library) 28,29 . The DNA standards consisted of genomic DNA from Thermus thermophilus DSM7039 HB8 29 and Blautia producta strain VPI 4299 (American Type Culture Collection, Manassas, VA). mRNA standards consisted of custom-designed 1006 nt artificial transcripts 29 . Artificial transcript sequences are available at Addgene Plasmid Repository (https://www.addgene.org; products MTST5 and MTST6). All four standards (two DNA and two mRNA) were added to the 0.22 μm pore size samples at the initiation of nucleic acid extraction. In the case of 18S iTag samples, genomic DNA from Arabidopsis (BioChain Institute, Inc., Newark, CA) and Mus musculus (Millipore Sigma, Burlington MA) was similarly added to the 5.0 μm pore size samples at initiation of extraction. Added amounts of internal standards were estimated at ~1% of final yields of DNA or mRNA based on prior recoveries from similar filters. Actual yields averaged ~2% of reads. The internal standards should be removed from the raw data prior to analysis. Information on how internal standards can be used for volume-based quantification is available elsewhere 29,30 .
Environmental data collected in association with the nucleic acid samples are given in Online-only Table 1. Available data differ between sampling dates depending on whether sampling was done by the ESP, from Niskin grab samples, or both.