Metagenome sequencing and 768 microbial genomes from cold seep in South China Sea

Cold seep microbial communities are fascinating ecosystems on Earth which provide unique models for understanding the living strategies in deep-sea distinct environments. In this study, 23 metagenomes were generated from samples collected in the Site-F cold seep field in South China Sea, including the sea water closely above the invertebrate communities, the cold seep fluids, the fluids under the invertebrate communities and the sediment column around the seep vent. By binning tools, we retrieved a total of 768 metagenome assembled genome (MAGs) that were estimated to be >60% complete. Of the MAGs, 61 were estimated to be >90% complete, while an additional 105 were >80% complete. Phylogenomic analysis revealed 597 bacterial and 171 archaeal MAGs, of which nearly all were distantly related to known cultivated isolates. In the 768 MAGs, the abundant Bacteria in phylum level included Proteobacteria, Desulfobacterota, Bacteroidota, Patescibacteria and Chloroflexota, while the abundant Archaea included Asgardarchaeota, Thermoplasmatota, and Thermoproteota. These results provide a dataset available for further interrogation of deep-sea microbial ecology. Measurement(s) metagenome assembled genomes Technology Type(s) metagenome sequencing and genome binning Sample Characteristic - Organism microorganism Sample Characteristic - Environment marine cold seep biome Sample Characteristic - Location South China Sea Measurement(s) metagenome assembled genomes Technology Type(s) metagenome sequencing and genome binning Sample Characteristic - Organism microorganism Sample Characteristic - Environment marine cold seep biome Sample Characteristic - Location South China Sea


Background & Summary
Cold seeps are seafloor manifestations of methane-rich fluid migration from the sedimentary subsurface and support unique communities via chemosynthetic interactions fuelled 1 . The microorganisms inhabiting cold seeps transform the chemical energy in methane to products that sustain rich benthic communities around the gas leaks 2 . The use of next-generation sequencing methods has tremendously improved the insights into seep microbiomes and will advance microbial ecology from the diversity microbial distribution pattern to the adaptive survival strategy in deep-sea environments.
The cold seep in Site F (also known as Formosa Ridge) is one of the active cold seeps on the north-eastern slope of the South China Sea (SCS) 3 , where the natural gas hydrate exposed on the seafloor and was covered by chemosynthetic communities mainly comprising deep-sea mussels and galatheid crabs 4 . The geochemical characters have been illustrated by the in-situ detection using the developed Raman insertion Probe (RiP) system and integrated sensors [5][6][7] . The horizontal and vertical variations in methane concentrations showed contrasting trends in fields from the center of flourishing communities to the margin of sediments 6 . No CH 4 or H 2 S Raman peaks were detected in the cold seep fluids, while dissolved CH 4 were identified in the fluids under the lush chemosynthetic communities, and the sediment pore water profiles collected near the cold seep were characterized by the loss of SO 4 2− and increased CH 4 , H 2 S and HS − peaks 5,7 . As the microbial communities in deep-sea cold seeps are often shaped by geochemical components in seepage solutions, we collected samples from the Site-F cold seep field in 2017, including the sea water closely above the invertebrate communities, the cold seep fluids, the fluids under the invertebrate communities and the sediment column around the seep vent ( Fig. 1 and Table 1). The metagenomes were sequenced with Illumina HiSeq X Ten platform, with each metagenome yielding approximately 52.7 Gbps to 80.6 Gbps of clean bases (Table 2). We further obtained 768 metagenome-assembled genomes (MAGs) of environmental Bacteria and Archaea estimated to be >60% complete and <20% contamination (Supplementary Table 1). Of the MAGs, 61 were estimated to be >90% complete, while an additional 105 were >80% complete. There were 59 high-quality MAGs (completeness > 90% and contamination < 5%), accounting for 7.68% of the total. The anaerobic methanotrophic archaea (ANME), aerobic methanotrophic bacteria Methylococcales, sulfate-reducing Desulfobacterales, as well as sulfide-oxidizing Campylobacterales and Thiotrichales (Supplementary Table 2), well match the most favourable microbial metabolisms at methane seeps in terms of substrate supply. Meanwhile, the phylogenomic analysis suggests that this set of draft genomes includes highly sought-after genomes that lack cultured representatives, such as archaea Bathyarchaeota (30), Aenigmarchaeota (29), Heimdallarchaeota (20) and Pacearchaeota (10), and bacteria Patescibacteria (44), WOR-3 (23), Zixibacteria (13), Marinisomatota (12) and Eisenbacteria (6) et al.  www.nature.com/scientificdata www.nature.com/scientificdata/ ( Fig. 2). In addition, there are also some potential new phylum including NPL-UPA2 (7), UBP15 (4), FCPU426 (2) and SM23-31 (2) et al. All the non-redundant draft metagenome-assembled genomes described here were deposited into the National Center for Biotechnology Information (NCBI). These data will hopefully provide a resource for downstream analysis acting as references for largescale comparative genomics within globally vital phylogenetic groups, as well as allowing for the exploration of novel microbial metabolisms.

Methods
Sampling. Samples were retrieved from a cold seep field in the northern SCS by the KEXUE research vessel during the cruise in Sep 2017 ( Fig. 1 a and Table 1). The water closely above the invertebrate communities was collected by an in-situ water sampling cylinder equipped on FAXIAN Remotely Operated Vehicle (ROV) during the dive 164 and 165 (sample ID: SW_1 and SW_2, respectively). The cold seep fluid was collected at the gas plumes during the dive 166 (sample ID: SW_3), and the fluid under the invertebrate communities was collected during the dive 167 (sample ID: SW_4). About 15 L water of each sample was filtered through a 0.22μm polycarbonate membrane (Millipore, Bedford, MA, USA). The membranes were stored at −80 °C and used for DNA extraction. A sediment core was collected by ROV at reductive sediments area nearby the invertebrate communities during dive 157. A thin outer layer ( < 1 cm) of the push core was discarded to avoid contamination. The black reduced sediment core, 20 cm in length, was sliced into layers by every two centimetres with a pushcore equipment (sample ID: RS_1 ~ RS_10). Another sediment core was collected at the same site by a deep-sea light weighted monitorable and controllable long-coring system 8 , and the sample layers of 0~300 cm below the seafloor (cmbsf) was collected from the sediment core and sliced into 35-cm subsamples (sample ID: RS_11 ~ RS_19). All subsamples were stored at −80 °C until DNA extraction. Environmental data (CH 4 , H 2 S and SO 4 2− ) were detected in situ by a deep-sea laser Raman spectrometer mounted with the ROV in the previous report 5,9 . DNA extraction. A schematic overview of workflow in this study was shown in Fig. 1b  www.nature.com/scientificdata www.nature.com/scientificdata/ (SRA) under the accession numbers SRR13892585~SRR13892607 (Table 2), and within the BioProject accession number PRJNA707313.
Genome binning. The initial de novo assembly was carried out using MEGAHIT v1.1.3 with default parameters 10 . Short genomic assemblies ( < 1,000 bp) that could have biased the subsequent analysis were first excluded. Genomes were then binned based on their tetranucleotide frequency, differential coverage, and GC content, as well as codon usage, using different binning tools, including MetaBAT 2, MaxBin 2.0 and CONCOCT implemented by MetaWRAP v1.2.1 pipeline (default parameters) (Supplementary Table 1) [11][12][13] . The binning results were refined using the MetaWRAP package (parameters: -c 60 -x 20) 14 and all the produced bin sets were aggregated and dereplicated at 95% average nucleotide identity (ANI) using dRep v2.3.2 (parameters: -comp 60 -con 20 -sa 0.9) 15 . Taxonomic classification of each bin was determined by CheckM v1.0.3 and GTDB-Tk with default parameters (Supplementary Table 2) 16,17 . The bin quality assessment (completeness > 60% and contamination < 20%)  Table 2) and reference genomes of Bacteria and Archaea available in RefSeq (Supplementary Table 3). The scale bar corresponds to 3.00 substitutions per amino acid position. The number of draft genomes in each node are provided. The branches with red dots have no cultured representatives.

Data records
This project has been deposited at DDBJ/ENA/GenBank under the BioProject accession no. PRJNA707313, with the Sequence Read Archive deposited under the accessions SRR13892585~SRR13892607  . Other data is available through figshare 49 , including the fasta files containing the contigs of all 768 MAG, the newick format of the phylogenetic tree.

technical Validation
Potential contamination of samples was limited by following guidelines for analyses of microbiota communities 50,51 . Briefly, the samples were pre-treated in a sterile station in the lab of the Research Vessel KEXUE. DNA extractions took place within a dedicated laboratory space under a laminar flow hood using aseptic techniques (such as, surface sterilisation, DNA-OFF, use of sterile plasticware, and use of aerosol barrier pipette tips). Sample processing was completed within 2 days, using the same batch of PowerSoil DNA Isolation Kit for all sediment samples, and PowerWater DNA Isolation Kit for all water-filters samples. The filtered and trimmed Illumina reads were evaluated for their sequencing qualities using fastp v0.20.1 (https://github.com/OpenGene/ fastp) with default parameters 52 . In all samples, the Q score for the reads of each sample was calculated and showed that more than 90% of reads scored Q30 ( Table 2), indicating that most of the reads were constructed with low error rates. Metagenome data have been assembled and refined into MAGs using the automated quality control steps and assembly procedures described in the manuscript. To ensure the assembly quality of the contigs, several kmers (21,29,39,59,79,99,119,141) were selected in the assembly procedures of MEGAHIT. As for binning, more strict standards were selected, and the sequence after binning was re-assembled to ensure the best result.

Code availability
The above methods indicate the programs used for analysis within the relevant sections. The code used to analyse individual data packages is deposited at https://github.com/zhcosa/MAGs-from-cold-seep.