A comprehensive genomic catalog from global cold seeps

Cold seeps harbor abundant and diverse microbes with tremendous potential for biological applications and that have a significant influence on biogeochemical cycles. Although recent metagenomic studies have expanded our understanding of the community and function of seep microorganisms, knowledge of the diversity and genetic repertoire of global seep microbes is lacking. Here, we collected a compilation of 165 metagenomic datasets from 16 cold seep sites across the globe to construct a comprehensive gene and genome catalog. The non-redundant gene catalog comprised 147 million genes, and 36% of them could not be assigned to a function with the currently available databases. A total of 3,164 species-level representative metagenome-assembled genomes (MAGs) were obtained, most of which (94%) belonged to novel species. Of them, 81 ANME species were identified that cover all subclades except ANME-2d, and 23 syntrophic SRB species spanned the Seep-SRB1a, Seep-SRB1g, and Seep-SRB2 clades. The non-redundant gene and MAG catalog is a valuable resource that will aid in deepening our understanding of the functions of cold seep microbiomes.

of significant importance for both local and global biogeochemical cycles, underscoring the essential role of these microorganisms in deep-sea ecosystems.Although previous findings have revealed various lineages of ANME and SRB in seep sediments [13][14][15] , there is currently no a comprehensive genome catalog of these lineages in cold seep sediments globally.Extensive, high-quality reference genomes of the global seep microbiome could improve the resolution and accuracy of taxonomic and functional analyses and provide the opportunity for large-scale comparative genomics [16][17][18][19] , especially for elucidating the physiological basis of ANME-SRB interactions.
Here, we collected metagenomic sequence data from 165 sediment samples at 16 cold seeps across the Pacific, Atlantic, and Arctic Oceans (Fig. 1), encompassing gas hydrates (n = 4), methane seeps (n = 14), oil and gas seeps (n = 4), mud volcanoes (n = 2) and asphalt volcanoes (n = 1).The sediment samples span different depths and redox conditions, from the oxic sediment-water interface to anoxic layers down to 68.55 m below the sea floor (mbsf) (Supplementary Table 1).The non-redundant gene catalog was constructed from these metagenomes, comprising a total of 147,289,169 protein clusters (Fig. 2).The mapping ratios of the non-redundant gene catalog to clean reads of the 165 metagenomes averaged 62%.This is the most comprehensive gene catalog generated from the cold seep sediment microbiome to date, corresponding to half the size of the global microbial gene catalog (GMGC v1; 303 million) 16 , the size of the global topsoil microbiome gene catalog (~160 million) 20 , three times the size of the ocean microbial reference gene catalog (OM-RGC v2; ~47 million) 21 , and six times the size of the Tibetan Glacier gene catalog (TG2G; ~25 million) 17 .
A total of 3,164 species-level MAGs were recovered in this study.The total mapping ratios of all these MAGs to clean reads of the 165 metagenomes averaged 27%.These MAGs covered various prokaryotic lineages spanning 113 phyla (97 bacterial and 16 archaeal).The phyla with the largest diversity of recovered species included Chloroflexota (n = 371), Proteobacteria (n = 335), Desulfobacterota (n = 306), Planctomycetota (n = 190), Patescibacteria (n = 152) and Bacteroidota (n = 151) and the archaeal phyla Halobacteriota (n = 129), Thermoplasmatota (n = 108), Thermoproteota (n = 98), Asgardarchaeota (n = 95) and Nanoarchaeota (n = 47) (Fig. 3b).Overall, ~94% of the recovered species are not represented in current databases (Fig. 3c), suggesting that cold seep sediments harbor a rich diversity of previously undescribed microbes.The non-redundant MAG catalog considerably expands the phylogenetic diversity and is an unparalleled genome resource of the cold seep microbiome.The compendium of ANME (Fig. 4) and syntrophic SRB MAGs (Fig. 5) expands the currently known diversity of these groups in cold seeps and will aid in expanding our understanding of the physiological basis of their interactions and their evolutionary histories.

Functional annotation and taxonomic classification of the non-redundant gene catalog.
The representative amino acid sequences from each cluster were functionally annotated using eggNOG-mapper (v2.1.9;default parameters) 43,44 .The functional annotations, including those for eggNOG 5.0, Pfam 33.1, KEGG, EC, GO, and CAZy, were derived from the eggNOG-mapper results.We found that 64% of the non-redundant genes had a hit in at least one of the following databases: eggNOG (n = 88,929,242; ~60%), Pfam (n = 85,404,569; ~58%), KEGG (n = 48,756,524; ~33%), EC (n = 27,619,712; ~19%), GO (n = 5,966,227; ~4%) and CAZy (n = 1,514,988; ~1%) (Fig. 2a,b).After analyzing the annotated genes based on the eggNOG database (Fig. 2c), the predominant category was "Function unknown" (n = 17,018,774).This category includes proteins that have not yet been characterized or for which there is insufficient information to assign a specific function.A total of  ~40% of genes (n = 58,359,927; Fig. 2b) could not be assigned to an eggNOG orthologous group, similar to the percentage observed in the OM-RGC v2 (~39%) 21 and higher than that in the GMGC v1 (~27%) 16 .According to the eggNOG database annotation, half of the genes (~51%), including 58,359,927 unannotated genes and 17,018,774 genes labeled as "Function unknown", were functionally unidentified, suggesting that cold seeps harbor numerous unknown functional genes.MMseqs2 taxonomy (v13.45111;parameter: --tax-lineage 1) 45 was used to assign taxonomic labels to each representative amino acid sequence, using the GTDB R207 as a reference database 46 .The MMseqs2 taxonomy uses an approximate 2bLCA (lowest common ancestor, LCA) approach (--lca-mode: 2bLCA).A notable percentage of the non-redundant sequences (n = 44,441,531; ~30%) could not be classified as belonging to any prokaryotes in the GTDB, suggesting that these sequences may be attributed to novel prokaryotes (Fig. 2d).Approximately 9% (n = 13,154,825) of the non-redundant sequences could be identified only as either bacteria or archaea and could not be further classified at the phylum level (Fig. 2d).The results of taxonomic classification further confirm that this gene catalog contains many untapped genetic resources.
Metagenomic binning and non-redundant MAG catalog construction.Assembled contigs were filtered by length (>1000 bp) for subsequent binning.BWA software (v0.7.17;BWA-MEM algorithm) 47   used to align short reads back to filtered contigs, with the alignment being sorted by SAMtools (v1.9) 48.The contig depth profiles were produced using jgi_summarize_bam_contig_depths for running metabat2, maxbin2, SemiBin, Rosella and VAMB, while for running concoct, concoct_coverage_table.py was used.The binning process was performed using the metaWRAP binning module (v1.3.2;parameters: -metabat2, -maxbin2, -concoct, -universal) 35 , SemiBin with single_easy_bin mode (v1.4.0; default parameters) 49 , and Rosella (v0.4.1; default parameters; https://github.com/rhysnewell/rosella).The number of metagenomic samples collected from S11 (n = 13) and RS (n = 19) was larger than that obtained from other sites, making it computationally challenging to bin the co-assemblies of the samples from these sites.Thus, individual assemblies from the S11 and RS sites were concatenated and binned separately using the VAMB tool in "bin-split" mode (v3.0.2; parameters: --minfasta 200000 -o C) 50 .Afterward, the bins obtained with each binning tool were integrated and refined using the Bin_refinement module of the metaWRAP pipeline (v1.3.2;parameters: -c 50 -x 10) 35 .The completeness and contamination of refined bins were evaluated with CheckM (v1.2.1) 51 .Then, the resulting 8,654 MAGs were checked by GUNC (v1.0.5; default parameters) 52 to remove genomes potentially containing chimerism based on "pass.GUNC".All MAGs were dereplicated at the species level using dRep (v3.4.0; parameters: -comp 50 -con 10) 53 with an average nucleotide identity (ANI) cutoff value of 95%.Representative genomes were selected based     Fig. 4 Phylogenetic tree of ANME genomes and related archaea.The phylogenetic tree was constructed from 41 previously published ANME genomes and 135 MAGs belonging to Halobacteriota from this study.The tree was constructed by the maximum likelihood method using a concatenated alignment of 53 conserved archaeal single-copy marker genes.on the dRep scores derived from genome completeness, contamination and N50.A total of 3,164 MAGs with the highest dRep score from each species cluster were selected as the species representatives.MAGpurify software (v2.1.1;default parameters) 54 was used to identify and remove putative contaminant contigs from each MAG based on the clade-markers, tetra-freq, gc-content, and clean-bin modules.Importantly, the resulting representative genomes should be considered population genomes within species 55 .
MAGs were taxonomically classified using the GTDB-Tk toolkit (v2.1.1) 56,57with default parameters against the R207 database.According to the taxonomic classification, four species clusters, with medium-or high-quality representatives (CSMAG_1499, CSMAG_2247, CSMAG_2329, and CSMAG_3128), were not assigned to any existing phylum.They did not cluster together and were included in different clades, exhibiting low relative evolutionary divergence values ranging from 0.32 to 0.43.These results suggest that these species belong to undescribed phyla.Additionally, 44 classes, 184 orders, 412 families, 1,043 genera and 2,984 species lacked classification assignments based on the GTDB R207 (Fig. 3c), representing potential novel lineages.

Data Records
Details for the non-redundant gene catalog, the functional annotation and taxonomic classification for gene clusters, non-redundant MAGs, and phylogenetic trees are available in the Figshare repository 71 .All non-redundant MAGs are deposited in the NCBI database under BioProject PRJNA950938 (ref. 72) with the accession numbers detailed in Supplementary Table 2.

technical Validation
To maximize the number of genes and ensure the quality of the genes, we selected assembled contigs with a length greater than 500 bp to predict CDSs, as suggested in previous studies 17,73,74 .Then, we selected assembled contigs by length (>1000 bp) for metagenomic binning.The quality of MAGs was strictly controlled according to the following standards: (1) completeness >50% and contamination <10%; (2) genome sequences without potential chimerism (details in Supplementary Table 2); and (3) genome sequences without potential misassigned contigs.

Usage Notes
The dataset compiled and analyzed in this study is the largest of its kind from cold seep sediment environments.Researchers could use the gene catalog of seeps to compare genes of interest to those in other habitats, such as glaciers, polar regions and hydrothermal vents, to study the habitat specificity of genes.The compendium of ANME could be used to investigate the distributional pattern of ANME archaeal communities in global cold seeps and ecological niche partitioning.Furthermore, the evolutionary and physiological basis of ANME-SRB interactions could also be explored.

Fig. 1
Fig. 1 Overview of the studied areas and bioinformatics workflow.(a) Geographic distribution of the 16 global cold seep sites where metagenomic sequencing data were collected.The map was drawn using the maptools and ggplot2 packages in R v4.0.3.(b) Numbers and proportions of cold seep samples classified according to their types and depths.(c) Overview of the computational pipeline used to generate the non-redundant gene and MAG catalogs.

Fig. 2
Fig. 2 Functional and taxonomic characterization of the non-redundant gene catalog.(a) An overview of annotations for the non-redundant gene catalog.Non-annotation indicates that these genes were not annotated in at least one of the following databases: eggNOG, Pfam, KEGG, EC, GO and CAZy.(b) Number of genes with functional annotations across the six functional databases.Vertical bars represent the number of genes unique (color) to each functional database or shared (black) between different functional databases.Horizontal bars in the lower panel indicate the total number of genes with functional annotations in each database.(c) Functional annotations at the COG category level.S: Function unknown.(d) Breakdown of taxonomic classifications for the non-redundant gene catalog.

Fig. 3
Fig. 3 Quality and novelty of non-redundant MAGs.(a) Genome statistics for the representative species of non-redundant MAGs.(b) Taxonomic classification (domain and phylum levels) of the species-level representative MAGs.(c) Taxonomic novelty of the representative species.
p e r e d e n s n it r o r e d u c e n s A N M E -2 d C a .M e th a n o p e r e d e n s s p .B L n o s a r c in a le s C S M A G _ 1 8 6 5 |M e th a n o s a rc in a le s C S M A G _ 2 4 9 3 |M e th a n o s a rc in a le s C S M A G _ 1 5 7 7 |M e th a n o s a rc in a le s C S M A G _ 2 1 2 2 |M e th a n o s a rc in a le s C S M A G _ 5 0 1 |M e th a n o s a rc in a le s C S M A G _ 9 0 3 |M e th a n o s a rc in a le s C S M A G _ 3 8 9 |M e th a n o s a rc in a le s C S M A G _ 3 9 3 |M e th a n o sa rc in a le s C S M A G _2 78 2| M et ha no sa rc in al es C S M A G _2

1 U
3 8 C S M A G _ 4 6 7 |S y n tr o p h a rc h a e a le s J A C Q P P 0

6 C 5 CSM
(including syntrophic SRB, namely, HotSeep-1, Seep-SRB2, Seep-SRB1a and Seep-SRB1g, and non-syntrophic SRB) and 327 MAGs assigned to Desulfobacterota from this study.The concatenated multiple C S M A G _3 01 2| D es ul fo ba ct er ia CS M AG _4 72 |D es ul fo ba ct er ia CS MA G_ 30 46 |D es ulf ob ac ter ia Se ep -SR B1 a sp .5 str .S7 14 a sp. 5 str.20073 SRB CSM AG_ 537 Se ep-SR B1 a sp.8 str.AB 03 Bin 172 CS MA G_ 28 1 CS M AG _3 84 C SM AG _2 59 4 S ee p-S R B 1a sp .5 st r. 20 07 4 S R B S ee p-S R B 1a sp .9 st r.M eg 22 02 B in 90 S e e p -S R B 1 a sp .9 st r.M e g 2 2 2 4 B in 6 8 S e e p -S R B 1 a s p . 9 s tr .M e g 2 2 4 6 B in 2 3 A G _4 10 |D es ul fo ba ct er ia C SM AG _9 94 |D es ul fo ba ct er ia CS M AG _1 07 7| De su lfo ba cte ria Et h-S RB 1 sp 00 41 93 55 p -S R B 1 g s tr .C 0 0 0 0 3 1 0 6 C S M A G _ 1 1 6 4 |D T X E 0 1 S e e p -S R B 2 sp .5 st r. 0 0 3 6 4 5 6 0 5 C S M A G _8 94 D el ta pr ot eo ba ct er ia ba ct er iu m is ol at e E 20 S ee pS R B 2 Se ep -S RB 2 sp .

69 C
7 str .M L8 D Se ep -S R B2 sp .6 st r.M eg 22 02 Bi n u lf o b a c c a a c e to x id a n s D p ro te o b a ct e ri a b a ct e ri u m 3 7 -6 5 -8 C S M A G _2 27 |D ef er ris om at ia C SM AG _2 45 |D ef er ris om at ia CS MA G_ 135 2|J AD FV X0 1 CSM AG_ 41|J ADF VX0 1 CSM AG_2 158|J ADFV X01 CS M AG _3 06 3| BS N0 33 D el ta pr ot eo ba ct er ia ba ct er iu m R B G 13 61 14 C S M A G _3 05 9| B in at ia C S M A G _ 3 1 2 2 |B in a tia C S M A G _ 2 3 9 0 |Z y m o g e n ia C S M A G _ 8 6 4 |Z y m o g e n ia C S M A G _ 3 0 6 2 |J d F R -9 7 S m it h e ll a s p .P ta U 1 .B in 1 6 2 D S M -4 6 6 0 S y n tr o p h o b a c te r ia D e su lfo b act eri a D e s u lf o b a c te ri a D e s u lf o b u lb ia S y n tr o p h ia DS was