Main

Bacterial and archaeal cells account for an estimated 1029 cells in the oceans, and are essential components that underpin global biogeochemical fluxes and ecological processes5. They are characterized by broad taxonomic and metabolic diversity and can undergo rapid evolutionary adaptations in response to environmental changes. Recent advancements in sequencing technologies have lifted the barrier imposed by uncultivability, and have thus enabled genome-resolved metagenomics to shed light on marine biodiversity. In particular, landmark projects such as Global Ocean Sampling1 (GOS) and the Tara Oceans Expedition2, have significantly expanded our understanding of the oceanic microbial inventory on a planetary scale.

Despite these global sequencing efforts, only a few studies have applied a comprehensive approach to assess the functional diversity of the global marine microbiome4,6. A similar approach in terms of scale was performed by Nayfach et al.3 (2020), albeit with a focus on terrestrial and host-associated microbiomes and without experimental validation of the predicted biotechnological potential3. In relation to biotechnological potential, preliminary experimental data exist for genes involved in phospeptin and pythonamide biosynthetic pathways in ocean microbiomes4. Thus, robust experimental evidence is required to assess the usefulness and therefore value of these global microbiome datasets for their future exploitation to advance biotechnological and biomedical applications. To aid this approach, microorganisms from various marine ecosystems, including the difficult-to-assess polar oceans and the deep sea, must be included to explore the vast microbial diversity. Consequently, a two-step approach was chosen to address the current gaps: (1) we generated a comprehensive and unified catalogue based on genome-resolved metagenomics covering all major marine ecosystems including polar oceans and the deep sea; and (2) we applied deep learning-based bioinformatics in combination with experimental approaches to provide robust evidence that ocean microbiomes are a valuable resource for marine bioprospecting.

To realize our approach and thereby provide a step change in ocean-based bioprospecting, we have analysed publicly available marine metagenomes from National Center for Biotechnology Information (NCBI), European Bioinformatics Institute (EBI) and Joint Genome Institute (JGI) from the period of August 2009 to July 2020. We generated 43,191 metagenome-assembled genomes (MAGs) across 3,470 microbial genera and 138 phyla. Combining these MAGs with public marine bacterial and archaeal genomes from NCBI, Ocean Microbiomics Database (OMD) and OceanDNA4,6, we constructed a unified global ocean microbiome genome catalogue (GOMC). The GOMC markedly expands the known marine microbial diversity with numerous novel MAGs across various taxonomic ranks. By profiling the abundance of bacterial and archaeal MAGs, we identified biogeographic patterns of microbiomes on a global scale. Through comprehensive statistical analyses driven by a vast array of genomes, our study unveils microbial adaptive traits encoded in their genomes, such as genome size and preference for CRISPR–Cas or antibiotic resistance gene (ARG) defence systems. We also identified a novel CRISPR–Cas9 system, several antimicrobial peptides (AMPs) and highly active halophilic PETases that degrade plastics and demonstrated their respective activities in the laboratory. Thus, our unified catalogue represents a valuable resource for future studies, not only in terms of advancing our understanding of global microbial diversity, but also for how this diversity can be sustainably exploited for mitigating environmental pollution and for benefitting mankind through advancing biotechnological and biomedical applications.

Expansion of the global ocean microbiome

We collected 237.02 Tb of sequence data from 24,395 publicly available marine metagenomes, covering a broad range of marine environments, from pole to pole (latitude ranging from 77.90 °S to 89.99 °N) and from the surface ocean to hadal trenches (Extended Data Fig. 1). From these metagenomes, we reconstructed a collection of 43,191 medium- to high-quality MAGs with average completeness of 82.33% and 1.79% potential contaminations (Fig. 1a,b). A total of 26, 79, 304, 1,185 and 5,783 MAGs could not be assigned to known taxa against the Genome Taxonomy Database (GTDB) at the phylum, class, order, family and genus level, respectively. At the species level, a large proportion of bacteria and archaea (43.37% and 43.89%, respectively), accounting for 20,295 MAGs could not be assigned to any known taxon. To provide an exhaustive marine microbial genome catalogue, we further integrated marine microbial genomes from three additional databases, including the OMD4, OceanDNA6 and 8,050 public genomes from NCBI, resulting in a non-redundant catalogue comprising 24,195 genomes, which constitutes the GOMC (Extended Data Fig. 1 and Supplementary Table 1). A total of 9,937 MAGs, accounting for 41.07% of the GOMC, were newly recovered in the current study, most of which (82.06%) represent potential novel species that were not available in previous databases (Fig. 1c and Supplementary Table 1). These specific MAGs were recovered mainly from the bathypelagic zone (3,713 MAGs), sediment (1,371 MAGs) and host-associated (1,250 MAGs) ecosystems (Fig. 1d). Our newly recovered MAGs significantly increased the known diversity of marine microbiomes, constituting 65% of the genomes for the Thermoproteota and Halobacteriota phyla (Fig. 1d and Extended Data Fig. 2a), and accounting for more than 85% of Campylobacterota and Desulfobacterota genomes (Fig. 1d and Extended Data Fig. 2b).

Fig. 1: Geographic and ecosystem distribution of MAGs.
figure 1

a, Geographic distribution of 43,191 newly recovered MAGs. BATH, bathypelagic; DCM, deep chlorophyll maximum layer; MES, mesopelagic; SRF, surface water. b, The collection of 43,191 MAGs with medium or higher quality that form the basis of this study. The central dot plot displays the distribution of completeness and contamination for all MAGs recovered in this study. The top bar plot indicates the percentage of MAGs within specific completeness ranges, while the right bar plot shows the percentage within specific contamination ranges. The grey bar plot embedded in the center illustrates the number of taxonomically unclassified MAGs across taxonomic ranks. c, A Venn diagram showing the specific or shared species-level genomes among the newly assembled genomes, NCBI, OMD and OceanDNA. d, Contribution of the current study and extant published databases to each bacterial and archaeal phylum. The inset table presents the original ecosystems of the 9,937 specific MAGs in this study.

In addition to genome cataloguing, we further explored the biogeographic implications of our database (Extended Data Fig. 3a and Supplementary Note 1). Previous studies have investigated the marine microbial communities, particularly in the context of ocean microbiome dynamics, mostly by amplicon sequencing7,8,9, with a few exceptions utilizing metagenomes10. Here we introduce the implementation of uniform manifold approximation and projection (UMAP) to unveil biogeographic patterns within marine microbiomes11 (Supplementary Note 1). Our analyses identified 56 distinct metagenomic provinces (MPs) (ANOSIM test, R = 0.61, P < 0.01) (Extended Data Fig. 3b). Globally, MPs were not confined to geographically clustered sampling sites but exhibited large-scale biogeographical partitioning (Supplementary Note 1). The absence of strict geographical constraints on the distribution of MPs raises questions about the role of ocean connectivity in shaping microbial biogeography12. It is plausible that water masses facilitate the dispersal of microbial communities across large distances, contributing to the observed global-scale patterns10. MPs were primarily restricted to specific ocean depths with few exceptions across adjacent depth boundaries, and thus exhibited a clear depth profile (Extended Data Fig. 3c). This depth-related segregation suggests the existence of strong environmental filtering. Besides the role of MPs in delineating ecological patterns, they represent a framework for identifying genomic properties against distinguishing features of MPs on a broader geographic scale, as exemplified in the subsequent analysis focusing on defence systems (Extended Data Fig. 3d).

Implications of large marine bacterial genomes

Evolutionary theory predicts that high environmental variability selects for larger genomes with increased metabolic potential13. This has been documented in terrestrial and freshwater ecosystems but is less known for marine habitats14. In GOMC, we discovered 303 large genomes with estimated genome sizes of at least 8 Mb. Among them, three newly recovered MAGs from the Planctomycetota phylum with genome sizes ranging from 16.7 to 18.4 Mb extended the known upper limit of marine bacterial genome size (Fig. 2a, Supplementary Note 2 and Supplementary Table 1). These genomes were recovered from two samples from the Cariaco Basin, an anoxic marine basin situated on the northern continental shelf of Venezuela in the Caribbean Sea15. Their closest relative, Pirellulaceae bacterium, with a genome size of 11.7 Mb, was discovered in the upper layer of the anoxic pelagic system of the Black Sea16 (Supplementary Note 2). Although the two environments differ in several physiochemical properties, they are both characterized by a fluctuating supply of nutrients and significant redoxclines. This suggests that the larger environmental variability in these ecosystems might impose selection pressure that benefits bacteria with large genomes17. To further investigate the relationship between genome features and size, we assessed the variations in overall genome characteristics (Extended Data Fig. 4a). Although smaller genomes tend to employ higher coding density, no consistent trend was observed between coding density and genome size across major bacterial phyla. However, we identified an increase in gene length and intergenic length with genome size. Similarly, larger genomes tend to have a higher GC content with a maximum of around 75%, which might be attributed to a combination of intrinsic mutation bias and possibly also environmental factors18,19.

Fig. 2: Genome size and functional domain variation in Planctomycetota genomes.
figure 2

a, Phylogenetic tree of Pirellulaceae in Planctomycetota. Outer bars indicate the genome size. b, Heat map illustrating the distribution of the top 33 functional domains across genomes in the Planctomycetota phylum. Each row corresponds to a distinct Pfam domain and each column represents an individual genome. Genomes are arranged in ascending order on the basis of their size, as shown in the bar plot (top). The colour gradient from blue to red signifies the number of proteins associated with the respective functional domain within each genome. Warmer colours indicate a larger number of proteins, providing a visual representation of the Pfam domain composition across the analysed genomes. Right, the ordering of Pfam domains from top to bottom is determined by their R2 values obtained from the phylogenetic regression analysis within the specific phylum.

Additionally, we examined the trend between functional gene content and genome size, guided by the hypothesis that larger genomes preferentially accumulate genes involved in genome stability, cell cycle progression, signal transduction and gene regulation. Utilizing phylogenetic regression analyses, the reconstruction of ancestral proteomes, and exploring the associations between genome size and gene copies, we identified 77 Pfam domains that potentially underpin the expansion of genome size (Supplementary Note 2 and Extended Data Fig. 4b). Most of these domains exhibited a significant positive correlation with genome size across a wide taxonomic range (Fig. 2b and Extended Data Fig. 5a,b), and demonstrated a broad spectrum of functional roles, such as nutrient acquisition, responsiveness to environmental stimuli and interactions with other organisms. For instance, the methyltransferase domains (PF08241 and PF13649) appear to be a significant predictor of genome size. Studies have demonstrated roles for bacterial DNA methylation in gene regulation, genome stability and defence mechanisms20,21. Thus, bacteria with larger genomes may encode a greater diversity of genes and regulatory elements, contributing to increased complexity in DNA methylation patterns20. These organisms may also invest in an elaborate defence system, utilizing DNA methylation to protect against phage infection and foreign DNA21. The von Willebrand factor type A domain (PF13519), which serves as a key structural motif influencing bacterial adhesion, biofilm formation and cellular interactions22, emerges as another notable indicator. Its modular nature facilitates crucial protein–protein interactions, essential for bacterial adhesion in diverse environmental contexts, and contributes to ligand recognition, affecting bacterial colonization and community dynamics23. Notably, we observed a significant positive correlation between genome size and WD40 motif-containing proteins (PF00400). WD40 is an ancient protein domain family that was originally identified in eukaryotes but was subsequently also found in bacteria, especially in those with increased phenotypic complexity24,25. Proteins featuring the WD40 motif often function as scaffolds for protein–protein interactions, with a potential role in the formation of the distinctive intracytoplasmic membrane in Planctomycetes, and thereby promoting eukaryote-like intracellular compartmentalization26,27.

Trade-offs between CRISPR–Cas and ARG systems

The CRISPR–Cas system, which serves as a microbial ‘immune’ system, is crucial for preventing heterogeneous nucleate invasion. Since its discovery, the distribution of CRISPR–Cas systems across microbial phylogeny and diverse ecosystems has attracted substantial interest28. In GOMC, we identified 5,127 Cas operons of 40 types from 3,212 MAGs (around 15%), among which 1,708 also contained complete CRISPR arrays (Fig. 3a and Supplementary Table 1). Notably, Firmicutes_B possessed the highest fraction of MAGs encoding Cas operons, whereas other taxa, such as the Margulisbacteria and Rhizobiales, rarely encode Cas proteins (Fig. 3a). These results align well with previous studies with respect to the overall presence of Cas operons and uneven distribution pattern across phylogenies29. To investigate the potential factors driving this taxonomic bias in terms of the presence of Cas operons, we interrogated the influence of temperature, a previously reported variable influencing the abundance of CRISPR–Cas systems in bacteria and archaea30. We predicted the optimal growth temperature for all the genomes in the GOMC, and found that microorganisms that encode Cas proteins exhibited a significantly higher average optimal growth temperature compared with those without Cas operons across most phyla (Extended Data Fig. 6a). Consistently, the fraction of MAGs encoding CRISPR–Cas systems was significantly higher in thermophile compared with psychrophile, and in hydrothermal vents compared with open-ocean water samples (Fig. 3b,c and Extended Data Fig. 3d). In addition to temperature, host-associated ecosystems exhibited a higher frequency of CRISPR–Cas encoding MAGs compared to open oceans (Extended Data Fig. 3d). Anaerobic ecosystems also demonstrated relatively higher prevalence of Cas operons, including intestinal microbiomes, engineered wastewater anaerobic microbiomes and microbiomes from terrestrial deep subsurface ecosystems (Fig. 3c), which might be owing to host-associated condition and/or low oxygen concentration, as previously reported28.

Fig. 3: The distribution of defence systems.
figure 3

a, Bar plot indicating the frequency of Cas operons in different lineages of all GOMC genomes. Only lineages with more than 50 genomes are presented, and the blue line represents the genome number of each lineage. b, Bar plots displaying the incidence rate of Cas operon in GOMC genomes with different optimal growth temperatures. The grey line displays the incidence rate of CRISPR array. c, Bar plots displaying the incidence rate of Cas operons in genomes from different ecosystems. d, Line plots showing the fractions of genomes encoding ARG with or without the presence of Cas operons. Boxes represent the difference of these two ratios with a blue box indicating the fraction by which the absence of Cas operons increased the frequency of ARG, and a red box indicating the fraction by which the absence of Cas operons decreased the frequency of ARG. e, The trend indicates a decrease in the upper limit number of ARGs with increased number of Cas operons.

Despite the canonical role of microbial CRISPR–Cas systems in thwarting foreign DNA invasion, their potential effect on the acquisition of adaptive traits, such as antibiotic resistance capacity31, remains an intriguing area of inquiry. We examined the frequency of ARGs in genomes that either encoded Cas operons or lacked them across various lineages. We observed a significantly lower frequency of ARGs in genomes that simultaneously encoded Cas compared with genomes without Cas in several microbial phyla. These phyla inhabit environments that favour the selection of CRISPR–Cas defence systems, including Thermoplasmatota and Halobacteriota from hydrothermal vents32, as well as Patescibacteria, WOR-3, Gemmatimonadota, Marinisomatota and Firmicutes, which are commonly observed in anaerobic or host-associated environments30 (Fig. 3d and Extended Data Fig. 6b). However, the presence of Cas did not decrease the fraction of MAGs encoding ARGs in the remaining phyla (Fig. 3d). Further investigation considering not only the presence or absence but also the number of Cas operons in each genome revealed that their number appears to restrict the upper limit on the number of ARGs that the genome can potentially encode (nested ANOVA test, P < 0.001). Consequently, as the number of Cas operons increases, the number of ARGs and mobile genetic elements decreases (Fig. 3e and Extended Data Fig. 6c). Notably, previous studies have reported divergent trends, with some identifying an inverse relationship between CRISPR–Cas and ARGs in selected pathogenic strains33,34, whereas genes associated with fosfomycin and rifampicin resistance were reported to be more common in Escherichia coli genomes with CRISPR–Cas system34. These observations align with our findings depicted in Fig. 3d, suggesting a lack of a consistent monotonic trend. Here, we noticed a significantly higher proportion of MAGs encoding ARGs in MPs from the open ocean. Conversely, higher fractions of MAGs encoding CRISPR–Cas immunity systems were observed in host-associated MPs, suggesting that the protection against foreign DNA might be of greater importance in these environments compared with open-ocean environments. MAGs encoding both immune systems were observed across most of the MPs associated with various marine ecosystems, although their frequencies were relatively low (Extended Data Fig. 3d).

A CRISPR–Cas9 system with robust in vitro activity

We have demonstrated the potential of the GOMC database as a valuable resource for exploring novel genome editing tools. Taking the most widely used Cas9 system as an example, we identified 88 contigs containing Cas9 operons and complete CRISPR arrays, among which 36 had Cas9 proteins more than 950 amino acids in size35 (Supplementary Table 2). From these, we selected the shortest one (ocean microbiome CRISPR–Cas9 system (Om1Cas9); 1,054 amino acids) from newly recovered genomes for experimental testing (Supplementary Table 3a). Om1Cas9 utilizes a guide RNA scaffold consisting of a 37-bp mature CRISPR RNA (crRNA) and a 72-bp trans-activating crRNA (tracrRNA) (Extended Data Fig. 7a,b) and specifically recognizes 3′ NNGG protospacer adjacent motif (PAM) sequences for targeting of double-stranded DNA (dsDNA) (Extended Data Fig. 7c). We conducted digestion experiments by incubating Om1Cas9 ribonucleoprotein complexes and dsDNA substrates at temperatures ranging from 22 to 42 °C. The results demonstrated that Om1Cas9 can effectively cleave dsDNA across the tested temperature range, displaying robust in vitro editing performance (Extended Data Fig. 7d and Supplementary Fig. 1). Furthermore, we integrated the Om1Cas9 sequence into the pX458 plasmid and evaluated its activity using human cells (Supplementary Table 3b). Specifically, we selected five target sites in the haemoglobin subunit gamma (HBG) gene and BCL11a enhancer regions, and designed corresponding guide RNA spacers with appropriate PAMs to explore the practical application of Om1Cas9 in the treatment of β-thalassaemia (Supplementary Table 3c). Om1Cas9 showed a cleavage efficiency of 17.08–37.44% and 14.89–93.83% at the HBG and BCL11a enhancer gene loci, respectively, in the HEK293T cell line derived from embryonic kidney cells (Extended Data Fig. 7e,f). This case study demonstrates the efficacy and highlights the potential of utilizing the GOMC resources for identifying novel CRISPR–Cas systems for various biotechnological applications.

AMPs with efficacy against a range of pathogens

Marine microbial communities have the ability to synthesize secondary metabolites with significant ecological, biotechnological and therapeutic application potentials4. These molecules are encoded by biosynthetic gene clusters (BGCs). In our study, we predicted a total of 64,217 BGCs of 66 different types, with lengths ranging from 1,001 to 576,743 bp (Extended Data Fig. 8a). To address redundancy and incompleteness inherent in individual BGCs, we clustered all BGCs into 13,063 gene cluster families (GCFs) (Fig. 4a). Remarkably, approximately 25.49% (16,369 BGCs) of the BGCs from 5,793 GCFs were only remotely similar (cosine distance > 0.2) to any annotated GCFs in the BiG-FAM reference database36. Approximately 60.83% of these novel BGCs were specifically encoded by the newly reconstructed MAGs in our study. Most of the novel BGCs were from Proteobacteria (38.36%) and Bacteroidota (10.65%) (Extended Data Fig. 8b), with ribosomally synthesized and post-translationally modified peptides (RiPPs) (43.12%) and terpenes (23.12%) being most dominant types (Extended Data Fig. 8c). Furthermore, we identified a total of 419 archaeal GCFs, with 233 archaeal-specific domains, mainly from Halobacteriota and Thermoplasmatota. Among the bacterial phyla, Proteobacteria, Actinobacteriota and Firmicutes had the highest diversity of GCFs, with more than 80% of their GCFs being phylum-specific (Extended Data Fig. 8d), implying that the biosynthesis of certain secondary metabolites might be restricted to specific taxa. Furthermore, we extrapolated the trend of GCF-coding potential across various taxonomic ranks, identified Proteobacteria, Actinobacteriota, Bacteroidota and Planctomycetota as having the highest potential for producing secondary metabolites (Extended Data Fig. 8e) and emphasized the efficacy of genus-level classification in assessing coding potential (Extended Data Fig. 8f,g), further corroborating previous findings36.

Fig. 4: Identification of biosynthetic gene clusters and AMPs.
figure 4

a, Comparison of biosynthetic gene clusters among phyla. The number of unique GCFs detected in each phylum is displayed by the bar chart. b, SEM examination of five bacterial strains treated with cAMP_87 and non-AMP negative control group, revealing leakage of cell contents and disruption of the cell wall and membrane. The experiments were conducted in triplicate, yielding consistent results, and a representative image is provided for illustration.

We conducted extensive data mining to identify potential novel AMPs from putative BGCs, which often exhibit various antibacterial and antitumor activities37 (Extended Data Fig. 8h). We identified 1,079 putative AMPs from 629 BGCs, of which 121 unique candidate AMPs (cAMPs) were identified from 115 BGCs using deep learning models (Supplementary Table 4). The cAMPs were mainly derived from lanthipeptide class II and lanthipeptide class I BGCs from Actinobacteriota (31 cAMPs), Firmicutes (27 cAMPs) and Proteobacteria (21 cAMPs) (Extended Data Fig. 8h). Out of the 121 candidate AMPs, 117 showed high potential to be novel cAMPs, indicating a rich source of unexplored AMPs in the marine microbiome38.

To validate and characterize their antimicrobial activity, we successfully synthesized 63 cAMPs with fewer than 50 amino acids by solid-phase peptide synthesis (Supplementary Table 4). We examined their antimicrobial activity against five bacterial strains, including Gram-positive Staphylococcus aureus (ATCC 12600) and Bacillus subtilis (ATCC 6051), as well as Gram-negative E. coli (ATCC 25922), Klebsiella pneumoniae (ATCC 13883), and Vibrio vulnificus (ATCC 27562). Preliminary examination identified ten cAMPs with antimicrobial activity that inhibited the growth of at least one strain (Extended Data Fig. 9a and Supplementary Table 4). Of note, one of the tested cAMPs (cAMP_87) showed the lowest minimal inhibitory concentration (MIC) and minimal bactericidal concentration (MBC) of 4 μM against the S. aureus and B. subtilis strains, whereas for other three strains, the MIC was 16 μM and the MBC remained below 32 μM (Extended Data Fig. 9a,b). The 22-amino-acid peptide cAMP_87 was initially identified from a novel bacterium of the Salinibacteraceae family. The structure predicted by AlphaFold2 showed that cAMP_87 adopts an alpha-helical conformation, consistent with a typical structure of AMPs39 (Extended Data Fig. 9c,d). Both scanning electron microscope (SEM) and transmission electron microscope (TEM) images revealed damage of the bacterial membrane upon exposure to cAMP_87 (Fig. 4b and Extended Data Fig. 9e). Thus, cAMP_87 exhibits broad-spectrum and potent antibacterial activity against both Gram-negative and Gram-positive bacteria. Our finding indicates that novel marine bacterial genomes have great potential for AMP mining, pointing to the unexplored novel antibiotics space of marine microbial genomes.

Deep-sea PETases depolymerize PET film

We also constructed a global ocean microbiome protein catalogue (GOPC) by predicting open reading frames (ORFs) of the assembled contigs (Extended Data Fig. 1). The GOPC contains more than 2,458 million unique genes, surpassing the gene count of Ocean Microbial Reference Gene Catalogue40 (OM-RGC_v2), providing a more comprehensive resource for novel enzyme mining for various biotechnological applications. The enzymatic breakdown of polyethylene terephthalate (PET) has attracted increasing attention since the discovery of a novel PET hydrolase (IsPETase) from a PET-assimilating bacterial strain41,42. Techniques such as directed evolution have significantly improved the catalytic efficiency for PET degradation and recycling43,44,45. We conducted a targeted search against GOPC using the IsPETase sequence as a reference to discover novel PET hydrolases. We identified 1,598 IsPETase homogenous sequences from various marine ecosystems containing the conserved Ser-Asp-His catalytic triad46 (Extended Data Fig. 10a). These sequences showed significant phylogenetic diversity and formed distinct clades not being constrained by their geographic origins (Extended Data Fig. 10b,c). To identify PET hydrolases with robust performance under different conditions and enzymatic stability, we focused on PETase candidates associated with extreme marine environments47. Consequently, we selected three sequences from the hadal trench and another three sequences from hydrothermal vents for heterologous expression in E. coli and following in vitro biochemical characterization (Extended Data Fig. 10d and Supplementary Table 5).

To test the hydrolytic activities of the six heterologously expressed deep-sea PETases (dsPETases), commercial GfPET films (ES301445, Goodfellow) were used as substrates. The total concentration of the main hydrolysis products including mono (2-hydroxyethyl) terephthalic acid (MHET) and terephthalic acid (TPA) (Fig. 5a), served as a proxy of catalytic activity44. Among the six candidates, three halophilic PET hydrolases (dsPETase05 from the North Su hydrothermal vent, and dsPETase01 and dsPETase06 from the Mariana Trench) exhibited superior catalytic activity against amorphous GfPET films, especially under elevated NaCl concentrations (Fig. 5b and Extended Data Fig. 10e). Their catalytic activities increased with higher NaCl concentrations and reached their peak performance at 4.5 M or 5.3 M NaCl at 37 °C. Compared to IsPETase, these three halophilic PET hydrolases displayed 12.0-, 16.0- and 5.6-fold higher activity, respectively (Fig. 5b). However, no significant catalytic activity was observed for dsPETase02, dsPETase03 and dsPETase04 under varying saline conditions (Extended Data Fig. 10e). The optimum temperature for dsPETase05 from the vent plume water (55 °C) was higher than those for dsPETase01 and dsPETase06 derived from the hadal trench water (40 and 45 °C) (Fig. 5c). Compared with the salt-intolerant IsPETase, the three halophilic dsPETases showed 11.8- to 44.3-fold higher activities under the optimum saline and temperature conditions (Fig. 5c). We conducted an incubation experiment using the most active dsPETase05, with IsPETase serving as the control, to visualize the PET depolymerization process. The incubation systems contained in-lab prepared solvent-cast PET (scPET) films with an average thickness of 28 μm and PETases with 300 and 500 nM concentrations. During three days of incubation, dsPETase05 showed more significant visible degradation of the scPET film than IsPETase (Fig. 5d). Notably, after three days of incubation with 500 nM dsPETase05, all scPET was degraded into small fragments. The dsPETse05 achieved 83% depolymerization rate, significantly higher than that of 41% achieved by IsPETase. Similarly, at a reduced enzyme concentration of 300 nM, the dsPETase05 still exhibited a higher depolymerization rate of 41%, compared with 27% for the IsPETase.

Fig. 5: Hydrolytic activities of halophilic dsPETases.
figure 5

a, Schematic depolymerization of PET catalysed by PETase, mainly producing MHET, TPA and ethylene glycol (EG) as soluble products. b, Halophilic properties of dsPETases. Hydrolytic activities towards amorphous GfPET films proxied by the concentrations of total released products (the sum of MHET and TPA, analysed by HPLC). The reactions catalysed by 50 nM dsPETases were carried out in pH 9.0 Tris-HCl buffer for 120 h at a series of NaCl concentrations. The activity of IsPETase in the absence of NaCl was determined in parallel as a reference. c, Hydrolytic activities of three halophilic dsPETases towards GfPET films under a range of temperatures. The reactions were initiated by adding enzymes to their optimal saline concentrations, which were 5.3 M of NaCl for dsPETase01 and dsPETase05, and 4.5 M of NaCl for dsPETase06. All reactions were conducted in triplicate. The bars and circles represent the mean and individual values, respectively, and error bars represent the s.d. of the replicated experiments. d, Visible degradation of scPET films by halophilic dsPETase05 under optimal saline and temperature conditions. The reaction catalysed by IsPETase under NaCl-free Tris-HCl buffer (pH 9.0) at 37 °C was set as reference. In each sample, 3 mg of scPET was incubated in a total volume of 3 ml, with 300 or 500 nM of enzyme as indicated. The experiments were conducted in triplicate with consistent results, and one representative figure is shown.

Discussion

We conducted extensive data collection and analysis of ocean metagenomes and marine microbial genomes from worldwide distributed samples. Our study significantly contributes to the expanding knowledge on marine microbiomes with the establishment of the GOMC database, comprising 24,195 species-level genomes. This comprehensive resource, alongside our protein database GOPC, provides valuable insights into the intrinsic biological diversity of marine environments and opens promising avenues for bioprospecting. Although previous MAGs-based studies have offered preliminary insights into the role of the marine system in maintaining biological diversity1,2,3,9,48, our research extends these findings and introduces avenues for sustainable exploration and exploitation.

The interactions between microorganisms and their environment are of paramount importance within marine ecosystems owing to the dynamics of oceanic habitats. Factors such as salinity, temperature fluctuations, light availability and significant changes in pressure from the surface to the sea floor impose unique selective pressures on microbial populations, shaping their (co)evolution49,50. The evolutionary process results in disparities in genome size and the variations of adaptive mechanisms such as defence systems. The observed augmentation in bacterial genome size demonstrates a complex association with the proliferation of distinct functional domains that are crucial for nutrient acquisition, responsiveness to environmental stimuli and interactions with other organisms20,21,22,23,24,25,26,27. The differential abundance and uneven distribution of defence systems across ecosystems reflect the competitive nature of oceans, despite their dilute environment. The presence of CRISPR–Cas systems underscores their importance in shaping microbial survival strategies in dynamic marine ecosystems31,34,51. Despite complex and sometimes contradictory patterns, the correlation between CRISPR–Cas occurrences and ARG defence systems suggests potential trade-offs between adaptive immunity and the acquisition of new genetic material31,33,34,51. This genomic plasticity not only contributes to the ecological success of specific lineages in certain ecosystems but also highlights the rich diversity of marine microorganisms, offering unique opportunities for potent bioprospecting.

Drawing from these insights, our study leverages the repository of marine microbial genomes as a fundamental resource for genome mining. This approach enables the discovery of genetic tools and novel bioactive compounds. Our investigation unveils valuable information about newly identified CRISPR–Cas9 systems, AMPs and plastic-degrading enzymes, showcasing the diverse molecular arsenal encoded within the microbial communities of the marine environment. For instance, our newly identified CRISPR–Cas9 systems in the GOMC hold great application potential in various fields of research and biotechnology52,53. Notably, taxa such as Streptomyces, Micromonospora and Pseudomonas_E emerge as promising candidates for bioprospecting endeavors54,55,56, facilitating the exploration of novel BGCs and bioactive compounds. The observed diversity in the BGC coding potential serves as a tangible demonstration of the extensive range of genetic diversity within marine microbial communities. Notably, the experimentally validated AMPs show promise as antibiotics against multiple pathogens. The insights obtained from our experimental in vitro work can in return improve deep learning algorithms tailored for the precise identification of AMPs57. This synergistic relationship between computational prediction and experimental validation might establish a positive feedback loop, where each iteration refines and strengthens the efficacy of both methodologies37,57. Additionally, our databases harbour significant potential for discovering novel enzymes, exemplified by the identification of PETases for plastic degradation and waste management practices41,42,43. Together, the deep learning-based genome mining of ocean microbiomes in combination with in vitro verification hold great promise for addressing global challenges from antimicrobial shortages to ocean pollution, emphasizing the critical role of marine microbiomes in advancing human well-being and environmental sustainability.

Methods

Marine metagenome sequence data collection

We downloaded and reanalysed marine metagenome sequencing data from NCBI, EBI and JGI published during August 2009 to July 2020. For datasets from NCBI database, we first screened 49 marine-related taxonomy IDs (Supplementary Table 6), including those with keywords of various marine environments, ecosystems, animals, plants and others related. Based on these taxonomy IDs, we used NCBI’s E-utilities tool to obtain the corresponding sample and Sequence Read Archive information. For datasets from the EBI database, we first downloaded the metadata of all classification systems, and then manually screened them according to 27 keywords related to the ocean (Supplementary Table 6). For datasets from JGI database, we directly downloaded sample information based on the same set of keywords as used for EBI database. To reduce potential redundancy between different databases, we double-checked and removed the duplicated datasets obtained from different databases. A total of 24,395 marine metagenomic samples with more than 230 Tb sequencing data were downloaded from these three databases15,17,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108. All of the downloaded datasets were further checked and samples of rRNA gene amplicon sequencing, metatranscriptomics sequencing, and non-marine environments, were removed and not considered in this study.

Metagenome assembly, binning and quality evaluation

Sequencing reads of metagenome samples were obtained using sratoolkit (v2.10.8), and reads with low sequencing quality, PCR duplication and adapter contamination were trimmed by SOAPnuke (v1.5.6). Clean data of each sample was assembled into contigs by megahit (v1.1) with parameters “--min-count 2 --k-min 33 --k-max 63 --k-step 10”109. Contigs with lengths longer than 1,000 bp were subjected to the MetaBAT2 (v2.12.1)110 module of MetaWRAP (v1.1.5)111 software for binning analysis to generate the MAGs. A total of 119,843 MAGs were reconstructed after individual-sample binning of all marine metagenomic samples in the current study. The quality score (QS) defined as ‘completeness − 5 × contamination’ of each MAG was estimated with CheckM (v1.0.12)112, and the low-quality genomes following the MIMAG standard (completeness <50% or contamination >10% or QS < 50) were removed from downstream analysis. Genomes with completeness ≥90% and contamination ≤5% were defined as near-complete genomes, with completeness ≥70% and contamination ≤10% were defined as high-quality genomes, and completeness ≥50%, contamination ≤10% were defined as medium-quality genomes. Finally, 43,191 MAGs met the criterion of medium or higher quality of the MIMAG standard (completeness ≥ 50%, contamination ≤ 10% and QS ≥ 50) and were kept for downstream analysis and included in the GOMC. However, it should be noted that more MAGs would probably be reconstructed from these metagenomic datasets by different genome binning approaches. For instance, multiple-sample binning methods based on differential coverage across all samples from similar environments4,113. We suggest that researchers adopt diverse and flexible binning approaches if they intend to recover as many genomes as possible. However, significantly more computing resources will be required by the multiple-sample binning approach.

Genome catalogue construction and taxonomic assessment

The 43,191 newly recovered MAGs from this study together with another two recently published databases, including marine microbial genome datasets OMD4, OceanDNA6 and 8,050 qualified publicly available marine bacterial and archaeal genomes (completeness ≥50%, contamination ≤10% and QS ≥ 50) from NCBI114 (CNSA accession number: DATAmic13, retrieved from NCBI on 31 May 2020) were combined and subjected to de-redundancy at the species level using dRep (v2.6.2)115 with parameter settings of “-comp 50 -con 10 -pa 0.9 --S_ani 0.95 --cov_thresh 0.3”. Taxonomic annotation was performed using the Genome Taxonomy Database Toolkit (GTDB-Tk, v2.1.1) with the classify_wf function under default parameter settings (dataset r207v2)116. The bacterial and archaeal phylogenetic trees of GOMC were constructed by FastTree (v2.1.10)117 using the protein sequence alignments produced by GTDB-Tk and visualized by iTOL (v5.0)118.

Functional annotation of GOMC genomes

ORFs of genomes in GOMC were predicted using Prokka (v1.14.6)119, and functional annotation of the predicted ORFs was conducted. Protein families (Pfams) were annotated using InterProScan (v5.0) against Pfam (v43) database. The ARGs were identified using RGI (v5.2.0) against the CARD (v3.2.9) database, and only the ‘Perfect’ and ‘Strict’ annotation results with protein length >50 amino acids and identity >30% were retained. CRISPR–Cas genes and arrays were searched and identified using CRISPRCasTyper (v1.6.1)120. The acr-aca operons including Anti-CRISPR (Acr) proteins and Acr-associated (Aca) proteins were identified by using the AcaFinder121 with parameters “-l 800 -i 300 -b 10”. The optimal growth temperature (OGT) was estimated using OGT_prediction by the excluding_genome_size_and_rRNA regression model122. Based on the predicted optimal growth temperature, we divided MAGs into 6 categories, including psychrophile (OGT < 10 °C), psychrotrophs (10 ≤ OGT < 20 °C), mesophile (20 °C ≤ OGT < 40 °C), thermotolerant (40 °C ≤ OGT < 55 °C), thermophile (55 °C ≤ OGT < 85 °C) and hyperthermophile (OGT ≥ 85 °C).

Genome mining of CRISPR–Cas operons from diverse ecosystems

To compare the distribution of CRISPR–Cas system in different ecosystems, we downloaded 19,483 genomes of eight ecosystems from the Genomes from Earth’s Microbiomes (GEM) catalogue3, including 7,335 genomes of aquatic freshwater, 2,461 of terrestrial soil, 1,955 of mammalian digestive system, 1,910 of wastewater anaerobic digestor, 1,735 of aquatic non-marine saline and alkaline, 1,579 of aquatic thermal springs, 508 of terrestrial deep subsurface and 2,000 randomly selected genomes of human digestive system. Then all genomes from each ecosystem were clustered at 95% average nucleotide identity using dRep (v2.6.2)115 with parameters “-comp 50 -con 10 -pa 0.9 --S_ani 0.95 --cov_thresh 0.3”. Besides, 968 non-redundant glacier genomes were downloaded from the Tibetan Glacier Genome and Gene (TG2G) catalogue123. Totally, 10,274 non-redundant genomes from nine ecosystems were used for the CRISPR–Cas operons and acr-aca operons detection by using CRISPRCasTyper (v1.6.1)120 and AcaFinder121, respectively, as described above. MobileElementFinder (v1.1.2) was used to investigate the mobile genetic elements including transposon, insertion, integrative conjugative elements and integrative mobilizable elements of all genomes124. The results showed that many microorganisms encode anti-CRISPR genes that inhibit the CRISPR–Cas function to avoid self-targeting immunity34,125 and facilitate the acquisition of novel beneficial functions under complex regulation mechanisms126,127.

Assessing CRISPR–Cas9 activity

From the CRISPRCasTyper prediction results, we identified 88 genomes in GOMC containing complete Cas9 operon and CRISPR array in one contig and selected 36 Cas9 proteins longer than 950 amino acids. We predicted their 3D structures using AlphaFold2 (v2.3.0)128 and checked the conserved key residuals of the active center, leading to a total of 26 Cas9 proteins showing the conserved structure of the key residuals (Supplementary Table 2). The genome GOMC.bin.16150 (CNSA accession no. CNA0069409), which contained the shortest Cas9 protein (Ocean Microbiome Cas9, Om1Cas9) with a length of 1,054 amino acids among the newly recovered genomes in GOMC, were selected for the experimental demonstration of potential genome editing activity.

The CRISPR locus of Om1Cas9 was designed and synthesized into pACYC184 plasmid. To identify the mature crRNA and tracrRNA of Om1Cas9, we transformed the plasmid expressing the Cas9 system into E. coli BL21. We extracted total RNA using the RNA Extraction Kit (Tiangen) and constructed the small RNA sequencing library using the MGIEasy Small RNA Library preparation kit (MGI). The small RNA library was sequenced on a DNBSEQ-G400 sequencer generating single-end 100 bp reads. The sequencing reads were trimmed and mapped to the 10 kb DNA sequence flanking the CRISPR locus to identify the crRNA and tracrRNA sequences (Supplementary Table 3a and Supplementary Fig. 6a). The small RNA sequencing reads are deposited in the China National GeneBank Sequence Archive (CNSA) database under the dataset number MDB0000002.

For protein expression, the Om1Cas9 sequence was cloned and inserted into the pET28a (+) vector, and transformed into the E. coli BL21 (DE3) for expression. The plasmid maps are available in Supplementary Table 3b. Ni-NTA chromatography and size-exclusion chromatography were used to purify target proteins. To assess the targeting capability of Om1Cas9 to generate specific double-stranded DNA breaks, we conducted experiments using the DocMF platform to validate the positive cutting event of Om1Cas9 against the opposite PAM library as described previously and the PAM sequence logos were generated using ggseqlogo129,130 (Supplementary Table 3c). All plasmids and primer sequences were synthesized by BGI Tech Solutions.

In vitro activity assay of Om1Cas9

The amplified AAVS1 gene DNA fragments from 293T cell genomic DNA were used as dsDNA cleavage substrates, and the PCR primer pair sequences were shown in Supplementary Table 3d. All guide RNAs were transcribed in vitro using T7 RNA polymerase (AM1354, Invitrogen) following the manufacturer’s instructions. Nuclease and guide RNA (IVT) were combined to form active RNP complexes at a concentration of 100 nM in a 1:1 molar ratio in 1× NEBuffer r3.1 (NEB, US). The 50 nM DNA substrates were incubated with the formed active RNP complexes at a temperature range from 22 °C to 42 °C for 30 min for cleavage. After incubation, samples were analysed by electrophoresis on a 1% agarose gel.

We constructed the editing plasmids by using the pX458 vector backbone which included Om1Cas9 ORFs with 2A-puromycin resistant marker (2A-EGFP). Five potentially effective editing sites in HBG and BCL11a genes were selected according to the Om1Cas9 PAM sequences131,132 (Supplementary Table 3e). The guide RNA oligonucleotides were synthesized and inserted downstream of the U6 promoter through the BsaI recognition site after annealing.

The HEK293T (CRL-3216 ATCC) cells were seeded into 12-well plates and transfected with a total of 2 μg Om1Cas9 plasmids using the Lipofectamine 3000 kit (Invitrogen, UK) (2.4 μl per well). Blank plasmids were used as negative control. The frozen cell line was provided by Servicebio Technology Co., Ltd. (No. STCC10301G) and was identified by short tandem repeat (STR) profiling. Mycoplasma contamination was not detected by using the Myc-PCR Mycoplasma Detection Kit (Yeasen Biotechnology Co., Ltd., No. 40601ES10). After transfection, the cells were cultured in Dulbecco’s modified Eagle medium (10566016, Gibco) for 3 days. Then the treated cells were collected and the transfection efficiency was calculated by Countstar Rigel S2 (Countstar) using the GFP channel. There was no significant difference in the transfection rate between the negative control and the Om1Cas9 plasmids. Genomic DNA was extracted using the TIANamp Genomic DNA Kit (DP304 Tiangen) and quantified using Nanodrop. The target sites were amplified from genomic DNA (PrimeSTAR GXL DNA Polymerase, TAKARA). The primers are listed in Supplementary Table 3f. The purified amplification mixture was used for the library construction using the MGIEasy PCR-Free Library Prep Set (MGI), and the libraries were subjected to deep sequencing on the DNBSEQ-G400 sequencer using the paired-end 150 bp mode. CRISPRresso2 was used to analyse the insertions and deletions (indels) of the target amplicon sequencing data133.

The prediction of marine microbial natural products

Secondary metabolite BGCs were identified using antiSMASH (v5.0)134 with default parameters. Similar or identical BGCs were grouped into GCFs using BiG-SLiCE (v1.1.0)135 with the parameters “--threshold 99999 --complete”. For each BGC, we calculated its cosine distances to all BGCs in the BiG-FAM database, selecting the minimum value as the distance between the target BGC and BiG-FAM136. Subsequently, we computed the mean cosine distance of all BGCs within a GCF to determine the distance between the GCF and the BiG-FAM database. Finally, GCFs with a distance larger than 0.2 were identified as novel GCFs36. Furthermore, a GCF presence/absence matrix was generated and then used as “incidence raw data” in the R package iNEXT (v2.0.20) to estimate the potential diversity of GCFs for the top 20 dominant phyla and genera, respectively4,36. For the cAMPs identification, a total of 1,079 putative core peptides mined from diverse BGCs were subjected to the unified deep learning pipeline including Attention, LSTM, and BERT as previously described, and only the peptides with prediction scores of > 0.5 in all three models (Attention, LSTM, and BERT)137 were considered as candidate AMPs.

AMP synthesis and activity assessment

All the AMPs used in this study were synthesized by solid-phase peptide synthesis by Sangon Biotech with their accurate molecular masses determined by mass spectrometry. The purity of all peptides was determined by high-performance liquid chromatography, and the purity of all peptides was higher than 90%.

The bacterial inhibition assays were conducted as described previously with slight modifications137. Five bacterial strains were streaked on Luria-Bertani (LB) agar plate and incubated at 37 °C overnight. The single colonies were picked into LB culture medium and shaken at 120 r.p.m. at 37 °C overnight. The resulting LB bacterial suspension was adjusted to a predetermined starting concentration with OD600 of 0.1 and then diluted 1,000 times for the inhibition test. Freeze-dried powder of AMPs was firstly dissolved in double-distilled water to a final concentration of 2.4 mM. We set three groups to test AMP antibacterial activity: (1) blank group, 200 μl of LB medium; (2) negative control group, 100 μl of LB medium, 95 μl of bacterial culture and 5 μl of sterile water; and (3) test group, 100 μl of LB medium, 95 μl of bacterial culture and 5 μl of AMP mother solution (with final working AMP concentration of 60 μM). Experiments were performed in 96-well plates with a working volume of 200 μl. The OD600 value of each well was measured after 12 h cultivation at 37 °C. All experiments were performed with three independent technical replicates.

MIC determination of AMPs was performed by broth microdilution as described in Clinical and Laboratory Standards Institute guidelines138. In brief, the above mentioned 5 strains were inoculated in cation-adjusted Mueller-Hinton broth (CaMHB, QDRS Biotec, 11865) and incubated at 37 °C overnight. The cultures were diluted 1:100 using fresh CaMHB and cultured to the exponential phase (OD600 of 0.4–0.6). The cell concentrations were then adjusted to approximately 5.5 × 106 colony-forming units per ml. Subsequently, 10-μl aliquots were transferred into 96-well plates containing 100 μl of serially twofold-diluted AMP-CaMHB solutions, to the final AMP concentrations ranging from 1,024 µM to 1 µM. After incubating at 37 °C for 16–18 h, the MIC values were determined as the minimum concentration of AMP where bacteria showed no detectable growth. All assays were performed in technical triplicate, and the entire experiment was repeated three times for robustness and reliability. Following the MIC examination, samples obtained from the 2×, 1× and 0.5× MIC wells were inoculated onto CaMHB agar plates. Sections devoid of any visible bacterial colony growth were identified as the MBC for each strain and the respective AMP. All experiments were conducted in triplicate.

Damage of the bacterial cell envelope by AMPs was visualized by TEM and SEM. cAMP_87 treated and untreated samples were fixed with 2.5% glutaraldehyde in 0.1 M phosphate buffer at 4 °C for 4 h. The carbon-coated grids were placed in the bacteria solution for 3 min for absorption of bacteria, dried using a wedge of filter paper, and stained with 0.2% uranyl acetate for approximately 5 s. Samples were observed in the STEM mode of an electron microscopy (Zeiss Crossbeam550). Freshly cultured bacteria were diluted to the final OD600 of 1.0 in CaMHB broth. Subsequently, bacteria cells were treated with cAMP_87 at the concentration of 1× MIC in CaMHB for 5 h. The untreated control samples were prepared by supplementing the same volume of sterile water. Subsequently, the bacteria samples were fixed with 2.5% glutaraldehyde in 0.1 M phosphate buffer at 4 °C for 4 h. And the fixed bacteria were dehydrated through an ethanol gradient and dried with a critical point drier (Leica EM CPD300). Then, the treated bacteria were mounted and sputter coated with platinum using a sputter coater (Cressington 108) and imaged using a field emission SEM (FEI Quanta FEG 250).

Construction of the gene set

Coding sequence (CDS) regions of all metagenomic assembled contigs were predicted using MetaGeneMark (v3.38)139, and all predicted CDS sequences were lumped and redundant sequences were removed using the easy-linclust function of MMseqs2 (v12.113e3)140 with the parameters “--cov-mode 1 -c 0.99 --min-seq-id 0.95” to construct a unique global ocean microbiome protein catalogue (GOPC). The GOPC constructed in this study contained a total of more than 2,458 million unique genes (Extended Data Fig. 1). Comparison of the gene catalogue constructed in our study with those published previously, including Ocean Microbial Reference Gene Catalogue (OM-RGC_v2)40, Global Microbial Gene Catalog (GMGC10)141 and microbial gene catalogue of mangrove ecosystem (Mangrove)142 revealed significantly improved comprehensiveness of the current marine microbial gene catalogue (Extended Data Fig. 1). Functional annotation of the unique genes in GOPC was carried out against KEGG database (v87.0) by kofamscan (v1.3.0). The results of functional annotation showed that ~803 million genes were annotated to 10,287 KOs against KEGG database, leaving the majority not being annotated. This indicates that there are still plenty of novel functions in marine ecosystems to be explored.

The identification of PETase sequences from the deep sea

To identify potential active PETase proteins in marine ecosystems, DIAMOND (v0.8.23.85)143 was used to search the GOPC against the sequences of the typical PET hydrolase IsPETase (GenBank: GAP38373.1)144,145 as a reference with E-value cutoff of <10−5. A total of 3,954 hits were obtained from GOPC, exhibiting a hit rate of 0.011 hits per Mb, consistent with the previous study146. Each of the 3,954 hit sequences was aligned to the reference sequences using MUSCLE (v3.8.31) to check whether the Ser-Asp-His catalytic triad was contained, resulting in 1,598 aligned sequences containing the conserved catalytic triad. The multi-sequence alignment of PETase candidates was carried out by MAFFT (v7.407), and the phylogenetic tree was constructed by FastTree (v2.1.10)117. SignalP (v5.0b) was used to detect the signal peptide sequences of all PETase candidates, and 893 candidates with signal peptides were retained and then the signal peptide amino acids were removed before downstream analysis147,148. Among the 893 candidates, 295 of them were identified from marine surface water, while 86 and 22 of them were identified from the bathypelagic zone and hydrothermal vents respectively. We picked out the candidates from the extreme marine environments, which were assumed to be more stable in hostile conditions. Finally, three PETase candidates from the deepest parts of the Mariana trench (10,400 m) and Kermadic trench (9,177 m), and three sequences from two hydrothermal vents at different depths were selected for subsequent biochemical characterization (Supplementary Table 5). The six target amino acid sequences were aligned using the ClustalW algorithm in MEGA X149, and the alignment results and amino acid residues were analysed and visualized by ESPript (v3.0)150.

Protein purification and the assessment of PET hydrolysis activity

Genes encoding PETases were commercially synthesized by BGI Research, with codon optimization for E. coli. The N-terminal signal peptides of the enzymes were truncated before synthesis. The synthesized genes were subcloned into pET32a-LIC plasmid downstream of the TEV protease cleavage site. The constructs were subsequently transformed into competent E. coli Rosetta-gami 2 (DE3) (Novagen) for protein expression. Protein purification was conducted as previously described145.

Amorphous GfPET film, ES301445 (Goodfellow) was cut into small round pieces by a hole puncher with 6 mm in diameter as substrates. The GfPET film was incubated with 50 nM of IsPETase or dsPETases in 500 μl of Tris-HCl buffer (pH 9.0) containing a series of NaCl concentrations of 0, 0.6, 1.2, 1.9, 2.8, 3.7, 4.5, and 5.3 M, respectively. The reaction mixture was incubated at 37 °C for 48 or 120 h, then the amount of hydrolysed products was used as a proxy of activity and visualized using GraphPad Prism (v9.5.1). The halophilic dsPETases were subsequently applied to catalyse PET hydrolysis under a series of temperatures ranging from 25 to 75 °C, in 500 μl of Tris-HCl buffer (pH 9.0) containing 4.5 M NaCl for dsPETase06, and 5.3 M NaCl for dsPETase01 and dsPETase05. Since the pH of Tris-HCl buffer significantly changes with temperature, all the buffers were prepared by adjusting pH under the same temperatures as the reaction conditions. The hydrolysis products MHET and TPA were analysed and quantified using HPLC as previously described151.

Hydrolytic activities of IsPETase and dsPETases were also evaluated using solvent-cast PET film (scPET) as substrate. scPET was prepared as reported151,152 with slight modifications where appropriate. In brief, 4 ml of 1,1,1,3,3,3-hexafluoro-2-propanol (HFIP) dissolved GfPET (40 mg ml−1) was cast on a flat glass sheet with a diameter of 10 cm. After overnight evaporation of HFIP under ambient temperature, and then incubation in 75% ethanol for 2 h, the resulting scPET film was peeled off and cut into small pieces for degradation assessment. The reactions were carried out with 3 mg of scPET film in glass tubes in a total volume of 3 ml. The reaction mixtures for dsPETase05 contained Tris-HCl buffer (pH 9.0) with 5.3 M NaCl, and the incubation temperature was 55 °C. IsPETase catalysing reaction under NaCl-free Tris-HCl buffer (pH 9.0) at 37 °C was set as reference. The reactions were carried out with 300 or 500 nM enzyme concentration, to a total reacting volume of 3 ml, with technical triplicate. After 3 days of incubation, the soluble PET hydrolysis products were analysed by HPLC and the depolymerization rate was calculated according to the theoretical amount of MHET unit, which is 15.6 μmol in 3 mg of scPET.

Statistics and reproducibility

Statistical analyses were performed in RStudio with R v4.0.2-4.3.1. Specific statistical tests used for individual analysis are detailed in the figure legends. Unless otherwise specified in methods and legends, statistical tests were two-sided. Fig. 1a and the map in Extended Data Fig. 1 were generated using the R package maps (v3.4.2) and scatterpie (v0.2.3). Bar, box, violin and heat map plots were created using the R package ggplot2 (v3.5.1). Each boxplot displays the distribution of data as follows: the box represents the interquartile range (IQR), with the median marked by a horizontal line inside the box. The whiskers extend to the largest and smallest values within 1.5 times the IQR from the hinges. Outliers beyond the whiskers are plotted individually. Wherever applicable, individual data points were plotted above the bar or violin plots to depict the original distribution of the data. Phylogenetic trees were visualized using either iTOL (v5.0)118 or the R package ggtree (v3.8.2) as specified in Methods.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.