Background & Summary

Peatlands occupy only 3% of the total land area but contain 600–650 Gt C1,2,3, nearly double the biomass stored in the world’s tropical rainforests4. Peatland ecosystems thus have a significant impact on global carbon storage, greenhouse gas emissions, and climate regulation. Although peatlands are widely distributed across the earth, some of the most carbon-dense (80–120 Gt C) peatlands exist in the tropics due to the sheer depth of peat deposits that occur within a relatively small area (0.4 million km2)5. Peatlands in South-East Asia (SEA) account for 56% of those that occur in the tropics and store nearly 77% of carbon locked in tropical peatlands globally2. Since 1990, tropical peatlands in SEA have been extensively repurposed (20–40% of land area) for forestry (pulpwood plantations) and agriculture (oil palm plantations)6, resulting in a substantial loss of sequestered carbon through peat oxidation (2.5 Gt C between 1990–2015)5. Peat oxidation is largely driven by microorganisms and thus our ability to predict carbon outcomes relies on identifying those with the capacity to sequester or emit carbon. In context to the data presented in this study, microorganisms refer to bacteria and archaea only.

Microorganisms in SEA tropical peat microbiomes have largely been identified and studied using marker-based approaches7,8,9,10,11. While such studies have helped uncover the high microbial diversity9,10,11 that exists within SEA tropical peatlands and its association with environmental variables7,8,9,10,11, they do not provide direct insights into microbial carbon metabolism. Identifying and understanding the role of specific microorganisms that turnover carbon in the community requires reference genomes which link microbial identity with metabolic capacity. However, genomically resolving such communities is highly challenging owing to their sheer complexity. So far, the carbon-processing potential of tropical peat microbiomes remains either genomically unresolved or poorly resolved at best, but, recent studies in temperate peatlands have shown that the functional potential of peat microbiomes can be genomically resolved using metagenomics approaches12,13.

Here, we deeply sequenced peat metagenomes proximal to and distant from oil palm trees in an oil palm plantation. Oil palm plantations on peatlands are of particular interest as they are hotspots of microbially-driven carbon emissions. Using assembly-based approaches, we reconstructed 764 sub-species level (99% gANI) metagenome-assembled genomes (MAGs) with a completeness ≥50% and redundancy ≤10% which cluster into 333 species-level (95% gANI) MAGs (245 bacterial and 88 archaeal) (Fig. 1). Of these, 38 bacterial and 13 archaeal genomes are near-complete (completeness ≥90%, redundancy ≤5%, number of unique tRNAs ≥18), while an additional 207 bacterial and 91 archaeal genomes are substantially complete (completeness ≥70%, redundancy ≤10%). The MAGs have a median size of 3.23 Mbp (range: 0.43–10.91 Mbp), median N50 of 6.34 kbp (range: 3.3–104.84 kbp) and encode a total of 2,530,130 protein-coding genes. The sub-species-level collection spans 14 different bacterial and archaeal phyla with a majority (53.1%) belonging to the phyla Acidobacteriota and Thermoplasmatota, both of which occur widely in peatlands and in acidic soils14,15. Within these phyla, the MAGs provide maximum phylogenetic gain for the orders UBA7540 (13 genomes; phylogenetic gain: 33%) and UBA184 (56 genomes; phylogenetic gain: 76.72%). To our knowledge, this catalogue represents the largest collection of microbial genomes from a tropical peat ecosystem.

Fig. 1
figure 1

Maximum likelihood tree of bacterial species-level MAGs. The phylogenetic tree was constructed using a concatenated set of 120 conserved bacterial marker genes. Concentric rings (moving outward) represent genome completeness and redundancy. The bar plot represents the size of the MAG in Mbp.

Carbon-processing potential of MAGs was determined using a comprehensive marker-gene-based approach which integrates gene functional annotations from multiple databases such as KEGG16, dbCAN17, PFAM18, and TIGRFAM19. The ability to respire a broad spectrum of carbon substrates such as amino acids, polysaccharides, and fatty acids was widespread across both bacterial and archaeal species but the capacity to fix carbon was detected only in a few genomes (Fig. 2). Fermentative pathways which produce alcohols and organic acids such as ethanol and acetate, as well as hydrogen and carbon dioxide were also prevalent. None of the archaeal genomes encode for pathways to convert fermentative end-products into methane, however, the capacity to oxidise methane was detected in a small fraction of MAGs (34 MAGs; 10.2%) from the phyla Acidobacteriota, Actinobacteriota, Protebacteria, Desulfobacterota_B, Thermoplasmata, Thermoproteota, and Micrarchaeota. In contrast, the capacity to oxidise non-methane trace gases such as methanol (93 MAGs; 27.9%), ethanol (69 MAGs; 20.7%), hydrogen (103 MAGs; 30.9%), and carbon monoxide (177 MAGs; 53.2%) was detected in several MAGs. Interestingly, 38 of the 93 MAGs, capable of oxidising methanol belong to the phylum Acidobacteriota, members of which are not typically linked to methanol consumption.

Fig. 2
figure 2

Genome-resolved carbon-processing potential of bacterial and archaeal species-level MAGs. Heatmap showing the presence (black) or absence (white) of key carbon-processing pathways across MAGs within phyla containing a minimum of ten genomes with a completeness ≥70% and redundancy ≤10%. Phylum names are shown either below or next to the heatmap slice corresponding to MAGs from the particular phylum. MAGs within each phylum are clustered based on the occurrence frequency of different pathways. Carbon-processing pathways are grouped and color-coded on the left for visual clarity.

Overall, we expect our genomes database and metagenomes to be widely useful as a reference for metatranscriptomic experiments, comparative studies, and genome-guided isolation efforts. Availability of statistics describing the prevalence of carbon-processing functions across microbial populations will help fill existing knowledge gaps about their diversity, distribution, and metabolism. This data is particularly timely as carbon emissions from repurposed tropical peatlands continue to accelerate at an unprecedented rate, posing a grave threat to our climate.

Methods

Sample collection

Peat samples proximal to (0.5–1 m) and distant from (≥5 m) oil palm trees were collected as part of a time-stamped fertilizer intervention experiment in an oil palm plantation located in Jambi, Indonesia (103°49ʹ 32.23ʺ E, 1°40ʹ58.24ʺ S). The plantation was considered young as age of drainage was ≤10 years8. The sampling location, local weather conditions and peat physiochemical parameters have been previously described8. The mineral fertilizer intervention involved a single application of NPK (16:16:16; P as P2O5, and K as K2O; 1.6–1.8 kg palm−1) and urea (0.5–1 kg palm−1) following local practices20,21. Peat samples were collected from four oil palm trees across two time-points before (days 1 – 2015-01-14 – and 4) and four time-points after (days 6, 7, 10, and 14) fertilizer application. All peat samples were collected from a depth 0–20 cm using an auger and flash frozen in liquid nitrogen on site.

Metagenome sequencing and assembly

Genomic DNA was extracted from all samples using the Zymo Research Soil MidiPrep kit (Zymo Research, CA, USA). Shotgun DNA libraries were prepared from a total of 36 samples using the TruSeq DNA library preparation kit (Illumina, San Diego, CA, USA) with 2 × 250 bp chemistry, and sequenced on the Illumina HiSeq 2500 (Illumina, San Diego, CA, USA) at SCELSE (https://www.scelse.sg), Nanyang Technological University, Singapore. We generated a total of 133.7 Gbp of raw sequence data, with each sample, containing on an average, 3.8 Gbp.

Raw sequence reads were processed using Cutadapt v3.422 with parameters: --error-rate 0.2, --minimum-length 75, --no-indels to remove Illumina sequence adapters. Low-quality regions from adapter-free reads were trimmed using bbmap v38.96 (https://sourceforge.net/projects/bbmap/) with parameters: trimq = 20, qtrim = rl, minlen = 75. Overall, 121.6 Gbp of sequence data were retained after quality filtering.

Samples were then assembled de-novo both individually and co-assembled in groups using MEGAHIT v1.2.723 with parameters: --k-min 27, --k-max 197, --k-step 10 on 48-core compute nodes with 2T RAM. Co-assemblies of proximal and distant peat samples were performed separately by first pooling samples from each time-point and then from all the time-points. Assemblies were length-filtered to retain only contigs ≥1 kbp and renamed using the rename.sh script (bbmap v38.96) with parameters: minscaf = 1000. This resulted in a total of 10.35 million contigs (equivalent to 21.05 Gbp) with a median N50 of 1.96 kbp (range: 1.55–3.27 kbp). Read containment and across-sample contig coverage was estimated by cross-mapping each assembly against quality-filtered reads from all the samples using Bowtie2 v2.4.524 with parameters: -no-unal, -X 1000, SAMtools v1.625, and the jgi_summarize_bam_contig_depths script (METABAT2 v1.2.926). Summaries of individual samples, assemblies, and assembled contigs ≥1 kbp are available on figshare27 in the files “jopf_sample_data.csv”, “jopf_assembly_summary.csv”, “jopf_single_sample_assemblies.tar.gz” and “jopf_co_assemblies.tar.gz” respectively.

Genome binning

Genome bins were recovered from each assembly using METABAT2 v1.2.926, CONCOCT v1.1.028, and MaxBin2 v2.2.729 with parameters: -min_contig_length 2500, all of which use a combination of differential coverage and tetranucleotide frequency information. Bins obtained using the three binning algorithms were then pooled and processed using DASTool v1.1.230 with parameters: --score_threshold 0, --write_bin_evals 1, --search_engine diamond, and --write_bins 1 to achieve a unified set of non-redundant bins. This resulted in a total of 1,535 genome bins with a median size of 3.49 Mbp (range: 0.46–17.88 Mbp) and a median N50 of 6.76 kbp (range: 3.04–134.04 kbp).

Genome refinement and dereplication

Genome bins were first refined using refineM v0.1.231 with parameters: --cov_corr 0.8. Contigs were removed (a) if either their GC content or tetranucleotide frequency exceeded reference-based thresholds (98th percentile) or (b) if across-sample coverage correlation was <80%. Bins were further refined using reference-based approaches implemented in MAGpurify v2.1.232. Contigs were removed if they contained taxonomically-discordant marker genes, known contaminants, lacked a concordant marker gene or if they aligned poorly to conspecific genomes (when available) from the IGGdb database32. Refined bins with completeness ≥50% and redundancy ≤10% were designated as MAGs. Species and sub-species-level collections were generated by dereplicating the MAGs using dRep v3.4.033 with parameters: -sa 0.95/0.99 --S_algorithm fastANI.

Genome quality assessment and taxonomic classification

MAG statistics were estimated using CheckM v1.1.334 with parameters: lineage_wf, -t 24 --pplacer_threads 1, --tab_table and are summarised in Fig. 3. Transfer RNA (tRNA) gene sequences were predicted using tRNAScan-SE v2.0.935 using kingdom-specific HMM models. Taxonomic annotation was performed using GTDB-Tk v2.1.136,37,38,39,40,41,42,43,44,45 and the Genome Taxonomy Database r20746. MAGs were classified as near-complete31,47 if they had completeness ≥90%, redundancy ≤5%, and ≥18 unique tRNAs, or as medium-quality otherwise. MAG statistics and taxonomic labels for the species and sub-species-level collections are available on figshare27 in the file “jopf_mag_quality_summary.csv”.

Fig. 3
figure 3

Genome statistics for the 764 sub-species-level MAGs. Histograms (from left to right, starting from the top) show genome completeness, redundancy, genome size, number of contigs, contig N50, mean contig length, length of the longest contig, and the number of tRNAs corresponding to the 20 standard amino acids identified in each MAG.

Estimating phylogenetic diversity and gain

Phylogenetic diversity and gain were estimated by constructing kingdom-specific maximum likelihood trees integrating species-level MAGs from this study and reference genomes from GTDB r20746. Phylogenetic trees were constructed using GTDB-Tk v2.1.136,37,38,39,40,41,42,43,44,45 with parameters: de_novo_wf, --bacteria/--archaea, and --outgroup_taxon p__Patescibacteria/p__Altiarchaeota. Relative taxon phylogenetic diversity and phylogenetic gain were computed using GenomeTreeTk v0.1.6 (https://github.com/dparks1134/GenomeTreeTk) with parameters: pd_clade.

De-novo trees comprising only MAGs from this study were constructed using GTDB-Tk v2.1.136,37,38,39,40,41,42,43,44,45 with parameters: de_novo_wf, --bacteria/--archaea, --skip_gtdb_refs, and --outgroup_taxon p__Acidobacteriota/p__Thermoplasmatota. Trees were visualised and annotated using iTOL v648. Unrooted Archaeal/Bacterial trees with and without GTDB reference genomes are available on figshare27 in the files “jopf_bacteria_with_gtdb_r207_refs_unrooted.tree”, “jopf_archaea_with_gtdb_r207_refs_unrooted.tree”, “jopf_bacteria_unrooted.tree”, and “jopf_archaea_unrooted.tree”.

Functional annotation

Carbon-processing potential of the MAGs was estimated using METABOLIC v4.049 which integrates functional annotations from KEGG16, dbCAN217, PFAM18, TIGRFAM19, and custom HMMs for specific metabolic functions. Metabolic pathways were considered present if the MAG contained at least one associated marker gene or absent otherwise. Presence/absence of carbon-processing pathways in MAGs is available on figshare27 in the file “jopf_carbon_processing_pathways.csv”.

Data Records

Raw metagenomes and metagenome-assembled genomes are available on NCBI BioProject PRJNA88352850. Datasets and data products generated from the raw data are available on figshare27.

Technical Validation

MAGs reported in this study only consist of those that met the medium quality threshold or above as defined in Bowers et al.51.

Usage Notes

Users/researchers should independently assess the accuracy of genes, contigs, and functional assignments for genomes of interest prior to downstream analysis.