Background & Summary

Acid mine drainage (AMD) is a type of acidic (pH < 4) and metal-enriched water that results from the accelerated oxidative dissolution of exposed minerals, principally sulfides, and is associated with mining1,2. The strong acidity and heavy metal toxicity of AMD has caused severe pollution to surrounding water systems and soils2,3,4, making AMD one of the most serious environmental problems arising during the mining of mineral resources5,6. Metabolically-active acidophilic microorganisms have been observed in AMD7,8, including microbes primarily from the Bacteria (such as Proteobacteria, Nitrospirae, Actinobacteria, Firmicutes, and Acidobacteria) and Archaea domains9.

Microbes in AMD play a key role in the bioremediation of AMD environments10,11. For example, Acidithiobacillus12, one of the most common genera in AMD, includes microbes with chemolithotrophic metabolisms that are able to oxidize Fe2+ and sulfur compounds (such as Acidithiobacillus ferrooxidans, Acidithiobacillus ferridurans, and Acidithiobacillus ferrivorans)9,13,14, or oxidize sulfur compounds alone (such as Acidithiobacillus caldus, Acidithiobacillus thiooxidans, and Acidithiobacillus albertensis)15,16,17. Sulfate-reducing bacteria (SRB), a group of diverse anaerobic microorganisms that are ubiquitous in natural habitats, have been utilized in AMD remediation11.

Secondary metabolite biosynthetic gene clusters (smBGCs) found in AMD microbes are important resources for the synthesis of antibacterial and anticancer drugs18,19. A previous study reported that microbes including Penicillium sp., Penicillium rubrum, Penicillium solitum, Penicillium clavigerum, Chaetomium funicola, and Pithomyces sp. were separated and cultivated from water and sediment samples in a pit lake formed by the former Berkeley copper mine, among which worthwhile secondary metabolites were found20. For example, berkelic acid, a secondary metabolite of Penicillium sp., had anti-OVCAR-3 activity in NCI-DTP60; berkeleydione, the terpenoid secondary metabolite of Penicillium rubrum, showed selective activity against non-small cell lung cancer NCI-H460 in NCI-DTP 60; and CHCl3 extracted from Penicillium solitum strongly inhibited MMP-3 and caspase-1. In addition, cyclodipeptide synthases (CDPSs) that were capable of synthesizing cyclodipeptide, a precursor of 2,5-diketopiperazines, were found to be produced by 23 metagenome-assembled genomes (MAGs) (LMSG_G000006317.1–LMSG_G000006339.1) in Diplorickettsiaceae in this study21,22,23. Therefore, mining smBGCs from AMD may reveal valuable secondary metabolites18.

In this study, data were collected and the GTDB species representative assignment for the binned MAGs and putative novel smBGCs of 111 samples from nine mineral types were analyzed. The same method was used to reanalyze public metagenomic datasets consisting of 68 samples of eight mineral types from seven countries. In total, this study obtained the analysis results of metagenomic datasets covering 179 samples of 13 projects across 13 mineral types from seven countries (Table 1, Supplementary Table 1, Figs. 1a,2). A total of 7,007 MAGs mined from the datasets exceeded the medium-quality level of the MIMAG standard24, including 981 MAGs determined to be high quality (Table 2, Supplementary Table 2). Further taxonomic analysis by GTDB-Tk showed that 1,394 MAGs were classified into 150 existed genera, while 5,613 MAGs were not assigned to existed genera; total of 667 MAGs could be assigned to 154 GTDB species representatives, while 6,340 MAGs were not assigned (Fig. 3, Supplementary Table 2). Overall, 11,856 smBGCs in eight categories were obtained from 7,007 MAGs (Table 3, Supplementary Table 3), and 10,899 smBGCs were identified as putative novel smBGCs for discovering novel secondary metabolites by querying each smBGC sequence against the NCBI nucleotide sequence collection (Supplementary Table 3). The analysis of the number of smBGCs in all mineral types showed that the greatest number of smBGCs was found in polymetallic mines, and the second largest number was found in copper mines. The descending order of smBGC abundance in the remaining mineral types was as follows: lead-zinc mines, antimony mines, pyrite-copper mines, pyrite mines, coal mines, nickel-copper mines, magnetite mines, tin-zinc mines, iron mines, arsenic mines, and lignite mines (Fig. 4a).

Table 1 Data information for each mineral type.
Fig. 1
figure 1

Geographic distribution of sampling sites in this study. (a) Geographic distribution of sampling sites for all samples (the latitude and longitude of SRS1810936 was retrieved according to the geographic location of this sample). (b) Geographic distribution of sampling sites for the acid mine drainage (AMD) metagenomic datasets for China.

Fig. 2
figure 2

Base number distributions of samples from 13 types of minerals. The median base number of samples was similar among lead-zinc mines, antimony mines, pyrite-copper mines, magnetite mines, tin-zinc mines, and polymetallic mines. The upper and lower whiskers extend from the hinge within 1.5 x the inter-quantile range to the highest and lowest values, respectively. The outlier points (black) are the ones outside that range.

Table 2 Quality control standards and metagenome-assembled genome (MAG) numbers in each quality level.
Fig. 3
figure 3

Maximum-likelihood phylogenetic trees of bacterial and archaeal MAGs at the phylum level. Major lineages are assigned arbitrary colours and named. Lineages with GTDB representative species assignment are highlighted with red dots, while lineages with existed genera assignment (genus with NCBI taxonomy ID) are marked with purple triangles. (a) Maximum-likelihood phylogenetic trees of bacterial MAGs were inferred from a concatenated alignment of 120 bacterial single-copy marker genes. The tree includes 8 named archaeal phyla. (b) Maximum-likelihood phylogenetic trees of archaeal MAGs inferred from a concatenated alignment of 122 archaeal single-copy marker genes. The tree includes 40 named bacterial phyla.

Table 3 Numbers and percentages of smBGCs in eight categories classified by BIG-SCAPE.
Fig. 4
figure 4

Secondary metabolite biosynthetic gene cluster (smBGC) distributions in 13 types of minerals. (a) The number of smBGCs in different types of minerals. (b) Relative frequency of smBGC types across 13 types of minerals.

Methods

The workflow of data processing is depicted in Supplementary Fig. 1.

Date source

AMD metagenomic datasets of 179 samples from 13 mineral types obtained from seven countries were used to analyze GTDB species representative assignment for the binned MAGs and putative novel smBGCs (Table 1, Supplementary Table 1, Figs. 1a,2), including 68 public and 111 private samples. The datasets of 68 publicly available samples were downloaded from the SRA database (up to November 17, 2020) using the following search strategies: (((((Mine AMD) OR acid mine drainage) OR mine tailings) OR acidic stream) AND WGS [Strategy]) AND METAGENOMIC [Source] and (mine drainage metagenome [Organism]) AND WGS [Strategy] AND METAGENOMIC [Source], and the Illumina sequence data were kept. A total of 111 private samples across nine mineral types were collected and sequenced in this study. Among them, 87 samples across four mineral types newly collected in this study came from the same mineral types as the datasets downloaded from the SRA database, and 24 samples were obtained from five new mineral types. A total of 122 samples from 10 mineral types constituted the AMD metagenomic datasets for China (Table 1, Fig. 1b).

Quality control of raw data and metagenomic assembly

Trimmomatic is a flexible and efficient preprocessing tool used for reads processing of Illumina next-generation sequencing data, primarily for the filtering of adapter and low-quality sequences25. Quality control of the raw data for 179 samples in this study was performed using Trimmomatic (version 0.39) with Phred quality score cutoff of 20 and a minimum read length of 50 to remove the low-quality sequences. MetaSPAdes performs better in assembly compared to the other assembly tools, but it is time-consuming and requires very high memory26,27. MEGAHIT and metaSPAdes are both widely used tools for metagenome assembly28,29,30. Although metaSPAdes can provide high-quality assemblies across diverse data sets, MEGAHIT can provide acceptable assemblies with low memory usage and computational time31. Therefore, by a comprehensive consideration of the large volume of AMD samples analyzed and the affordable computational resources, we chose MEGAHIT28,29 as the software for metagenome assembly. The analysis of metagenome assembly was performed by MEGAHIT (version 1.2.9) in meta-sensitive mode to generate assembled contigs.

Metagenomic binning

Compared to original binning software, automated methods with multiple binning methods, such as MAGO, MetaWRAP or DAS Tool, combine the strengths of a flexible set of established binning algorithms to generate more or better bins32,33,34. MetaWRAP is a widely used tool for the metagenome binning of both environmental35,36,37,38,39,40,41 and host-associated42,43,44 samples, and it can obtain the largest number of high-quality draft genomes in tested datasets with relatively less computational requirements33,45. Additionally, MAGO used DAS Tool for bin refinement, and MetaWRAP outperformed DAS Tool for datasets of varied complexity33. Therefore, we selected MetaWRAP for metagenomic binning in this study. For each assembly, contigs were binned using the binning module (parameter: –maxbin2 –concoct –metabat2), consolidated into a de-replicated bin set using the bin_refinement module (parameter: -c 50 -x 5), and the quality of bins was further improved by using the reassemble_bins module within MetaWRAP (version 1.3.2). A total of 8,035 binned MAGs were obtained from 179 samples by MetaWRAP taking 1224 hours of wall time using an HPC with multiple 2.10 GHz Intel Xeon E7-4380 CPUs and 2 TB of RAM.

The completeness and contamination of all MAGs were estimated using CheckM (version 1.1.2) with a lineage-specific workflow46,47. Based on these results, we selected 7,007 MAGs that were estimated to be at least 50% complete, with less than 5% contamination and that had a quality score of >5036. As additional indicators of completeness, we identified tRNA genes using tRNAscan-SE (version 2.0.9)48 and rRNA genes using Infernal (version 1.1.2)49 with models from the Rfam database50. Based on these results, we found that 981 of the 7,007 MAGs were classified as high quality based on the MIMAG standard (≥90% completeness, ≤5% contamination, ≥18/20 tRNA genes and the presence of 5S, 16S and 23S rRNA genes), with the remaining classified as medium quality (Table 2, Supplementary Table 2).

Taxonomic assignment for bacterial and archaeal genomes

GTDB-Tk is a computationally efficient tool providing objective taxonomic assignment for bacterial and archaeal genomes based on the Genome Taxonomy Database (GTDB, http://gtdb.ecogenomic.org), and it is widely used for the classification of draft genomes directly from environmental- and human-associated samples51. Taxonomic analysis of each MAG was initially assigned using GTDB-Tk (version 1.4.0) based on the GTDB taxonomy R05-RS9552, and forty-eight phyla (eight archaeal phyla and 40 bacterial phyla) were obtained. GTDB-Tk analysis of 7,007 MAGs required 23 hours of wall time using an HPC with multiple 2.10 GHz Intel Xeon E7-4380 CPUs and 2 TB of RAM.

Based on the results of the GTDB-Tk analysis, a total of 1,707 MAGs were assigned to archaeal phyla, while 5,300 MAGs were assigned to bacterial phyla; 6,026 medium-quality MAGs were assigned to seven archaeal phyla and 38 bacterial phyla, while 981 high-quality MAGs were classified to four archaeal phyla and 31 bacterial phyla (Supplementary Table 2). In the genus level analysis, a total of 1,394 MAGs were classified into 150 extant genera, while 5,613 MAGs were not assigned. A total of 667 MAGs were assigned to GTDB representative genomes of 154 species, while 6,340 MAGs were not assigned to any GTDB species representative, data that would provide a large number of microbial resources for further research in the field of AMD bioremediation. A. ferrooxidans, A. ferrivorans, and A. thiooxidan have been demonstrated to be functional in AMD recovery9,14,16. In this study, A. ferrooxidans was found in copper mines, and A. ferrivorans and A. thiooxidan were found in polymetallic mines (Supplementary Table 2).

Constructing a phylogeny of nonredundant MAGs

dRep can reduce the computational time for pairwise genome comparisons by sequentially applying a fast, inaccurate estimation of genome distance and a slow, accurate measure of average nucleotide identity, thereby achieving a 28 fold increase in speed with perfect recall and precision compared to previously developed algorithms53. All of the produced 7,007 qualified bin sets were aggregated and de-replicated at 95% average nucleotide identity (ANI) using dRep (version 3.2.0, parameters: -comp 50 -con 5 -sa 0.95 –pa 0.9), resulting in a total of 1,992 species-level qualified MAGs54. These 1,992 de-replicated MAGs were further refined using a maximum-likelihood phylogeny inferred from a concatenation of 120 bacterial or 122 archaeal marker genes produced by GTDB-Tk51. Bacterial and archaeal approximate maximum likelihood trees were built using FastTree (version 2.1.10) with WAG + GAMMA models47,55,56,57, and visualized by iTOL58.

A striking feature of these trees is the large number of major lineages without assignment of a GTDB species representative (Fig. 3)51. There were 24 phyla in Bacteria without assignment of a GTDB species representative, and very limited MAGs were assigned to GTDB species representatives of Bacteria in the 16 phyla of Proteobacteria, Actinobacteriota, Nitrospirota, Firmicutes_E, Firmicutes, SZUA-79, Bacteroidota, Campylobacterota, Desulfobacterota, Spirochaetota, Firmicutes_B, Patescibacteria, Acidobacteriota, Aquificota, Bdellovibrionota, and Deinococcota (Fig. 3a). No MAGs were assigned to GTDB species representatives of Archaea in the phyla of Halobacteriota, Methanobacteriota, Thermoproteota, Asgardarchaeota, and Aenigmatarchaeota, and very limited MAGs were assigned to GTDB species representatives of Archaea in the phyla of Nanoarchaeota, Micrarchaeota, and Thermoplasmatota (Fig. 3b).

Mining of secondary metabolite biosynthetic gene clusters

Antibiotics & Secondary Metabolite Analysis Shell (antiSMASH, https://antismash.secondarymetabolites.org) is a tool that enables rapid identification, annotation, and analysis of smBGCs in genomes59. Since its first release in 2011, it has been the most widely used bioinformatics software for predicting smBGCs and the standard tool for smBGCs mining60. A total of 11,856 putative smBGCs were mined from 7,007 qualified MAGs across 13 mineral types using antiSMASH (version 5.1.2) called as follows: –cf-create-clusters –cb-general –cb-knownclusters –cb-subclusters –asf –pfam2go –smcog-trees –genefinding-tool prodigal, and in addition ignoring contigs with lengths shorter than 5 kb. antiSMASH analysis of 7,007 MAGs required 24 hours wall time using an HPC with multiple 2.60 GHz Intel (R) Xeon (R) Gold 6126 CPUs and 196 GB of RAM.

Using a threshold of 75% identity over 80% of the query length, 10,899 (91.93%) of 11,856 putative smBGCs were identified as putative novel smBGCs querying against the NCBI nucleotide sequence collection (downloaded 27 Jan 2021) by the command ‘blastn’ within the NCBI BLAST+ package (version 2.11)61 with an E-value cutoff of 1 × 10−1. Although many modular clusters were fragmented, we identified over 154 smBGC regions >50 kb in length and more than 1,834 > 30 kb. These smBGCs were further classified into eight categories using BIG-SCAPE with default parameters62. Among these eight smBGC categories, terpene had the largest number and made up the highest percentage of smBGCs at 3,751 smBGCs and 31.64%, respectively (Table 3, Supplementary Table 3).

Data Records

The rawdata from the 111 private samples was deposited in NODE (https://www.biosino.org/node/project/detail/OEP001841)63, GSA (CRA006735)64, and NCBI SRA (PRJNA666025)65. A total of 7,007 MAGs with completeness ≥50%, contamination ≤5%, and had a quality score of >50 (the medium-quality level of the MIMAG standard) were obtained from 13 mineral types by metagenomic assembly and binning47. A total of 981 (14.00%) MAGs were assigned as high quality according to the MIMAG standard24. All 7,007 MAGs from the current study have been deposited in eLMSG (an eLibrary of Microbial Systematics and Genomics, https://www.biosino.org/elmsg/index) under accession numbers LMSG_G000004334.1–LMSG_G000011340.166, NODE (https://www.biosino.org/node/analysis/detail/OEZ008530)67, and GenBank (PRJNA834572)68.

All 11,856 putative smBGCs from 7,007 MAGs of 13 mineral types were deposited in NODE (https://www.biosino.org/node/analysis/detail/OEZ008529)69 and GenBank (KFVK00000000)70. The classes of secondary metabolites synthesized by each smBGC across 13 mineral types were assigned (Fig. 4b). Non-ribosomal peptide synthetase (NRPS), post-translationally modified peptides (RiPPs), and terpene were found in all mineral types. The 13 mineral types in this study had relatively low numbers of smBGCs in the remaining smBGC categories, including type I polyketide synthesase (PKS I), PKSother, and PKS-NRP_hybrids. Saccharides are only found in pyrite-copper mines.

Technical Validation

In order to ensure that the datasets from the SRA database only contained AMD metagenomic data, the metadata of these datasets from the SRA database and the scientific literature were manually curated. To select metagenomic datasets, only datasets for which the library strategy was WGS and the library source was METAGENOMIC were chosen. Because the pH values of AMD were usually 2–41, datasets such as SRS1650501-SRS1650503, SRS872561, SRS962537, SRS963313, SRS963552, SRS963574, SRS963594, SRS963611, and SRS963627, whose pH values were greater than 4, were removed to further filter the AMD metagenomic datasets. For datasets that did not provide pH values, metadata in the SRA database and in the scientific literature were reviewed to preserve only AMD metagenomic datasets71,72,73,74,75.

The latitude and longitude of SRS1810936 was retrieved according to the geographic location of this sample. The mineral types of SRS5255199, SRS5255198, SRS5255197, and SRS2947527 were obtained through manual review of the metadata in the SRA database and scientific literature76.

The smBGCs number and type varied even within the same dRep cluster (Supplementary Table 4). Therefore, we used the 7,007 MAGs before de-replication for the smBGCs prediction. A total of 6,026 from 7,007 MAG belonged to medium quality according to the MIMAG standard24. Using the draft genome for the smBGCs mining by using antiSMASH would cause the number of detected gene clusters to be artificially high, and some contigs with gene cluster fragments might be left undetected77. In order to obtain better smBGCs, we ignored contigs with lengths shorter than 5 kb to increase the chance of the smBGCs we mined to have roles in secondary metabolite synthesis78. Although many modular clusters were fragmented, we identified over 154 BGC regions >50 kb in length and more than 1,834 > 30 kb.

We used linear regression to examine the sample size associated with the diversity of secondary metabolite biosynthetic gene clusters by GraphPad Prism (version 9.3.1). The total number of smBGCs in each sample showed a moderate positive correlation (R2 = 0.3620) with the total length of quality MAGs in each sample (Fig. 5a), demonstrating that the number of smBGCs may also be affected by other factors.

Fig. 5
figure 5

The diversity of secondary metabolite biosynthetic gene clusters (smBGCs) in different mineral types and geographic locations. (a) Correlation between the total number of smBGCs in each sample and the total length of quality MAGs in each sample. (b) smBGC counts per Gigabase (the total number of smBGCs in each sample divided by the total length of quality MAGs in each sample) plotted according to mineral type. (c) smBGC counts per Gigabase (the total number of smBGCs in each sample divided by the total length of quality MAGs in each sample) plotted according to geographic location. Data were analyzed using one-way ANOVA followed by Turkey’s test (*P < 0.05, **P < 0.01, ***P < 0.001, and ****P < 0.0001).

The box plots of smBGC counts per Gigabase in different geographic locations or mineral types were generated using GraphPad Prism (version 9.3.1). One-way ANOVA followed by Turkey’s test was used to analyze the differences among groups (P < 0.05) by GraphPad Prism (version 9.3.1). Notably, the smBGCs were most abundant in Canada: Ontario and USA: Pennsylvania, Scalp Level by the analysis of geographic location, while Coal mine and Nickel-Copper mine had relatively greater abundances of smBGCs according to the analysis of mineral type (Fig. 5b,c).

Usage Notes

The datasets analyzed in this study were the largest AMD metagenomic datasets considered to date. Among the 68 samples from the SRA database, only 11 (16%) of the samples were from AMD metagenomic datasets from China. Through the collection and sequencing of 111 AMD samples in this study, the metagenomic data of AMD in southeastern China were obtained. This complemented the publicly available datasets in order to provide a better overview of the putative novel microorganisms and secondary metabolite resources in the AMD environment. These datasets can be further employed in research on AMD bioremediation, the mining of novel secondary metabolites for drug synthesis, and for the analysis of gene functions, metabolic pathways, and CNPS cycles in AMD.