Assessing the genetic diversity of Cu resistance in mine tailings through high-throughput recovery of full-length copA genes

Characterizing the genetic diversity of microbial copper (Cu) resistance at the community level remains challenging, mainly due to the polymorphism of the core functional gene copA. In this study, a local BLASTN method using a copA database built in this study was developed to recover full-length putative copA sequences from an assembled tailings metagenome; these sequences were then screened for potentially functioning CopA using conserved metal-binding motifs, inferred by evolutionary trace analysis of CopA sequences from known Cu resistant microorganisms. In total, 99 putative copA sequences were recovered from the tailings metagenome, out of which 70 were found with high potential to be functioning in Cu resistance. Phylogenetic analysis of selected copA sequences detected in the tailings metagenome showed that topology of the copA phylogeny is largely congruent with that of the 16S-based phylogeny of the tailings microbial community obtained in our previous study, indicating that the development of copA diversity in the tailings might be mainly through vertical descent with few lateral gene transfer events. The method established here can be used to explore copA (and potentially other metal resistance genes) diversity in any metagenome and has the potential to exhaust the full-length gene sequences for downstream analyses.


Results and Discussion
In generating the metagenomes, the MiSeq sequencing yielded > 3.9 billion bp and > 14.8 million reads after quality control for the 7 tailings samples. The 7 individual tailings metagenomes were pooled together for high-quality assembly and subsequent copA gene recalling. In total, 8,566,357 reads (greater than 300 bp) were used for metagenomic assembly which generated 82,334 contigs with an N 50 of 1,700 bp and a longest contig of 123,516 bp. The data amount used for assembly in this study is much higher than earlier studies [27][28][29] and comparable to recent studies [30][31][32] where the characterized soil microbial communities have much higher complexity. This allows a deeper coverage of the functional gene diversity in the tailings microbial communities. Moreover, though metagenomes of microbial communities in acid mine drainage have been well studied 33,34 , this is, to our knowledge, the first report on microbial metagenomes from neutral mine tailings which more closely resemble natural soils than acidic tailings. While natural soil always harbors a microbial community with a complexity whose fine decoding is still beyond the capacity of current bioinformatics tools 35 , the tailings microbial metagenomes of lower diversity and community structure complexity characterized here may be a useful proxy for studying soil ecosystem functioning.
The Mt Isa Cu-Pb-Zn tailings are neutral and saline (EC > 2 mS/cm; in the dry season it is much higher as found in our previous study 36 ) substrates and contain an average Cu concentration more than 16-fold greater than the background soil values 37 (Table 1). In addition to the high levels of salinity and Cu, the tailings are also high in total Pb and Zn concentrations. The saline and metal stresses may have exerted a strong selective pressure on the microorganisms within the system, as indicated by the extremely low microbial biomass (Table 1) and the microbial diversity, which is dominated by either thermophiles or halophiles as found in our previous studies 38,39 . The dominance of extremophiles may explain the high GC contents (all > 60%) of the metagenome libraries 40 .
The strong selective effects of metal stresses were also reflected by the significantly high abundance of resistance genes for heavy metals in the metagenomes, based on the annotations by MG-RAST pipeline. In comparison with a local soil close to the tailings site sampled, the tailings metagenomes contain average cop, czc (coding for multiple metal resistance) and ars (coding for arsenic resistance) gene abundances of 2.8, 2.5 and 1.7 folds, respectively, of those of the soil (unpublished results; the metagenome data will be released soon in MG-RAST). The heavy metals in the soil are close to background crustal values (Fig. 1). Increased levels of toxins can lead to the enrichment of relevant resistance genes, as reported for both antibiotic and heavy metal resistance genes [41][42][43][44][45] . Enrichment of copA has also been reported in various metal-contaminated environments such as paddy soil 3 , arable soil 14   and a reference soil (n = 3) used in this study (B) and the corresponding abundances of resistance genes for these metals in the metagenomes from the tailings/soil samples (A). ars, resistance genes for As; cop, resistance genes for Cu; czc, resistance genes for multi-metals (e.g., Co, Pb, Zn).
However, to our knowledge, statistical results from shotgun metagenomic sequencing for cop genes have not been reported until recently 8 . Considering the polymorphism of cop systems, the results in this study may cover a deeper diversity of Cu resistance genes than those which are amplicon-based and rely on the specificity of primers.
A copA dataset containing 122 sequences annotated as copA obtained from full genomes available in Genbank has been built in this study. The abundance of copA selected from each phylum generally reflects the abundance of the available copA for that phylum in Genbank. Specification tests of the local BLASTN were done using copA sequences from the local database as well as full genomes containing putative copA. For the copA sequences from the database, the method was able to unambiguously identify all these genes with sensitivity and specificity at 100% and full coverage; at an elevated e-value, fragmented sequences of identity > 80% can be found. At an e-value of 10 −4 , we were also able to locate copA-like genes in the three genomes of Acidithiobacillus ferrooxidans, Rubrobacter radiotolerans and Thioalkalivibrio sulfidophilus, who are phylogenetically close to those of abundant species in the tailings, with more than 10 hits with an identity > 80% and an alignment length > 40 bp (the cutoff used for subsequent contig annotation in this study). Additionally, 6 hits were returned with an identity > 86% and alignment length > 40 bp when the BLASTN was applied to the genome of Thermodesulfobacterium geofontis. These results demonstrate a good coverage of copA diversity in the database and the tool established here is theoretically able to cover the novel copA diversity in our tailings metagenome.
In total, 99 full-length putative copA genes were recovered from the tailings metagenome. All of these genes had high similarity (> 70%) to copA sequences from the genomes in Genbank. As the local BLASTN method is similarity-based, the number of copA genes identified can vary largely upon how novel the subjected microbial gene pool is and the diversity of the copA database. The threshold of alignment length in local BLASTN results was arbitrarily set as 40 bp to limit the number of contigs subject for subsequent copA gene annotation. However, the number of copA sequences recovered can be increased if the threshold is reduced, since Gupta et al. 26 found a minimum alignment length of 17 bp for putative new genes using a similar method for ARGs. Considering that the tailings microbial communities in this study are fairly low in species diversity, the local BLASTN method we have established has the potential to exhaust the copA diversity therein and can also be a novel tool to explore copA diversity in more complex soil environments.
Annotation of new genes is mostly sequence-similarity-based 47 , including the early studies detecting copA homologs. While both multicopper oxidase-like and P-type ATPase-like copA have been referred to in the early literature 9,48 , more recent studies state these as two "types" of copA 14 . However, we propose that the so-called two types of copA are not homologs and are incorrectly associated due to the limited number of available sequences used for similarity comparison in early studies. A wide range of protein sequences was used for phylogenetic analysis in this study. These were sequences that were verified experimentally or annotated as CopA and CopA-like Cu translocating proteins obtained from reported Cu resistant microorganisms. The sequences clearly separated into two distinct groups, one containing typical multicopper oxidases CueO and the other containing the typical P-type ATPase CtpA and ZosA ( Fig. 2; Table 2). The former group includes the protein products of copA genes detected in early studies from the plasmids of P. syringae and E. coli., and the latter group includes P-type ATPase Cu translocating proteins from A. fulgidus, E. hirae and L. pneumophila whose crystal structure has been resolved. These two groups of CopA may have distinct roles in Cu resistance, as implied by the experimental assays in vitro of typical CopA. Members of multicopper oxidase group CopA have been reported to be able to oxidize substrates of laccase, like phenol 10,49 , while members of ATPase CopA have been found to bind and transport Cu(I) 19,50,51 . Meanwhile, in the E. coli genome, the model microorganism for cop system studies, both groups of copA are present and their roles are found to be different. Plasmid-borne pcoA functions synergistically with chromosomal copA, and the protein PcoA can oxidise Cu(I) carried by PcoC and substitute the role of chromosomal CueO 10,52,53 . Our analysis here indicated that many model species for copA studies contain both groups of copA, such as E. hirae and P. syringae (Fig. 2). The different roles of the two groups of copA may also be implied by their different sequence and tertiary structure. Multicopper oxidase CopA has no crystal structure resolved so far but prediction by SWISS-MODEL 54 showed that EcPcoA and PsCopA have similar crystal structure with laccase, which is distinct from the transmembrane figuration of the ATPase CopA detected in A. fulgidus and L. pneumophila 16,19,50,55 . ET analysis indicated that the two groups of CopA have different metal binding motifs (Supplementary Figure 1). The ATPase group typically uses Cys as a metal binding residue in the motifs of CXXC (with HXXH as a variant) and CXC (the typical transmembrane metal binding motif), while the multicopper group uses His for metal binding in the form of HXH which is highly conservative within the group. Taken together, we suggest that CopA homologous to PcoA should be named as PcoA which is a multicopper oxidase, to differentiate it from ATPase CopA. Consequently, the metal binding motifs detected here can be used to screen for copA candidates with a high potential of functioning in Cu resistance.
The 99 full-length putative copA sequences were translated and aligned with the known CopA used for ET analysis. The presence of metal binding motifs abovementioned was screened in the putative CopA; For ATPase like CopA, the ATP binding motifs (GDGIN) were also used for screening. In total, 70 putative CopA were thought to have a high chance of functioning in Cu resistance.
Phylogenetic analysis was performed on the CopA sequences affiliated with the dominant species detected in the tailings (Fig. 3). The topology of the tree is largely consistent with that of the 16S rRNA gene based phylogeny. This indicates that the copA gene diversity was mainly controlled by vertical descent in the history of tailings community evolution, at least among the dominant species. Similar conclusions have been drawn for merA genes responsible for mercury resistance 2 . P-type ATPase metal Scientific RepoRts | 5:13258 | DOi: 10.1038/srep13258 homeostasis genes are probably ancient genes and have been essential for microbial survival, and thus lateral gene transfer plays a minor role in their evolution 56 .
It is worth noting that the 29 putative copA sequences with no metal binding or phosphatase binding domains found may also function for Cu-resistance. For one thing, it is possible the combined presence of all the metal binding motifs found here are not essential for Cu resistance. Studies on EcCopA have found that the metal binding domain C 479 PC 481 motif is essential for Cu resistance but the other two C 14 XXC 17 and C 110 XXC 113 are not 57,58 . Meanwhile, some Cu resistance ability may have evolved with sequences containing alternative metal binding domains. For example, in the alignment of ATPase group CopA, about half of the members lack the C 84 XXC 87 motif; and CopA of Sulfolobus solfataricus, in which the Cu resistance function has been experimentally verified, uses YPC instead of CPC as a transmembrane metal binding domain (Supplementary Figure 1). Furthermore, it must be highlighted that the CopA proteins are relatively divergent in terms of both sequence composition and length. Inaccurate alignments can be made and lead to failure for detecting motifs when the sequences are too divergent 59 . Therefore, the accuracy of alignment methods may largely determine the ability to detect the metal binding motifs of the subject sequences. For example, the alignment method used in this study failed to align CueO of E. coli with the others, and manual adjustments were required to locate the three experimentally-verified metal binding domains 60 .

Figure 2. A phylogenetic tree based on the alignment of the CopA and reference protein sequences used for ET analysis in this study.
Microbial species containing CopA sequences from both the P-type ATPase and Multicopper oxidase groups are highlighted in colour. When the Cu resistance function has been verified this is mentioned for those taxa.
Scientific RepoRts | 5:13258 | DOi: 10.1038/srep13258 Conclusions A local BLASTN method was established in this study to recover full-length copA genes in an assembled tailings metagenome. The detected copA genes were further screened for potentially functioning copA by screening active sites in their encoding proteins, which were inferred by ET analysis using known CopA. The method established here can be used to recover full-length copA (and potentially other resistance genes) in any assembled metagenomes.

Materials and Methods
Sampling and DNA extraction. Tailings samples were sampled in June in 2013 from a field trial site located at Mount Isa (Mt Isa) tailings impoundment. Mt Isa (20.73 °S, 139.5 °E) is located in northwest Queensland, Australia. Mt Isa has a semi-arid climate with an annual pan evaporation of 2800 mm and an average rainfall of 400 mm, with the vast majority of the rain falling during the wet season (November to February). The tailings used for field trial were highly weathered and collected from a tailings storage facility (TD5) that contained mixed streams of Cu and Pb-Zn tailings and was decommissioned about  Table 2. Selected microbial species used for evolutionary trace analysis in this study. The MIC of Cu, and whether the protein function has been determined is included. Genomes of these species all harbour copA or copA-like copper translocating genes. a minimum inhibitory concentration as of the strains of the species studied and under the specific test conditions in the corresponding references; b functions of copA genes have been experimentally verified or the crystal structure of CopA proteins has been resolved; c referred as Cu resistant but MIC is unknown or not provided. 40 years ago, which had received streams of Cu and Pb-Zn tailings for decades. The details of tailings properties and treatment setup can be found in our previous studies 38,39 . Briefly, the tailings mainly consist of quartz, dolomite, pyrite, gypsum and kaolinite, and contain 0.13 ± 0.03% of total Cu, 0.18 ± 0.02% of total Pb, and 0.29 ± 0.01% of total Zn. A field revegetation trial was established in 2010, and within 3 years the woodchips amendment and/or revegetation treatments did not substantially change the tailings physiochemical properties and the dominant microbial species. Dominant microbial species were affiliated with Rubrobacter spp. of Actinobacteria, Truepera spp. of Deinococcus-Thermus and Thioalkalivibrio spp. and Thiobacillus spp. of Proteobacteria 39,61 . Tailings sample were stored in an iced container and shipped to the laboratory within 24 hours for DNA extraction. For physiochemical analyses part of the tailings were oven-dried at 40 °C, sieved through a 2 mm screen and mixed thoroughly before use. Methods for physiochemical analyses, such as electrical conductivity (EC), cation exchange capacity (CEC), total organic carbon (TOC), microbial biomass carbon (MBC) and total elemental concentrations (Table 1), can be found in our previous studies 39, 61 .
DNA was extracted from 24 tailings samples (8 plots with 3 replicates each plot). DNA extraction was done using commercial kits after cell enrichment by sucrose density centrifugation; detailed methods can be found in our previous studies 39,61 . For MiSeq shotgun sequencing, replicates were combined to obtain enough DNA, and one sample of pure tailings failed the quality test for sequencing. Therefore, 7 independent tailings DNA samples representing the gene pool in the tailings landscape were sequenced using Illumina MiSeq platform in this study. The DNA concentrations and quality were measured using the Qubit ® 2.0 Fluorometer (Thermo Fisher Scientific). DNA concentrations of the 7 samples ranged from 4.5 to 25.2 ng/μ l, giving between 0.23 to 1.51 μ g of DNA to use for the sequencing. MiSeq sequencing. DNA sequencing libraries were prepared using the Illumina TruSeq DNA LT Sample Prep Kit (Illumina), following the standard manufacturer's protocol (Part #15026486 Rev. C July 2012) with modifications as described below. Genomic DNA was fragmented by sonication (Covaris S2) using the "Whole-genome Resequencing" settings in the protocol, after which the DNA was purified using AMPure XP beads. The purified, fragmented DNA was end repaired and again purified using AMPure XP beads. The purified, end repaired DNA was size-selected using a double-SPRI method to obtain insert sizes of approximately 600 bp (actual average insert size ranges from 456-667 bp). The size-selected DNA was A-tailed and then had adapters ligated, followed by an AMPure XP purification. This ligated product was amplified by PCR to produce the final library. The final individual libraries were visualized and quantified on the Agilent BioAnalyzer 2100 using the High Sensitivity DNA Kit. The libraries were then pooled in an equimolar ratio, and the pool was quantified by qPCR using the KAPA Illumina Library Quantification Kit (KAPA Biosystems). The library pool was sequenced on the Illumina MiSeq (MiSeq control software v2.0.5/Real Time Analysis 1.18), using the MiSeq Reagent Kit v3 (600-cycle) with paired-end 300 bp reads.

Shotgun metagenomic analyses.
For functional annotations, the paired-end raw data was uploaded to MG-RAST server (Rapid Annotation using Subsystems Technologies for Metagenomes) 62 . Sequences were annotated to functional categories against M5NR database using BLASTX at an e-value cutoff of 1 × 10 −5 . Results of general descriptors and annotations were downloaded from MR-RAST server for download analyses. Gene abundance was counted manually for cop, czc and ars genes, which are responsible for Cu, Pb/Zn and As resistance, respectively. For metagenomic assembly, the raw reads were trimmed using Skewer 63 to have a Q > 25 and a minimum length of 200 bp. Quality trimmed reads were used for de-novo assembly using the open source software package Mira (4.0rc5) 23 on a 24 core, 192 GB server with multithreading enabled. The seven tailings libraries were combined to improve the quality of assembly 64 . The contig dataset was then subjected to local BLASTN for searching of copA-like genes.
Local BLASTN. A local BLASTN method, similar to ARGAnnot by Gupta et al. 26 , was established to recall copA-like genes in the assembled metagenomic database. Briefly, nucleotide sequences annotated as copA in Genbank were retrieved manually and formed a local copA database (Supplementary Material 1). Then the assembled metagenomic dataset was aligned against the copA database using the embedded BLASTN method in BioEdit 65 . Then the results were sorted based on aligned length and a threshold of 40 bp was used for screening copA-like genes in the metagenomic dataset. All the contigs containing candidate copA were picked out manually and then subjected to gene finding using Glimmer 66 . For copA-like gene finding, the selected contigs were aligned against the Genbank database using BLASTN one-by-one and the sequence region annotated as copA or heavy metal resistance genes were compared with the open reader framework (ORF) found by Glimmer. The candidate ORF was then manually curated again against the Genbank database. The closest copA sequences were also retrieved for phylogenetic analysis. Maximum-likelihood phylogenetic trees were constructed for all the copA sequences in the database and selected found copA sequences from the tailings metagenome, using the method described below.
Local BLASTN against the copA database was done on a Window 7 computer equipped with dual-core 2.8 GHz CPU and 8 G RAM. The assembled dataset was cut into 6 sub-datasets and BLASTN for each dataset used up to 6 computing hours.
Specificity tests were done for the local BLASTN method using 1) 4 copA sequences included in the database (gi_294009986_Sphingobium japonicum; gi_258541105_Acetobacter pasteurianus; gi_347756788_Micavibrio aeruginosavorus; gi_162145846_Gluconacetobacter diazotrophicus), 2) three complete genomes (gi_198282148_A. ferrooxidans, gi_627776062_R. radiotolerans, and gi_220933193_T. sulfidophilus who are phylogenetically close to those of abundant species in the tailings) containing copA novel but with homologs in the copA database and 3) a complete genome (gb_CP002829.1_T. geofontis) belonging to phylum Thermodesulfobacteria which is not included in the database and probably containing novel copA.
Evolutionary trace analysis. Evolutionary trace was done following the method by Wilkins et al. 67 .
Since microorganisms harbouring copA homologs can also be Cu-sensitive 68 , gene sequences for ET analysis in this study were selected based on three criteria ( Table 2): 1) the microorganisms must be reported as Cu-resistant; 2) the full genome and/or complete sequence of plasmid(s) must be available in the Genbank; and 3) the gene must be annotated as copA or copA-like translocating genes in the full genome. In addition, five CopA protein sequences whose gene functions have been verified experimentally through mutagenesis or whose crystal structure has been resolved were also obtained from UniProt 69 . Information on active sites of these verified CopA was gathered from the literature 10,[16][17][18][19]50 . Information on active sites of plasmid-encoded PsCopA was obtained from the predictions of UniPro since no crystal structure has been resolved so far for this group of CopA, and crystal structure of EcCueO and other multicopper oxidases were used as references. Furthermore, two multicopper oxidases, LccA (laccase) of Haloferax volcanii and CueO of E. coli for Cu homeostasis, and two P-type ATPases, CtpA of Mycobacterium smegmatis for cadmium transport and ZosA of Bacillus subtilis for Zn transport, 38 CopA sequences of known or highly possible to have Cu-resistance functions were aligned using the method of MUSCLE 70 embedded within MEGA 6 71 . The crystal structures of reference Scientific RepoRts | 5:13258 | DOi: 10.1038/srep13258 proteins, LccA, CueO, CtpA and ZosA, have also been resolved [72][73][74][75] . For active sites comparison, the Cu-binding motifs were searched manually based on the available information of verified proteins said above. For phylogenetic analysis, the alignment in FASTA format was used to determine the conserved regions using Gblocks 0.91 b online 76 . Two highly conserved regions, one at the C-terminal and one at the N-terminal, were found, and the variable beginning and tail parts were trimmed in MEGA 6. The aligned regions of the alignment was then used for the construction of a Maximum-Likelihood tree using default values within MEGA.
Screening of CopA containing active sites found in ET analysis. All the CopA sequences detected from the tailings metagenome were aligned with the reference CopA (those with verified functions) using the alignment method of MUSCLE, as described above. The active site regions were checked manually based on the alignment for screening of CopA, gaps were manually adjusted to refine the alignment, and those containing the conserved motifs were determined as highly possible to be functioning as P-type ATPase Cu translocating proteins.