Searching for signatures across microbial communities: Metagenomic analysis of soil samples from mangrove and other ecosystems

In this study, we categorize the microbial community in mangrove sediment samples from four different locations within a vast mangrove system in Kerala, India. We compared this data to other samples taken from the other known mangrove data, a tropical rainforest, and ocean sediment. An examination of the microbial communities from a large mangrove forest that stretches across southwestern India showed strong similarities across the higher taxonomic levels. When ocean sediment and a single isolate from a tropical rain forest were included in the analysis, a strong pattern emerged with Bacteria from the phylum Proteobacteria being the prominent taxon among the forest samples. The ocean samples were predominantly Archaea, with Euryarchaeota as the dominant phylum. Principal component and functional analyses grouped the samples isolated from forests, including those from disparate mangrove forests and the tropical rain forest, from the ocean. Our findings show similar patterns in samples were isolated from forests, and these were distinct from the ocean sediment isolates. The taxonomic structure was maintained to the level of class, and functional analysis of the genes present also displayed these similarities. Our report for the first time shows the richness of microbial diversity in the Kerala coast and its differences with tropical rain forest and ocean microbiome.

Taxonomic Diversity across the Kerala samples. The taxonomic classifications of genes were assigned to the RefSeq annotation source 19 using the Best Hit Classification algorithm of MG-RAST. A total of 13269130 representative sequences were assigned to different taxonomies using the RefSeq database from all the four datasets accounting for more than 99% of the reads for PYN, MAL, VL1 and PGD samples respectively (Table 1). There were also a number of sequences that could not be assigned to the highest taxonomy levels, with an average of 15.5% of the reads either unassigned, unclassified, or identified as other. Collectively, bacterial sequences dominated the overall reads accounting for 96.81% of the total assigned reads, with Eukaryota and Archaea assigned 1.8% and 1.32% respectively. Less than 1% of the reads mapped to viruses. Sequences from all the datasets were assigned to 28 different bacterial, 5 archaeal, and 34 eukaryotic phyla. All four samples shared Proteobacteria as the phyla with the most reads assigned, having an average of 65.70% (Supplementary Table 1). Similar rich dominance of Proteobacteria was found in Brazilian oil contaminated mangroves 20 . Other dominant bacterial phyla included Bacteriodes (11.83%), Firmicutes (5.56%) and Actinobacteria (3.61%), but there was some variation in the ranking of each of these among the Kerala samples. Although all samples shared Proteobacteria as the predominant phylum, there was diversity in the percentage of the reads assigned the classes within that taxon ( Fig. 1 and  Fig. 2 and Table 2). The remainder of the reads in Archaea mapped predominantly to Crenarchaeota (10.43% ± 2.16) and Thaumarchaeota (6.43 ± 8.47), although the order and the percentage varied across the four Kerala locations. Of the 1.32% of the assigned reads that mapped to Eukaryota,  90% of those reads were found across 8 phyla ( Fig. 3 and Table 2), with the majority (18.04% ± 3.55) mapping to the phylum Streptophyta, closely followed by Ascomycota (16.4% ± 1.78) in all of the samples except PGD. Unlike what was seen in Bacteria and Archaea, a large number of the Eukaryota reads (14.94% ± 0.88) could not be classified to a specific phylum. An analysis of the genera found in these sediment samples showed unique differences at each sampling location, with some similarities. The reads that mapped to each genus were calculated, and only those genera with more than 1% of the total mapped reads were noted (Fig. 4). Burkholderia and Geobacter were the only genera that were found in each of the sampling locations. Each location had a different predominant genus, and while three locations had most of the same dominant genera (PYN, MAL and VL1), PGD was quite different in both the number and members identified.VL1 had Sulfuricurvum, which at 6% had almost double the reads as the next genus (Sulfurimonas). Burkholderia was the most prominent in PGD, with Acidovorax also prominent. Geobacter, Gramella and Shewanella were the dominant genera found at the PYN location, and Marinobacter was the most prominent in MAL.
Comparison of metagenomes from different soil samples. Sediment samples from three different ecosystems were compared to the Kerala mangrove samples. These included four samples from a Brazilian mangrove forest 21 , sediment from a tropical rain forest in Puerto Rico 22 , and four ocean sediment samples from the South China Sea 23 . A Principal Component Analysis (PCA) of all the samples clearly grouped the forest isolates together on one axis, clearly separating them from the ocean samples (Fig. 5).
MG-RAST's organism tree tool was used to examine patterns across the different sediment types. At the highest taxonomic level (Super Kingdom) clear differences were seen between the ocean sediment samples, and those isolated from the two types of forests. The four ocean samples had an average of 868769.5 (±123583) reads, with an average of more than 99% assigned to a taxonomic category (Table 1 and Fig. 6). In contrast, the majority of the reads isolated from any of the forest samples mapped to Bacteria, with Archaea being less than 5% of the assigned total in any sample (Table 1 and Fig. 6). All of the sediment samples had approximately 1.5% of the reads assigned to Eukaryota. Very few virus reads were found. A more detailed look at the two taxonomic levels with the majority of the reads assigned (Archaea and Bacteria) showed some striking differences between the assigned phyla within these two groups. Most notably, the ocean sediment samples had the highest archaeal reads, while all of the arboreal samples are predominantly bacterial (Table 1). An average 9. 3% of the ocean reads mapped to Archaea, and more than 95% of the arboreal reads, both mangrove and rain forest, are bacterial. Those bacterial reads, regardless of the ecology of the samples, map to Proteobacteria across all samples (Fig. 6), but that similarity does not extend beyond the phylum. Within this phylum, the Kerala mangrove samples have either Gamma-or Betaproteobacteria as the class that has the most reads assigned to it, with the Brazilian mangrove and Puerto Rican forest samples both having Gammaproteobacteria as the dominant class (Supplementary Table 4). The ocean samples have either Beta-or Alphaproteobacteria as the most dominant class. The lack of consensus, both within and between the groups, is also seen with classes within the Bacteriodes and Firmicutes phyla (Supplementary Table 4). The only bacterial phylum where all groups shared a similar structure was the Actinobacteria, with Actinomycetales as the dominant class (Supplementary Table 4).
The ocean sediments had a high abundance of reads that map to the archaeal kingdom and within that, Thaumarchaeota as the dominant phylum (Fig. 6). Archaea reads were less than 5% of the total from the samples isolated from forests, but across all of them, Euryarchaeota was the dominant phylum, followed by Crenarchaeota.  All of the samples, both from the mangrove and rain forests, had Methanomicrobia as the dominant class in Euryarchaeota. They also shared Thermoprotei as the most prevalent Crenarchaeota class. Reads that mapped to these particular classes were not even found (or barely registered) in any of the ocean samples, which instead had Thermococci and Methanococci as the most prevalent classes among the ocean reads that mapped to the Euryarchaeota phyla.
Less than 2% of the total reads from any sample mapped to Eukaryota. Most of these reads mapped to Streptophyta in the Indian and Brazilian samples, but the majority of these could not be classified to any  Table 4). The dominant class in the sample from the rain forest was Ascomycota. Streptophyta was also an important part of the ocean eukaryotic reads, but the Bacillariophyta, which contains the diatoms, was the dominant phylum. There was no clear consensus in a dominant class across the groups.

Functional analysis.
A functional comparison that mapped the abundance profiles to different categories across the Subsystems 24 was performed for each of the Indian mangroves samples. All four samples had similar distributions and abundance of reads that were mapped at the highest levels of subsystem categorization (Supplementary Table 2 Table 3 and Fig. 7). The forest samples had similar profiles, the exception being Protein Metabolism, where the India mangrove sample (PGD) had many fewer reads (Supplementary Table 2), and the ocean sample with a clear majority of mapping reads. The subsystems that had the highest Z score when comparing the average across the three forest samples to the ocean sample were Carbohydrates, Clustering-based subsystems, Protein metabolism and Amino acids derivatives (Supplementary Table 3 and Fig. 7).

Discussion
Similarities and differences seen across the Kerala mangrove samples. All of the Kerala mangrove samples shared certain microbial community structures. Bacteria were the predominant organisms, averaging 81.72% of the total reads across all four locations. Metagenomic analysis of mangrove at Cardoso Island State Park, Brazil, through 16S rRNA pyrosequencing showed similar dominance of Proteobacteria (88% of overall sequence) irrespective of soil depth through 16S rRNA pyrosequence 25 . Within the Bacteria, Proteobacteria was the dominant phylum, followed by Bacteriodes, Firmicutes and then Actinobacteria. Similar bacterial phylum dominancy was found using PCR-Clone based metagenomic library screening 26 as well as 16S rRNA ribo-typing 27 . The similarities within Kerala datasets seen across the higher taxonomic levels (Kingdom and Phylum) did not continue at the class level, where the composition of the bacteria varied at each location (Fig. 1), especially among the classes within the Proteobacteria and Bacteriodes phyla. Archaea followed next with the most reads, but with an average of 1.8%, they were not considered a dominant member of the Kerala microbial community. Despite their small numbers, they had a remarkable consistency in their taxonomic structure, maintaining the same divisions even to the class level (Fig. 2). Previous whole metagenome study in Brazilian mangroves sediments by Andreote et al. 21 have found similar abundance (0-3.4%) of archaea. Furthermore, mangrove soil sediment from Saudi Arabia 28 has also showed similar percentage abundance (3.5%). The consistent level of archaea in mangrove samples could denote its importance in the ecosystem such as N cycle by Thaumarchaeota and the favorable conditions for Methanogens 29 . Eukaryota had almost as many reads assigned to it as the Archaea, with 1.32% of the total, and all four samples had similar distributions at the taxon class level (Fig. 3).
Genera that had more than 1% of the reads mapped to them were examined at each sampling location, revealing some similarities and differences across the Kerala isolates (Fig. 4). Only two genera, Burkholderia and Geobacter, had more than one percent of the sequences map to them in each of the four samples. The mangrove samples from Kerala (India) were affected by anthropogenic activities, hence, it is very likely that the high dominance of Burkholderia could be due to its role involved in degradation of various compounds such as polycyclic aromatic hydrocarbons (PAHs), diesel, kerosene, naphthalene, and phenol 30,31 . Burkholderia sp. was also abundant in Okinawa (Japan) oil contaminated mangrove sediments 32 . Three of the locations (MAL, PYN and VL1) shared most of the genera that had more than 1% of the total reads, but those found in PGD at similar levels were mostly unique. The two predominant genera found in the VL1 were Sulfuricurvum and Sulfurimonas, with 6.2% and 3.66% of the total reads mapping to them. Both of these genera have been associated with sulfur oxidation [33][34][35][36][37] . Sulfuricurvum is in high concentration only in VL1, but Sulfurimonas is also a predominant organism in PYN, with 1.19% of the reads mapping to this genus in this sample. MAL and PYN were in close proximity compared to the other samples, and they shared Marinobacter as a dominant genus (with 3.93% and 0.79% of the reads, respectively). Marinobacter has been described as being ubiquitous across the global oceans 38 , and is known to degrade hydrocarbons 39 and fix nitrogen 40 . This genus is also of interest as it has been shown to be one of the few bacterial genera known to "bloom" when oil or oil constituents are introduced into seawater 41 , and its predominant presence in the co-located MAL and PYN, as compared to VL1 and PGD could suggest some pollution in that environment when the samples were taken.
Although there were differences in the bacterial membership at certain taxonomic levels, this was not repeated when the functional capacity of the four samples was compared. The reads across all four samples had similar abundance patterns that were assigned to the 28 different functional categories used by Subsystems 22 . No significant differences were seen in sulfur metabolism, despite VL1 having the two most abundant genera being sulfur oxidizers.

Similarities and differences across global samples.
To look for similarities that might be shared across similar ecosystems, the Indian mangrove samples were compared with four samples isolated from mangrove forests in Brazil 21 , a tropical forest in Puerto Rico 22 and four samples collected sediment from the South China sea 23 . A principal component analysis clustered the samples taken from forests, which includes both the mangrove samples and the tropical rain forest isolate, closer to the y-axis than those samples collected in the ocean (Fig. 5). A taxonomic analysis revealed supported this finding. At the highest taxonomic levels, all of the forest samples had similar taxonomic patterns (Fig. 6), with the ocean samples being distinctly different. Bacteria were the dominant kingdom in all samples taken from forests with Archaea less than 5%, however, the ocean isolates have almost 12.9% archaeal reads. Previously, more than 87% of the microbial biomass was seen to be dominated by Archaea in deep subsurface sediments 42 . In addition, Antarctic circumpolar continental shelf waters have been shown to be highly dominated by Archaea 43 . Dominance of archaea in deep ocean subsurface and bacteria in arboreal samples could be explain according to the theory proposed by Valentine 44 which states that bacteria can adapt to the changing environment while archaea can sustain in the nutrient limited environment. At the phylum taxon, all forest samples were predominantly Proteobacteria. This was also true of the ocean samples, but as they accounted for less than 10% of the total reads they were not the dominant ocean phylum. Nevertheless, Proteobacteria was found to be the most dominant in all the ecosystems within the bacterial kingdom. Saline environments have shown to harbor Proteobacteria in the past 45,46 . The mangrove samples from Brazil and India had a mixture of Bacteriodes, Firmicutes and Actinobacteria as a significant part of the bacterial phyla diversity, which corroborates to mangroves data generated from other part of India 47 , but the Puerto Rican rainforest had two additional phyla (Acidobacteria and Planctomycetes) that were only a small part of the diversity in the mangroves. Acidobacteria have been found to be dominant in rain forest sample 48,49 . Ocean samples, which also included reads that mapped to Bacteriodes, Firmicutes and Actinobacteria.
Similarities were also seen in the structure of the Archaea across the four ecosystems examined. Of the Archaea reads that were present, the forest samples contained mainly Euryarchaeota. The ocean samples were primarily archaeal and mostly of the phylum Thaumarchaeota. The finding is in corroboration with the work of Quaiser et al. 50 , which showed the dominance of archaea by Group-I Thaumarchaeota in the sediment of Marmara Sea. They found that the archaeal amo (ammonia mono-oxygenase) genes were highly abundant in Marmara Sea suggestion the dominance of Thaumarchaeota in ammonia oxidation. Marine Group-I (MG) group were also found to be the most dominated group among the archaeal kingdom in deep Mediterranean Sea 51 . Studies have shown that the seasonal variation affects the archaeal diversity wherein Marine Group-I (MG) and Euryarchaeota MG II.b dominates during winter and Euryarchaeota MG II.b during summer 52 . Archaeal diversity can also be influenced by the difference in zones (depth) of the sea/ocean; the oxic/anoxic interface zone featured high dominance of Marine Group-I (MG) archaea. However, significant reduction was exhibited in sulfatemethane transition zone and methanic zone 53 . Similar patterns were also seen in the Eukaryota, which were not in the majority in any sample. Across both the forest and ocean samples the Streptophyta, which include the green plants, were a predominant phylum found in the eukaryotic reads. The forest samples also had reads that mapped to the phyla Ascomycota (fungi) in significant numbers and a large number in Chordata (vertebrates), Cnidaria, and Arthropoda (insects and arachnids). These same phyla were also present in the ocean. The mangrove and ocean samples had a number of reads mapping to the phylum Bacillariophyta (data not shown), which include the diatoms. Bacillariophyta play a crucial role in generation of organic carbon soluble in organic compounds in the ocean bottom and also produces exopolysaccharides (EPS) which stabilizes the sedimentary materials 54 .
A similar pattern that groups the forest samples distinctly from the ocean isolates was also seen in the functional analysis. While all samples had similar functional profiles in some of the subsystems, the forest samples were clearly distinct from the ocean samples in several of the subsystems examined. Interestingly, the ocean samples examined had more reads that mapped to genes active in the subsystems defined as Carbohydrates, Amino Acids and Derivatives, Protein Metabolism, Respiration, and RNA Metabolism (Fig. 7), but had fewer reads that mapped to the Virulence, Disease and Defense.

Conclusions
A metagenomic analysis of isolates from soil sediment in four different locations across the Kerala, India mangrove forest ecosystem showed strong similarities across higher taxonomic divisions extending to the level of phylum. Comparisons of the Indian mangrove isolates to samples from mangrove in Brazil, and to a tropical rain forest in North America showed similar patterns, with most of the reads mapping to phylum Proteobacteria within the Bacteria kingdom. Fewer reads were found that mapped to Archaea, but those that were present predominantly to the phylum Euryarchaeota. Similar numbers of reads in each of the isolates from mangrove or rainforest, mapped to Eukaryota, which showed comparable divisions across the phyla Streptophyta (green plants), Ascomycota (fungi), Chordata (vertebrates), Cnidaria and Arthropoda (insects and arachnids). However, differences were noted, when these samples were compared to isolates taken from ocean sediments. The ocean samples had, on average, larger number of Archaea, with almost all the reads mapping to the phylum Thaumarchaeota. Like the forest samples, the predominant bacterial phylum in the ocean samples mapped to the phylum Proteobacteria.
This finding shows strong patterns in metagenomics structure across the samples taken from the two types of forest ecosystems. It shows distinct patterns that unite the forest samples and differentiates them from an ocean sediment ecosystem.  Table 4). Following collection, each soil sample was preserved at −20 °C prior to DNA isolation.

Methods
Genomic DNA extraction and Sequencing. Genomic  (http://metagenomics.anl.gov/) 55 and processed following their standard protocol. Briefly, the mate-pairs were joined with overlap setting of 8 base pairs (bp) and a maximum difference of 10% and were then processed for quality control. Low-quality regions were trimmed off using SolexaQA 56 , de-replicated, and analyzed the artificially duplicated reads (ADRs) 57 using Duplicate Read Inferred Sequencing Error Estimation (DRISEE) 58 . The near-exact matches against model organisms including fly, mouse, cow and human were removed using Bowtie 59 . Coding regions in DNA sequences of 75 bp and longer were predicted using FragGeneScan 60 . Protein clusters with 90% identity were built using the UCLUSTimplementation 61 in Quantitative Insights into Microbial Ecology (QIIME). A representative of each cluster is subjected to similarity analysis using BLAST-like alignment tool (BLAT) 62 . Sequence similarity searches to identify proteins and mapped annotations are computed against a MG-RAST protein databases M5NR 19 , Genbank 63,64 , SEED 23 , Integrated Microbial Genomes & Microbiomes (IMG) 65 , Universal Protein Resource (UniProt) 40 , Kyoto Encyclopedia of Genes and Genomes (KEGG) 66 and Evolutionary genealogy of genes: Non-supervised Orthologous Groups (eggNOG) 67 .
Taxonomic and Functional Analysis. Taxonomic assignments were carried out against the RefSeq protein database, Metagenomics Rapid Annotation using Subsystem Technology (MG-RAST) 55 . This database is an integration of many sequence databases into one single, searchable database. Cut-offs included a maximum E-value of 1 × 10 −5 , a minimum percentage identity of 60%, and a minimum alignment length of 15 were used. The subsystems platform annotation that assigns genes to functional roles 68,69 , was used for functional analysis comparisons, with the same cut-off values used for the taxonomic assignments.
Comparison of metagenomes from different soil samples. Different sediment samples isolated from mangroves in Brazil 21 , a tropical forest in Puerto Rico 22 , and different sediment samples collected from the South China Sea 23 were compared to the Kerala, India soil samples (Table 1), which were publically available at MG-RAST. We selected samples that could be linked to public data. A PCA provided by MG-RAST was used to examine dimensions of maximal variation, and an examination of the taxonomic diversity across all samples was conducted using the Organism tree tool. Taxonomic assignments used M5NR with the same cut-offs used for the Kerala samples, and an average was made across all the samples from a similar location/ecosystem for final comparison. The subsystems platform was used to assigned functional roles, with settings similar to those described above. When functional comparisons were made across metagenomes from geographic and ecological variants, a single isolate that was chosen based on the number of reads to represent each of the geographic groups (India-PGD