The mangrove ecosystem plays a crucial role by acting as a buffer zone between land and sea, maintaining the sea level and protecting the coast1. Mangroves are a crucial component of the food chain in the saline coastal biome of the tropics and subtropics. Mangrove trees convert solar energy into organic matter via photosynthesis, with their leaves and branches serving as a source of energy and providing a habitat for a variety of aquatic organisms, which in turn, support a higher level in the food chain. This ecosystem is an enormous food web, supplying a myriad of microorganisms with both protection and nutrients2,3. It is considered to be one of the most critical in tropical regions, and also one of the most vulnerable to global climate change4.

The complexity of the mangrove microbial communities has generated deep interest among microbial ecologists. The dynamic environment of the mangrove ecosystem, brought about by the regular tidal variations, pH, temperature, salinity, light, rainfall and nutrient availability provides an excellent environment for a wide range of organisms with diversified functional roles5. Studies have shown that microbial communities play a vital role in this ecosystem, being essential for biogeochemical cycles and biocycling of most nutrients, including nitrogen6,7. Except for absorption of nutrients, transformation processes of nutrients can also be important for avoiding eutrophication8.

The impact that mangrove forests and their supporting ecosystem extends beyond their local ecological impact. Explorations of this ecosystem have revealed some interesting bioactive compounds that could have a potential impact on human health. These include some of the novel compounds with cytotxoic activity9,10, some p-aminoacetophenonic acids that may have antibiotic properties11,12,13, some indolesesquiterpenes that appear to have selective anti-HIV activity14,15 and an anti-fibrotic compound9.

Mangroves cover up to 152,000 km2 globally8, but their range is decreasing due to pollution, urbanization and other human activities. Conservation and research on mangroves are imperative due to their dominant role in the marine food chain and the recognition that they may play a role in human health by their potential ecological, industrial and pharmaceutical impacts5. There is an increasing urgency to understand the structural and functional architecture that underlies the mangrove ecosystem, and the microbial community is an important part that remains unexplored. With more than 98% of microbes currently unculturable16, examination of environmental samples by metagenomic analysis is now the only avenue available for exploring these communities. Until now, the few metagenomic studies in mangroves have been concentrated South America17. The South American mangrove ecosystems are important, but constitute only 11% of the range of mangroves across the globe18. In this study, we carried out the first detailed analysis of microbial communities within the mangrove ecosystems from four different locations in mangrove forests across Kerala, India, and compare them to a similar Brazilian ecosystem, a tropical rainforest, and samples taken from ocean sediment.


Sequencing, quality control and annotation of proteins

A total of 64413433 sequencing reads were obtained from the four samples from Kerala with an average length of 254 ± 12.75 bases. Following quality trimming, a total of 54833910 (85.13% of the total) remained, including for 7809729 reads for PGD (average read length of 408 ± 138 bp), 7303470 reads for MAL (average read length of 408 ± 138 bp), 6414174 reads for VL1 (average read length of 200 ± 68 bp), and 5920270 reads for PYN (average read length of 193 ± 64 bp). A total of 18.6 GB (64 million reads) were obtained across all four of the Kerala mangrove samples, and 54 million reads (85.13%) were retained for further analysis following quality control. The sequences for these reads are publicly available from the MG-RAST server under the following IDs: 4671371, 4671370, 467136 and 4671368. Of the sequences that passed quality control, each had a certain percentage that mapped to proteins of known (PGD= 39%, MAL= 41.2%, VL1= 35.4%, and PYN= 35.4%) and also to proteins of unknown function (PGD= 7.5%, MAL= 41.2%, VL1= 44.5%, and PYN= 44.1%). Asymptotic rarefaction curves were generated for each sample indicating that the majority of taxonomic diversity was covered in the samples (data not shown).

Taxonomic Diversity across the Kerala samples

The taxonomic classifications of genes were assigned to the RefSeq annotation source19 using the Best Hit Classification algorithm of MG-RAST. A total of 13269130 representative sequences were assigned to different taxonomies using the RefSeq database from all the four datasets accounting for more than 99% of the reads for PYN, MAL, VL1 and PGD samples respectively (Table 1). There were also a number of sequences that could not be assigned to the highest taxonomy levels, with an average of 15.5% of the reads either unassigned, unclassified, or identified as other. Collectively, bacterial sequences dominated the overall reads accounting for 96.81% of the total assigned reads, with Eukaryota and Archaea assigned 1.8% and 1.32% respectively. Less than 1% of the reads mapped to viruses. Sequences from all the datasets were assigned to 28 different bacterial, 5 archaeal, and 34 eukaryotic phyla. All four samples shared Proteobacteria as the phyla with the most reads assigned, having an average of 65.70% (Supplementary Table 1). Similar rich dominance of Proteobacteria was found in Brazilian oil contaminated mangroves20. Other dominant bacterial phyla included Bacteriodes (11.83%), Firmicutes (5.56%) and Actinobacteria (3.61%), but there was some variation in the ranking of each of these among the Kerala samples. Although all samples shared Proteobacteria as the predominant phylum, there was diversity in the percentage of the reads assigned the classes within that taxon (Fig. 1 and Table 2), with PYN and MAL both sharing Gammaproteobacteria as the dominant class, and VL1 and PGD with the majority of their Proteobacteria reads mapped to Betaproteobacteria. A similar lack of structure is seen within the phylum Bacteriodes. Both PYN and MAL have Flavobacteriia as the dominant class, but VL1 has most of its Bacteriodes reads map to Bacteroidia, and in PGD it is Flavobacteria. The classes within Firmicutes and Actinobacteria phyla have similar diversity and percentage of reads assigned to their top classes across all of the Kerala samples.

Table 1 Summary of reads and distribution across kingdoms in the samples from India, Brazil, Puerto Rico and from the isolates taken from the South China Sea.
Figure 1
figure 1

Taxonomic structure and diversity of the reads mapping to Bacteria. Kingdom, phyla and class divisions across the four sampling locations in Kerala, India, with class divisions noted across the top four phyla.

Table 2 Class taxonomic distribution across the samples isolated from a mangrove forest in Kerala, India.

A majority of the reads assigned to Archaea (1.32 ± 0.44% of total reads) mapped to the phylum Euryarchaeota (81.77% ± 7.02) across all four samples (Fig. 2 and Table 2). The remainder of the reads in Archaea mapped predominantly to Crenarchaeota (10.43% ± 2.16) and Thaumarchaeota (6.43 ± 8.47), although the order and the percentage varied across the four Kerala locations. Of the 1.32% of the assigned reads that mapped to Eukaryota, 90% of those reads were found across 8 phyla (Fig. 3 and Table 2), with the majority (18.04% ± 3.55) mapping to the phylum Streptophyta, closely followed by Ascomycota (16.4% ± 1.78) in all of the samples except PGD. Unlike what was seen in Bacteria and Archaea, a large number of the Eukaryota reads (14.94% ± 0.88) could not be classified to a specific phylum.

Figure 2
figure 2

Taxonomic structure and diversity of the reads mapping to Archaea. Kingdom, phyla and class divisions across the four sampling locations in Kerala, India, with class divisions noted across the top three phyla.

Figure 3
figure 3

Taxonomic structure and diversity of the reads mapping to Eukaryota. Kingdom, phyla and class divisions across the four sampling locations in Kerala, India, with class divisions noted across the top four phyla.

An analysis of the genera found in these sediment samples showed unique differences at each sampling location, with some similarities. The reads that mapped to each genus were calculated, and only those genera with more than 1% of the total mapped reads were noted (Fig. 4). Burkholderia and Geobacter were the only genera that were found in each of the sampling locations. Each location had a different predominant genus, and while three locations had most of the same dominant genera (PYN, MAL and VL1), PGD was quite different in both the number and members identified.VL1 had Sulfuricurvum, which at 6% had almost double the reads as the next genus (Sulfurimonas). Burkholderia was the most prominent in PGD, with Acidovorax also prominent. Geobacter, Gramella and Shewanella were the dominant genera found at the PYN location, and Marinobacter was the most prominent in MAL.

Figure 4
figure 4

Geographic locations and diversity in the sampling sites in the mangrove ecosystem of Kerala, India. Genera with more than 1% of the mapped reads are listed, with their rank and indications of their abundance are shown, and organisms shared across sampling locations are indicated by shading or boxes. (Image source: andInkscapev0.92 was used for sample site labelling).

Comparison of metagenomes from different soil samples

Sediment samples from three different ecosystems were compared to the Kerala mangrove samples. These included four samples from a Brazilian mangrove forest21, sediment from a tropical rain forest in Puerto Rico22, and four ocean sediment samples from the South China Sea23. A Principal Component Analysis (PCA) of all the samples clearly grouped the forest isolates together on one axis, clearly separating them from the ocean samples (Fig. 5).

Figure 5
figure 5

Principal Component Analysis and geographic location of samples compared across the globe. A. PCA analysis comparing mangrove forests, a rain forest, and four ocean sediment samples. B. Geographic locations of the samples that were compared in this analysis. (Image source- Link: Inkscape v0.92 was used for labelling).

MG-RAST’s organism tree tool was used to examine patterns across the different sediment types. At the highest taxonomic level (Super Kingdom) clear differences were seen between the ocean sediment samples, and those isolated from the two types of forests. The four ocean samples had an average of 868769.5 (±123583) reads, with an average of more than 99% assigned to a taxonomic category (Table 1 and Fig. 6). In contrast, the majority of the reads isolated from any of the forest samples mapped to Bacteria, with Archaea being less than 5% of the assigned total in any sample (Table 1 and Fig. 6). All of the sediment samples had approximately 1.5% of the reads assigned to Eukaryota. Very few virus reads were found.

Figure 6
figure 6

Taxonomic structure at the highest taxonomic levels across all samples. Averages of reads across samples from the same geographic location are pictured, with diversity at the kingdom level (large circles), different phyla within those kingdoms (smaller circles), and classes presented.

A more detailed look at the two taxonomic levels with the majority of the reads assigned (Archaea and Bacteria) showed some striking differences between the assigned phyla within these two groups. Most notably, the ocean sediment samples had the highest archaeal reads, while all of the arboreal samples are predominantly bacterial (Table 1). An average 9. 3% of the ocean reads mapped to Archaea, and more than 95% of the arboreal reads, both mangrove and rain forest, are bacterial. Those bacterial reads, regardless of the ecology of the samples, map to Proteobacteria across all samples (Fig. 6), but that similarity does not extend beyond the phylum. Within this phylum, the Kerala mangrove samples have either Gamma- or Betaproteobacteria as the class that has the most reads assigned to it, with the Brazilian mangrove and Puerto Rican forest samples both having Gammaproteobacteria as the dominant class (Supplementary Table 4). The ocean samples have either Beta- or Alphaproteobacteria as the most dominant class. The lack of consensus, both within and between the groups, is also seen with classes within the Bacteriodes and Firmicutes phyla (Supplementary Table 4). The only bacterial phylum where all groups shared a similar structure was the Actinobacteria, with Actinomycetales as the dominant class (Supplementary Table 4).

The ocean sediments had a high abundance of reads that map to the archaeal kingdom and within that, Thaumarchaeota as the dominant phylum (Fig. 6). Archaea reads were less than 5% of the total from the samples isolated from forests, but across all of them, Euryarchaeota was the dominant phylum, followed by Crenarchaeota. All of the samples, both from the mangrove and rain forests, had Methanomicrobia as the dominant class in Euryarchaeota. They also shared Thermoprotei as the most prevalent Crenarchaeota class. Reads that mapped to these particular classes were not even found (or barely registered) in any of the ocean samples, which instead had Thermococci and Methanococci as the most prevalent classes among the ocean reads that mapped to the Euryarchaeota phyla.

Less than 2% of the total reads from any sample mapped to Eukaryota. Most of these reads mapped to Streptophyta in the Indian and Brazilian samples, but the majority of these could not be classified to any particular class (Supplementary Table 4). The dominant class in the sample from the rain forest was Ascomycota. Streptophyta was also an important part of the ocean eukaryotic reads, but the Bacillariophyta, which contains the diatoms, was the dominant phylum. There was no clear consensus in a dominant class across the groups.

Functional analysis

A functional comparison that mapped the abundance profiles to different categories across the Subsystems24 was performed for each of the Indian mangroves samples. All four samples had similar distributions and abundance of reads that were mapped at the highest levels of subsystem categorization (Supplementary Table 2).

A single sample was chosen from each of the groups for a functional comparison (India mangrove = PGD/4671368.3, Brazil mangrove = BrMgv2/4451034.3, Puerto Rico Forest = 4446153.3, South China Sea = E201-2/4487295.3). The numbers of hits for each of the functional levels were summed, and the percentages of hits per each level compared to the total number were compared (Supplementary Table 3 and Fig. 7). The forest samples had similar profiles, the exception being Protein Metabolism, where the India mangrove sample (PGD) had many fewer reads (Supplementary Table 2), and the ocean sample with a clear majority of mapping reads. The subsystems that had the highest Z score when comparing the average across the three forest samples to the ocean sample were Carbohydrates, Clustering-based subsystems, Protein metabolism and Amino acids derivatives (Supplementary Table 3 and Fig. 7).

Figure 7
figure 7

Subsystem functional analysis of representatives from each of the ecosystems analyzed showing the top ten subsystems that the reads were assigned to. Reads that map to genes assigned to specific subsystems are demonstrated for a single representative from each geographic location compared.


Similarities and differences seen across the Kerala mangrove samples

All of the Kerala mangrove samples shared certain microbial community structures. Bacteria were the predominant organisms, averaging 81.72% of the total reads across all four locations. Metagenomic analysis of mangrove at Cardoso Island State Park, Brazil, through 16S rRNA pyrosequencing showed similar dominance of Proteobacteria (88% of overall sequence) irrespective of soil depth through 16S rRNA pyrosequence25. Within the Bacteria, Proteobacteria was the dominant phylum, followed by Bacteriodes, Firmicutes and then Actinobacteria. Similar bacterial phylum dominancy was found using PCR-Clone based metagenomic library screening26 as well as 16S rRNA ribo-typing27. The similarities within Kerala datasets seen across the higher taxonomic levels (Kingdom and Phylum) did not continue at the class level, where the composition of the bacteria varied at each location (Fig. 1), especially among the classes within the Proteobacteria and Bacteriodes phyla. Archaea followed next with the most reads, but with an average of 1.8%, they were not considered a dominant member of the Kerala microbial community. Despite their small numbers, they had a remarkable consistency in their taxonomic structure, maintaining the same divisions even to the class level (Fig. 2). Previous whole metagenome study in Brazilian mangroves sediments by Andreote et al.21 have found similar abundance (0–3.4%) of archaea. Furthermore, mangrove soil sediment from Saudi Arabia28 has also showed similar percentage abundance (3.5%). The consistent level of archaea in mangrove samples could denote its importance in the ecosystem such as N cycle by Thaumarchaeota and the favorable conditions for Methanogens29. Eukaryota had almost as many reads assigned to it as the Archaea, with 1.32% of the total, and all four samples had similar distributions at the taxon class level (Fig. 3).

Genera that had more than 1% of the reads mapped to them were examined at each sampling location, revealing some similarities and differences across the Kerala isolates (Fig. 4). Only two genera, Burkholderia and Geobacter, had more than one percent of the sequences map to them in each of the four samples. The mangrove samples from Kerala (India) were affected by anthropogenic activities, hence, it is very likely that the high dominance of Burkholderia could be due to its role involved in degradation of various compounds such as polycyclic aromatic hydrocarbons (PAHs), diesel, kerosene, naphthalene, and phenol30,31. Burkholderia sp. was also abundant in Okinawa (Japan) oil contaminated mangrove sediments32. Three of the locations (MAL, PYN and VL1) shared most of the genera that had more than 1% of the total reads, but those found in PGD at similar levels were mostly unique. The two predominant genera found in the VL1 were Sulfuricurvum and Sulfurimonas, with 6.2% and 3.66% of the total reads mapping to them. Both of these genera have been associated with sulfur oxidation33,34,35,36,37. Sulfuricurvum is in high concentration only in VL1, but Sulfurimonas is also a predominant organism in PYN, with 1.19% of the reads mapping to this genus in this sample. MAL and PYN were in close proximity compared to the other samples, and they shared Marinobacter as a dominant genus (with 3.93% and 0.79% of the reads, respectively). Marinobacter has been described as being ubiquitous across the global oceans38, and is known to degrade hydrocarbons39 and fix nitrogen40. This genus is also of interest as it has been shown to be one of the few bacterial genera known to “bloom” when oil or oil constituents are introduced into seawater41, and its predominant presence in the co-located MAL and PYN, as compared to VL1 and PGD could suggest some pollution in that environment when the samples were taken.

Although there were differences in the bacterial membership at certain taxonomic levels, this was not repeated when the functional capacity of the four samples was compared. The reads across all four samples had similar abundance patterns that were assigned to the 28 different functional categories used by Subsystems22. No significant differences were seen in sulfur metabolism, despite VL1 having the two most abundant genera being sulfur oxidizers.

Similarities and differences across global samples

To look for similarities that might be shared across similar ecosystems, the Indian mangrove samples were compared with four samples isolated from mangrove forests in Brazil21, a tropical forest in Puerto Rico22 and four samples collected sediment from the South China sea23. A principal component analysis clustered the samples taken from forests, which includes both the mangrove samples and the tropical rain forest isolate, closer to the y-axis than those samples collected in the ocean (Fig. 5). A taxonomic analysis revealed supported this finding. At the highest taxonomic levels, all of the forest samples had similar taxonomic patterns (Fig. 6), with the ocean samples being distinctly different. Bacteria were the dominant kingdom in all samples taken from forests with Archaea less than 5%, however, the ocean isolates have almost 12.9% archaeal reads. Previously, more than 87% of the microbial biomass was seen to be dominated by Archaea in deep subsurface sediments42. In addition, Antarctic circumpolar continental shelf waters have been shown to be highly dominated by Archaea43. Dominance of archaea in deep ocean subsurface and bacteria in arboreal samples could be explain according to the theory proposed by Valentine44 which states that bacteria can adapt to the changing environment while archaea can sustain in the nutrient limited environment. At the phylum taxon, all forest samples were predominantly Proteobacteria. This was also true of the ocean samples, but as they accounted for less than 10% of the total reads they were not the dominant ocean phylum. Nevertheless, Proteobacteria was found to be the most dominant in all the ecosystems within the bacterial kingdom. Saline environments have shown to harbor Proteobacteria in the past45,46. The mangrove samples from Brazil and India had a mixture of Bacteriodes, Firmicutes and Actinobacteria as a significant part of the bacterial phyla diversity, which corroborates to mangroves data generated from other part of India47, but the Puerto Rican rainforest had two additional phyla (Acidobacteria and Planctomycetes) that were only a small part of the diversity in the mangroves. Acidobacteria have been found to be dominant in rain forest sample48,49. Ocean samples, which also included reads that mapped to Bacteriodes, Firmicutes and Actinobacteria.

Similarities were also seen in the structure of the Archaea across the four ecosystems examined. Of the Archaea reads that were present, the forest samples contained mainly Euryarchaeota. The ocean samples were primarily archaeal and mostly of the phylum Thaumarchaeota. The finding is in corroboration with the work of Quaiser et al.50, which showed the dominance of archaea by Group-I Thaumarchaeota in the sediment of Marmara Sea. They found that the archaeal amo (ammonia mono-oxygenase) genes were highly abundant in Marmara Sea suggestion the dominance of Thaumarchaeota in ammonia oxidation. Marine Group-I (MG) group were also found to be the most dominated group among the archaeal kingdom in deep Mediterranean Sea51. Studies have shown that the seasonal variation affects the archaeal diversity wherein Marine Group-I (MG) and Euryarchaeota MG II.b dominates during winter and Euryarchaeota MG II.b during summer52. Archaeal diversity can also be influenced by the difference in zones (depth) of the sea/ocean; the oxic/anoxic interface zone featured high dominance of Marine Group-I (MG) archaea. However, significant reduction was exhibited in sulfatemethane transition zone and methanic zone53. Similar patterns were also seen in the Eukaryota, which were not in the majority in any sample. Across both the forest and ocean samples the Streptophyta, which include the green plants, were a predominant phylum found in the eukaryotic reads. The forest samples also had reads that mapped to the phyla Ascomycota (fungi) in significant numbers and a large number in Chordata (vertebrates), Cnidaria, and Arthropoda (insects and arachnids). These same phyla were also present in the ocean. The mangrove and ocean samples had a number of reads mapping to the phylum Bacillariophyta (data not shown), which include the diatoms. Bacillariophyta play a crucial role in generation of organic carbon soluble in organic compounds in the ocean bottom and also produces exopolysaccharides (EPS) which stabilizes the sedimentary materials54.

A similar pattern that groups the forest samples distinctly from the ocean isolates was also seen in the functional analysis. While all samples had similar functional profiles in some of the subsystems, the forest samples were clearly distinct from the ocean samples in several of the subsystems examined. Interestingly, the ocean samples examined had more reads that mapped to genes active in the subsystems defined as Carbohydrates, Amino Acids and Derivatives, Protein Metabolism, Respiration, and RNA Metabolism (Fig. 7), but had fewer reads that mapped to the Virulence, Disease and Defense.


A metagenomic analysis of isolates from soil sediment in four different locations across the Kerala, India mangrove forest ecosystem showed strong similarities across higher taxonomic divisions extending to the level of phylum. Comparisons of the Indian mangrove isolates to samples from mangrove in Brazil, and to a tropical rain forest in North America showed similar patterns, with most of the reads mapping to phylum Proteobacteria within the Bacteria kingdom. Fewer reads were found that mapped to Archaea, but those that were present predominantly to the phylum Euryarchaeota. Similar numbers of reads in each of the isolates from mangrove or rainforest, mapped to Eukaryota, which showed comparable divisions across the phyla Streptophyta (green plants), Ascomycota (fungi), Chordata (vertebrates), Cnidaria and Arthropoda (insects and arachnids). However, differences were noted, when these samples were compared to isolates taken from ocean sediments. The ocean samples had, on average, larger number of Archaea, with almost all the reads mapping to the phylum Thaumarchaeota. Like the forest samples, the predominant bacterial phylum in the ocean samples mapped to the phylum Proteobacteria.

This finding shows strong patterns in metagenomics structure across the samples taken from the two types of forest ecosystems. It shows distinct patterns that unite the forest samples and differentiates them from an ocean sediment ecosystem.


Soil samples

Approximately 250 g of soil were collected at 20 cm in depth from the surface in four different locations within Kerala, India and included Pyannur (PYN, Coordinates: 12.1050687, 75.2058), Panangod (PGD, Cord: 9.8959941, 76.326094), Vallarpadam(VL1, Cord: 9.9994138, 76.253705) and Madakal (MAL, Cord: 9.9091896, 76.30629) (Fig. 4B). Physical characteristics of the samples sites are provided (Supplementary Table 4). Following collection, each soil sample was preserved at −20 °C prior to DNA isolation.

Genomic DNA extraction and Sequencing

Genomic DNA was extracted using MoBio Powersoil DNA isolation Kit (MO BIO Laboratories, Inc., California) as per manufacturer’s instructions. Sample quality was verified by gel electrophoresis. Paired end read sequencing was performed on an Illumina HiSeq platform 2500 at SciGenome Labs Pvt Ltd, Cochin (India). The raw fastQ files were uploaded in NCBI SRA database with accession numbers SRR2844600, SRR2844601, SRR2844602 and SRR2844616.

Quality Control and Annotation Pipeline

The paired end fastQ read files of all the samples were uploaded to the Metagenome Rapid Annotation using Subsystem Technology (MG-RAST) server ( and processed following their standard protocol. Briefly, the mate-pairs were joined with overlap setting of 8 base pairs (bp) and a maximum difference of 10% and were then processed for quality control. Low-quality regions were trimmed off using SolexaQA56, de-replicated, and analyzed the artificially duplicated reads (ADRs)57 using Duplicate Read Inferred Sequencing Error Estimation (DRISEE)58. The near-exact matches against model organisms including fly, mouse, cow and human were removed using Bowtie59. Coding regions in DNA sequences of 75 bp and longer were predicted using FragGeneScan60. Protein clusters with 90% identity were built using the UCLUSTimplementation61 in Quantitative Insights into Microbial Ecology (QIIME). A representative of each cluster is subjected to similarity analysis using BLAST-like alignment tool (BLAT)62. Sequence similarity searches to identify proteins and mapped annotations are computed against a MG-RAST protein databases M5NR19, Genbank63,64, SEED23, Integrated Microbial Genomes & Microbiomes (IMG)65, Universal Protein Resource (UniProt)40, Kyoto Encyclopedia of Genes and Genomes (KEGG)66 and Evolutionary genealogy of genes: Non-supervised Orthologous Groups (eggNOG)67.

Taxonomic and Functional Analysis

Taxonomic assignments were carried out against the RefSeq protein database, Metagenomics Rapid Annotation using Subsystem Technology (MG-RAST)55. This database is an integration of many sequence databases into one single, searchable database. Cut-offs included a maximum E-value of 1 × 10−5, a minimum percentage identity of 60%, and a minimum alignment length of 15 were used. The subsystems platform annotation that assigns genes to functional roles68,69, was used for functional analysis comparisons, with the same cut-off values used for the taxonomic assignments.

Comparison of metagenomes from different soil samples

Different sediment samples isolated from mangroves in Brazil21, a tropical forest in Puerto Rico22, and different sediment samples collected from the South China Sea23 were compared to the Kerala, India soil samples (Table 1), which were publically available at MG-RAST. We selected samples that could be linked to public data. A PCA provided by MG-RAST was used to examine dimensions of maximal variation, and an examination of the taxonomic diversity across all samples was conducted using the Organism tree tool. Taxonomic assignments used M5NR with the same cut-offs used for the Kerala samples, and an average was made across all the samples from a similar location/ecosystem for final comparison. The subsystems platform was used to assigned functional roles, with settings similar to those described above. When functional comparisons were made across metagenomes from geographic and ecological variants, a single isolate that was chosen based on the number of reads to represent each of the geographic groups (India-PGD-4671368.3, Brazil-BrMgv2-4451034.3, Puerto Rico Rain forest- 4446153.3, South China Sea-E201-1-4487294.3).

Availability of data and materials

The raw fastQ files for the Kerala India samples were uploaded in NCBI SRA database with accession numbers SRR2844600, SRR2844601, SRR2844602 and SRR2844616. In addition, the sequences for these reads are publicly available from the MG-RAST server under the following IDs: 4671371, 4671370, 467136 and 4671368.