309 metagenome assembled microbial genomes from deep sediment samples in the Gulfs of Kathiawar Peninsula

Prokaryoplankton genomes from the deep marine sediments are less explored compared to shallow shore sediments. The Gulfs of Kathiawar peninsula experience varied currents and inputs from different on-shore activities. Any perturbations would directly influence the microbiome and their normal homeostasis. Advancements in reconstructing genomes from metagenomes allows us to understand the role of individual unculturable microbes in ecological niches like the Gulf sediments. Here, we report 309 bacterial and archaeal genomes assembled from metagenomics data of deep sediments from sites in the Gulf of Khambhat and Gulf of Kutch as well as a sample from the Arabian Sea. Phylogenomics classified them into 5 archaeal and 18 bacterial phyla. The genomes will facilitate understanding of the physiology, adaptation and impact of on-shore anthropogenic activities on the deep sediment microbes.

www.nature.com/scientificdata www.nature.com/scientificdata/ onshore anthropogenic activities may have imparted unique features to the microflora of the Gulfs. Study of microbial diversity and functions in the mentioned Gulfs have largely been focused on cultivation based approaches and very few molecular studies have been conducted on the shore sediments. Additionally, the presence of several on-shore industries like fertilizer, chemicals, oil refineries, power plants and ASSBRY (Alang Ship Breaking Yard) may have also influenced the deeper sediment microbiome leading to their variable gene profile 7 . Our previous insights into the pelagic sediment resistome profile by metagenomics approach have shown that the deeper sediments, earlier thought to be primeval are actually hosting microbes with a concerning number of resistance genes 7,8 . This acted as a propeller to the present study wherein we tried to look deeper into the metagenomics data of the samples collected from the Gulfs of Kathiawar Peninsula and a sample from the Arabian Sea by sorting individual prokaryoplankton genomes from the data using the binning approach.
We successfully reconstructed 309 Metagenome Assembled Genomes (MAGs) from the nine sediment metagenomics sequences ( Table 1) from Gulf of Khambhat (GOC), Gulf of Kutch (GOK) and Arabian Sea (A) by differential coverage approach and considering the GC percent and tetranucleotide frequencies. Out of the 309 MAGs, 39 were archaeal genomes (Online-only Table 1) and 270 were bacterial genomes (Online-only Table 2). Seventy-one were high quality drafts with a completeness of ≥90% and contamination <10%, 120 were medium quality (completeness: 70-90%, contamination: <10%) and the remaining 118 were draft genomes with a final completeness of >50%. The distribution of the bins as per the MIMAG quality standards 9 is described in Table 2.
To the best of our knowledge, this is the first report of multiple MAGs from the studied sites.
Single nucleotide polymorphisms were correlated to quality of bins to understand the influence of strain heterogeneity on the fragmentation of the MAGs (Fig. 1). Phylogenomic analysis revealed that the archaeal populations were quite different in two Gulfs, with GOC bins (n = 15) encompassing 3 major phyla: Thaumarchaeota and Aenigmarchaeota from the DPANN superphylum andBathyarchaeota. The GOK genomes (n = 24) were falling under the Bathyarchaeota, Thaumarchaeota, Euryarchaeota and the Korarchaeota phyla ( Figs. 2 and 3). Based on the community profile assessment of the samples by considering all the reads, the above mentioned archaeal phyla represented <3% of the total microbial population at each sample site. Majority of the phyla were those reported earlier in the marine and estuarine environments, with most having few or no cultured representatives 10,11 . The observed genomes like Thaumarchaeota have been reported to be nitrifiers in the sediment niche, thus, the insights into their gene content will provide details on the functional significance of the archaea in the  www.nature.com/scientificdata www.nature.com/scientificdata/ respective sample site. Genomes from Thaumarchaeota were recovered from both the sites (Fig. 2). Nevertheless, the difference in the populations observed in two Gulfs can also be studied based on the predicted roles of the genomes and correlation with the niche properties. Bin quality www.nature.com/scientificdata www.nature.com/scientificdata/ Among the bacterial members, five phyla were commonly observed between the Gulfs viz., Proteobacteria, Zixibacteria, Gemmatimonadetes, Dadabacteria and Planctomycetes (Figs. 2 and 4). Among the common bacterial phyla, Proteobacteria majorly comprised of Gammaproteobacteria members which are the most abundant reported bacteria in the marine sediments and have been reported to perform versatile roles including metabolite production, hydrocarbon degradation, acetate assimilation and many more 12,13 . Zixibacteria and Dadabacteria MAGs have been reported from marine environments as an evolutionary phyla and these have been observed to play role in the nutrient cycling of the niche 14,15 . Apart from these, few genomes in GOC encompassed Bacteroidetes, FCB superphylum, Armatimonadates, Acidobacteria, Chloroflexi and Aminicenantes phyla; while those in GOK were falling under Actinobacteria, KSB1, Saccharibacteria (TM7), Nitrospinae, Caldithrix, Verrucomicrobia and Balneolaeota. Species belonging to Nitrospinae are reported to be exclusively abundant in marine niche, where they play a role in nitrite oxidization, as well as these are ubiquitously observed in sites demanding thermoprotection 16,17 . Community profiling of the samples by considering all the reads revealed that the MAGs identified within Proteobacteria (>40%) and Chloroflexi (~15%) phyla represented a substantial population, while rest of the MAGs corresponded to 0.01% to 5% of the total microbial community at each sample site (details in Supplementary Table 1a and b).
The genomic bins described here would prominently enhance the repertoire of microbial genomic information from the Gulfs of Kathiawar Peninsula. It will also provide the insights for better understanding the effects of on-shore activities on the microbiome of deeper sediment in the Gulfs. In the long term, the data will fortify further applications of the genomic information for 1) understanding the microbes involved in the marine nutrient

Methods
Sample collection and whole metagenome sequencing. The sediment samples were collected and sequenced for whole metagenomics using Illumina HiSeq platform as described earlier 7,8 . In brief, one-meterlong sediment cores were collected from 9 locations across the 2 Gulfs, Gujarat state and open Arabian Sea by sailing through boats. The cores were maintained in cold storage and processed by cutting into halves without disturbing the sediments. 10 cm of sediment from top, middle and bottom of the core each was distributed into 3 sterile 50 ml collection containers. They were further used for assessment of physicochemical properties, metagenomic DNA isolation and culturing purpose. DNA was isolated in multiplicates to reach desired quantity using the MoBio Power Soil DNA isolation Kit (Qiagen, Germany). The DNA from each core section was pooled in equimolar concentrations for whole metagenomics sequencing using HiSeq 4000 (Macrogen Inc., Korea). No internal reference or control was used during the sequencing. The sequences were quality filtered for adaptor removal and a minimum quality score of 20.
Metagenomics assembly. The quality filtered reads of four GOK samples were used for pooled assembly using CLC Genomics Workbench v11.0 with default parameters except a k-mer size of 31 and a minimum contig  Fig. 4 Bacterial clades having ≥10 MAGs classified as the same level are collapsed and represented by triangles, size of which is proportional to the number of genomes collapsed in the taxa level which is also mentioned in the parentheses. Validity of the tree is indicated by filled black circles, size indicates bootstrap between 80 to 100%. The Akkermansiaceae (phylum: Verrucomicrobia) was arbitrarily taken as the root, the tree may be considered as unrooted 34 . www.nature.com/scientificdata www.nature.com/scientificdata/ length of 1 Kb, which resulted into 478 Mb of assembled data. Similarly, the four samples of GOC along with the A sample were included in another set of pooled assembly of 779 Mb. The raw reads from each of the individual nine samples were mapped against the two assemblies for coverage estimation using CLC Genomics Workbench. The coverage and the BAM files were obtained for further binning process. Genome binning. Metagenome assembled genomes (MAGs) were binned from both the assemblies using Maxbin v2.0 6 using the full reference marker set of genes for bacteria and archaea. More than 900 bins were initially obtained from the pooled datasets. The quality of bins was checked by CheckM v1.1.0 using the lineage-specific workflow and the bins were assessed based on its completeness and contamination values 18 . Contigs with outlier values for the GC percentage and tetranucleotide frequency were removed from the bins for lowering the contamination levels. The assessed bins were further refined using RefineM v0.0.22 by individual genomic properties, taxonomy and SSU based approaches. Further, the individual output of the genomic properties was used as input for the other methods, viz., output bins from the genomic property refined program was further filtered by taxonomic method and so on. The refined bins were re-assessed using CheckM v1.1.0 18 . It was observed that the refinement using the genomic properties which screens for any outlier contigs/scaffoled in a MAG in terms of GC percent, tetranuceotide frequency and coverage did improve the bins in terms of their completeness. While, the taxa based refinement gave an overall improvement of ~2% for few bins and a reduction in the contamination by removal of duplicate or miss-assigned single copy gene encoding contigs. SSU based refinement had no major impact on the MAGs in the study. The bins were then sorted into high quality, medium quality, draft and/or low quality genomes. Out of >900 bins, 309 bins that were falling up to the draft genome category were checked for taxonomic classification using the GTDB-tk v0.3.3 classifier 19 . However, for final submission as suggested by NCBI team, few of the NCBI taxonomic synonyms from GTDB-tk classification were considered. The strain names in the nomenclature were assigned as "sample number of the mapped reads -pooled assembly against which the reads were mapped followed by the number of bin from the mapped sample", as an example for the strain CS3-K071, CS3 indicates Gulf of Khambhat/Cambay Sample 3 which was mapped against the pooled assembly of Kutch samples and 071 is the bin number from the total bins generated from this mapping. The bins were also submitted to RAST v2.0 for annotation and the number of Protein Encoding Genes (PEGs) for each MAG were inferred from the same for preliminary functional assessment prior to NCBI submission 20 .
SNP estimation of the MAGs. SNPs were called for each MAG (n = 275, bins generated from all nine samples as a pool were omitted from SNP analysis) (Supplementary Table 1a and b) to assess their genetic diversity as described earlier 21  Phylogenomic tree construction. The archaeal and bacterial trees were inferred using the insert genome set into species tree app in the Kbase 24 . The annotated bins from NCBI were uploaded as GenBank file and a genome set was prepared using the app Batch Create Genome Set v1.2.0 along with one genome from the database (as default parameter was to take minimum one reference genome). The tree was computed by the alignment of a pre-decided subset of COG (Clusters of Orthologous Groups) domains using FastTree v. 2.1.10 25 , by maximum likelihood phylogeny. The tree was further annotated by iTOL v5.0 26 . The reference genome was hidden during visualization, keeping only the MAGs under the study.

Data records
The raw metagenomics reads and their corresponding pooled assemblies are available from EBI and NCBI 27,28 , respectively as detailed in Table 1. The sample-wise metagenome assemblies and pooled assemblies (GOC-A and GOK) are available under the Bioproject Id as mentioned in Table 1 29,30 . The 309 assembled genome sequences and their functional annotations are available from NCBI database 31,32 via biosample and genome accession numbers as detailed in Online-only Tables 1 and 2. The tree files corresponding to the figures and with reference genomes can be accessed throughfigshare 33 .

Technical Validation
The quality of MAGs was assessed using CheckM to validate the completeness and contamination of the bins. The genomes were also manually assessed at each point for similar bins by considering the parameters like GC, genome statistics and the number of genes.